[Alerting] research refactoring task_runner module #124206

ersin-erdal · 2022-02-01T10:09:25Z

⚠️ This is a meta issue.

This is a meta issue to refactor task_runner.ts module.

This module and its unit tests became too big and hard to follow in time. We can create sub-issues (extract type, interfaces and test fixtures, extract rule execution etc... ) and refactor them as small chunks.

Task breakdown

[Alerting] Refactoring task_runner, Move interface and test fixtures to their own files. #125416
[Response Ops] Refactor alerting task runner - combine loadRuleAttributesAndRun() and validateAndExecuteRule() #131544
[Response Ops] Refactor alerting task runner - refactor determination of active/new/recovered alerts #131546
[Response Ops] Refactor alerting task runner - combine logActiveAndRecoveredAlerts and generateNewAndRecoveredAlertEvents #131547
[Response Ops] Refactor alerting task runner - single function for scheduling actions #131548
[ResponseOps] can we remove some tests after refactoring task runner? #133831
[Response Ops] Refactor error handling in alerting task runner #134579
Refactor alerting task runner - remove RecoveryActionGroupId #139208

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-02-01T10:13:37Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2022-02-07T15:33:52Z

We can create sub-issues (extract type, interfaces and test fixtures, extract rule execution etc... ) and refactor them as small chunks.

We can have a team discussion before breaking it down or starting the work.

gmmorris · 2022-04-25T14:05:07Z

I've been thinking about this issue.... I want to make sure we break this down into concrete steps and make each step an independent PR.
This is both to make sure that we deliver incremental value even if we run out of time in this minor, and to avoid a +25000-99999 PR to review 😂😭 .

I suggest we enumerate specific problems that we want to address.
I'll name several, but please add to this list:

Alerts are grouped and classified (active/recovered/etc.) in several places - this should only be done once, in one place, and should be an independent unit.
There isn't a clear breakdown of domains. This class does "the stuff that the runs the rule executor", but that touches on a bunch of domains - I think each domain belongs in its own unit:
1. Rule Saved Object lifecycle
2. Monitoring metric calculations
3. Creating the services using a the fake request
4. Event Log event creation and operations
5. Alert lifecycle (classification, alert duration calculations etc.)

mikecote · 2022-04-26T18:57:30Z

+1 I think it would be good to avoid moving and changing code in the same PR.

Some quick thoughts:

Alerts are grouped and classified (active/recovered/etc.) in several places - this should only be done once, in one place, and should be an independent unit.

It would be good to assign the alerts into three groups (newAlerts, ongoingAlerts, recoveredAlerts) and pass those everywhere. I've had good success using this approach in a POC and also using a fourth group generatedAlerts which references new and ongoing alerts.

I've also seen performance increase calculating these objects in a for loop (reducing event loop blocking time).

example code block from my ON week

const newAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, ActionGroupIds>
> = {};
const ongoingAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, ActionGroupIds>
> = {};
const generatedAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, ActionGroupIds>
> = {};
const recoveredAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, RecoveryActionGroupId>
> = {};
for (const id in alerts) {
    if (alerts.hasOwnProperty(id)) {
        const hasScheduledActions = alerts[id].hasScheduledActions();
        if (hasScheduledActions && !originalAlertIds.has(id)) {
            newAlerts[id] = alerts[id];
            generatedAlerts[id] = alerts[id];
        } else if (hasScheduledActions) {
            ongoingAlerts[id] = alerts[id];
            generatedAlerts[id] = alerts[id];
        } else {
            recoveredAlerts[id] = alerts[id];
        }
    }
}

There isn't a clear breakdown of domains. This class does "the stuff that the runs the rule executor", but that touches on a bunch of domains - I think each domain belongs in its own unit:

I could also see a domain for scheduling the actions. It happens once for ongoing alerts and another time for recovered alerts.

The logging of events could also be done in its own module for all types of alert events (new, active, recovered, etc).

ymao1 · 2022-05-04T13:24:12Z

The alerting task runner is a class that primarily implements 2 functions:

run()
cancel()

Every other method in the class and helper functions in the file are in support of these two functions. We'll focus on run() first since it contains the bulk of the functionality.

`run()`

starts APM transaction
initializes event log record keeping and writes execute-start event
calls loadRuleAttributesAndRun
processes the monitoring and metrics captured during run
processes the state returned during run
writes event log execute event
completes APM transaction
calculates the next interval for the rule

`loadRuleAttributesAndRun()`

retrieves encrypted attributes from rule SO
retrieves rule SO using rules client (this follows RBAC)
ensures rule type is enabled
calls validateAndExecuteRule

`validateAndExecuteRule()`

validates rule parameters using rule type validation
creates "execution handler" callback
calls executeRule with execution handler callback

`executeRule()`

sets up required services and calls the rule type executor
determines active/recovered/new alerts after rule type executor is done - this determination is unnecessarily done multiple times and in different places
writes event log documents for alerts

Here is my proposal for the work to be done

encapsulate all event log transactions into a class (AlertingEventLog) with the following public functions:

initialize() - initializes the class with all the data required to populate the common metadata fields
start() - sets the start time for the transaction (shared between execute-start and execute and writes out the execute-start event
logActiveAlert()/logNewAlert()/logRecoveredAlert()/logTimeout()
complete() - sets metrics data for rule run and writes out execute event

combine loadRuleAttributesAndRun() and validateAndExecuteRule() and extract it into a library function with a complete set of unit tests
move the determination of new/active/recovered alerts to the AlertFactory. The AlertFactory contains all the information needed to make this determination. We should be able to add functions to getNewAlerts(), getActiveAlerts() and getRecoveredAlerts() and store it in the alert factory so even if they are called multiple times, the calculations are only done once. Keeping in mind that we should not be exposing these functions to the rule type executors (we want them only to use create() and done()). We should also see if we can move the functionality in trackAlertDurations into the alert factory so getActiveAlerts() returns alerts with updated duration and getRecoveredAlerts() returns alerts with updated duration and end time set.
combine logActiveAndRecoveredAlerts, generateNewAndRecoveredAlertEvents so we're logging to the server and the event log in a single place. Consider extracting it to a library function with a complete set of unit tests.
combine the logic inside

await Promise.all(
        alertsWithExecutableActions.map(
          ([alertId, alert]: [string, Alert<InstanceState, InstanceContext>]) =>
            this.executeAlert(alertId, alert, executionHandler, ruleRunMetricsStore)
        )
      );

await scheduleActionsForRecoveredAlerts<
        InstanceState,
        InstanceContext,
        RecoveryActionGroupId
      >({

into a single library function (scheduleActionsForAlerts) with a full set of unit tests.

ymao1 · 2022-05-04T16:29:59Z

Closing this meta issue as all followup issues have been created

…tributesAndRun() and validateAndExecuteRule() resolves elastic#131544 Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()` methods from the alerting task manager, into a separate module. meta issue: elastic#124206

…utesAndRun() and validateAndExecuteRule() (#132079) resolves #131544 Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()` methods from the alerting task runner, into a separate module. meta issue: #124206

botelastic bot added the needs-team Issues missing a team label label Feb 1, 2022

ersin-erdal changed the title ~~[Alerting] Refactoring task_manager module~~ [Alerting] Refactoring task_runner module Feb 1, 2022

ersin-erdal added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Feb 1, 2022

botelastic bot removed the needs-team Issues missing a team label label Feb 1, 2022

ersin-erdal added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Feb 1, 2022

mikecote changed the title ~~[Alerting] Refactoring task_runner module~~ [Alerting] research refactoring task_runner module Feb 7, 2022

ersin-erdal mentioned this issue Feb 11, 2022

[Alerting] Refactoring task_runner, Move interface and test fixtures to their own files. #125416

Closed

mikecote added the Meta label Feb 22, 2022

ymao1 self-assigned this May 4, 2022

ymao1 closed this as completed May 4, 2022

pmuellr mentioned this issue May 11, 2022

[ResponseOps]: Refactor alerting task runner - combine loadRuleAttributesAndRun() and validateAndExecuteRule() #132079

Merged

2 tasks

pmuellr mentioned this issue Jun 7, 2022

[ResponseOps] can we remove some tests after refactoring task runner? #133831

Closed

ymao1 mentioned this issue Jun 16, 2022

[Response Ops] Refactor error handling in alerting task runner #134579

Closed

doakalexi mentioned this issue Aug 22, 2022

Refactor alerting task runner - remove RecoveryActionGroupId #139208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] research refactoring task_runner module #124206

[Alerting] research refactoring task_runner module #124206

ersin-erdal commented Feb 1, 2022 •

edited by doakalexi

elasticmachine commented Feb 1, 2022

mikecote commented Feb 7, 2022 •

edited

gmmorris commented Apr 25, 2022

mikecote commented Apr 26, 2022

ymao1 commented May 4, 2022

ymao1 commented May 4, 2022

[Alerting] research refactoring task_runner module #124206

[Alerting] research refactoring task_runner module #124206

Comments

ersin-erdal commented Feb 1, 2022 • edited by doakalexi

Task breakdown

elasticmachine commented Feb 1, 2022

mikecote commented Feb 7, 2022 • edited

gmmorris commented Apr 25, 2022

mikecote commented Apr 26, 2022

ymao1 commented May 4, 2022

run()

loadRuleAttributesAndRun()

validateAndExecuteRule()

executeRule()

ymao1 commented May 4, 2022

ersin-erdal commented Feb 1, 2022 •

edited by doakalexi

mikecote commented Feb 7, 2022 •

edited

`run()`

`loadRuleAttributesAndRun()`

`validateAndExecuteRule()`

`executeRule()`