Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] research refactoring task_runner module #124206

Closed
6 of 8 tasks
ersin-erdal opened this issue Feb 1, 2022 · 6 comments
Closed
6 of 8 tasks

[Alerting] research refactoring task_runner module #124206

ersin-erdal opened this issue Feb 1, 2022 · 6 comments
Assignees
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@botelastic botelastic bot added the needs-team Issues missing a team label label Feb 1, 2022
@ersin-erdal ersin-erdal changed the title [Alerting] Refactoring task_manager module [Alerting] Refactoring task_runner module Feb 1, 2022
@ersin-erdal ersin-erdal added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Feb 1, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Feb 1, 2022
@ersin-erdal ersin-erdal added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Feb 1, 2022
@mikecote
Copy link
Contributor

mikecote commented Feb 7, 2022

We can create sub-issues (extract type, interfaces and test fixtures, extract rule execution etc... ) and refactor them as small chunks.

We can have a team discussion before breaking it down or starting the work.

@mikecote mikecote changed the title [Alerting] Refactoring task_runner module [Alerting] research refactoring task_runner module Feb 7, 2022
@mikecote mikecote added the Meta label Feb 22, 2022
@gmmorris
Copy link
Contributor

I've been thinking about this issue.... I want to make sure we break this down into concrete steps and make each step an independent PR.
This is both to make sure that we deliver incremental value even if we run out of time in this minor, and to avoid a +25000-99999 PR to review 😂😭 .

I suggest we enumerate specific problems that we want to address.
I'll name several, but please add to this list:

  1. Alerts are grouped and classified (active/recovered/etc.) in several places - this should only be done once, in one place, and should be an independent unit.
  2. There isn't a clear breakdown of domains. This class does "the stuff that the runs the rule executor", but that touches on a bunch of domains - I think each domain belongs in its own unit:
    1. Rule Saved Object lifecycle
    2. Monitoring metric calculations
    3. Creating the services using a the fake request
    4. Event Log event creation and operations
    5. Alert lifecycle (classification, alert duration calculations etc.)

@mikecote
Copy link
Contributor

+1 I think it would be good to avoid moving and changing code in the same PR.

Some quick thoughts:

Alerts are grouped and classified (active/recovered/etc.) in several places - this should only be done once, in one place, and should be an independent unit.

It would be good to assign the alerts into three groups (newAlerts, ongoingAlerts, recoveredAlerts) and pass those everywhere. I've had good success using this approach in a POC and also using a fourth group generatedAlerts which references new and ongoing alerts.

I've also seen performance increase calculating these objects in a for loop (reducing event loop blocking time).

example code block from my ON week
const newAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, ActionGroupIds>
> = {};
const ongoingAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, ActionGroupIds>
> = {};
const generatedAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, ActionGroupIds>
> = {};
const recoveredAlerts: Record<
    string,
    CreatedAlert<InstanceState, InstanceContext, RecoveryActionGroupId>
> = {};
for (const id in alerts) {
    if (alerts.hasOwnProperty(id)) {
        const hasScheduledActions = alerts[id].hasScheduledActions();
        if (hasScheduledActions && !originalAlertIds.has(id)) {
            newAlerts[id] = alerts[id];
            generatedAlerts[id] = alerts[id];
        } else if (hasScheduledActions) {
            ongoingAlerts[id] = alerts[id];
            generatedAlerts[id] = alerts[id];
        } else {
            recoveredAlerts[id] = alerts[id];
        }
    }
}

There isn't a clear breakdown of domains. This class does "the stuff that the runs the rule executor", but that touches on a bunch of domains - I think each domain belongs in its own unit:

I could also see a domain for scheduling the actions. It happens once for ongoing alerts and another time for recovered alerts.

The logging of events could also be done in its own module for all types of alert events (new, active, recovered, etc).

@ymao1 ymao1 self-assigned this May 4, 2022
@ymao1
Copy link
Contributor

ymao1 commented May 4, 2022

The alerting task runner is a class that primarily implements 2 functions:

  • run()
  • cancel()

Every other method in the class and helper functions in the file are in support of these two functions. We'll focus on run() first since it contains the bulk of the functionality.

run()

  • starts APM transaction
  • initializes event log record keeping and writes execute-start event
  • calls loadRuleAttributesAndRun
  • processes the monitoring and metrics captured during run
  • processes the state returned during run
  • writes event log execute event
  • completes APM transaction
  • calculates the next interval for the rule

loadRuleAttributesAndRun()

  • retrieves encrypted attributes from rule SO
  • retrieves rule SO using rules client (this follows RBAC)
  • ensures rule type is enabled
  • calls validateAndExecuteRule

validateAndExecuteRule()

  • validates rule parameters using rule type validation
  • creates "execution handler" callback
  • calls executeRule with execution handler callback

executeRule()

  • sets up required services and calls the rule type executor
  • determines active/recovered/new alerts after rule type executor is done - this determination is unnecessarily done multiple times and in different places
  • writes event log documents for alerts

Here is my proposal for the work to be done

  1. encapsulate all event log transactions into a class (AlertingEventLog) with the following public functions:
  • initialize() - initializes the class with all the data required to populate the common metadata fields
  • start() - sets the start time for the transaction (shared between execute-start and execute and writes out the execute-start event
  • logActiveAlert()/logNewAlert()/logRecoveredAlert()/logTimeout()
  • complete() - sets metrics data for rule run and writes out execute event
  1. combine loadRuleAttributesAndRun() and validateAndExecuteRule() and extract it into a library function with a complete set of unit tests

  2. move the determination of new/active/recovered alerts to the AlertFactory. The AlertFactory contains all the information needed to make this determination. We should be able to add functions to getNewAlerts(), getActiveAlerts() and getRecoveredAlerts() and store it in the alert factory so even if they are called multiple times, the calculations are only done once. Keeping in mind that we should not be exposing these functions to the rule type executors (we want them only to use create() and done()). We should also see if we can move the functionality in trackAlertDurations into the alert factory so getActiveAlerts() returns alerts with updated duration and getRecoveredAlerts() returns alerts with updated duration and end time set.

  3. combine logActiveAndRecoveredAlerts, generateNewAndRecoveredAlertEvents so we're logging to the server and the event log in a single place. Consider extracting it to a library function with a complete set of unit tests.

  4. combine the logic inside

await Promise.all(
        alertsWithExecutableActions.map(
          ([alertId, alert]: [string, Alert<InstanceState, InstanceContext>]) =>
            this.executeAlert(alertId, alert, executionHandler, ruleRunMetricsStore)
        )
      );

await scheduleActionsForRecoveredAlerts<
        InstanceState,
        InstanceContext,
        RecoveryActionGroupId
      >({

into a single library function (scheduleActionsForAlerts) with a full set of unit tests.

@ymao1
Copy link
Contributor

ymao1 commented May 4, 2022

Closing this meta issue as all followup issues have been created

@ymao1 ymao1 closed this as completed May 4, 2022
pmuellr added a commit to pmuellr/kibana that referenced this issue May 11, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit to pmuellr/kibana that referenced this issue May 16, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit to pmuellr/kibana that referenced this issue May 18, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit to pmuellr/kibana that referenced this issue May 19, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit to pmuellr/kibana that referenced this issue May 25, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit to pmuellr/kibana that referenced this issue May 26, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 2, 2022
…tributesAndRun() and validateAndExecuteRule()

resolves elastic#131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task manager, into a separate module.

meta issue: elastic#124206
pmuellr added a commit that referenced this issue Jun 14, 2022
…utesAndRun() and validateAndExecuteRule() (#132079)

resolves #131544

Extract the `loadRuleAttributesAndRun()` and `validateAndExecuteRule()`
methods from the alerting task runner, into a separate module.

meta issue: #124206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

5 participants