Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Alert Summaries #143200

Open
10 of 11 tasks
ersin-erdal opened this issue Oct 12, 2022 · 4 comments
Open
10 of 11 tasks

[meta] Alert Summaries #143200

ersin-erdal opened this issue Oct 12, 2022 · 4 comments
Labels
Feature:Alerting Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ersin-erdal
Copy link
Contributor

ersin-erdal commented Oct 12, 2022

Breakdown

Groundwork

Feature development

To onboard detection rules onto alert summaries

Summary

In order to reduce action storms, the system needs to provide the capability of invoking a single action whenever the rule finds one or many alerts.

Problem

  • The granularity at which actions are triggered are locked to the same granularity at which triggered are created.
  • Users are facing noisy action storms whenever an alerting rule encounters high cardinality of alerts.
  • Users, at times, only want to know once when an alerting rule finds one or many alerts.

Terminology

New Alert: (for both "Lifecycle Alerts" & “Persistent alerts”):

For summary of a single rule run: A new alert is an alert that is generated by the current execution.

For summary on a custom interval: A new alert is an alert that has been generated within the given interval

Ongoing Alert (for "Lifecycle Alerts"):

For summary of a single rule run: An ongoing alert is an alert that is generated by one of the previous executions and is still active (not recovered yet)

For summary on a custom interval: An ongoing alert is an alert that has been generated before the given interval and is still active (not recovered yet)

Recovered Alert (for "Lifecycle Alerts"):

For summary of a single rule run: A recovered alert is an alert that is generated by one of the previous executions but not active any more (no longer matches the rule conditions).

For summary on a custom interval: A recovered alert is an alert that has been generated before the given custom interval but is not active any more (no longer matches the rule conditions).

Functional Specification

  1. Summary feature to be available to rules using alerts-as-data
  2. Alerts summarization (a single notification) should be configurable per action
  3. Users are able to make the rule generate summaries in the following scenarios:
    a. Single action summarizing alerts from a single rule run (This should be available only the rules that has “Group By” option)
    b. Single action summarizing alerts from rule runs over a specific interval
  4. When a summary interval (Summary of alerts on a custom interval) is shorter than rule interval, users will be prompted to fix it.
  5. Mustache variables payload (example: used as {{rule.name}} )
{
  date,
  kibanaBaseUrl,
  rule: { id, name, type, spaceId, tags},
  alerts: {
    total: number,
    new: { 
       count: number, 
       data: [
          // alerts-as-data documents
        ]
    },
    ongoing: same structure as new alerts,
    recovered: same structure as new alerts,
    all: same structure as new alerts,
  }
}
  1. Summary email example
Since DD.MM.YYYY (Last summary),  
There were X New,  Y Ongoing, and Z Resolved alerts.

You can see the most recent alerts grouped by status below. 
Please click here to see the latest list of alerts

Since					Reason
Sep 13, 2022, 19:13:58.070 (UTC+2)  	system.cpu is 90% in the last 1 min for 0. Alert when > 10%.
Sep 13, 2022, 19:16:58.070 (UTC+2)  	system.cpu is 90% in the last 1 min for 0. Alert when > 10%.
Sep 13, 2022, 19:20:58.070 (UTC+2)	system.cpu is 90% in the last 1 min for 0. Alert when > 10%.
  1. A default summary message should be provided per Rule and Action type.

Non-Functional Specification

  1. Alert summary feature should support reporting at least the last 1000 alerts

Feature Metrics

Initially we can collect the new summary feature adoption rate:

  • Number of actions using the summary feature
  • Number of rules that has at least one action that using the summary feature
  • Actions using the summary feature grouped by connector types.
@banderror
Copy link
Contributor

In Security Solution, users can include context.alerts (which is an array) into their action body templates.

@ersin-erdal Is the proposal to replace context.alerts array with the new data structure:

  alerts: {
    total: number,
    new: { 
       count: number, 
       data: [
          // alerts-as-data documents
        ]
    },
    ongoing: same structure as new alerts,
    recovered: same structure as new alerts,
    all: same structure as new alerts,
  }

thus making this a breaking change, or to add alerts (the new data structure) in addition to the existing context.alerts array?

@banderror
Copy link
Contributor

cc @XavierM

@ersin-erdal
Copy link
Contributor Author

ersin-erdal commented Nov 8, 2022

In Security Solution, users can include context.alerts (which is an array) into their action body templates.

@ersin-erdal Is the proposal to replace context.alerts array with the new data structure:

  alerts: {
    total: number,
    new: { 
       count: number, 
       data: [
          // alerts-as-data documents
        ]
    },
    ongoing: same structure as new alerts,
    recovered: same structure as new alerts,
    all: same structure as new alerts,
  }

thus making this a breaking change, or to add alerts (the new data structure) in addition to the existing context.alerts array?

Hi @banderror

Sorry for the confusion, i deleted my previous comment.

Yes this will be in addition to the existing context.alerts for the sake of backward compatibility.
But you need to start using the new alerts object once you adopt the new summary feature.

@banderror
Copy link
Contributor

@ersin-erdal Gotcha, thank you for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

2 participants