Skip to content

Commit

Permalink
[Response Ops][Docs] Alerting circuit breaker docs (#131459)
Browse files Browse the repository at this point in the history
* Circuit breaker docs

* Apply suggestions from code review

Co-authored-by: Lisa Cawley <lcawley@elastic.co>

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
  • Loading branch information
ymao1 and lcawl committed May 4, 2022
1 parent c43a51d commit 2dcbcb4
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/settings/alert-action-settings.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -198,13 +198,13 @@ Specifies the minimum schedule interval for rules. This minimum is applied to al
+
`<count>[s,m,h,d]`
+
For example, `20m`, `24h`, `7d`. Default: `1m`.
For example, `20m`, `24h`, `7d`. This duration cannot exceed `1d`. Default: `1m`.

`xpack.alerting.rules.minimumScheduleInterval.enforce`::
Specifies the behavior when a new or changed rule has a schedule interval less than the value defined in `xpack.alerting.rules.minimumScheduleInterval.value`. If `false`, rules with schedules less than the interval will be created but warnings will be logged. If `true`, rules with schedules less than the interval cannot be created. Default: `false`.

`xpack.alerting.rules.run.actions.max`::
Specifies the maximum number of actions that a rule can trigger each time detection checks run.
Specifies the maximum number of actions that a rule can generate each time detection checks run.

`xpack.alerting.rules.run.timeout`::
Specifies the default timeout for tasks associated with all types of rules. The time is formatted as:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,50 @@ Because {kib} uses the documents to display historic data, you should set the de

For more information on index lifecycle management, see:
{ref}/index-lifecycle-management.html[Index Lifecycle Policies].

[float]
[[alerting-circuit-breakers]]
=== Circuit breakers

There are several scenarios where running alerting rules and actions can start to negatively impact the overall health of a {kib} instance either by clogging up Task Manager throughput or by consuming so much CPU/memory that other operations cannot complete in a reasonable amount of time. There are several <<alert-settings,configurable>> circuit breakers to help minimize these effects.

[float]
==== Rules with very short intervals

Running large numbers of rules at very short intervals can quickly clog up Task Manager throughput, leading to higher schedule drift. Use `xpack.alerting.rules.minimumScheduleInterval.value` to set a minimum schedule interval for rules. The default (and recommended) value for this configuration is `1m`. Use `xpack.alerting.rules.minimumScheduleInterval.enforce` to specify whether to strictly enforce this minimum. While the default value for this setting is `false` to maintain backwards compatibility with existing rules, set this to `true` to prevent new and updated rules from running at an interval below the minimum.

[float]
==== Rules that run for a long time

Rules that run for a long time typically do so because they are issuing resource-intensive {es} queries or performing CPU-intensive processing. This can block the event loop, making {kib} inaccessible while the rule runs. By default, rule processing is cancelled after `5m` but this can be overriden using the `xpack.alerting.rules.run.timeout` configuration. This value can also be configured per rule type using `xpack.alerting.rules.run.ruleTypeOverrides`. For example, the following configuration sets the global timeout value to `1m` while allowing *Index Threshold* rules to run for `10m` before being cancelled.

[source,yaml]
--
xpack.alerting.rules.run:
timeout: '1m'
ruleTypeOverrides:
- id: '.index-threshold'
timeout: '10m'
--

When a rule run is cancelled, any alerts and actions that were generated during the run are discarded. This behavior is controlled by the `xpack.alerting.cancelAlertsOnRuleTimeout` configuration, which defaults to `true`. Set this to `false` to receive alerts and actions after the timeout, although be aware that these may be incomplete and possibly inaccurate.

[float]
==== Rules that spawn too many actions

Rules that spawn too many actions can quickly clog up Task Manager throughput. This can occur if:

* A rule configured with a single action generates many alerts. For example, if a rule configured to run a single email action generates 100,000 alerts, then 100,000 actions will be scheduled during a run.
* A rule configured with multiple actions generates alerts. For example, if a rule configured to run an email action, a server log action and a webhook action generates 30,000 alerts, then 90,000 actions will be scheduled during a run.

Use `xpack.alerting.rules.run.actions.max` to limit the maximum number of actions a rule can generate per run. This value can also be configured by connector type using `xpack.alerting.rules.run.actions.connectorTypeOverrides`. For example, the following config sets the global maximum number of actions to 100 while allowing rules with *Email* actions to generate up to 200 actions.

[source,yaml]
--
xpack.alerting.rules.run:
actions:
max: 100
connectorTypeOverrides:
- id: '.email'
max: 200
--

0 comments on commit 2dcbcb4

Please sign in to comment.