From 2dcbcb45d106a0a5fe16ae09dccba93bb924bf89 Mon Sep 17 00:00:00 2001 From: Ying Mao Date: Wed, 4 May 2022 15:04:25 -0400 Subject: [PATCH] [Response Ops][Docs] Alerting circuit breaker docs (#131459) * Circuit breaker docs * Apply suggestions from code review Co-authored-by: Lisa Cawley Co-authored-by: Lisa Cawley --- docs/settings/alert-action-settings.asciidoc | 4 +- ...lerting-production-considerations.asciidoc | 47 +++++++++++++++++++ 2 files changed, 49 insertions(+), 2 deletions(-) diff --git a/docs/settings/alert-action-settings.asciidoc b/docs/settings/alert-action-settings.asciidoc index 33c97373a76c4e..f3fa027cb186e7 100644 --- a/docs/settings/alert-action-settings.asciidoc +++ b/docs/settings/alert-action-settings.asciidoc @@ -198,13 +198,13 @@ Specifies the minimum schedule interval for rules. This minimum is applied to al + `[s,m,h,d]` + -For example, `20m`, `24h`, `7d`. Default: `1m`. +For example, `20m`, `24h`, `7d`. This duration cannot exceed `1d`. Default: `1m`. `xpack.alerting.rules.minimumScheduleInterval.enforce`:: Specifies the behavior when a new or changed rule has a schedule interval less than the value defined in `xpack.alerting.rules.minimumScheduleInterval.value`. If `false`, rules with schedules less than the interval will be created but warnings will be logged. If `true`, rules with schedules less than the interval cannot be created. Default: `false`. `xpack.alerting.rules.run.actions.max`:: -Specifies the maximum number of actions that a rule can trigger each time detection checks run. +Specifies the maximum number of actions that a rule can generate each time detection checks run. `xpack.alerting.rules.run.timeout`:: Specifies the default timeout for tasks associated with all types of rules. The time is formatted as: diff --git a/docs/user/production-considerations/alerting-production-considerations.asciidoc b/docs/user/production-considerations/alerting-production-considerations.asciidoc index 42f9a17cc6f888..f94ec1c3d04bbc 100644 --- a/docs/user/production-considerations/alerting-production-considerations.asciidoc +++ b/docs/user/production-considerations/alerting-production-considerations.asciidoc @@ -64,3 +64,50 @@ Because {kib} uses the documents to display historic data, you should set the de For more information on index lifecycle management, see: {ref}/index-lifecycle-management.html[Index Lifecycle Policies]. + +[float] +[[alerting-circuit-breakers]] +=== Circuit breakers + +There are several scenarios where running alerting rules and actions can start to negatively impact the overall health of a {kib} instance either by clogging up Task Manager throughput or by consuming so much CPU/memory that other operations cannot complete in a reasonable amount of time. There are several <> circuit breakers to help minimize these effects. + +[float] +==== Rules with very short intervals + +Running large numbers of rules at very short intervals can quickly clog up Task Manager throughput, leading to higher schedule drift. Use `xpack.alerting.rules.minimumScheduleInterval.value` to set a minimum schedule interval for rules. The default (and recommended) value for this configuration is `1m`. Use `xpack.alerting.rules.minimumScheduleInterval.enforce` to specify whether to strictly enforce this minimum. While the default value for this setting is `false` to maintain backwards compatibility with existing rules, set this to `true` to prevent new and updated rules from running at an interval below the minimum. + +[float] +==== Rules that run for a long time + +Rules that run for a long time typically do so because they are issuing resource-intensive {es} queries or performing CPU-intensive processing. This can block the event loop, making {kib} inaccessible while the rule runs. By default, rule processing is cancelled after `5m` but this can be overriden using the `xpack.alerting.rules.run.timeout` configuration. This value can also be configured per rule type using `xpack.alerting.rules.run.ruleTypeOverrides`. For example, the following configuration sets the global timeout value to `1m` while allowing *Index Threshold* rules to run for `10m` before being cancelled. + +[source,yaml] +-- +xpack.alerting.rules.run: + timeout: '1m' + ruleTypeOverrides: + - id: '.index-threshold' + timeout: '10m' +-- + +When a rule run is cancelled, any alerts and actions that were generated during the run are discarded. This behavior is controlled by the `xpack.alerting.cancelAlertsOnRuleTimeout` configuration, which defaults to `true`. Set this to `false` to receive alerts and actions after the timeout, although be aware that these may be incomplete and possibly inaccurate. + +[float] +==== Rules that spawn too many actions + +Rules that spawn too many actions can quickly clog up Task Manager throughput. This can occur if: + +* A rule configured with a single action generates many alerts. For example, if a rule configured to run a single email action generates 100,000 alerts, then 100,000 actions will be scheduled during a run. +* A rule configured with multiple actions generates alerts. For example, if a rule configured to run an email action, a server log action and a webhook action generates 30,000 alerts, then 90,000 actions will be scheduled during a run. + +Use `xpack.alerting.rules.run.actions.max` to limit the maximum number of actions a rule can generate per run. This value can also be configured by connector type using `xpack.alerting.rules.run.actions.connectorTypeOverrides`. For example, the following config sets the global maximum number of actions to 100 while allowing rules with *Email* actions to generate up to 200 actions. + +[source,yaml] +-- +xpack.alerting.rules.run: + actions: + max: 100 + connectorTypeOverrides: + - id: '.email' + max: 200 +-- \ No newline at end of file