[ML] Context for recovered alerts #132496

darnautov · 2022-05-19T09:37:37Z

Summary

Adds context for recovered alerts created by Anomaly Detection and Anomaly Detection Health rule types.

The recovered context is limited to the same context and the active alerts, hence there are no new fields.

For Anomaly detection recovered alerts contains job IDs and link to the Anomaly Explorer page.
For Anomaly detection health, the message field is updated according to the test and results contain extra data about executed tests.

How to test

Anomaly detection rule

Create a rule instance.
Attach an action that runs when "Recovered" (with a connector of your choice, e.g. Slack). At the moment framework doesn't allow to provide a default message for recovered alerts, so you can copy-paste the message from the "Anomaly score matched the condition" action group, it is

Default message example:

RECOVERED!
[{{rule.name}}] Elastic Stack Machine Learning Alert:
- Job IDs: {{context.jobIds}}
- Time: {{date}}

{{context.message}}

{{! Replace kibanaBaseUrl if not configured in Kibana }}
[Open in Anomaly Explorer]({{{kibanaBaseUrl}}}{{{context.anomalyExplorerUrl}}})

Alert is recovered when there is no conclusive anomaly found. Hence it's quite easy to test, just ingest some data to the source index that triggers an anomaly and wait for some time (depending on your job config, in particular, bucket_span and query_delay of the datafeed)

Note, that for recovered alerts only these fields from the context are populated (because there is actually no anomaly has been found):

context.jobIds
context.anomalyExplorerUrl
context.message

Anomaly detection health rule

The example with the "Datafeed is not started" test:

Create a rule instance with and assign some open jobs, make sure that "Datafeed is not started" test is enabled
Attach an action that runs when "Recovered" (with a connector of your choice, e.g. Slack). At the moment framework doesn't allow to provide a default message for recovered alerts, so you can copy-paste the message from the "Issue detected" action group, it is

Default message example:

RECOVERED!
[{{rule.name}}] Anomaly detection jobs health check result:
{{context.message}}
{{#context.results}}
Job ID: {{job_id}}
{{#datafeed_id}}Datafeed ID: {{datafeed_id}}
{{/datafeed_id}}{{#datafeed_state}}Datafeed state: {{datafeed_state}}
{{/datafeed_state}}{{#memory_status}}Memory status: {{memory_status}}
{{/memory_status}}{{#model_bytes}}Model size: {{model_bytes}}
{{/model_bytes}}{{#model_bytes_memory_limit}}Model memory limit: {{model_bytes_memory_limit}}
{{/model_bytes_memory_limit}}{{#peak_model_bytes}}Peak model bytes: {{peak_model_bytes}}
{{/peak_model_bytes}}{{#model_bytes_exceeded}}Model exceeded: {{model_bytes_exceeded}}
{{/model_bytes_exceeded}}{{#log_time}}Memory logging time: {{log_time}}
{{/log_time}}{{#failed_category_count}}Failed category count: {{failed_category_count}}
{{/failed_category_count}}{{#annotation}}Annotation: {{annotation}}
{{/annotation}}{{#missed_docs_count}}Number of missed documents: {{missed_docs_count}}
{{/missed_docs_count}}{{#end_timestamp}}Latest finalized bucket with missing docs: {{end_timestamp}}
{{/end_timestamp}}{{#errors}}Error message: {{message}} {{/errors}}
{{/context.results}}

5. Stop the datafeed of any anomaly detection job assigned to this rule. 6. Start the datafeed. You need to make sure that ALL jobs assigned to this rule have started datafeed. 7. Check the message, it should be something similar to

Checklist

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

elasticmachine · 2022-05-20T08:25:23Z

Pinging @elastic/ml-ui (:ml)

x-pack/plugins/ml/server/lib/alerts/jobs_health_service.test.ts

x-pack/plugins/ml/server/lib/alerts/register_jobs_monitoring_rule_type.ts

darnautov · 2022-05-20T14:48:28Z

@elasticmachine merge upstream

x-pack/plugins/ml/server/lib/alerts/alerting_service.ts

x-pack/plugins/ml/server/lib/alerts/jobs_health_service.test.ts

szabosteve · 2022-05-23T10:20:01Z

x-pack/plugins/ml/server/lib/alerts/jobs_health_service.ts

+
+            if (hardLimitCount > 0) {
+              message = i18n.translate('xpack.ml.alertTypes.jobsHealthAlertingRule.mmlMessage', {
+                defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign the job more memory and restore from a snapshot from prior to reaching the hard limit.`,


Suggested change

defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign the job more memory and restore from a snapshot from prior to reaching the hard limit.`,

defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign more memory to the job and restore it from a snapshot taken prior to reaching the hard limit.`,

FYI this text has been there for a while. It's also been reviewed before

Updated in 552d005

szabosteve · 2022-05-23T10:22:03Z

x-pack/plugins/ml/server/lib/alerts/jobs_health_service.ts

+                'xpack.ml.alertTypes.jobsHealthAlertingRule.mmlSoftLimitMessage',
+                {
+                  defaultMessage:
+                    '{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign the job more memory or edit the datafeed filter to limit scope of analysis.',


Suggested change

'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign the job more memory or edit the datafeed filter to limit scope of analysis.',

'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign more memory to the job or edit the datafeed filter to limit the scope of analysis.',

FYI this text has been there for a while. It's also been reviewed before

Thanks for the context. It still sounds better this way in my opinion. Please take it or leave it as you wish.

Updated 552d005

x-pack/plugins/ml/server/lib/alerts/jobs_health_service.ts

…utov/kibana into ml-126803-recovered-alerts-context

…-alerts-context

ymao1

LGTM! Code review only for use of getRecoveredAlerts() inside the rule executors

peteharverson

Left another suggestion for one of the messages, but otherwise tested and LGTM.

peteharverson · 2022-05-23T15:06:41Z

x-pack/plugins/ml/server/lib/alerts/alerting_service.ts

+            'xpack.ml.alertTypes.anomalyDetectionAlertingRule.recoveredMessage',
+            {
+              defaultMessage:
+                'No anomalies have been found that exceeded the [{severity}] threshold.',


I'd edit this to No anomalies have been found that exceed the severity threshold of {severity}.

I don't think it needs the square brackets around the severity value.

Or better still, could you include the lookback interval in there too? Such as

No anomalies have been found in the past {lookbackInterval} that exceed the severity threshold of {severity}.

Updated in b27c03a

szabosteve

UI text LGTM.

kibana-ci · 2022-05-23T17:06:04Z

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`ml`	3.3MB	3.3MB	+24.0B

History

💔 Build #46972 failed 552d005
💚 Build #46869 succeeded 171fc55
💔 Build #46840 failed af15a0d
💛 Build #46506 was flaky 721f38b
💔 Build #46426 failed fbee25b

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @darnautov

* recovered context for ad alerting rule * datafeed report for recovered alerts * mml report for recovered alerts * update executor for setting recovered context * update jest tests, fix mml check * update error messages check * update jest tests * update delayed data test * fix the mml check * enable doesSetRecoveryContext * add rule.name to the default message * fix datafeed check * recovered message * refactor, update anomaly explorer URL time range for recovered alerts * update message for recovered errorMessage alert * update delayedDataRecoveryMessage * fix time range * update message for recovered anomaly detection alert * update mml messages

recovered context for ad alerting rule

4f3c221

darnautov self-assigned this May 19, 2022

darnautov added 7 commits May 19, 2022 16:49

datafeed report for recovered alerts

0f28640

mml report for recovered alerts

9d1a0c9

update executor for setting recovered context

b3deab2

update jest tests, fix mml check

c28c066

update error messages check

2c7dbff

update jest tests

7424eaa

update delayed data test

cfa610c

darnautov requested review from peteharverson and szabosteve May 20, 2022 08:23

darnautov added :ml Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Team:ML Team label for ML (also use :ml) v8.3.0 Feature:Anomaly Detection ML anomaly detection labels May 20, 2022

darnautov marked this pull request as ready for review May 20, 2022 08:25

darnautov requested a review from a team as a code owner May 20, 2022 08:25

darnautov added the release_note:enhancement label May 20, 2022

fix the mml check

3fef4a4

darnautov requested a review from ymao1 May 20, 2022 09:13

darnautov added 3 commits May 20, 2022 11:26

enable doesSetRecoveryContext

61a98f5

add rule.name to the default message

114e644

fix datafeed check

67c0a74

peteharverson reviewed May 20, 2022

View reviewed changes

recovered message

fbee25b

Merge branch 'main' into ml-126803-recovered-alerts-context

721f38b

peteharverson reviewed May 20, 2022

View reviewed changes

x-pack/plugins/ml/server/lib/alerts/alerting_service.ts Outdated Show resolved Hide resolved

szabosteve reviewed May 23, 2022

View reviewed changes

darnautov added 5 commits May 23, 2022 13:02

refactor, update anomaly explorer URL time range for recovered alerts

e5ab471

Merge branch 'ml-126803-recovered-alerts-context' of github.com:darna…

86573f5

…utov/kibana into ml-126803-recovered-alerts-context

update message for recovered errorMessage alert

3ce48d3

update delayedDataRecoveryMessage

f69b124

Merge remote-tracking branch 'upstream/main' into ml-126803-recovered…

af15a0d

…-alerts-context

darnautov requested review from szabosteve and peteharverson May 23, 2022 11:15

fix time range

171fc55

ymao1 approved these changes May 23, 2022

View reviewed changes

peteharverson approved these changes May 23, 2022

View reviewed changes

darnautov added 2 commits May 23, 2022 17:14

update message for recovered anomaly detection alert

b27c03a

update mml messages

552d005

darnautov enabled auto-merge (squash) May 23, 2022 15:17

szabosteve approved these changes May 23, 2022

View reviewed changes

update jest test

1776e02

darnautov merged commit 7a5fef1 into elastic:main May 23, 2022

kibanamachine added the backport:skip This commit does not require backporting label May 23, 2022

darnautov deleted the ml-126803-recovered-alerts-context branch May 24, 2022 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Context for recovered alerts #132496

[ML] Context for recovered alerts #132496

darnautov commented May 19, 2022 •

edited

Loading

elasticmachine commented May 20, 2022

darnautov commented May 20, 2022

szabosteve May 23, 2022

darnautov May 23, 2022

darnautov May 23, 2022

szabosteve May 23, 2022

darnautov May 23, 2022

szabosteve May 23, 2022

darnautov May 23, 2022

ymao1 left a comment

peteharverson left a comment

peteharverson May 23, 2022

darnautov May 23, 2022

szabosteve left a comment

kibana-ci commented May 23, 2022

	defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign the job more memory and restore from a snapshot from prior to reaching the hard limit.`,
	defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign more memory to the job and restore it from a snapshot taken prior to reaching the hard limit.`,

	'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign the job more memory or edit the datafeed filter to limit scope of analysis.',
	'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign more memory to the job or edit the datafeed filter to limit the scope of analysis.',

[ML] Context for recovered alerts #132496

[ML] Context for recovered alerts #132496

Conversation

darnautov commented May 19, 2022 • edited Loading

Summary

How to test

Anomaly detection rule

Anomaly detection health rule

Checklist

elasticmachine commented May 20, 2022

darnautov commented May 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 left a comment

Choose a reason for hiding this comment

peteharverson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szabosteve left a comment

Choose a reason for hiding this comment

kibana-ci commented May 23, 2022

💚 Build Succeeded

Metrics [docs]

Async chunks

History

darnautov commented May 19, 2022 •

edited

Loading