Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Context for recovered alerts #132496

Merged
merged 23 commits into from
May 23, 2022

Conversation

darnautov
Copy link
Contributor

@darnautov darnautov commented May 19, 2022

Summary

Resolves #126803

Adds context for recovered alerts created by Anomaly Detection and Anomaly Detection Health rule types.

The recovered context is limited to the same context and the active alerts, hence there are no new fields.

For Anomaly detection recovered alerts contains job IDs and link to the Anomaly Explorer page.
For Anomaly detection health, the message field is updated according to the test and results contain extra data about executed tests.

How to test

Anomaly detection rule
  1. Create a rule instance.
  2. Attach an action that runs when "Recovered" (with a connector of your choice, e.g. Slack). At the moment framework doesn't allow to provide a default message for recovered alerts, so you can copy-paste the message from the "Anomaly score matched the condition" action group, it is
Default message example:
RECOVERED!
[{{rule.name}}] Elastic Stack Machine Learning Alert:
- Job IDs: {{context.jobIds}}
- Time: {{date}}

{{context.message}}

{{! Replace kibanaBaseUrl if not configured in Kibana }}
[Open in Anomaly Explorer]({{{kibanaBaseUrl}}}{{{context.anomalyExplorerUrl}}})
  1. Alert is recovered when there is no conclusive anomaly found. Hence it's quite easy to test, just ingest some data to the source index that triggers an anomaly and wait for some time (depending on your job config, in particular, bucket_span and query_delay of the datafeed)

image

Note, that for recovered alerts only these fields from the context are populated (because there is actually no anomaly has been found):

  1. context.jobIds
  2. context.anomalyExplorerUrl
  3. context.message
Anomaly detection health rule

The example with the "Datafeed is not started" test:

  1. Create a rule instance with and assign some open jobs, make sure that "Datafeed is not started" test is enabled
  2. Attach an action that runs when "Recovered" (with a connector of your choice, e.g. Slack). At the moment framework doesn't allow to provide a default message for recovered alerts, so you can copy-paste the message from the "Issue detected" action group, it is
Default message example:
RECOVERED!
[{{rule.name}}] Anomaly detection jobs health check result:
{{context.message}}
{{#context.results}}
Job ID: {{job_id}}
{{#datafeed_id}}Datafeed ID: {{datafeed_id}}
{{/datafeed_id}}{{#datafeed_state}}Datafeed state: {{datafeed_state}}
{{/datafeed_state}}{{#memory_status}}Memory status: {{memory_status}}
{{/memory_status}}{{#model_bytes}}Model size: {{model_bytes}}
{{/model_bytes}}{{#model_bytes_memory_limit}}Model memory limit: {{model_bytes_memory_limit}}
{{/model_bytes_memory_limit}}{{#peak_model_bytes}}Peak model bytes: {{peak_model_bytes}}
{{/peak_model_bytes}}{{#model_bytes_exceeded}}Model exceeded: {{model_bytes_exceeded}}
{{/model_bytes_exceeded}}{{#log_time}}Memory logging time: {{log_time}}
{{/log_time}}{{#failed_category_count}}Failed category count: {{failed_category_count}}
{{/failed_category_count}}{{#annotation}}Annotation: {{annotation}}
{{/annotation}}{{#missed_docs_count}}Number of missed documents: {{missed_docs_count}}
{{/missed_docs_count}}{{#end_timestamp}}Latest finalized bucket with missing docs: {{end_timestamp}}
{{/end_timestamp}}{{#errors}}Error message: {{message}} {{/errors}}
{{/context.results}}
5. Stop the datafeed of any anomaly detection job assigned to this rule. 6. Start the datafeed. You need to make sure that ALL jobs assigned to this rule have started datafeed. 7. Check the message, it should be something similar to

image

Checklist

@darnautov darnautov self-assigned this May 19, 2022
@darnautov darnautov added :ml Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Team:ML Team label for ML (also use :ml) v8.3.0 Feature:Anomaly Detection ML anomaly detection labels May 20, 2022
@darnautov darnautov marked this pull request as ready for review May 20, 2022 08:25
@darnautov darnautov requested a review from a team as a code owner May 20, 2022 08:25
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@darnautov darnautov requested a review from ymao1 May 20, 2022 09:13
@darnautov
Copy link
Contributor Author

@elasticmachine merge upstream


if (hardLimitCount > 0) {
message = i18n.translate('xpack.ml.alertTypes.jobsHealthAlertingRule.mmlMessage', {
defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign the job more memory and restore from a snapshot from prior to reaching the hard limit.`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign the job more memory and restore from a snapshot from prior to reaching the hard limit.`,
defaultMessage: `{count, plural, one {Job} other {Jobs}} {jobsString} reached the hard model memory limit. Assign more memory to the job and restore it from a snapshot taken prior to reaching the hard limit.`,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this text has been there for a while. It's also been reviewed before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 552d005

'xpack.ml.alertTypes.jobsHealthAlertingRule.mmlSoftLimitMessage',
{
defaultMessage:
'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign the job more memory or edit the datafeed filter to limit scope of analysis.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign the job more memory or edit the datafeed filter to limit scope of analysis.',
'{count, plural, one {Job} other {Jobs}} {jobsString} reached the soft model memory limit. Assign more memory to the job or edit the datafeed filter to limit the scope of analysis.',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this text has been there for a while. It's also been reviewed before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. It still sounds better this way in my opinion. Please take it or leave it as you wish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 552d005

x-pack/plugins/ml/server/lib/alerts/jobs_health_service.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Code review only for use of getRecoveredAlerts() inside the rule executors

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left another suggestion for one of the messages, but otherwise tested and LGTM.

'xpack.ml.alertTypes.anomalyDetectionAlertingRule.recoveredMessage',
{
defaultMessage:
'No anomalies have been found that exceeded the [{severity}] threshold.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd edit this to No anomalies have been found that exceed the severity threshold of {severity}.

I don't think it needs the square brackets around the severity value.

Or better still, could you include the lookback interval in there too? Such as

No anomalies have been found in the past {lookbackInterval} that exceed the severity threshold of {severity}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in b27c03a

@darnautov darnautov enabled auto-merge (squash) May 23, 2022 15:17
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI text LGTM.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
ml 3.3MB 3.3MB +24.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @darnautov

@darnautov darnautov merged commit 7a5fef1 into elastic:main May 23, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 23, 2022
@darnautov darnautov deleted the ml-126803-recovered-alerts-context branch May 24, 2022 12:44
j-bennet pushed a commit to j-bennet/kibana that referenced this pull request Jun 2, 2022
* recovered context for ad alerting rule

* datafeed report for recovered alerts

* mml report for recovered alerts

* update executor for setting recovered context

* update jest tests, fix mml check

* update error messages check

* update jest tests

* update delayed data test

* fix the mml check

* enable doesSetRecoveryContext

* add rule.name to the default message

* fix datafeed check

* recovered message

* refactor, update anomaly explorer URL time range for recovered alerts

* update message for recovered errorMessage alert

* update delayedDataRecoveryMessage

* fix time range

* update message for recovered anomaly detection alert

* update mml messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Anomaly Detection ML anomaly detection :ml release_note:enhancement Team:ML Team label for ML (also use :ml) v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Add context for recovered alerts
7 participants