[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701

darnautov · 2021-08-16T14:02:59Z

Summary

Part of #101028

Adds a test for errors in the jobs messages for the Anomaly detection jobs health rule type.

Checklist

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

elasticmachine · 2021-08-16T14:03:02Z

Pinging @elastic/ml-ui (:ml)

szabosteve

Thanks for the text change, LGTM!

sophiec20 · 2021-08-16T15:20:35Z

Can we please add some UI helper text to explain that these operational alerts are best suited for your mission critical or important jobs. For example, the "datafeed is not started" alert is only useful if applied to a datafeed that is operationally critical (i.e. that is a real-time job for which you probably already have an alert running on the anomaly detection results).
Re errors in job messages - The other alerts can all be solved. e.g. a datafeed can be started, and job memory amended. How are we expecting the job message errors to be resolved? Does it take the "Clear job messages" option into account? or is there a time frame over which to look back for errors in which case it will age out? -- the helper text should explain.
"There are errors in the job messages" - this wording does not seem in keeping with the rest of the operational alerts.
How do we tell which jobs are experiencing which problems? - Until an integrated alerting UI is available, we are relying on the alert action (e.g. email message) to describe which jobs are experiencing which problem. Therefore, we rely on easy (ish) access to this context info and well written documentation to describe how to do it. Is this part of this PR or will it be a follow up?

szabosteve · 2021-08-16T16:18:21Z

@darnautov As that part of the text is not edited in this PR, I cannot add a suggestion to the "There are errors in job messages" text that Sophie mentioned, so I leave some options here as a comment:

Errors in job messages (I'd prefer this one.)
Job messages contain errors

darnautov · 2021-08-16T16:31:01Z

thanks for the feedback @sophiec20!

Can we please add some UI helper text to explain that these operational alerts are best suited for your mission critical or important jobs. For example, the "datafeed is not started" alert is only useful if applied to a datafeed that is operationally critical (i.e. that is a real-time job for which you probably already have an alert running on the anomaly detection results).

Do you suggest updating the rule type helper text and the health check description as well?

Re errors in job messages - The other alerts can all be solved. e.g. a datafeed can be started, and job memory amended. How are we expecting the job message errors to be resolved? Does it take the "Clear job messages" option into account? or is there a time frame over which to look back for errors in which case it will age out? -- the helper text should explain.

@droberts195 suggested notifying about errors only once and I think it makes sense. So during the initial check, we query for any existing error messages in specified jobs, and for consecutive executions applying a time range according to the previous execution time.

"There are errors in the job messages" - this wording does not seem in keeping with the rest of the operational alerts.

@szabosteve @lcawl do you have any suggestions?

How do we tell which jobs are experiencing which problems? - Until an integrated alerting UI is available, we are relying on the alert action (e.g. email message) to describe which jobs are experiencing which problem. Therefore, we rely on easy (ish) access to this context info and well written documentation to describe how to do it. Is this part of this PR or will it be a follow up?

There is a limitation of the Alerts and actions framework. Our alerting context contains a collection, i.e. for each health check we provide a set of results (array of objects) and it is not possible to describe such context variables. I created an enhancement request but I haven't got an estimation yet. The best we can do so far is:

Provide a predefined default message that contains a mustache template with all possible fields. It's already in place.
Describe the alerting context in the documentation, similar to context.hits in the Elasticsearch query rule type

alvarezmelissa87

LGTM ⚡

sophiec20 · 2021-08-16T16:56:18Z

Do you suggest updating the rule type helper text and the health check description as well?

Rule type helper text. Please work with our docs team for suitable wording.

notifying about errors only once ... So during the initial check, we query for any existing error messages in specified jobs, and for consecutive executions applying a time range according to the previous execution time.

I think we need to think through this a little more. On the first invocation, it would not be ideal to search for any error since the beginning of time, because this could be last year for a very long running job. Or it could be from a time since before the job got reset as we do not clear out job messages. Perhaps it should only ever check since the prev execution time, and use the first invocation to set the execution time.

jgowdyelastic · 2021-08-17T08:17:05Z

x-pack/plugins/ml/server/models/job_audit_messages/job_audit_messages.ts

+      latest_errors: Pick<estypes.SearchResponse<JobMessage>, 'hits'>;
+    }>;
+
+    const result = errors.buckets.map((bucket) => {


total nit, result isn't needed

Changed in 54cc87a

jgowdyelastic · 2021-08-17T08:19:33Z

x-pack/plugins/ml/server/models/job_audit_messages/job_audit_messages.ts

+   * Retrieve list of errors per job.
+   * @param jobIds
+   */
+  async function getJobsErrors(jobIds: string[], earliestMs?: number): Promise<JobsErrorsResponse> {


This function could take the message level as a parameter, possibly defaulting to MESSAGE_LEVEL.ERROR, to make it more reusable.

yeah, I was thinking about it but not sure about the use case, i.e. if we ever want to retrieve warnings or info messages

yeah, I can't see a use case at the moment, it would just make it potentially more reusable for no extra cost, especially if it had a default message level set to error.
If not i think the function should be renamed to getJobsErrorMessages, to conform to the general naming convention in the file.

jgowdyelastic · 2021-08-17T08:23:48Z

x-pack/plugins/ml/server/models/job_audit_messages/job_audit_messages.ts

+              ...(earliestMs ? [{ range: { timestamp: { gte: earliestMs } } }] : []),
+              { terms: { job_id: jobIds } },
+              {
+                term: { level: { value: 'error' } },


If the comment above about making message level a param isn't added, this should be MESSAGE_LEVEL.ERROR

Changed in 54cc87a

szabosteve · 2021-08-17T08:58:44Z

@darnautov I suggest the following alternative for the rule type helper text:

Alert when anomaly detection jobs experience operational issues. Enable suitable alerts for critically important jobs.

And then the link to the documentation as the screenshot above shows.

kibanamachine · 2021-08-17T12:23:51Z

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`ml`	6.0MB	6.0MB	+146.0B

History

💚 Build #146056 succeeded 8e3aa15
💚 Build #145785 succeeded 364dd71
💔 Build #145728 failed 6f287fc
💔 Build #145707 failed 71015d7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @darnautov

…le type (elastic#108701) * [ML] retrieve job errors * [ML] account for previous execution time * [ML] update default message * [ML] update description * [ML] update unit tests * [ML] update unit tests * [ML] update action name * [ML] update errorMessages name * [ML] update a default message to avoid line breaks * [ML] update rule helper text * [ML] refactor getJobsErrors * [ML] perform errors check starting from the second execution

kibanamachine · 2021-08-17T14:24:35Z

💚 Backport successful

Status	Branch	Result
✅	7.x

This backport PR will be merged automatically after passing CI.

…le type (#108701) (#108918) * [ML] retrieve job errors * [ML] account for previous execution time * [ML] update default message * [ML] update description * [ML] update unit tests * [ML] update unit tests * [ML] update action name * [ML] update errorMessages name * [ML] update a default message to avoid line breaks * [ML] update rule helper text * [ML] refactor getJobsErrors * [ML] perform errors check starting from the second execution Co-authored-by: Dima Arnautov <dmitrii.arnautov@elastic.co>

darnautov added 4 commits August 16, 2021 14:56

[ML] retrieve job errors

5b2be8d

[ML] account for previous execution time

b2ed399

[ML] update default message

0fa6900

[ML] update description

db26e47

darnautov requested review from pheyos, alvarezmelissa87, jgowdyelastic and szabosteve August 16, 2021 14:02

darnautov requested a review from a team as a code owner August 16, 2021 14:03

darnautov self-assigned this Aug 16, 2021

darnautov added 3 commits August 16, 2021 16:03

Merge remote-tracking branch 'upstream/master' into ml-101028-errors

71015d7

[ML] update unit tests

9422554

[ML] update unit tests

6f287fc

szabosteve approved these changes Aug 16, 2021

View reviewed changes

darnautov mentioned this pull request Aug 16, 2021

[ML] Operational alerting rule type for Anomaly detection jobs #101028

Open

5 tasks

Merge remote-tracking branch 'upstream/master' into ml-101028-errors

364dd71

alvarezmelissa87 approved these changes Aug 16, 2021

View reviewed changes

lukasolson removed the 8.0.0 label Aug 17, 2021

lukasolson added the v8.0.0 label Aug 17, 2021

darnautov added 2 commits August 17, 2021 09:47

[ML] update action name

ef572e5

[ML] update errorMessages name

65a5a66

jgowdyelastic reviewed Aug 17, 2021

View reviewed changes

darnautov added 2 commits August 17, 2021 10:42

[ML] update a default message to avoid line breaks

7ea914a

[ML] update rule helper text

8e3aa15

darnautov added 2 commits August 17, 2021 11:02

[ML] refactor getJobsErrors

54cc87a

[ML] perform errors check starting from the second execution

347a393

darnautov merged commit f243b05 into elastic:master Aug 17, 2021

darnautov deleted the ml-101028-errors branch August 17, 2021 14:21

kibanamachine mentioned this pull request Aug 17, 2021

[7.x] [ML] Check for error messages in the Anomaly Detection jobs health rule type (#108701) #108918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701

[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701

darnautov commented Aug 16, 2021 •

edited

elasticmachine commented Aug 16, 2021

szabosteve left a comment

sophiec20 commented Aug 16, 2021 •

edited

szabosteve commented Aug 16, 2021

darnautov commented Aug 16, 2021

alvarezmelissa87 left a comment

sophiec20 commented Aug 16, 2021

jgowdyelastic Aug 17, 2021

darnautov Aug 17, 2021

jgowdyelastic Aug 17, 2021 •

edited

darnautov Aug 17, 2021

jgowdyelastic Aug 17, 2021

jgowdyelastic Aug 17, 2021

darnautov Aug 17, 2021

szabosteve commented Aug 17, 2021 •

edited

kibanamachine commented Aug 17, 2021

kibanamachine commented Aug 17, 2021

[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701

[ML] Check for error messages in the Anomaly Detection jobs health rule type #108701

Conversation

darnautov commented Aug 16, 2021 • edited

Summary

Checklist

elasticmachine commented Aug 16, 2021

szabosteve left a comment

Choose a reason for hiding this comment

sophiec20 commented Aug 16, 2021 • edited

szabosteve commented Aug 16, 2021

darnautov commented Aug 16, 2021

alvarezmelissa87 left a comment

Choose a reason for hiding this comment

sophiec20 commented Aug 16, 2021

jgowdyelastic Aug 17, 2021

Choose a reason for hiding this comment

darnautov Aug 17, 2021

Choose a reason for hiding this comment

jgowdyelastic Aug 17, 2021 • edited

Choose a reason for hiding this comment

darnautov Aug 17, 2021

Choose a reason for hiding this comment

jgowdyelastic Aug 17, 2021

Choose a reason for hiding this comment

jgowdyelastic Aug 17, 2021

Choose a reason for hiding this comment

darnautov Aug 17, 2021

Choose a reason for hiding this comment

szabosteve commented Aug 17, 2021 • edited

kibanamachine commented Aug 17, 2021

💚 Build Succeeded

Metrics [docs]

Async chunks

History

kibanamachine commented Aug 17, 2021

💚 Backport successful

darnautov commented Aug 16, 2021 •

edited

sophiec20 commented Aug 16, 2021 •

edited

jgowdyelastic Aug 17, 2021 •

edited

szabosteve commented Aug 17, 2021 •

edited