[APM] APM Anomaly rule type might miss anomalies #92839

dgieselaar · 2021-02-25T16:32:17Z

The APM Anomaly alert type is always querying now-15m when retrieving anomaly data. Anomaly records are timestamped at the leading edge of the bucket, and are only finalized when the bucket has closed, which means that the final result is likely indexed in Elasticsearch with a timestamp of more than 15 minutes prior to the time of ingestion. This means that we are likely to miss final results. We might fire on interim results which could be generated half way through the bucket, but these results might be corrected, ie the severity might go up or down or the anomaly might be deleted.

Additionally, the alert might recover if its window crosses over to the next bucket, for which no interim result has been generated, and if the next (interim) result is again anomalous, the alert will fire again, ad infinitum.

We should increase the lookback window to at least 2x bucket span, which is 30m, and figure out how to deal with interim results.

elasticmachine · 2021-02-25T16:32:18Z

Pinging @elastic/apm-ui (Team:apm)

jasonrhodes · 2021-02-25T17:14:51Z

Ideally, there would be an ML API that we could query that gives us "how many anomalies are in a reportable state to the user for a given time range". The logic for answering all of the complicated questions about what qualifies for that state would live inside that API, inside the ML app.

Is this a possibility, @elastic/machine-learning ?

darnautov · 2021-02-25T17:54:26Z

hey @dgieselaar @jasonrhodes,
We consider extending the ml alerting shared service available in Kibana server-side to cover solution apps needs (not only AMP), so indeed the logic of retrieving anomalies will live in the ML app.
Shall we prioritize it for 7.13?

The execute method might be sufficient already, but I need more context of which results you expect. I will check the APM anomaly alerting related logic and post an update next week.

dgieselaar · 2021-02-26T11:53:44Z

Some other things to note about the APM Anomaly alert type:

When an alert is created for all environments, and an anomaly that exceeds the specified threshold is found, the alert fires for every started job, regardless if the anomaly was generated by that job. E.g., suppose the user has started ML jobs for production, testing, and (NOT DEFINED), and creates an alert for all environments, and an anomaly is found for opbeans-rum in production for request type page-load, it will fire three alerts: one for production, one for testing, and one for (NOT DEFINED).
Additionally, we use an avg aggregation at the service level, and there is no breakdown per environment or per transaction type. Consider the following example:

The user has set the severity level to low (0)
There is data for production and testing, but not for (NOT DEFINED)
an anomaly of 95 was found for opbeans-rum:testing:route-change
an anomaly of 95 was found for opbeans-rum:testing:http-request
an anomaly of 95 was found for opbeans-rum:testing:user-interaction
an anomaly of 1 was found for opbeans-rum:production:page-load

In this scenario, we will fire the following alerts, all with severity critical (71):

opbeans-rum:testing:route-change
opbeans-rum:testing:http-request
opbeans-rum:testing:user-interaction
opbeans-rum:testing:page-load
opbeans-rum:production:route-change
opbeans-rum:production:http-request
opbeans-rum:production:user-interaction
opbeans-rum:production:page-load
opbeans-rum:(NOT DEFINED):route-change
opbeans-rum:(NOT DEFINED):http-request
opbeans-rum:(NOT DEFINED):user-interaction
opbeans-rum:(NOT DEFINED):page-load

So, it A) fires too many alerts. B) it fires alerts with the wrong severity levels. This in addition to it firing only on interim results, which might be false positives or false negatives.

botelastic · 2021-12-27T22:52:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dgieselaar · 2023-07-03T10:38:01Z

This is fixed, we properly group anomaly data by service environment and we look back 30m at a minimum.

dgieselaar added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support v7.12.0 v7.11.2 labels Feb 25, 2021

dgieselaar self-assigned this Mar 2, 2021

shahzad31 mentioned this issue Mar 3, 2021

[Uptime]Duration Anomaly alert , anomalies look back time range is wrong #93388

Closed

sorenlouv added [zube]: 7.12 and removed [zube]: 7.12 labels Mar 3, 2021

zube bot added [zube]: Discuss and removed [zube]: Inbox labels Mar 3, 2021

sorenlouv added [zube]: Backlog and removed [zube]: In Progress labels Mar 3, 2021

dgieselaar mentioned this issue Mar 3, 2021

[APM] Address issues in APM anomaly alert type #93459

Closed

jasonrhodes mentioned this issue Mar 4, 2021

[Metrics UI] Turn off anomaly alert type (until we understand ML logic) #93645

Closed

sorenlouv changed the title ~~[APM] APM Anomaly alert type might miss anomalies~~ [APM] APM Anomaly rule type might miss anomalies Mar 10, 2021

sorenlouv removed v7.11.2 v7.12.0 labels Jun 30, 2021

formgeist mentioned this issue Aug 25, 2021

[APM] Improve anomaly detection results in APM (Meta) #109787

Open

10 tasks

graphaelli mentioned this issue Sep 2, 2021

[APM] ML jobs should support metric documents #101734

Closed

botelastic bot added the stale Used to mark issues that were closed for being stale label Dec 27, 2021

gbamparop added the apm:alerting label Dec 20, 2022

botelastic bot removed the stale Used to mark issues that were closed for being stale label Dec 20, 2022

dgieselaar closed this as completed Jul 3, 2023

zube bot added [zube]: Done and removed [zube]: Backlog labels Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] APM Anomaly rule type might miss anomalies #92839

[APM] APM Anomaly rule type might miss anomalies #92839

dgieselaar commented Feb 25, 2021 •

edited

elasticmachine commented Feb 25, 2021

jasonrhodes commented Feb 25, 2021 •

edited

darnautov commented Feb 25, 2021 •

edited

dgieselaar commented Feb 26, 2021 •

edited

botelastic bot commented Dec 27, 2021

dgieselaar commented Jul 3, 2023

[APM] APM Anomaly rule type might miss anomalies #92839

[APM] APM Anomaly rule type might miss anomalies #92839

Comments

dgieselaar commented Feb 25, 2021 • edited

elasticmachine commented Feb 25, 2021

jasonrhodes commented Feb 25, 2021 • edited

darnautov commented Feb 25, 2021 • edited

dgieselaar commented Feb 26, 2021 • edited

botelastic bot commented Dec 27, 2021

dgieselaar commented Jul 3, 2023

dgieselaar commented Feb 25, 2021 •

edited

jasonrhodes commented Feb 25, 2021 •

edited

darnautov commented Feb 25, 2021 •

edited

dgieselaar commented Feb 26, 2021 •

edited