Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] APM Anomaly rule type might miss anomalies #92839

Closed
Tracked by #109787
dgieselaar opened this issue Feb 25, 2021 · 6 comments
Closed
Tracked by #109787

[APM] APM Anomaly rule type might miss anomalies #92839

dgieselaar opened this issue Feb 25, 2021 · 6 comments
Assignees
Labels
apm:alerting bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support

Comments

@dgieselaar
Copy link
Member

dgieselaar commented Feb 25, 2021

The APM Anomaly alert type is always querying now-15m when retrieving anomaly data. Anomaly records are timestamped at the leading edge of the bucket, and are only finalized when the bucket has closed, which means that the final result is likely indexed in Elasticsearch with a timestamp of more than 15 minutes prior to the time of ingestion. This means that we are likely to miss final results. We might fire on interim results which could be generated half way through the bucket, but these results might be corrected, ie the severity might go up or down or the anomaly might be deleted.

Additionally, the alert might recover if its window crosses over to the next bucket, for which no interim result has been generated, and if the next (interim) result is again anomalous, the alert will fire again, ad infinitum.

We should increase the lookback window to at least 2x bucket span, which is 30m, and figure out how to deal with interim results.

@dgieselaar dgieselaar added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support v7.12.0 v7.11.2 labels Feb 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@jasonrhodes
Copy link
Member

jasonrhodes commented Feb 25, 2021

Ideally, there would be an ML API that we could query that gives us "how many anomalies are in a reportable state to the user for a given time range". The logic for answering all of the complicated questions about what qualifies for that state would live inside that API, inside the ML app.

Is this a possibility, @elastic/machine-learning ?

@darnautov
Copy link
Contributor

darnautov commented Feb 25, 2021

hey @dgieselaar @jasonrhodes,
We consider extending the ml alerting shared service available in Kibana server-side to cover solution apps needs (not only AMP), so indeed the logic of retrieving anomalies will live in the ML app.
Shall we prioritize it for 7.13?

The execute method might be sufficient already, but I need more context of which results you expect. I will check the APM anomaly alerting related logic and post an update next week.

@dgieselaar
Copy link
Member Author

dgieselaar commented Feb 26, 2021

Some other things to note about the APM Anomaly alert type:

  1. When an alert is created for all environments, and an anomaly that exceeds the specified threshold is found, the alert fires for every started job, regardless if the anomaly was generated by that job. E.g., suppose the user has started ML jobs for production, testing, and (NOT DEFINED), and creates an alert for all environments, and an anomaly is found for opbeans-rum in production for request type page-load, it will fire three alerts: one for production, one for testing, and one for (NOT DEFINED).

  2. Additionally, we use an avg aggregation at the service level, and there is no breakdown per environment or per transaction type. Consider the following example:

  • The user has set the severity level to low (0)
  • There is data for production and testing, but not for (NOT DEFINED)
  • an anomaly of 95 was found for opbeans-rum:testing:route-change
  • an anomaly of 95 was found for opbeans-rum:testing:http-request
  • an anomaly of 95 was found for opbeans-rum:testing:user-interaction
  • an anomaly of 1 was found for opbeans-rum:production:page-load

In this scenario, we will fire the following alerts, all with severity critical (71):

  • opbeans-rum:testing:route-change
  • opbeans-rum:testing:http-request
  • opbeans-rum:testing:user-interaction
  • opbeans-rum:testing:page-load
  • opbeans-rum:production:route-change
  • opbeans-rum:production:http-request
  • opbeans-rum:production:user-interaction
  • opbeans-rum:production:page-load
  • opbeans-rum:(NOT DEFINED):route-change
  • opbeans-rum:(NOT DEFINED):http-request
  • opbeans-rum:(NOT DEFINED):user-interaction
  • opbeans-rum:(NOT DEFINED):page-load

So, it A) fires too many alerts. B) it fires alerts with the wrong severity levels. This in addition to it firing only on interim results, which might be false positives or false negatives.

@botelastic
Copy link

botelastic bot commented Dec 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the stale Used to mark issues that were closed for being stale label Dec 27, 2021
@botelastic botelastic bot removed the stale Used to mark issues that were closed for being stale label Dec 20, 2022
@dgieselaar
Copy link
Member Author

This is fixed, we properly group anomaly data by service environment and we look back 30m at a minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:alerting bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support
Projects
None yet
Development

No branches or pull requests

6 participants