New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] APM Anomaly rule type might miss anomalies #92839
Comments
Pinging @elastic/apm-ui (Team:apm) |
Ideally, there would be an ML API that we could query that gives us "how many anomalies are in a reportable state to the user for a given time range". The logic for answering all of the complicated questions about what qualifies for that state would live inside that API, inside the ML app. Is this a possibility, @elastic/machine-learning ? |
hey @dgieselaar @jasonrhodes, The execute method might be sufficient already, but I need more context of which results you expect. I will check the APM anomaly alerting related logic and post an update next week. |
Some other things to note about the APM Anomaly alert type:
In this scenario, we will fire the following alerts, all with severity critical (71):
So, it A) fires too many alerts. B) it fires alerts with the wrong severity levels. This in addition to it firing only on interim results, which might be false positives or false negatives. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is fixed, we properly group anomaly data by service environment and we look back 30m at a minimum. |
The APM Anomaly alert type is always querying now-15m when retrieving anomaly data. Anomaly records are timestamped at the leading edge of the bucket, and are only finalized when the bucket has closed, which means that the final result is likely indexed in Elasticsearch with a timestamp of more than 15 minutes prior to the time of ingestion. This means that we are likely to miss final results. We might fire on interim results which could be generated half way through the bucket, but these results might be corrected, ie the severity might go up or down or the anomaly might be deleted.
Additionally, the alert might recover if its window crosses over to the next bucket, for which no interim result has been generated, and if the next (interim) result is again anomalous, the alert will fire again, ad infinitum.
We should increase the lookback window to at least 2x bucket span, which is 30m, and figure out how to deal with interim results.
The text was updated successfully, but these errors were encountered: