General API response time alarms #2501

sarayourfriend · 2023-06-29T01:23:46Z

Description

Project thread: #2344
Implementation plan: https://docs.openverse.org/projects/proposals/monitoring/20230606_implementation_plan_ecs_alarms.html

Create alarms in next/modules/monitoring/production-api for the following:

Average response time anomaly
Average response time over threshold
p99 response time anomaly
p99 response time over threshold

Additional context

This issue will remain open until the alarms are stabilised.

Blocked by the baseline configuration changes in #2499

The text was updated successfully, but these errors were encountered:

krysal · 2023-09-08T18:35:02Z

@sarayourfriend What is the difference between the alarms for time over threshold and time anomaly here? The PR for the former is up at https://github.com/WordPress/openverse-infrastructure/pull/614

stacimc · 2023-09-18T22:20:52Z

What is the difference between the alarms for time over threshold and time anomaly here?

@sarayourfriend can double-check me here, but my understanding is time over threshold means alerting when response time exceeds a pre-determined static threshold, while anomaly detection is a more sophisticated way of detecting that a metric has gone outside of a "normal" range based on past values. It fits a 'band' to the graph of the metric, and alerts whenever the metric goes falls outside the range.

So instead of configuring a static threshold ("alert when response time is higher than x"), you configure how wide the band is. A way of thinking about it is alerting when the metric is more than x standard deviations from the line of best fit (I don't know anything about the actual math involved and it's probably more complicated than that, but I think that's the general idea). I think these are the relevant docs in our case.

sarayourfriend · 2023-09-18T22:36:44Z

That's correct, Staci. Threshold alarms are generally good for being alerted quickly when a "very bad" condition is met, like a rapid spike in response times over 3 seconds (as an example). Anomaly detection, by its very nature, includes lots of false positives if you alert on 1 datapoint outside the expected deviation (the "band" or "normal range" Staci mentioned), and so typically takes more time to alert you, because you need multiple datapoints outside the band in a row or given period before it's a "true anomaly". Anomaly detection can help us know when things are trending in one direction or another "too fast", while still accounting for normal variations. Anomaly detection also accounts for daily fluctuations. This allows us to confidently alarm when request count goes down for the particular time of day. We can set a number that won't false-alarm during expected low response times, but because our request counts fluctuate daily quite a bit, we would want to know if a significant dip happened in the middle of the busy period, even if that wasn't below the threshold for alarms during the off period. Anomaly detection also incorporates seasonality at a weekly, monthly, or yearly level, and so again "self-adjusts" over the year to account for general trends that are not statistically anomalous.

In a basic sense, anomaly detection is real statistical analysis of metrics on an ongoing basis, detecting persistent outliers. Threshold detection is much simpler and not flexible, so all it can do is tell us if a pre-determined "bad state" is met.

You can read more about anomaly detection around the internet, there are heaps of resources explaining the different use cases, and CloudWatch's documentation is comprehensive for their features along with helpful illustrations: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html. It's a good starting point for understanding better what anomaly detection is all about, but like I said, there are probably thousands of blog posts about it 🙂 Here are some other resources I was able to find that I would consider good explanations. Start with them and then it may be easier to find reliable resources elsewhere.

Data Dog also has anomaly detection and while their feature set is slightly different than CloudWatch's (it's frankly a lot more comprehensive), their examples section looks like another good starting point for understanding it: https://docs.datadoghq.com/monitors/types/anomaly/#examples

Here is a Grafana blog post on "outlier detection", which looks for all intents and purposes like the same thing, by a different name: https://grafana.com/blog/2022/12/22/introducing-outlier-detection-in-grafana-machine-learning-for-grafana-cloud/

Here is a blog post that explains the maths behind it, might be useful as well, if that's a way that works for you: https://blog.davidvassallo.me/2021/10/01/grafana-prometheus-detecting-anomalies-in-time-series/. It's surprisingly approachable, in the grand scheme of things, as long as you're familiar with the relationship between standard deviations and averages. I found the mathematical approach to understanding it very helpful when I was learning this and made it much more intuitive to understand how to apply it, but YMMV depending on how you prefer to approach things 🙂

sarayourfriend · 2023-09-18T22:38:43Z

A way of thinking about it is alerting when the metric is more than x standard deviations from the line of best fit (I don't know anything about the actual math involved and it's probably more complicated than that, but I think that's the general idea)

Also, @stacimc that is basically all there is too it 😁 I don't know how they calculate rolling means and standard deviations or how seasonality is accounted for, that's all much more complicated, but the general principle is all I've ever felt was necessary for the idea to be intuitive enough for me to work with it.

sarayourfriend · 2023-09-18T22:41:19Z

Sorry, one more thing, you'll also want to learn about monitor downtime, which I talked about in the project proposal here: https://docs.openverse.org/projects/proposals/monitoring/20230606_implementation_plan_ecs_alarms.html#downtime

Anomaly alarms sometimes need them if the metric swings very widely. Threshold alarms, like ES CPU usage during data refresh, also sometimes need them.

krysal · 2023-09-19T20:41:11Z

Thank you both, I had a general idea of what they were. I guess I phrased my question wrong. 😅 I meant to ask if we want both types of alarms (anomaly & over threshold) and why. To me, it seems we want to choose one or the other.

sarayourfriend · 2023-09-19T21:09:16Z

Sorry! If we only want one or the other, then anomaly is the correct choice. Like I said, I think a combination is helpful to catch "very bad spikes" that get thrown out as false positives in anomaly detection. If average response time spikes to 10 seconds but only for one datapoint, we probably want to know that happened so we can look into it. But we also do not want an anomaly alarm to go off on every outlier, usually only repeated outliers (like 2/2 or 2/3 datapoints, for example), otherwise we will get heaps of false alarms that are like 1.1 standard deviations away for one data point, which isn't that big a deal.

To me, they are different tools, and I think we should use both. But if we only want to use one or the other, then anomaly is the more useful one.

krysal · 2023-09-20T20:02:44Z

@sarayourfriend, I see your point! And experiencing it firsthand the last days with the API struggle. Makes sense to me set a reasonably high threshold for one alarm and let the anomaly detection catch most things. The issues to complete this is up :) https://github.com/WordPress/openverse-infrastructure/pull/622

sarayourfriend added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature ⛔ status: blocked Blocked & therefore, not ready for work labels Jun 29, 2023

sarayourfriend added this to the ECS Alarms milestone Jun 29, 2023

sarayourfriend removed the ⛔ status: blocked Blocked & therefore, not ready for work label Jul 17, 2023

krysal self-assigned this Jul 25, 2023

krysal added the 🔒 staff only Restricted to staff members label Jul 27, 2023

krysal mentioned this issue Sep 8, 2023

Add runbooks for API response times alarms #3001

Merged

6 tasks

krysal closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General API response time alarms #2501

General API response time alarms #2501

sarayourfriend commented Jun 29, 2023

krysal commented Sep 8, 2023

stacimc commented Sep 18, 2023

sarayourfriend commented Sep 18, 2023 •

edited

sarayourfriend commented Sep 18, 2023

sarayourfriend commented Sep 18, 2023

krysal commented Sep 19, 2023

sarayourfriend commented Sep 19, 2023

krysal commented Sep 20, 2023

General API response time alarms #2501

General API response time alarms #2501

Comments

sarayourfriend commented Jun 29, 2023

Description

Additional context

krysal commented Sep 8, 2023

stacimc commented Sep 18, 2023

sarayourfriend commented Sep 18, 2023 • edited

sarayourfriend commented Sep 18, 2023

sarayourfriend commented Sep 18, 2023

krysal commented Sep 19, 2023

sarayourfriend commented Sep 19, 2023

krysal commented Sep 20, 2023

sarayourfriend commented Sep 18, 2023 •

edited