Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General API response time alarms #2501

Closed
sarayourfriend opened this issue Jun 29, 2023 · 8 comments
Closed

General API response time alarms #2501

sarayourfriend opened this issue Jun 29, 2023 · 8 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: infra Related to the Terraform config and other infrastructure 🔒 staff only Restricted to staff members
Milestone

Comments

@sarayourfriend
Copy link
Contributor

Description

Project thread: #2344
Implementation plan: https://docs.openverse.org/projects/proposals/monitoring/20230606_implementation_plan_ecs_alarms.html

Create alarms in next/modules/monitoring/production-api for the following:

  • Average response time anomaly
  • Average response time over threshold
  • p99 response time anomaly
  • p99 response time over threshold

Additional context

This issue will remain open until the alarms are stabilised.

Blocked by the baseline configuration changes in #2499

@sarayourfriend sarayourfriend added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature ⛔ status: blocked Blocked & therefore, not ready for work labels Jun 29, 2023
@sarayourfriend sarayourfriend added this to the ECS Alarms milestone Jun 29, 2023
@dhruvkb dhruvkb added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: infra Related to the Terraform config and other infrastructure and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature labels Jul 3, 2023
@sarayourfriend sarayourfriend removed the ⛔ status: blocked Blocked & therefore, not ready for work label Jul 17, 2023
@krysal krysal self-assigned this Jul 25, 2023
@krysal krysal added the 🔒 staff only Restricted to staff members label Jul 27, 2023
@krysal
Copy link
Member

krysal commented Sep 8, 2023

@sarayourfriend What is the difference between the alarms for time over threshold and time anomaly here? The PR for the former is up at https://github.com/WordPress/openverse-infrastructure/pull/614

@stacimc
Copy link
Contributor

stacimc commented Sep 18, 2023

What is the difference between the alarms for time over threshold and time anomaly here?

@sarayourfriend can double-check me here, but my understanding is time over threshold means alerting when response time exceeds a pre-determined static threshold, while anomaly detection is a more sophisticated way of detecting that a metric has gone outside of a "normal" range based on past values. It fits a 'band' to the graph of the metric, and alerts whenever the metric goes falls outside the range.

So instead of configuring a static threshold ("alert when response time is higher than x"), you configure how wide the band is. A way of thinking about it is alerting when the metric is more than x standard deviations from the line of best fit (I don't know anything about the actual math involved and it's probably more complicated than that, but I think that's the general idea). I think these are the relevant docs in our case.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Sep 18, 2023

That's correct, Staci. Threshold alarms are generally good for being alerted quickly when a "very bad" condition is met, like a rapid spike in response times over 3 seconds (as an example). Anomaly detection, by its very nature, includes lots of false positives if you alert on 1 datapoint outside the expected deviation (the "band" or "normal range" Staci mentioned), and so typically takes more time to alert you, because you need multiple datapoints outside the band in a row or given period before it's a "true anomaly". Anomaly detection can help us know when things are trending in one direction or another "too fast", while still accounting for normal variations. Anomaly detection also accounts for daily fluctuations. This allows us to confidently alarm when request count goes down for the particular time of day. We can set a number that won't false-alarm during expected low response times, but because our request counts fluctuate daily quite a bit, we would want to know if a significant dip happened in the middle of the busy period, even if that wasn't below the threshold for alarms during the off period. Anomaly detection also incorporates seasonality at a weekly, monthly, or yearly level, and so again "self-adjusts" over the year to account for general trends that are not statistically anomalous.

In a basic sense, anomaly detection is real statistical analysis of metrics on an ongoing basis, detecting persistent outliers. Threshold detection is much simpler and not flexible, so all it can do is tell us if a pre-determined "bad state" is met.

You can read more about anomaly detection around the internet, there are heaps of resources explaining the different use cases, and CloudWatch's documentation is comprehensive for their features along with helpful illustrations: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html. It's a good starting point for understanding better what anomaly detection is all about, but like I said, there are probably thousands of blog posts about it 🙂 Here are some other resources I was able to find that I would consider good explanations. Start with them and then it may be easier to find reliable resources elsewhere.

Data Dog also has anomaly detection and while their feature set is slightly different than CloudWatch's (it's frankly a lot more comprehensive), their examples section looks like another good starting point for understanding it: https://docs.datadoghq.com/monitors/types/anomaly/#examples

Here is a Grafana blog post on "outlier detection", which looks for all intents and purposes like the same thing, by a different name: https://grafana.com/blog/2022/12/22/introducing-outlier-detection-in-grafana-machine-learning-for-grafana-cloud/

Here is a blog post that explains the maths behind it, might be useful as well, if that's a way that works for you: https://blog.davidvassallo.me/2021/10/01/grafana-prometheus-detecting-anomalies-in-time-series/. It's surprisingly approachable, in the grand scheme of things, as long as you're familiar with the relationship between standard deviations and averages. I found the mathematical approach to understanding it very helpful when I was learning this and made it much more intuitive to understand how to apply it, but YMMV depending on how you prefer to approach things 🙂

@sarayourfriend
Copy link
Contributor Author

A way of thinking about it is alerting when the metric is more than x standard deviations from the line of best fit (I don't know anything about the actual math involved and it's probably more complicated than that, but I think that's the general idea)

Also, @stacimc that is basically all there is too it 😁 I don't know how they calculate rolling means and standard deviations or how seasonality is accounted for, that's all much more complicated, but the general principle is all I've ever felt was necessary for the idea to be intuitive enough for me to work with it.

@sarayourfriend
Copy link
Contributor Author

Sorry, one more thing, you'll also want to learn about monitor downtime, which I talked about in the project proposal here: https://docs.openverse.org/projects/proposals/monitoring/20230606_implementation_plan_ecs_alarms.html#downtime

Anomaly alarms sometimes need them if the metric swings very widely. Threshold alarms, like ES CPU usage during data refresh, also sometimes need them.

@krysal
Copy link
Member

krysal commented Sep 19, 2023

Thank you both, I had a general idea of what they were. I guess I phrased my question wrong. 😅 I meant to ask if we want both types of alarms (anomaly & over threshold) and why. To me, it seems we want to choose one or the other.

@sarayourfriend
Copy link
Contributor Author

Sorry! If we only want one or the other, then anomaly is the correct choice. Like I said, I think a combination is helpful to catch "very bad spikes" that get thrown out as false positives in anomaly detection. If average response time spikes to 10 seconds but only for one datapoint, we probably want to know that happened so we can look into it. But we also do not want an anomaly alarm to go off on every outlier, usually only repeated outliers (like 2/2 or 2/3 datapoints, for example), otherwise we will get heaps of false alarms that are like 1.1 standard deviations away for one data point, which isn't that big a deal.

To me, they are different tools, and I think we should use both. But if we only want to use one or the other, then anomaly is the more useful one.

@krysal
Copy link
Member

krysal commented Sep 20, 2023

@sarayourfriend, I see your point! And experiencing it firsthand the last days with the API struggle. Makes sense to me set a reasonably high threshold for one alarm and let the anomaly detection catch most things. The issues to complete this is up :) https://github.com/WordPress/openverse-infrastructure/pull/622

@krysal krysal closed this as completed Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: infra Related to the Terraform config and other infrastructure 🔒 staff only Restricted to staff members
Projects
Archived in project
Development

No branches or pull requests

4 participants