-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alerts] change load avg alert to warning and route to Slack #11420
Conversation
/hold |
in the platform sync, the concern came up that this alert would be too noisy and lead to animosity among on-callers. Adding @vulkoingim as a reviewer to share ideas on how to make it less noisy. |
Heya, we discussed this a bit in the platform sync. I think that looking at the 1 min load average is not a good indicator of high load - it's too short of an interval and it's common to have high load over a short time so you'll be alerting on spikes. Ideally you should be looking at 5m or, even better 15m. There's no recording rule for the higher timeframe, so I would suggest to add one similar to: https://github.com/gitpod-io/gitpod/blob/main/operations/observability/mixins/workspace/rules/components/nodes/rules.libsonnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sagor999 one suggestion that I think will make this more palatable
operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet
Outdated
Show resolved
Hide resolved
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🙏
Description
Add critical alert for when load avg is high for too long.
First, why change to critical alert:
if load average is high for too long, it will affect all workspaces on that node, so we want to react to that fast, hence critical alert (otherwise on call person can potentially be working on something else and might miss this for some time).
Why change duration to 10 min:
We want to minimize amount of false positives as much as possible. We should alert\wake up someone only on sustained high load average.
Why change load avg check to above 10:
Again, we want to be alerted only on legit signal. Load avg above 10 for more then 10 min should signal in almost all cases that something really bad is going on on that node.
Related Issue(s)
Fixes #
How to test
Release Notes
Documentation
Werft options: