Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerts] change load avg alert to warning and route to Slack #11420

Merged
merged 1 commit into from
Aug 5, 2022

Conversation

sagor999
Copy link
Contributor

Description

Add critical alert for when load avg is high for too long.

First, why change to critical alert:
if load average is high for too long, it will affect all workspaces on that node, so we want to react to that fast, hence critical alert (otherwise on call person can potentially be working on something else and might miss this for some time).

Why change duration to 10 min:
We want to minimize amount of false positives as much as possible. We should alert\wake up someone only on sustained high load average.

Why change load avg check to above 10:
Again, we want to be alerted only on legit signal. Load avg above 10 for more then 10 min should signal in almost all cases that something really bad is going on on that node.

Related Issue(s)

Fixes #

How to test

Release Notes

none

Documentation

Werft options:

  • /werft with-preview

@sagor999 sagor999 requested a review from a team July 15, 2022 21:05
@sagor999
Copy link
Contributor Author

/hold
wait for runbook to be merged first: https://github.com/gitpod-io/runbooks/pull/371

@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Jul 15, 2022
@meysholdt
Copy link
Member

in the platform sync, the concern came up that this alert would be too noisy and lead to animosity among on-callers. Adding @vulkoingim as a reviewer to share ideas on how to make it less noisy.

@vulkoingim
Copy link
Contributor

Heya, we discussed this a bit in the platform sync. I think that looking at the 1 min load average is not a good indicator of high load - it's too short of an interval and it's common to have high load over a short time so you'll be alerting on spikes. Ideally you should be looking at 5m or, even better 15m. There's no recording rule for the higher timeframe, so I would suggest to add one similar to: https://github.com/gitpod-io/gitpod/blob/main/operations/observability/mixins/workspace/rules/components/nodes/rules.libsonnet

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sagor999 one suggestion that I think will make this more palatable

@sagor999
Copy link
Contributor Author

sagor999 commented Aug 5, 2022

/unhold

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🙏

@roboquat roboquat merged commit 06a686a into main Aug 5, 2022
@roboquat roboquat deleted the pavel/alert branch August 5, 2022 19:11
@kylos101 kylos101 changed the title [alerts] change load avg alert to critical [alerts] change load avg alert to warning and route to Slack Aug 5, 2022
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/XS team: workspace Issue belongs to the Workspace team
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants