[alerts] change load avg alert to warning and route to Slack #11420

sagor999 · 2022-07-15T21:05:00Z

Description

Add critical alert for when load avg is high for too long.

First, why change to critical alert:
if load average is high for too long, it will affect all workspaces on that node, so we want to react to that fast, hence critical alert (otherwise on call person can potentially be working on something else and might miss this for some time).

Why change duration to 10 min:
We want to minimize amount of false positives as much as possible. We should alert\wake up someone only on sustained high load average.

Why change load avg check to above 10:
Again, we want to be alerted only on legit signal. Load avg above 10 for more then 10 min should signal in almost all cases that something really bad is going on on that node.

Related Issue(s)

Fixes #

How to test

Release Notes

none

Documentation

Werft options:

/werft with-preview

sagor999 · 2022-07-15T21:05:12Z

/hold
wait for runbook to be merged first: https://github.com/gitpod-io/runbooks/pull/371

meysholdt · 2022-07-25T12:51:59Z

in the platform sync, the concern came up that this alert would be too noisy and lead to animosity among on-callers. Adding @vulkoingim as a reviewer to share ideas on how to make it less noisy.

vulkoingim · 2022-07-25T13:10:09Z

Heya, we discussed this a bit in the platform sync. I think that looking at the 1 min load average is not a good indicator of high load - it's too short of an interval and it's common to have high load over a short time so you'll be alerting on spikes. Ideally you should be looking at 5m or, even better 15m. There's no recording rule for the higher timeframe, so I would suggest to add one similar to: https://github.com/gitpod-io/gitpod/blob/main/operations/observability/mixins/workspace/rules/components/nodes/rules.libsonnet

kylos101

@sagor999 one suggestion that I think will make this more palatable

operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet

sagor999 · 2022-08-05T19:08:28Z

/unhold

kylos101

🚀 🙏

sagor999 requested a review from a team July 15, 2022 21:05

roboquat added the release-note-none label Jul 15, 2022

roboquat added do-not-merge/hold size/S labels Jul 15, 2022

github-actions bot added the team: workspace Issue belongs to the Workspace team label Jul 15, 2022

meysholdt requested a review from vulkoingim July 25, 2022 12:50

kylos101 requested changes Aug 5, 2022

View reviewed changes

operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet Outdated Show resolved Hide resolved

[alerts] change load avg alert to critical

01a6aad

sagor999 force-pushed the pavel/alert branch from 238e60e to 01a6aad Compare August 5, 2022 19:07

sagor999 requested a review from kylos101 August 5, 2022 19:07

roboquat added size/XS and removed size/S labels Aug 5, 2022

roboquat removed the do-not-merge/hold label Aug 5, 2022

kylos101 approved these changes Aug 5, 2022

View reviewed changes

roboquat merged commit 06a686a into main Aug 5, 2022

roboquat deleted the pavel/alert branch August 5, 2022 19:11

kylos101 changed the title ~~[alerts] change load avg alert to critical~~ [alerts] change load avg alert to warning and route to Slack Aug 5, 2022

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alerts] change load avg alert to warning and route to Slack #11420

[alerts] change load avg alert to warning and route to Slack #11420

sagor999 commented Jul 15, 2022

sagor999 commented Jul 15, 2022

meysholdt commented Jul 25, 2022

vulkoingim commented Jul 25, 2022

kylos101 left a comment

sagor999 commented Aug 5, 2022

kylos101 left a comment

[alerts] change load avg alert to warning and route to Slack #11420

[alerts] change load avg alert to warning and route to Slack #11420

Conversation

sagor999 commented Jul 15, 2022

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

sagor999 commented Jul 15, 2022

meysholdt commented Jul 25, 2022

vulkoingim commented Jul 25, 2022

kylos101 left a comment

Choose a reason for hiding this comment

sagor999 commented Aug 5, 2022

kylos101 left a comment

Choose a reason for hiding this comment