-
Notifications
You must be signed in to change notification settings - Fork 27
feat: configurable alarm severities #1768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: github-actions <github-actions@github.com>
Did you see #1767? Wonder if orchestration failure should be medium severity instead. |
src/package-sources/npmjs.ts
Outdated
// there is nothing that can be done from our end, besides waiting for the replica to be all | ||
// caught up. | ||
monitoring.addLowSeverityAlarm( | ||
monitoring.addMediumSeverityAlarm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is the first time we are using addMediumSeverityAlarm
anywhere. Not an issue in itself, but this alarm will not initialize without alarmActions.mediumSeverity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap. I'm actually revising this PR to support configurable severities, which won't change the default here.
Pull request was converted to draft
Yeah, maybe you're right. Lets discuss in ops. |
Followup on an internal team discussion:
Backend Orchestration Failed
This alarm fires on a single execution failure -> setting it to a high severity is too harsh because sporadic singular failures can happen from time to time, and its not an indication of a severe problem.
Given we also have a high severity alarm for when execution failure rate is above 75%, we want to set this alarm to be low severity.
New version visibility SLA breached
This alarm fires when the package canary is not ingested in time, according to the defined SLA. It is, presumably, an indication of a major problem because the package is predictable and processable. However, this alarm is also impacted by the npm replica, which, from to time, experiences significant delays out of our control. We therefore decided to set it to a low severity initally.
Since then, we have increased our SLA and the replica itself has also undergone substantial changes. Seems like an appropriate timing to start increasing the severity.
Note
We could have also just changed the severities, without providing a way to configure them. This however would be a breaking change. Construct hub is not a stable package, so we would have gotten away with it, but I decide to play it safe.
Configurable severities seems like a good feature to have anyway.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license