Skip to content

Conversation

iliapolo
Copy link
Contributor

@iliapolo iliapolo commented Jul 20, 2025

Followup on an internal team discussion:

Backend Orchestration Failed

This alarm fires on a single execution failure -> setting it to a high severity is too harsh because sporadic singular failures can happen from time to time, and its not an indication of a severe problem.

Given we also have a high severity alarm for when execution failure rate is above 75%, we want to set this alarm to be low severity.

New version visibility SLA breached

This alarm fires when the package canary is not ingested in time, according to the defined SLA. It is, presumably, an indication of a major problem because the package is predictable and processable. However, this alarm is also impacted by the npm replica, which, from to time, experiences significant delays out of our control. We therefore decided to set it to a low severity initally.

Since then, we have increased our SLA and the replica itself has also undergone substantial changes. Seems like an appropriate timing to start increasing the severity.

Note

We could have also just changed the severities, without providing a way to configure them. This however would be a breaking change. Construct hub is not a stable package, so we would have gotten away with it, but I decide to play it safe.
Configurable severities seems like a good feature to have anyway.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Signed-off-by: github-actions <github-actions@github.com>
@iankhou
Copy link
Contributor

iankhou commented Jul 21, 2025

Did you see #1767? Wonder if orchestration failure should be medium severity instead.

// there is nothing that can be done from our end, besides waiting for the replica to be all
// caught up.
monitoring.addLowSeverityAlarm(
monitoring.addMediumSeverityAlarm(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is the first time we are using addMediumSeverityAlarm anywhere. Not an issue in itself, but this alarm will not initialize without alarmActions.mediumSeverity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. I'm actually revising this PR to support configurable severities, which won't change the default here.

@iliapolo iliapolo changed the title fix!: some alarm severities are improperly configured feat: custom alarm severities Jul 21, 2025
@iliapolo iliapolo marked this pull request as draft July 21, 2025 07:30
auto-merge was automatically disabled July 21, 2025 07:30

Pull request was converted to draft

@iliapolo
Copy link
Contributor Author

Wonder if orchestration failure should be medium severity instead.

Yeah, maybe you're right. Lets discuss in ops.

@iliapolo iliapolo changed the title feat: custom alarm severities feat: configurable alarm severities Jul 21, 2025
@iliapolo iliapolo marked this pull request as ready for review July 21, 2025 07:57
@cdklabs-automation cdklabs-automation added this pull request to the merge queue Jul 21, 2025
Merged via the queue into main with commit 2126bd2 Jul 21, 2025
10 checks passed
@cdklabs-automation cdklabs-automation deleted the epolon/reconfigure-alarm-severities branch July 21, 2025 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants