Skip to content

Add SLO burn-rate alerting and error budget dashboard#147

Merged
scotwells merged 12 commits intomainfrom
fix/alerting-and-resources
Mar 26, 2026
Merged

Add SLO burn-rate alerting and error budget dashboard#147
scotwells merged 12 commits intomainfrom
fix/alerting-and-resources

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

Adds SLO-based alerting with multi-window burn-rate detection for the activity-apiserver. Validated against production incident data.

SLOs defined

Category Threshold Target
Metadata (activitypolicies) p99 < 1s 99%
Audit log queries p99 < 3s 99%
Activity queries p99 < 3s 99%
Event queries p99 < 3s 99%
Availability Non-5xx 99%

99.9% tracked on dashboard as a north-star goal.

What's included

  • Recording rules: SLI good/total ratios and error ratios at 5 time windows per SLO
  • 15 burn-rate alerts: 5 SLOs × 3 urgency tiers (page/ticket/low) with zero-traffic guards
  • Grafana dashboard: SLO status, error budget remaining, burn rate trends, per-category latency

Validation

Tested against production March 24 incident data:

  • Audit query SLO would have fired at 19:12 UTC — 2 hours before peak impact
  • Both 1h and 5m windows exceeded the 0.144 page threshold (peaks: 0.999 and 1.0)
  • Availability SLO would have fired at ~20:00 UTC

Test plan

  • Deploy to staging and verify all 15 alerts evaluate without errors
  • Confirm alerts are inactive during normal operation (zero-traffic guards working)
  • Verify dashboard panels render correctly in Grafana
  • Confirm recording rules produce data after ~5 minutes of scraping

Closes #145

🤖 Generated with Claude Code

scotwells and others added 11 commits March 25, 2026 22:40
Defines five SLOs for the activity-apiserver (99% target, 99.9% north-star):
- Metadata operations (activitypolicies): p99 < 1s
- Audit log queries: p99 < 3s
- Activity queries: p99 < 3s
- Event queries: p99 < 3s
- Availability: non-5xx error rate

Adds 15 multi-window burn-rate alerts (5 SLOs × 3 urgency tiers) with
zero-traffic guards to prevent false positives during idle periods.
Includes Grafana dashboard for error budget tracking.

Validated against production March 24 incident data: the audit query SLO
would have fired at 19:12 UTC, ~2 hours before peak impact at 21:25 UTC.

Closes #145

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Output of `task observability:build-mixin` after adding SLO recording
rules and dashboard Jsonnet source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move generated outputs to the correct config/ paths. Previous commit
placed them under observability/config/ due to working directory issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change clamp_min denominator from 1 to 0.001 in error ratio calculations.
With rates (req/s) as the denominator, clamping to 1 massively inflated
the value when traffic was low (e.g., 0.013 req/s clamped to 1 produced
a 98.7% error ratio). The zero-traffic guard in alerts prevents firing
when total is 0, but the clamp value must be proportionate to expected
rate magnitudes.

Validated in staging: all 15 SLO alerts now correctly inactive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add GrafanaDashboard CRD and ConfigMap generator entry so the SLO
dashboard is deployed automatically via Flux.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dashboard panels now return NaN during zero-traffic periods instead of
computing 0/0.001=0 and displaying a red 0% panel. Uses an AND filter
on total > 0 so Grafana renders "No data" for idle endpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use Grafana value mappings to display "No traffic" in a neutral color
instead of "No data" in red when an endpoint has zero requests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce panel widths from 6 to 4 columns so all 5 SLO stat and gauge
panels fit on one line within Grafana's 24-column grid. Use auto-layout
via util.grid.makeGrid for positioning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual Y coordinate tracking with util.grid.wrapPanels for
automatic panel positioning. Extract reusable helpers for burn rate and
latency panels to reduce duplication. Add verb!="WATCH" to metadata
latency panel to exclude long-lived watch connections from the chart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The burnRateTarget helper was producing 'error_ratio:raterate5m' instead
of 'error_ratio:rate5m' because the window parameter already included
the 'rate' prefix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells requested a review from kevwilliams March 26, 2026 14:24
Error ratio rules were computing 1 - (0 / 0.001) = 1.0 when there was
no traffic, causing burn rate panels to spike to 100%. Add a zero-traffic
guard using `* on() (total > bool 0)` which returns 0 instead of 1 when
the denominator has no data. This makes burn rate charts show 0 (healthy)
during idle periods instead of false 100% spikes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells merged commit 64ce455 into main Mar 26, 2026
7 checks passed
@scotwells scotwells deleted the fix/alerting-and-resources branch March 26, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define SLOs and add burn-rate alerting for activity-apiserver

2 participants