Add SLO burn-rate alerting and error budget dashboard by scotwells · Pull Request #147 · datum-cloud/activity

scotwells · 2026-03-26T03:40:49Z

Summary

Adds SLO-based alerting with multi-window burn-rate detection for the activity-apiserver. Validated against production incident data.

SLOs defined

Category	Threshold	Target
Metadata (activitypolicies)	p99 < 1s	99%
Audit log queries	p99 < 3s	99%
Activity queries	p99 < 3s	99%
Event queries	p99 < 3s	99%
Availability	Non-5xx	99%

99.9% tracked on dashboard as a north-star goal.

What's included

Recording rules: SLI good/total ratios and error ratios at 5 time windows per SLO
15 burn-rate alerts: 5 SLOs × 3 urgency tiers (page/ticket/low) with zero-traffic guards
Grafana dashboard: SLO status, error budget remaining, burn rate trends, per-category latency

Validation

Tested against production March 24 incident data:

Audit query SLO would have fired at 19:12 UTC — 2 hours before peak impact
Both 1h and 5m windows exceeded the 0.144 page threshold (peaks: 0.999 and 1.0)
Availability SLO would have fired at ~20:00 UTC

Test plan

Deploy to staging and verify all 15 alerts evaluate without errors
Confirm alerts are inactive during normal operation (zero-traffic guards working)
Verify dashboard panels render correctly in Grafana
Confirm recording rules produce data after ~5 minutes of scraping

Closes #145

🤖 Generated with Claude Code

Defines five SLOs for the activity-apiserver (99% target, 99.9% north-star): - Metadata operations (activitypolicies): p99 < 1s - Audit log queries: p99 < 3s - Activity queries: p99 < 3s - Event queries: p99 < 3s - Availability: non-5xx error rate Adds 15 multi-window burn-rate alerts (5 SLOs × 3 urgency tiers) with zero-traffic guards to prevent false positives during idle periods. Includes Grafana dashboard for error budget tracking. Validated against production March 24 incident data: the audit query SLO would have fired at 19:12 UTC, ~2 hours before peak impact at 21:25 UTC. Closes #145 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Output of `task observability:build-mixin` after adding SLO recording rules and dashboard Jsonnet source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move generated outputs to the correct config/ paths. Previous commit placed them under observability/config/ due to working directory issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change clamp_min denominator from 1 to 0.001 in error ratio calculations. With rates (req/s) as the denominator, clamping to 1 massively inflated the value when traffic was low (e.g., 0.013 req/s clamped to 1 produced a 98.7% error ratio). The zero-traffic guard in alerts prevents firing when total is 0, but the clamp value must be proportionate to expected rate magnitudes. Validated in staging: all 15 SLO alerts now correctly inactive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add GrafanaDashboard CRD and ConfigMap generator entry so the SLO dashboard is deployed automatically via Flux. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Dashboard panels now return NaN during zero-traffic periods instead of computing 0/0.001=0 and displaying a red 0% panel. Uses an AND filter on total > 0 so Grafana renders "No data" for idle endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use Grafana value mappings to display "No traffic" in a neutral color instead of "No data" in red when an endpoint has zero requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce panel widths from 6 to 4 columns so all 5 SLO stat and gauge panels fit on one line within Grafana's 24-column grid. Use auto-layout via util.grid.makeGrid for positioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace manual Y coordinate tracking with util.grid.wrapPanels for automatic panel positioning. Extract reusable helpers for burn rate and latency panels to reduce duplication. Add verb!="WATCH" to metadata latency panel to exclude long-lived watch connections from the chart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The burnRateTarget helper was producing 'error_ratio:raterate5m' instead of 'error_ratio:rate5m' because the window parameter already included the 'rate' prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Error ratio rules were computing 1 - (0 / 0.001) = 1.0 when there was no traffic, causing burn rate panels to spike to 100%. Add a zero-traffic guard using `* on() (total > bool 0)` which returns 0 instead of 1 when the denominator has no data. This makes burn rate charts show 0 (healthy) during idle periods instead of false 100% spikes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scotwells and others added 11 commits March 25, 2026 22:40

Generate SLO dashboard JSON and updated recording rules

72b7c65

Output of `task observability:build-mixin` after adding SLO recording rules and dashboard Jsonnet source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix generated file paths for SLO dashboard and recording rules

d7d3b33

Move generated outputs to the correct config/ paths. Previous commit placed them under observability/config/ due to working directory issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add generated SLO dashboard and updated recording rules

db7f84c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add SLO dashboard to kustomize manifests

29f140b

Add GrafanaDashboard CRD and ConfigMap generator entry so the SLO dashboard is deployed automatically via Flux. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Show neutral 'No traffic' for idle SLO panels

db19f32

Use Grafana value mappings to display "No traffic" in a neutral color instead of "No data" in red when an endpoint has zero requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fit all SLO panels on single rows

e1d43e1

Reduce panel widths from 6 to 4 columns so all 5 SLO stat and gauge panels fit on one line within Grafana's 24-column grid. Use auto-layout via util.grid.makeGrid for positioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix burn rate panel metric names (raterate → rate)

59c8406

The burnRateTarget helper was producing 'error_ratio:raterate5m' instead of 'error_ratio:rate5m' because the window parameter already included the 'rate' prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scotwells requested a review from kevwilliams March 26, 2026 14:24

kevwilliams approved these changes Mar 26, 2026

View reviewed changes

scotwells merged commit 64ce455 into main Mar 26, 2026
7 checks passed

scotwells deleted the fix/alerting-and-resources branch March 26, 2026 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SLO burn-rate alerting and error budget dashboard#147

Add SLO burn-rate alerting and error budget dashboard#147
scotwells merged 12 commits intomainfrom
fix/alerting-and-resources

scotwells commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scotwells commented Mar 26, 2026

Summary

SLOs defined

What's included

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants