Add SLO burn-rate alerting and error budget dashboard#147
Merged
Conversation
Defines five SLOs for the activity-apiserver (99% target, 99.9% north-star): - Metadata operations (activitypolicies): p99 < 1s - Audit log queries: p99 < 3s - Activity queries: p99 < 3s - Event queries: p99 < 3s - Availability: non-5xx error rate Adds 15 multi-window burn-rate alerts (5 SLOs × 3 urgency tiers) with zero-traffic guards to prevent false positives during idle periods. Includes Grafana dashboard for error budget tracking. Validated against production March 24 incident data: the audit query SLO would have fired at 19:12 UTC, ~2 hours before peak impact at 21:25 UTC. Closes #145 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Output of `task observability:build-mixin` after adding SLO recording rules and dashboard Jsonnet source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move generated outputs to the correct config/ paths. Previous commit placed them under observability/config/ due to working directory issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change clamp_min denominator from 1 to 0.001 in error ratio calculations. With rates (req/s) as the denominator, clamping to 1 massively inflated the value when traffic was low (e.g., 0.013 req/s clamped to 1 produced a 98.7% error ratio). The zero-traffic guard in alerts prevents firing when total is 0, but the clamp value must be proportionate to expected rate magnitudes. Validated in staging: all 15 SLO alerts now correctly inactive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add GrafanaDashboard CRD and ConfigMap generator entry so the SLO dashboard is deployed automatically via Flux. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dashboard panels now return NaN during zero-traffic periods instead of computing 0/0.001=0 and displaying a red 0% panel. Uses an AND filter on total > 0 so Grafana renders "No data" for idle endpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use Grafana value mappings to display "No traffic" in a neutral color instead of "No data" in red when an endpoint has zero requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce panel widths from 6 to 4 columns so all 5 SLO stat and gauge panels fit on one line within Grafana's 24-column grid. Use auto-layout via util.grid.makeGrid for positioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual Y coordinate tracking with util.grid.wrapPanels for automatic panel positioning. Extract reusable helpers for burn rate and latency panels to reduce duplication. Add verb!="WATCH" to metadata latency panel to exclude long-lived watch connections from the chart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The burnRateTarget helper was producing 'error_ratio:raterate5m' instead of 'error_ratio:rate5m' because the window parameter already included the 'rate' prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Error ratio rules were computing 1 - (0 / 0.001) = 1.0 when there was no traffic, causing burn rate panels to spike to 100%. Add a zero-traffic guard using `* on() (total > bool 0)` which returns 0 instead of 1 when the denominator has no data. This makes burn rate charts show 0 (healthy) during idle periods instead of false 100% spikes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kevwilliams
approved these changes
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds SLO-based alerting with multi-window burn-rate detection for the activity-apiserver. Validated against production incident data.
SLOs defined
99.9% tracked on dashboard as a north-star goal.
What's included
Validation
Tested against production March 24 incident data:
Test plan
Closes #145
🤖 Generated with Claude Code