Skip to content

Fix false 100% burn rates for low-traffic SLO endpoints#167

Merged
scotwells merged 2 commits intomainfrom
fix/alerting-and-resources
Mar 30, 2026
Merged

Fix false 100% burn rates for low-traffic SLO endpoints#167
scotwells merged 2 commits intomainfrom
fix/alerting-and-resources

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

Fixes multi-window SLO error ratio recording rules that were producing false 100% values for low-traffic endpoints (metadata, audit queries, activity queries).

Problem

The 1h, 6h, and 3d error ratio rules used clamp_min(total, 0.001) to avoid division by zero, combined with a * (rate5m > bool 0) guard. During brief traffic pulses, the guard opened momentarily while the total was still near-zero, producing 1 - (0/0.001) = 1 (100% error ratio). This caused the burn rate dashboard to show false spikes.

Fix

Replace clamp_min + bool multiplication with direct division and per-window and total > 0 guards — the same pattern already working correctly for the 5m rules. Each window now uses its own total for the guard instead of referencing the 5m total.

Test plan

  • Deploy and verify metadata error_ratio:rate1h shows 0 or no data (not 1)
  • Verify availability SLO continues measuring correctly
  • Confirm burn rate dashboard panels show clean data without false spikes

🤖 Generated with Claude Code

@scotwells scotwells requested a review from kevwilliams March 26, 2026 21:20
Replace clamp_min + bool multiplication guard with direct division and
per-window 'and total > 0' guard. The previous approach produced false
100% error ratios during traffic transitions because:

1. clamp_min(total, 0.001) computed 1-(0/0.001)=1 when total was 0
2. The bool guard used the 5m total which could briefly be >0 during
   traffic pulses, letting the false 1 through to the longer windows

The new approach uses each window's own total for the guard and returns
no data (instead of 1) when there's no traffic in that specific window.
This matches the pattern already working correctly for the 5m rules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the fix/alerting-and-resources branch from fd9047d to bb7307b Compare March 26, 2026 21:41
@scotwells scotwells merged commit 956cf6f into main Mar 30, 2026
7 checks passed
@scotwells scotwells deleted the fix/alerting-and-resources branch March 30, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants