Skip to content

Reduce alert noise and increase apiserver CPU headroom#141

Merged
scotwells merged 3 commits intomainfrom
fix/alerting-and-resources
Mar 26, 2026
Merged

Reduce alert noise and increase apiserver CPU headroom#141
scotwells merged 3 commits intomainfrom
fix/alerting-and-resources

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

  • Fix alerts that fired incorrectly for 7+ days due to low-traffic false positives
  • Raise apiserver CPU limits to prevent throttling under load spikes
  • Exclude WATCH connections from latency metrics so dashboards reflect real query performance
  • Add investigation documents from the production alert review

Changes

Alerts: Replace rate-based NATS consumer alerts with backlog-aware versions that tolerate expected zero-traffic periods. Fix a pipeline stall alert that could never fire due to a wrong label selector. Lower the latency alert threshold to catch real user-impacting slowdowns. Add a new alert for ClickHouse keeper session health.

CPU resources: Increase the apiserver CPU limit from 500m to 1 core to eliminate CFS throttling observed during load spikes. Proportionally increase dev/test limits.

Latency metrics: Exclude WATCH requests from latency recording rules. WATCH connections stay open for minutes and were pinning the p99 dashboard to 60s.

Investigations: Six root cause analysis documents covering the alerts and incidents observed during the past week.

Test plan

  • Deploy to staging and verify new alerts are not firing when no backlog exists
  • Confirm ActivityDataPipelineStalled alert evaluates correctly against Vector's ClickHouse sinks
  • Verify latency dashboards no longer show 60s readings
  • Confirm apiserver pods are not CPU-throttled under normal load with new limits

Closes #140

🤖 Generated with Claude Code

scotwells and others added 3 commits March 25, 2026 19:52
Fixes several alerts that fired incorrectly over the past week due to
low-traffic thresholds and a broken label selector. Raises apiserver CPU
limits to prevent throttling under load. Excludes long-lived WATCH
connections from latency metrics so dashboards show accurate query
performance.

Closes #140

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point-in-time incident reports don't belong in the codebase. Key findings
posted to relevant GitHub issues instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells merged commit d95dcb6 into main Mar 26, 2026
7 checks passed
@scotwells scotwells deleted the fix/alerting-and-resources branch March 26, 2026 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve alerting resilience and apiserver CPU resources

2 participants