Reduce alert noise and increase apiserver CPU headroom#141
Merged
Conversation
Fixes several alerts that fired incorrectly over the past week due to low-traffic thresholds and a broken label selector. Raises apiserver CPU limits to prevent throttling under load. Excludes long-lived WATCH connections from latency metrics so dashboards show accurate query performance. Closes #140 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point-in-time incident reports don't belong in the codebase. Key findings posted to relevant GitHub issues instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kevwilliams
approved these changes
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Alerts: Replace rate-based NATS consumer alerts with backlog-aware versions that tolerate expected zero-traffic periods. Fix a pipeline stall alert that could never fire due to a wrong label selector. Lower the latency alert threshold to catch real user-impacting slowdowns. Add a new alert for ClickHouse keeper session health.
CPU resources: Increase the apiserver CPU limit from 500m to 1 core to eliminate CFS throttling observed during load spikes. Proportionally increase dev/test limits.
Latency metrics: Exclude WATCH requests from latency recording rules. WATCH connections stay open for minutes and were pinning the p99 dashboard to 60s.
Investigations: Six root cause analysis documents covering the alerts and incidents observed during the past week.
Test plan
ActivityDataPipelineStalledalert evaluates correctly against Vector's ClickHouse sinksCloses #140
🤖 Generated with Claude Code