Reduce alert noise and increase apiserver CPU headroom by scotwells · Pull Request #141 · datum-cloud/activity

scotwells · 2026-03-26T00:53:10Z

Summary

Fix alerts that fired incorrectly for 7+ days due to low-traffic false positives
Raise apiserver CPU limits to prevent throttling under load spikes
Exclude WATCH connections from latency metrics so dashboards reflect real query performance
Add investigation documents from the production alert review

Changes

Alerts: Replace rate-based NATS consumer alerts with backlog-aware versions that tolerate expected zero-traffic periods. Fix a pipeline stall alert that could never fire due to a wrong label selector. Lower the latency alert threshold to catch real user-impacting slowdowns. Add a new alert for ClickHouse keeper session health.

CPU resources: Increase the apiserver CPU limit from 500m to 1 core to eliminate CFS throttling observed during load spikes. Proportionally increase dev/test limits.

Latency metrics: Exclude WATCH requests from latency recording rules. WATCH connections stay open for minutes and were pinning the p99 dashboard to 60s.

Investigations: Six root cause analysis documents covering the alerts and incidents observed during the past week.

Test plan

Deploy to staging and verify new alerts are not firing when no backlog exists
Confirm ActivityDataPipelineStalled alert evaluates correctly against Vector's ClickHouse sinks
Verify latency dashboards no longer show 60s readings
Confirm apiserver pods are not CPU-throttled under normal load with new limits

Closes #140

🤖 Generated with Claude Code

Fixes several alerts that fired incorrectly over the past week due to low-traffic thresholds and a broken label selector. Raises apiserver CPU limits to prevent throttling under load. Excludes long-lived WATCH connections from latency metrics so dashboards show accurate query performance. Closes #140 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Point-in-time incident reports don't belong in the codebase. Key findings posted to relevant GitHub issues instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scotwells and others added 3 commits March 25, 2026 19:52

Remove investigation documents from repo

d805795

Point-in-time incident reports don't belong in the codebase. Key findings posted to relevant GitHub issues instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove remaining investigation documents from repo

d320d38

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scotwells requested review from JoseSzycho, drewr, ecv, kevwilliams and zachsmith1 March 26, 2026 01:00

kevwilliams approved these changes Mar 26, 2026

View reviewed changes

scotwells merged commit d95dcb6 into main Mar 26, 2026
7 checks passed

scotwells deleted the fix/alerting-and-resources branch March 26, 2026 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce alert noise and increase apiserver CPU headroom#141

Reduce alert noise and increase apiserver CPU headroom#141
scotwells merged 3 commits intomainfrom
fix/alerting-and-resources

scotwells commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scotwells commented Mar 26, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants