Summary
Following a week-long analysis of production alert firings, several alert rules need tuning and the apiserver CPU resources need adjustment to prevent throttling.
Alert rule changes
Fixed mis-firing alerts:
- Replace
VectorNATSActivitiesSourceStopped and VectorNATSEventsSourceStopped with backlog-aware versions that only fire when NATS messages are pending but Vector isn't consuming. Zero rate with zero backlog is expected for low-volume streams.
- Fix
ActivityDataPipelineStalled — component_id="clickhouse" doesn't match any Vector component. Changed to component_type="clickhouse".
- Widen
ActivityGenerationStalled rate window from 5m to 1h and lower severity to warning (1h30m detection delay is too slow for critical).
- Tighten
VectorNATSSourceStalledWithBacklog backlog selector to stream="AUDIT_EVENTS".
Improved thresholds:
- Lower
ActivityQueryLatencyHigh from >10s to >5s — production incident showed user impact at 4s p99 that the 10s threshold would have missed.
New alert:
ClickHouseKeeperSessionErrors — fires on ZooKeeper session expirations. Covers a gap discovered when a keeper outage caused a 7-hour KEEPER_EXCEPTION storm with no alert.
CPU resource changes
Raise apiserver CPU to prevent CFS throttling observed during the March 24 load spike:
| Environment |
CPU Request |
CPU Limit |
| Base (staging/prod) |
100m → 250m |
500m → 1 core |
| Dev |
50m → 100m |
250m → 500m |
| Test-infra |
50m → 100m |
250m → 500m |
Recording rule changes
Exclude WATCH requests from latency recording rules (activity:apiserver_request_duration:p50/p95/p99). WATCH connections are long-lived (minutes to hours) and saturate the histogram at the 60s bucket ceiling, making latency dashboards unusable.
Summary
Following a week-long analysis of production alert firings, several alert rules need tuning and the apiserver CPU resources need adjustment to prevent throttling.
Alert rule changes
Fixed mis-firing alerts:
VectorNATSActivitiesSourceStoppedandVectorNATSEventsSourceStoppedwith backlog-aware versions that only fire when NATS messages are pending but Vector isn't consuming. Zero rate with zero backlog is expected for low-volume streams.ActivityDataPipelineStalled—component_id="clickhouse"doesn't match any Vector component. Changed tocomponent_type="clickhouse".ActivityGenerationStalledrate window from 5m to 1h and lower severity to warning (1h30m detection delay is too slow for critical).VectorNATSSourceStalledWithBacklogbacklog selector tostream="AUDIT_EVENTS".Improved thresholds:
ActivityQueryLatencyHighfrom >10s to >5s — production incident showed user impact at 4s p99 that the 10s threshold would have missed.New alert:
ClickHouseKeeperSessionErrors— fires on ZooKeeper session expirations. Covers a gap discovered when a keeper outage caused a 7-hour KEEPER_EXCEPTION storm with no alert.CPU resource changes
Raise apiserver CPU to prevent CFS throttling observed during the March 24 load spike:
Recording rule changes
Exclude WATCH requests from latency recording rules (
activity:apiserver_request_duration:p50/p95/p99). WATCH connections are long-lived (minutes to hours) and saturate the histogram at the 60s bucket ceiling, making latency dashboards unusable.