Skip to content

Improve alerting resilience and apiserver CPU resources #140

@scotwells

Description

@scotwells

Summary

Following a week-long analysis of production alert firings, several alert rules need tuning and the apiserver CPU resources need adjustment to prevent throttling.

Alert rule changes

Fixed mis-firing alerts:

  • Replace VectorNATSActivitiesSourceStopped and VectorNATSEventsSourceStopped with backlog-aware versions that only fire when NATS messages are pending but Vector isn't consuming. Zero rate with zero backlog is expected for low-volume streams.
  • Fix ActivityDataPipelineStalledcomponent_id="clickhouse" doesn't match any Vector component. Changed to component_type="clickhouse".
  • Widen ActivityGenerationStalled rate window from 5m to 1h and lower severity to warning (1h30m detection delay is too slow for critical).
  • Tighten VectorNATSSourceStalledWithBacklog backlog selector to stream="AUDIT_EVENTS".

Improved thresholds:

  • Lower ActivityQueryLatencyHigh from >10s to >5s — production incident showed user impact at 4s p99 that the 10s threshold would have missed.

New alert:

  • ClickHouseKeeperSessionErrors — fires on ZooKeeper session expirations. Covers a gap discovered when a keeper outage caused a 7-hour KEEPER_EXCEPTION storm with no alert.

CPU resource changes

Raise apiserver CPU to prevent CFS throttling observed during the March 24 load spike:

Environment CPU Request CPU Limit
Base (staging/prod) 100m → 250m 500m → 1 core
Dev 50m → 100m 250m → 500m
Test-infra 50m → 100m 250m → 500m

Recording rule changes

Exclude WATCH requests from latency recording rules (activity:apiserver_request_duration:p50/p95/p99). WATCH connections are long-lived (minutes to hours) and saturate the histogram at the 60s bucket ceiling, making latency dashboards unusable.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions