Improve alerting resilience and apiserver CPU resources

## Summary

Following a week-long analysis of production alert firings, several alert rules need tuning and the apiserver CPU resources need adjustment to prevent throttling.

## Alert rule changes

**Fixed mis-firing alerts:**
- Replace `VectorNATSActivitiesSourceStopped` and `VectorNATSEventsSourceStopped` with backlog-aware versions that only fire when NATS messages are pending but Vector isn't consuming. Zero rate with zero backlog is expected for low-volume streams.
- Fix `ActivityDataPipelineStalled` — `component_id="clickhouse"` doesn't match any Vector component. Changed to `component_type="clickhouse"`.
- Widen `ActivityGenerationStalled` rate window from 5m to 1h and lower severity to warning (1h30m detection delay is too slow for critical).
- Tighten `VectorNATSSourceStalledWithBacklog` backlog selector to `stream="AUDIT_EVENTS"`.

**Improved thresholds:**
- Lower `ActivityQueryLatencyHigh` from >10s to >5s — production incident showed user impact at 4s p99 that the 10s threshold would have missed.

**New alert:**
- `ClickHouseKeeperSessionErrors` — fires on ZooKeeper session expirations. Covers a gap discovered when a keeper outage caused a 7-hour KEEPER_EXCEPTION storm with no alert.

## CPU resource changes

Raise apiserver CPU to prevent CFS throttling observed during the March 24 load spike:

| Environment | CPU Request | CPU Limit |
|-------------|------------|-----------|
| Base (staging/prod) | 100m → 250m | 500m → 1 core |
| Dev | 50m → 100m | 250m → 500m |
| Test-infra | 50m → 100m | 250m → 500m |

## Recording rule changes

Exclude WATCH requests from latency recording rules (`activity:apiserver_request_duration:p50/p95/p99`). WATCH connections are long-lived (minutes to hours) and saturate the histogram at the 60s bucket ceiling, making latency dashboards unusable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve alerting resilience and apiserver CPU resources #140

Summary

Alert rule changes

CPU resource changes

Recording rule changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Environment	CPU Request	CPU Limit
Base (staging/prod)	100m → 250m	500m → 1 core
Dev	50m → 100m	250m → 500m
Test-infra	50m → 100m	250m → 500m

Improve alerting resilience and apiserver CPU resources #140

Description

Summary

Alert rule changes

CPU resource changes

Recording rule changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions