Add alerts for silent failure modes found during staging validation

## Summary

Post-deploy validation of the alert rule changes (#140) found 5 issues in staging over the past week that were not caught by any alert. These represent gaps in our current alerting coverage.

## Missing alerts

**1. NATS consumer lag (critical gap)**

The `clickhouse-ingest` consumer stalled for 10 hours with 15,544 messages stuck. No alert fired.

```yaml
- alert: NATSConsumerLagHigh
  expr: nats_jetstream_consumer_num_pending{stream="AUDIT_EVENTS",consumer="clickhouse-ingest"} > 1000
  for: 5m
```

**2. AuditLogQuery 504 errors**

Users received 504 errors on audit log queries with no alert. The existing error rate alert uses a 1% threshold which is too high for low-traffic periods.

```yaml
- alert: AuditLogQuery504Errors
  expr: rate(apiserver_request_total{job="activity-apiserver",code="504",resource="auditlogqueries"}[5m]) > 0
  for: 5m
```

**3. ClickHouse query iteration errors**

`activity_clickhouse_query_errors_total{error_type="iteration"}` indicates client disconnects or row iteration failures. No alert exists.

**4. DLQ slow-leak detection**

The existing `DLQQueueGrowing` alert requires >100 events in 15 minutes. A slow leak of ~59 events over a week goes undetected. Lower the threshold or add an hourly increase check.

**5. Pipeline end-to-end latency**

P99 pipeline latency spiked to 100-114s with no alert. Consider alerting on `activity_pipeline_end_to_end_latency_seconds` p99 > 30s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alerts for silent failure modes found during staging validation #144

Summary

Missing alerts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add alerts for silent failure modes found during staging validation #144

Description

Summary

Missing alerts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions