Summary
Post-deploy validation of the alert rule changes (#140) found 5 issues in staging over the past week that were not caught by any alert. These represent gaps in our current alerting coverage.
Missing alerts
1. NATS consumer lag (critical gap)
The clickhouse-ingest consumer stalled for 10 hours with 15,544 messages stuck. No alert fired.
- alert: NATSConsumerLagHigh
expr: nats_jetstream_consumer_num_pending{stream="AUDIT_EVENTS",consumer="clickhouse-ingest"} > 1000
for: 5m
2. AuditLogQuery 504 errors
Users received 504 errors on audit log queries with no alert. The existing error rate alert uses a 1% threshold which is too high for low-traffic periods.
- alert: AuditLogQuery504Errors
expr: rate(apiserver_request_total{job="activity-apiserver",code="504",resource="auditlogqueries"}[5m]) > 0
for: 5m
3. ClickHouse query iteration errors
activity_clickhouse_query_errors_total{error_type="iteration"} indicates client disconnects or row iteration failures. No alert exists.
4. DLQ slow-leak detection
The existing DLQQueueGrowing alert requires >100 events in 15 minutes. A slow leak of ~59 events over a week goes undetected. Lower the threshold or add an hourly increase check.
5. Pipeline end-to-end latency
P99 pipeline latency spiked to 100-114s with no alert. Consider alerting on activity_pipeline_end_to_end_latency_seconds p99 > 30s.
Summary
Post-deploy validation of the alert rule changes (#140) found 5 issues in staging over the past week that were not caught by any alert. These represent gaps in our current alerting coverage.
Missing alerts
1. NATS consumer lag (critical gap)
The
clickhouse-ingestconsumer stalled for 10 hours with 15,544 messages stuck. No alert fired.2. AuditLogQuery 504 errors
Users received 504 errors on audit log queries with no alert. The existing error rate alert uses a 1% threshold which is too high for low-traffic periods.
3. ClickHouse query iteration errors
activity_clickhouse_query_errors_total{error_type="iteration"}indicates client disconnects or row iteration failures. No alert exists.4. DLQ slow-leak detection
The existing
DLQQueueGrowingalert requires >100 events in 15 minutes. A slow leak of ~59 events over a week goes undetected. Lower the threshold or add an hourly increase check.5. Pipeline end-to-end latency
P99 pipeline latency spiked to 100-114s with no alert. Consider alerting on
activity_pipeline_end_to_end_latency_secondsp99 > 30s.