Skip to content

Add alerts for silent failure modes found during staging validation #144

@scotwells

Description

@scotwells

Summary

Post-deploy validation of the alert rule changes (#140) found 5 issues in staging over the past week that were not caught by any alert. These represent gaps in our current alerting coverage.

Missing alerts

1. NATS consumer lag (critical gap)

The clickhouse-ingest consumer stalled for 10 hours with 15,544 messages stuck. No alert fired.

- alert: NATSConsumerLagHigh
  expr: nats_jetstream_consumer_num_pending{stream="AUDIT_EVENTS",consumer="clickhouse-ingest"} > 1000
  for: 5m

2. AuditLogQuery 504 errors

Users received 504 errors on audit log queries with no alert. The existing error rate alert uses a 1% threshold which is too high for low-traffic periods.

- alert: AuditLogQuery504Errors
  expr: rate(apiserver_request_total{job="activity-apiserver",code="504",resource="auditlogqueries"}[5m]) > 0
  for: 5m

3. ClickHouse query iteration errors

activity_clickhouse_query_errors_total{error_type="iteration"} indicates client disconnects or row iteration failures. No alert exists.

4. DLQ slow-leak detection

The existing DLQQueueGrowing alert requires >100 events in 15 minutes. A slow leak of ~59 events over a week goes undetected. Lower the threshold or add an hourly increase check.

5. Pipeline end-to-end latency

P99 pipeline latency spiked to 100-114s with no alert. Consider alerting on activity_pipeline_end_to_end_latency_seconds p99 > 30s.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions