Skip to content

Catch silent failures that slipped past existing alerts#153

Merged
scotwells merged 2 commits intomainfrom
worktree-agent-a5ea0f97
Mar 26, 2026
Merged

Catch silent failures that slipped past existing alerts#153
scotwells merged 2 commits intomainfrom
worktree-agent-a5ea0f97

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

Adds five new alerts to cover failure modes discovered during a production validation that went undetected by existing alerting.

New alerts

  • NATS consumer lag — fires when audit event backlog exceeds 5,000 messages for 5+ minutes
  • Audit query timeouts — catches 504 errors on audit log queries at a sustained rate
  • ClickHouse iteration errors — surfaces row streaming failures during query execution
  • Vector write pipeline stopped — detects when Vector receives events from NATS but stops writing to ClickHouse
  • DLQ slow leak — catches low-rate persistent failures that fall below the existing fast-growth alert

What prompted this

Post-deploy validation found a 48-hour gap where Vector stopped writing events to ClickHouse with no alert, a 10-hour NATS consumer stall with no alert, and ongoing DLQ accumulation from a broken policy with no alert.

Test plan

  • Deploy to staging and verify all new alerts evaluate without errors
  • Confirm alerts are inactive during normal healthy operation
  • Verify VectorClickHouseWritesStopped correctly uses component_type selector

Closes #144

🤖 Generated with Claude Code

scotwells and others added 2 commits March 26, 2026 10:31
Add 5 new PrometheusRule alerts covering gaps found during staging and
production validation:

- NATSConsumerLagHigh: warns when AUDIT_EVENTS pending count exceeds
  1000, an early signal before the existing critical backlog alert fires
- AuditLogQuery504Errors: catches per-resource timeout errors on
  auditlogqueries that aggregate error rate alert misses
- ClickHouseQueryIterationErrors: surfaces row iteration failures during
  result streaming, previously a silent failure mode
- VectorClickHouseWritesStopped: detects the case where Vector is
  receiving from NATS but not writing to ClickHouse (split-brain
  pipeline failure)
- DLQSlowLeak: catches low-rate but persistent DLQ growth (>10 events
  in 6h) that falls below the existing DLQQueueGrowing threshold

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells requested a review from kevwilliams March 26, 2026 16:21
@scotwells scotwells merged commit 8e1c585 into main Mar 26, 2026
7 checks passed
@scotwells scotwells deleted the worktree-agent-a5ea0f97 branch March 26, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add alerts for silent failure modes found during staging validation

2 participants