Skip to content

Reduce alert noise during cascading failures#152

Merged
scotwells merged 3 commits intomainfrom
worktree-agent-a7235e9f
Mar 26, 2026
Merged

Reduce alert noise during cascading failures#152
scotwells merged 3 commits intomainfrom
worktree-agent-a7235e9f

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

@scotwells scotwells commented Mar 26, 2026

Summary

Adds alert inhibition rules so that when a root-cause alert fires, its predictable downstream symptoms are suppressed. This prevents alert storms during infrastructure failures.

What's included

An AlertmanagerConfig resource (prometheus-compatible, auto-converted by the VM operator) with 10 inhibition rules:

  • Severity-based: Critical alerts suppress related warnings for the same component
  • Apiserver down: Suppresses error rate and latency alerts
  • ClickHouse unavailable: Suppresses query latency and pipeline stall alerts
  • All NATS sources stalled: Suppresses individual consumer alerts
  • Processor down: Suppresses generation stalled, error rate, and all DLQ alerts
  • NATS disconnected: Suppresses generation stalled
  • Keeper session errors: Suppresses ZooKeeper exception alerts
  • Pipeline backlog critical: Suppresses consumer lag warning
  • SLO page-burn: Suppresses equivalent threshold alerts to avoid duplicates

Validated in staging

  • AlertmanagerConfig accepted and auto-converted by the VM operator
  • All 10 inhibition rules loaded into alertmanager config (verified via config secret)
  • Required a minimal route/receiver to pass VM operator CRD validation

Test plan

  • Deploy to staging and verify resource is accepted
  • Confirm inhibition rules appear in alertmanager generated config
  • Simulate a critical alert and verify downstream warnings are suppressed

Closes #146

🤖 Generated with Claude Code

…ules (#146)

Add activity-alert-inhibitions VMAlertmanagerConfig to suppress redundant
downstream alerts when root-cause alerts are already firing. Covers
apiserver down, ClickHouse unavailable, NATS stalled, processor down,
SLO burn-rate, and ClickHouse keeper/merge scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells scotwells force-pushed the worktree-agent-a7235e9f branch from 5fc990d to 75ccddc Compare March 26, 2026 15:43
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the worktree-agent-a7235e9f branch from 75ccddc to 5748c4c Compare March 26, 2026 15:47
@scotwells scotwells requested a review from kevwilliams March 26, 2026 15:54
@scotwells scotwells merged commit 2d9a040 into main Mar 26, 2026
7 checks passed
@scotwells scotwells deleted the worktree-agent-a7235e9f branch March 26, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add VMAlertmanagerConfig with alert inhibition rules

2 participants