Skip to content

fix(monitoring): eliminate false 100% CPU spikes and fix alert double-evaluation#183

Merged
simonjcarr merged 1 commit into
mainfrom
fix/monitoring-cpu-spikes-and-alert-evaluation
Apr 14, 2026
Merged

fix(monitoring): eliminate false 100% CPU spikes and fix alert double-evaluation#183
simonjcarr merged 1 commit into
mainfrom
fix/monitoring-cpu-spikes-and-alert-evaluation

Conversation

@simonjcarr
Copy link
Copy Markdown
Collaborator

Summary

  • False 100% CPU spikes: Immediate heartbeats (triggered by check/task result delivery) were calling collectMetrics() with a near-zero delta window milliseconds after the previous regular tick. Any system activity in that window inflated the CPU reading to ~100% on otherwise idle hosts. Fixed by introducing refreshMetrics() that caches the metric snapshot; only the regular 30-second ticker refreshes the cache — immediate heartbeats reuse it.
  • Alert double-evaluation: GetAlertRulesForHost fetched all rules where host_id IS NULL, including is_global_default = true templates. When a host is approved these templates are cloned as host-specific rules, but the originals continued firing alongside the clones. Disabling the clones had no effect. Fixed by adding AND is_global_default = false to the query.
  • Global defaults invisible in host Alerts tab: Users couldn't see org-wide default rules from the host page, so had no way to understand why alerts kept firing after disabling all visible rules. Added a read-only "Organisation-wide Default Rules" section that fetches and displays the global defaults with a link to Settings → Alerts.

Test plan

  • Deploy updated agent; verify CPU chart shows steady low values with no false 100% spikes during check execution
  • Disable all host-specific alert rules; verify no alerts fire from global defaults
  • Navigate to host Alerts tab; verify "Organisation-wide Default Rules" section appears and lists configured defaults

🤖 Generated with Claude Code

…-evaluation

**CPU metric spikes (agent)**
Immediate heartbeats triggered by check/task results (resultsReady) fired
milliseconds after the previous regular tick, causing readCPUPercent() to
measure a near-zero delta window. Any system activity in that window inflated
the CPU reading towards 100% even on an idle host.

Fix: introduce refreshMetrics() which collects all system metrics and caches
them in a new hostMetricsSnapshot on Runner. The regular 30-second ticker
calls refreshMetrics() before each heartbeat; the resultsReady path sends the
cached snapshot without re-collecting, so the CPU delta always spans a full
tick interval.

**Alert double-evaluation (ingest)**
GetAlertRulesForHost fetched all rules with host_id IS NULL, which includes
global default rules (is_global_default = true). When a host is approved,
global defaults are cloned as host-specific rules. The originals were then
evaluated a second time, firing alerts even when the host-specific clones were
disabled.

Fix: add AND is_global_default = false to the query so global defaults are
treated as templates only; the host-specific clones are the sole evaluated
source.

**Global defaults visible in host Alerts tab (web)**
The Alerts tab only showed host-specific rules, so users had no visibility
into the org-wide global defaults that were also firing. Added a separate
query for getGlobalAlertDefaults and a read-only "Organisation-wide Default
Rules" section linking to Settings → Alerts for management.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonjcarr simonjcarr merged commit 01e7146 into main Apr 14, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant