fix(monitoring): eliminate false 100% CPU spikes and fix alert double-evaluation#183
Merged
Merged
Conversation
…-evaluation **CPU metric spikes (agent)** Immediate heartbeats triggered by check/task results (resultsReady) fired milliseconds after the previous regular tick, causing readCPUPercent() to measure a near-zero delta window. Any system activity in that window inflated the CPU reading towards 100% even on an idle host. Fix: introduce refreshMetrics() which collects all system metrics and caches them in a new hostMetricsSnapshot on Runner. The regular 30-second ticker calls refreshMetrics() before each heartbeat; the resultsReady path sends the cached snapshot without re-collecting, so the CPU delta always spans a full tick interval. **Alert double-evaluation (ingest)** GetAlertRulesForHost fetched all rules with host_id IS NULL, which includes global default rules (is_global_default = true). When a host is approved, global defaults are cloned as host-specific rules. The originals were then evaluated a second time, firing alerts even when the host-specific clones were disabled. Fix: add AND is_global_default = false to the query so global defaults are treated as templates only; the host-specific clones are the sole evaluated source. **Global defaults visible in host Alerts tab (web)** The Alerts tab only showed host-specific rules, so users had no visibility into the org-wide global defaults that were also firing. Added a separate query for getGlobalAlertDefaults and a read-only "Organisation-wide Default Rules" section linking to Settings → Alerts for management. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
collectMetrics()with a near-zero delta window milliseconds after the previous regular tick. Any system activity in that window inflated the CPU reading to ~100% on otherwise idle hosts. Fixed by introducingrefreshMetrics()that caches the metric snapshot; only the regular 30-second ticker refreshes the cache — immediate heartbeats reuse it.GetAlertRulesForHostfetched all rules wherehost_id IS NULL, includingis_global_default = truetemplates. When a host is approved these templates are cloned as host-specific rules, but the originals continued firing alongside the clones. Disabling the clones had no effect. Fixed by addingAND is_global_default = falseto the query.Test plan
🤖 Generated with Claude Code