fix(alerts): shorter detection window + investigator wait bumps#60
Merged
Conversation
Addresses items 1a, 1c, and 1d from the 2026-05-30 eval surfaced detection failures. 1a (low-rate / low-volume services). The alert evaluator now computes curr_requests/curr_errors directly from spans over a -15m window instead of reading home_service_summary's -1h $vt_results. The -1h window was diluting 7-min flag bursts on low-traffic services (cart, ad, product-reviews, recommendation) below the 1% threshold; -15m keeps the signal fresh. Baseline (prev) still reads the -1h criblapm_alert_prev lookup so the comparison is against a longer healthy reference. Side effects handled: - traffic_ratio now normalizes both windows to requests-per-minute (curr/15 vs prev/60) so a healthy service doesn't look like it's at 25% of prior traffic from window-duration mismatch alone. - Added a curr_requests >= 5 floor to the error_rate clause. Without it, very-low-traffic services with 1 error out of 1 span would false-fire (e.g. product-reviews micro-noise). 1c (investigator timeouts). emailMemoryLeak and recommendationCacheFailure investigator waitMs bumped 5 → 10 minutes. Both timed out in the 2026-05-30 eval with multi-step playbooks (uptime, memory metric, slope, downstream health). leakFingerprint already had 10 min. 1d (adHighCpu p99 surface threshold). Pattern was ≥5ms; eval saw ~3-4ms readings. Loosened to ≥3ms. Not addressed in this PR (intentional): - 1b: alert_history_send leakage (paymentFailure had state firing but no dataset event). Needs deeper instrumentation; flagged in ROADMAP for follow-up. - 1b: UI surfaces missing firing state on productCatalogFailure. Same diagnosis-needed reason. - llmRateLimitError navigation failure when product-reviews drops off the Services list (eval-side workaround). - failedReadinessProbe cluster-config check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses items 1a, 1c, and 1d from the 2026-05-30 eval (PR #59 results).
1a — Alert state machine doesn't fire on low-rate / low-volume services
5 scenarios in the 2026-05-30 eval (cart, ad, product-reviews, recommendation, checkout) never reached FIRING state because the alert evaluator was reading home_service_summary's -1h $vt_results. A 7-min burst on a low-traffic service gets diluted by 53 minutes of healthy traffic below the 1% threshold.
Fix: alertEvaluator() now computes
curr_requests/curr_errorsdirectly from spans over a -15m window. Baseline (prev) still reads the -1hcriblapm_alert_prevlookup so the comparison is against a longer healthy reference.Side effects handled:
traffic_rationormalizes both windows to requests-per-minute (curr/15vsprev/60) so a healthy service doesn't look like it's at 25% of prior traffic from window-duration mismatch alone.curr_requests >= 5floor on the error_rate clause. Without it, very-low-traffic services with 1 error out of 1 span would false-fire.The three alert searches (
criblapm__home_alerts,_alert_state_export,_alert_history_send) all updated toearliest: '-15m'.1c — Investigator timeouts on gradual-onset scenarios
emailMemoryLeakandrecommendationCacheFailureInvestigator waitMs bumped 5 → 10 minutes. Both timed out in the 2026-05-30 eval;leakFingerprintalready had 10 min.1d — adHighCpu p99 surface threshold
Pattern was ≥5ms; eval saw ~3-4ms readings. Loosened to ≥3ms.
Not addressed (intentional)
Test plan
npx tsc --noEmit— cleannpm run lint— 0 errorsnpm test— 107/107 passingnpm run deploy— deployed, verifiedauto:health:cartlookup is fresh (no stale state)Followed by a full eval re-run; results to land in a separate session log.
🤖 Generated with Claude Code