fix(alerts): shorter detection window + investigator wait bumps by coccyx · Pull Request #60 · criblio/apm

coccyx · 2026-05-31T05:44:16Z

Addresses items 1a, 1c, and 1d from the 2026-05-30 eval (PR #59 results).

1a — Alert state machine doesn't fire on low-rate / low-volume services

5 scenarios in the 2026-05-30 eval (cart, ad, product-reviews, recommendation, checkout) never reached FIRING state because the alert evaluator was reading home_service_summary's -1h $vt_results. A 7-min burst on a low-traffic service gets diluted by 53 minutes of healthy traffic below the 1% threshold.

Fix: alertEvaluator() now computes curr_requests / curr_errors directly from spans over a -15m window. Baseline (prev) still reads the -1h criblapm_alert_prev lookup so the comparison is against a longer healthy reference.

Side effects handled:

traffic_ratio normalizes both windows to requests-per-minute (curr/15 vs prev/60) so a healthy service doesn't look like it's at 25% of prior traffic from window-duration mismatch alone.
Added curr_requests >= 5 floor on the error_rate clause. Without it, very-low-traffic services with 1 error out of 1 span would false-fire.

The three alert searches (criblapm__home_alerts, _alert_state_export, _alert_history_send) all updated to earliest: '-15m'.

1c — Investigator timeouts on gradual-onset scenarios

emailMemoryLeak and recommendationCacheFailure Investigator waitMs bumped 5 → 10 minutes. Both timed out in the 2026-05-30 eval; leakFingerprint already had 10 min.

1d — adHighCpu p99 surface threshold

Pattern was ≥5ms; eval saw ~3-4ms readings. Loosened to ≥3ms.

Not addressed (intentional)

1b: alert_history_send leakage (paymentFailure had state firing but no dataset event). Needs deeper instrumentation; flagged in ROADMAP follow-up.
1b: UI surfaces missing firing state on productCatalogFailure (same diagnosis-needed reason).
llmRateLimitError navigation failure when product-reviews drops off the Services list (eval-side workaround).
failedReadinessProbe cluster-config check.

Test plan

npx tsc --noEmit — clean
npm run lint — 0 errors
npm test — 107/107 passing
npm run deploy — deployed, verified auto:health:cart lookup is fresh (no stale state)
MCP probe of new query shape returns sensible per-service rows

Followed by a full eval re-run; results to land in a separate session log.

🤖 Generated with Claude Code

Addresses items 1a, 1c, and 1d from the 2026-05-30 eval surfaced detection failures. 1a (low-rate / low-volume services). The alert evaluator now computes curr_requests/curr_errors directly from spans over a -15m window instead of reading home_service_summary's -1h $vt_results. The -1h window was diluting 7-min flag bursts on low-traffic services (cart, ad, product-reviews, recommendation) below the 1% threshold; -15m keeps the signal fresh. Baseline (prev) still reads the -1h criblapm_alert_prev lookup so the comparison is against a longer healthy reference. Side effects handled: - traffic_ratio now normalizes both windows to requests-per-minute (curr/15 vs prev/60) so a healthy service doesn't look like it's at 25% of prior traffic from window-duration mismatch alone. - Added a curr_requests >= 5 floor to the error_rate clause. Without it, very-low-traffic services with 1 error out of 1 span would false-fire (e.g. product-reviews micro-noise). 1c (investigator timeouts). emailMemoryLeak and recommendationCacheFailure investigator waitMs bumped 5 → 10 minutes. Both timed out in the 2026-05-30 eval with multi-step playbooks (uptime, memory metric, slope, downstream health). leakFingerprint already had 10 min. 1d (adHighCpu p99 surface threshold). Pattern was ≥5ms; eval saw ~3-4ms readings. Loosened to ≥3ms. Not addressed in this PR (intentional): - 1b: alert_history_send leakage (paymentFailure had state firing but no dataset event). Needs deeper instrumentation; flagged in ROADMAP for follow-up. - 1b: UI surfaces missing firing state on productCatalogFailure. Same diagnosis-needed reason. - llmRateLimitError navigation failure when product-reviews drops off the Services list (eval-side workaround). - failedReadinessProbe cluster-config check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coccyx merged commit 462d5f1 into master May 31, 2026
3 checks passed

coccyx deleted the fix/alert-detection-window branch May 31, 2026 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(alerts): shorter detection window + investigator wait bumps#60

fix(alerts): shorter detection window + investigator wait bumps#60
coccyx merged 1 commit into
masterfrom
fix/alert-detection-window

coccyx commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coccyx commented May 31, 2026

1a — Alert state machine doesn't fire on low-rate / low-volume services

1c — Investigator timeouts on gradual-onset scenarios

1d — adHighCpu p99 surface threshold

Not addressed (intentional)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant