docs: eval v0.9.0 results + DEMOS.md + roadmap fixes for detection gaps by coccyx · Pull Request #59 · criblio/apm

coccyx · 2026-05-31T04:13:06Z

Summary

Two deliverables from the 2026-05-30 eval session:

DEMOS.md — new doc telling operators how to demonstrate each of the 14 scenarios manually (flag flip, expected service, surface checks, Investigator prompt with passing rubric). Pulled directly from the eval scenario declarations.
Full eval results + ROADMAP additions — ran the 14-scenario eval on v0.9.0 over 6h 45m, captured the per-scenario report, and prepended a new top-of-roadmap entry breaking down four distinct detection failure patterns into actionable sub-items.

Eval results

Mean score: 0.66 (was 0.59 in the 2026-04-23 baseline). 3 of 14 fully detected. Investigator scored root-cause-correct on 10 of 11 scenarios that ran it — diagnosis is solid; detection is where the score hides.

	Fully detected	Investigator failures
Now	adManualGc, kafkaQueueProblems, loadGeneratorFloodHomepage	emailMemoryLeak, leakFingerprint, recommendationCacheFailure (all timeouts)
April	paymentFailure (only)	—

Full per-scenario table + comparison vs the April baseline in docs/sessions/2026-05-30-eval-v0.9.0.md.

Failure patterns → ROADMAP priorities

The new Priority #1: Alert detection entry breaks down what the eval surfaced into four sub-items:

1a. Alert state machine doesn't fire for low-rate / low-volume services. -1h baseline window dilutes 7-min flag bursts below the 1% error_rate threshold. Affects 5 scenarios. Proposed: shorter analysis window + low-volume floor.
1b. State machine → UI render path is leaky. productCatalogFailure shows state firing but 3 UI surfaces silent. paymentFailure shows state firing but alertHistory event missing from dataset.
1c. Investigator times out on gradual-onset scenarios (emailMemoryLeak, leakFingerprint, recommendationCacheFailure). 5-min waitMs too short for multi-step playbooks.
1d. Surface contract mismatches (adHighCpu p99 threshold, llmRateLimitError navigation when service drops off list, failedReadinessProbe cluster behavior unclear).

Old roadmap items 2-11 renumbered down by one to make room.

Files

DEMOS.md (new)
docs/sessions/2026-05-30-eval-v0.9.0.md (new)
ROADMAP.md (Priority repo: flatten oteldemo/ to root and drop k8s plumbing #1 added; ci: add GitHub Actions CI + tagged-release workflows #2-docs: consolidate plans into ROADMAP.md as single source of truth #12 renumbered)

Test plan

Doc-only PR. No code changes:

npx tsc --noEmit — clean
npm run lint — 0 errors
npm test — 107/107 passing
Eval run captured to /tmp/apm-eval-2026-05-30.log + /tmp/apm-eval-runs/transcript-*.jsonl

🤖 Generated with Claude Code

Practical "how to demo each failure scenario" pulled from the eval suite (14 scenarios). For each: flag flip, expected service, surface checks per page, the Investigator prompt + scoring rubric, and what makes the scenario interesting. Companion to FAILURE-SCENARIOS.md (deeper reference) and eval/ (automated harness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ran the 14-scenario eval against the v0.9.0 deployed pack. Mean score 0.66, 3 fully detected (adManualGc, kafkaQueueProblems, loadGeneratorFloodHomepage). Detail in docs/sessions/2026-05-30-eval-v0.9.0.md. The Investigator scored root-cause-correct on 10/11 scenarios it ran on — diagnosis layer is healthy. Detection layer is where the score is hiding. Four distinct failure patterns identified; each gets a sub-item under the new top-of-roadmap entry: 1a. Alert state machine doesn't fire on low-rate/low-volume services (cart, ad, product-reviews, recommendation, checkout-via-readiness). -1h baseline window dilutes 7-min burst below 1% threshold; FIRE_AFTER=2 × 5min cadence requires ≥10min to reach FIRING. 1b. Alert state → UI render path is leaky. productCatalogFailure fires in state machine + history dataset but 3 UI surfaces don't show it. paymentFailure fires in state but history event never lands. 1c. Investigator times out on gradual-onset scenarios (emailMemoryLeak, leakFingerprint, recommendationCacheFailure) — 5-min wait too short for multi-step playbooks. 1d. Surface contract mismatches (adHighCpu p99 threshold, llmRateLimitError navigation when service drops off list, failedReadinessProbe cluster behavior unclear). ROADMAP items 2-11 renumbered (3-12) to make room for the new detection priority at slot 1. Companion: DEMOS.md (added in this branch's first commit) tells operators how to manually trigger each scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coccyx and others added 2 commits May 30, 2026 14:27

coccyx merged commit 56a73c1 into master May 31, 2026
3 checks passed

coccyx deleted the docs/eval-2026-05-30 branch May 31, 2026 05:37

coccyx mentioned this pull request May 31, 2026

fix(alerts): shorter detection window + investigator wait bumps #60

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: eval v0.9.0 results + DEMOS.md + roadmap fixes for detection gaps#59

docs: eval v0.9.0 results + DEMOS.md + roadmap fixes for detection gaps#59
coccyx merged 2 commits into
masterfrom
docs/eval-2026-05-30

coccyx commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coccyx commented May 31, 2026

Summary

Eval results

Failure patterns → ROADMAP priorities

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant