docs: eval v0.9.0 results + DEMOS.md + roadmap fixes for detection gaps#59
Merged
Conversation
Practical "how to demo each failure scenario" pulled from the eval suite (14 scenarios). For each: flag flip, expected service, surface checks per page, the Investigator prompt + scoring rubric, and what makes the scenario interesting. Companion to FAILURE-SCENARIOS.md (deeper reference) and eval/ (automated harness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran the 14-scenario eval against the v0.9.0 deployed pack. Mean
score 0.66, 3 fully detected (adManualGc, kafkaQueueProblems,
loadGeneratorFloodHomepage). Detail in
docs/sessions/2026-05-30-eval-v0.9.0.md.
The Investigator scored root-cause-correct on 10/11 scenarios it
ran on — diagnosis layer is healthy. Detection layer is where
the score is hiding. Four distinct failure patterns identified;
each gets a sub-item under the new top-of-roadmap entry:
1a. Alert state machine doesn't fire on low-rate/low-volume
services (cart, ad, product-reviews, recommendation,
checkout-via-readiness). -1h baseline window dilutes 7-min
burst below 1% threshold; FIRE_AFTER=2 × 5min cadence
requires ≥10min to reach FIRING.
1b. Alert state → UI render path is leaky. productCatalogFailure
fires in state machine + history dataset but 3 UI surfaces
don't show it. paymentFailure fires in state but history
event never lands.
1c. Investigator times out on gradual-onset scenarios
(emailMemoryLeak, leakFingerprint, recommendationCacheFailure)
— 5-min wait too short for multi-step playbooks.
1d. Surface contract mismatches (adHighCpu p99 threshold,
llmRateLimitError navigation when service drops off list,
failedReadinessProbe cluster behavior unclear).
ROADMAP items 2-11 renumbered (3-12) to make room for the new
detection priority at slot 1.
Companion: DEMOS.md (added in this branch's first commit) tells
operators how to manually trigger each scenario.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two deliverables from the 2026-05-30 eval session:
Eval results
Mean score: 0.66 (was 0.59 in the 2026-04-23 baseline). 3 of 14 fully detected. Investigator scored root-cause-correct on 10 of 11 scenarios that ran it — diagnosis is solid; detection is where the score hides.
Full per-scenario table + comparison vs the April baseline in
docs/sessions/2026-05-30-eval-v0.9.0.md.Failure patterns → ROADMAP priorities
The new Priority #1: Alert detection entry breaks down what the eval surfaced into four sub-items:
Old roadmap items 2-11 renumbered down by one to make room.
Files
DEMOS.md(new)docs/sessions/2026-05-30-eval-v0.9.0.md(new)ROADMAP.md(Priority repo: flatten oteldemo/ to root and drop k8s plumbing #1 added; ci: add GitHub Actions CI + tagged-release workflows #2-docs: consolidate plans into ROADMAP.md as single source of truth #12 renumbered)Test plan
Doc-only PR. No code changes:
npx tsc --noEmit— cleannpm run lint— 0 errorsnpm test— 107/107 passing/tmp/apm-eval-2026-05-30.log+/tmp/apm-eval-runs/transcript-*.jsonl🤖 Generated with Claude Code