Skip to content

docs: eval v0.9.0 results + DEMOS.md + roadmap fixes for detection gaps#59

Merged
coccyx merged 2 commits into
masterfrom
docs/eval-2026-05-30
May 31, 2026
Merged

docs: eval v0.9.0 results + DEMOS.md + roadmap fixes for detection gaps#59
coccyx merged 2 commits into
masterfrom
docs/eval-2026-05-30

Conversation

@coccyx
Copy link
Copy Markdown
Contributor

@coccyx coccyx commented May 31, 2026

Summary

Two deliverables from the 2026-05-30 eval session:

  1. DEMOS.md — new doc telling operators how to demonstrate each of the 14 scenarios manually (flag flip, expected service, surface checks, Investigator prompt with passing rubric). Pulled directly from the eval scenario declarations.
  2. Full eval results + ROADMAP additions — ran the 14-scenario eval on v0.9.0 over 6h 45m, captured the per-scenario report, and prepended a new top-of-roadmap entry breaking down four distinct detection failure patterns into actionable sub-items.

Eval results

Mean score: 0.66 (was 0.59 in the 2026-04-23 baseline). 3 of 14 fully detected. Investigator scored root-cause-correct on 10 of 11 scenarios that ran it — diagnosis is solid; detection is where the score hides.

Fully detected Investigator failures
Now adManualGc, kafkaQueueProblems, loadGeneratorFloodHomepage emailMemoryLeak, leakFingerprint, recommendationCacheFailure (all timeouts)
April paymentFailure (only)

Full per-scenario table + comparison vs the April baseline in docs/sessions/2026-05-30-eval-v0.9.0.md.

Failure patterns → ROADMAP priorities

The new Priority #1: Alert detection entry breaks down what the eval surfaced into four sub-items:

  • 1a. Alert state machine doesn't fire for low-rate / low-volume services. -1h baseline window dilutes 7-min flag bursts below the 1% error_rate threshold. Affects 5 scenarios. Proposed: shorter analysis window + low-volume floor.
  • 1b. State machine → UI render path is leaky. productCatalogFailure shows state firing but 3 UI surfaces silent. paymentFailure shows state firing but alertHistory event missing from dataset.
  • 1c. Investigator times out on gradual-onset scenarios (emailMemoryLeak, leakFingerprint, recommendationCacheFailure). 5-min waitMs too short for multi-step playbooks.
  • 1d. Surface contract mismatches (adHighCpu p99 threshold, llmRateLimitError navigation when service drops off list, failedReadinessProbe cluster behavior unclear).

Old roadmap items 2-11 renumbered down by one to make room.

Files

Test plan

Doc-only PR. No code changes:

  • npx tsc --noEmit — clean
  • npm run lint — 0 errors
  • npm test — 107/107 passing
  • Eval run captured to /tmp/apm-eval-2026-05-30.log + /tmp/apm-eval-runs/transcript-*.jsonl

🤖 Generated with Claude Code

coccyx and others added 2 commits May 30, 2026 14:27
Practical "how to demo each failure scenario" pulled from the eval
suite (14 scenarios). For each: flag flip, expected service,
surface checks per page, the Investigator prompt + scoring rubric,
and what makes the scenario interesting.

Companion to FAILURE-SCENARIOS.md (deeper reference) and eval/
(automated harness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran the 14-scenario eval against the v0.9.0 deployed pack. Mean
score 0.66, 3 fully detected (adManualGc, kafkaQueueProblems,
loadGeneratorFloodHomepage). Detail in
docs/sessions/2026-05-30-eval-v0.9.0.md.

The Investigator scored root-cause-correct on 10/11 scenarios it
ran on — diagnosis layer is healthy. Detection layer is where
the score is hiding. Four distinct failure patterns identified;
each gets a sub-item under the new top-of-roadmap entry:

  1a. Alert state machine doesn't fire on low-rate/low-volume
      services (cart, ad, product-reviews, recommendation,
      checkout-via-readiness). -1h baseline window dilutes 7-min
      burst below 1% threshold; FIRE_AFTER=2 × 5min cadence
      requires ≥10min to reach FIRING.
  1b. Alert state → UI render path is leaky. productCatalogFailure
      fires in state machine + history dataset but 3 UI surfaces
      don't show it. paymentFailure fires in state but history
      event never lands.
  1c. Investigator times out on gradual-onset scenarios
      (emailMemoryLeak, leakFingerprint, recommendationCacheFailure)
      — 5-min wait too short for multi-step playbooks.
  1d. Surface contract mismatches (adHighCpu p99 threshold,
      llmRateLimitError navigation when service drops off list,
      failedReadinessProbe cluster behavior unclear).

ROADMAP items 2-11 renumbered (3-12) to make room for the new
detection priority at slot 1.

Companion: DEMOS.md (added in this branch's first commit) tells
operators how to manually trigger each scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coccyx coccyx merged commit 56a73c1 into master May 31, 2026
3 checks passed
@coccyx coccyx deleted the docs/eval-2026-05-30 branch May 31, 2026 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant