Skip to content

eval: the v0.4-effect rerun — gate failures 78→0; conversion exposes a rule-semantics decision (PR-15)#24

Merged
ryandmonk merged 1 commit into
mainfrom
eval/v04-effect-rerun
Jul 4, 2026
Merged

eval: the v0.4-effect rerun — gate failures 78→0; conversion exposes a rule-semantics decision (PR-15)#24
ryandmonk merged 1 commit into
mainfrom
eval/v04-effect-rerun

Conversation

@ryandmonk

Copy link
Copy Markdown
Contributor

What

M3 plan Phase 1 close, PR-15: the v0.4-effect rerun. Same 12 prompts × 2 templates × 3 runs, byte-identical prompt set, three local families, against the v0.4 contract — the contract revision is the measured variable.

The pre-registered check: confirmed at full strength

v0.3 (2026-07-03) v0.4 (this run)
emitter-gate failures carrying the trigger-label signature 78/78 0/216
where the failure class surfaces post-emit A3 (unrepairable) pre-emit S3, named finding with rationale + repair feedback

The ADR-M3-1 conversion claim holds at the strongest measurable value.

What the conversion exposed (the real review item)

Of 204 first-attempt rule.trigger-carries-label violations, the committed decomposition script splits:

  • 131 genuinely-unprojectable — the measured gap class, caught exactly as filed.
  • 73 projectable-but-stricter — surfaces A3 would have ACCEPTED (a labeled bearer exists) flagged by the rule's ∀ semantics on a textless sibling button. This is the exact idiom behind all 28 of gpt-oss's v0.3 end-to-end passes (verified against the E0 evidence: 28/28 were one-labeled-plus-one-unlabeled-button triggers). Split: gemma 29 / gpt-oss 31 / qwen 13 — not one model's quirk.

⚠️ Maintainer decision, surfaced not patched: keep ∀ ("every button in a trigger carries its label" — defensible: a textless button is an unlabeled interactive control, a real a11y defect — but own in the rationale that v0.4 rejects surfaces v0.3+A3 shipped), or amend spec v0.4 §4.1 / the rule to an ∃ form ("at least one label-bearing button") before v0.4 semantics freeze. Findings addendum states both readings; nothing in this PR presumes the answer.

Costs, stated plainly

  • e2e passes 28 → 0. Under ∀, every family's dominant trigger idiom violates, and no run repaired to clean within maxRepairs=2 (findings decreased 51 / unchanged 135 / increased 26). The ADR's promise is half-realized at these models' repair ability: repairable in principle, unrepaired in practice at n=2. Direct input to the ADR-M3-3 deconfound.
  • qwen adapter flakiness elevated: 4/72 genuine failures (3 empty output, 1 truncated JSON) vs 1/72 in July — known classes, noted not explained.

Evidence

docs/evidence/2026-07-04-v04-effect/ — matrix, results, report, all 216 retained reports, plus decompose-trigger-label.py + decomposition.txt so the 131/73 split is one command to re-verify.

Review (#18 protocol)

Suggested spot-check set: one projectable-but-stricter report (e.g. any gpt-oss p01 — compare its trigger shape to the same cell in 2026-07-03-gptoss-third-family/), one genuinely-unprojectable qwen report, the p12 probe cells (textless-trigger: now S3 findings, not gate failures), and re-run the decomposition script.

🤖 Generated with Claude Code

…sed a rule-design decision (PR-15)

M3 plan Phase 1 close: same matrix, byte-identical prompts, three local
families, v0.4 shadcn contract (2.1.0).

Pre-registered check CONFIRMED at full strength: ZERO emitter-gate failures
in 216 runs (v0.3 baseline: 78/78 with the trigger-label signature). The
projection gap no longer reaches the emitter — it surfaces pre-emit as
named, rationale-carrying S3 findings.

The conversion also measured something the ADR did not predict: of 204
first-attempt trigger-carries-label violations, 131 are the measured gap
class (no projectable label anywhere), but 73 are PROJECTABLE surfaces
flagged by the rule's for-every-button (∀) semantics — textless sibling
buttons, the exact idiom behind all 28 of gpt-oss's v0.3 passes. Decision
surfaced to the maintainer, not patched: keep ∀ (own the strictness — an
unlabeled button is a real a11y defect) or amend to ∃ ('at least one
label-bearing button') before v0.4 semantics freeze.

Cost stated plainly: e2e passes 28→0; no run repaired to clean within
maxRepairs=2 (findings decreased 51 / same 135 / increased 26) — repairable
in principle, unrepaired in practice at n=2. qwen adapter flakiness elevated
(4/72 vs 1/72; 3 empty + 1 truncated, known classes).

Evidence: docs/evidence/2026-07-04-v04-effect/ (216 reports + the
decomposition script and output the buckets are checkable against).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 4, 2026 04:04

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@ryandmonk ryandmonk merged commit 06b747f into main Jul 4, 2026
3 checks passed
@ryandmonk ryandmonk deleted the eval/v04-effect-rerun branch July 4, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants