eval: the v0.4-effect rerun — gate failures 78→0; conversion exposes a rule-semantics decision (PR-15)#24
Merged
Merged
Conversation
…sed a rule-design decision (PR-15)
M3 plan Phase 1 close: same matrix, byte-identical prompts, three local
families, v0.4 shadcn contract (2.1.0).
Pre-registered check CONFIRMED at full strength: ZERO emitter-gate failures
in 216 runs (v0.3 baseline: 78/78 with the trigger-label signature). The
projection gap no longer reaches the emitter — it surfaces pre-emit as
named, rationale-carrying S3 findings.
The conversion also measured something the ADR did not predict: of 204
first-attempt trigger-carries-label violations, 131 are the measured gap
class (no projectable label anywhere), but 73 are PROJECTABLE surfaces
flagged by the rule's for-every-button (∀) semantics — textless sibling
buttons, the exact idiom behind all 28 of gpt-oss's v0.3 passes. Decision
surfaced to the maintainer, not patched: keep ∀ (own the strictness — an
unlabeled button is a real a11y defect) or amend to ∃ ('at least one
label-bearing button') before v0.4 semantics freeze.
Cost stated plainly: e2e passes 28→0; no run repaired to clean within
maxRepairs=2 (findings decreased 51 / same 135 / increased 26) — repairable
in principle, unrepaired in practice at n=2. qwen adapter flakiness elevated
(4/72 vs 1/72; 3 empty + 1 truncated, known classes).
Evidence: docs/evidence/2026-07-04-v04-effect/ (216 reports + the
decomposition script and output the buckets are checkable against).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
M3 plan Phase 1 close, PR-15: the v0.4-effect rerun. Same 12 prompts × 2 templates × 3 runs, byte-identical prompt set, three local families, against the v0.4 contract — the contract revision is the measured variable.
The pre-registered check: confirmed at full strength
The ADR-M3-1 conversion claim holds at the strongest measurable value.
What the conversion exposed (the real review item)
Of 204 first-attempt
rule.trigger-carries-labelviolations, the committed decomposition script splits:Costs, stated plainly
Evidence
docs/evidence/2026-07-04-v04-effect/— matrix, results, report, all 216 retained reports, plusdecompose-trigger-label.py+decomposition.txtso the 131/73 split is one command to re-verify.Review (#18 protocol)
Suggested spot-check set: one
projectable-but-stricterreport (e.g. any gpt-oss p01 — compare its trigger shape to the same cell in2026-07-03-gptoss-third-family/), onegenuinely-unprojectableqwen report, the p12 probe cells (textless-trigger: now S3 findings, not gate failures), and re-run the decomposition script.🤖 Generated with Claude Code