eval: the v0.4-effect rerun — gate failures 78→0; conversion exposes a rule-semantics decision (PR-15) by ryandmonk · Pull Request #24 · aestheticfunction/dspack-gen

ryandmonk · 2026-07-04T04:04:22Z

What

M3 plan Phase 1 close, PR-15: the v0.4-effect rerun. Same 12 prompts × 2 templates × 3 runs, byte-identical prompt set, three local families, against the v0.4 contract — the contract revision is the measured variable.

The pre-registered check: confirmed at full strength

	v0.3 (2026-07-03)	v0.4 (this run)
emitter-gate failures carrying the trigger-label signature	78/78	0/216
where the failure class surfaces	post-emit A3 (unrepairable)	pre-emit S3, named finding with rationale + repair feedback

The ADR-M3-1 conversion claim holds at the strongest measurable value.

What the conversion exposed (the real review item)

Of 204 first-attempt rule.trigger-carries-label violations, the committed decomposition script splits:

131 genuinely-unprojectable — the measured gap class, caught exactly as filed.
73 projectable-but-stricter — surfaces A3 would have ACCEPTED (a labeled bearer exists) flagged by the rule's ∀ semantics on a textless sibling button. This is the exact idiom behind all 28 of gpt-oss's v0.3 end-to-end passes (verified against the E0 evidence: 28/28 were one-labeled-plus-one-unlabeled-button triggers). Split: gemma 29 / gpt-oss 31 / qwen 13 — not one model's quirk.

⚠️ Maintainer decision, surfaced not patched: keep ∀ ("every button in a trigger carries its label" — defensible: a textless button is an unlabeled interactive control, a real a11y defect — but own in the rationale that v0.4 rejects surfaces v0.3+A3 shipped), or amend spec v0.4 §4.1 / the rule to an ∃ form ("at least one label-bearing button") before v0.4 semantics freeze. Findings addendum states both readings; nothing in this PR presumes the answer.

Costs, stated plainly

e2e passes 28 → 0. Under ∀, every family's dominant trigger idiom violates, and no run repaired to clean within maxRepairs=2 (findings decreased 51 / unchanged 135 / increased 26). The ADR's promise is half-realized at these models' repair ability: repairable in principle, unrepaired in practice at n=2. Direct input to the ADR-M3-3 deconfound.
qwen adapter flakiness elevated: 4/72 genuine failures (3 empty output, 1 truncated JSON) vs 1/72 in July — known classes, noted not explained.

Evidence

docs/evidence/2026-07-04-v04-effect/ — matrix, results, report, all 216 retained reports, plus decompose-trigger-label.py + decomposition.txt so the 131/73 split is one command to re-verify.

Review (#18 protocol)

Suggested spot-check set: one projectable-but-stricter report (e.g. any gpt-oss p01 — compare its trigger shape to the same cell in 2026-07-03-gptoss-third-family/), one genuinely-unprojectable qwen report, the p12 probe cells (textless-trigger: now S3 findings, not gate failures), and re-run the decomposition script.

🤖 Generated with Claude Code

…sed a rule-design decision (PR-15) M3 plan Phase 1 close: same matrix, byte-identical prompts, three local families, v0.4 shadcn contract (2.1.0). Pre-registered check CONFIRMED at full strength: ZERO emitter-gate failures in 216 runs (v0.3 baseline: 78/78 with the trigger-label signature). The projection gap no longer reaches the emitter — it surfaces pre-emit as named, rationale-carrying S3 findings. The conversion also measured something the ADR did not predict: of 204 first-attempt trigger-carries-label violations, 131 are the measured gap class (no projectable label anywhere), but 73 are PROJECTABLE surfaces flagged by the rule's for-every-button (∀) semantics — textless sibling buttons, the exact idiom behind all 28 of gpt-oss's v0.3 passes. Decision surfaced to the maintainer, not patched: keep ∀ (own the strictness — an unlabeled button is a real a11y defect) or amend to ∃ ('at least one label-bearing button') before v0.4 semantics freeze. Cost stated plainly: e2e passes 28→0; no run repaired to clean within maxRepairs=2 (findings decreased 51 / same 135 / increased 26) — repairable in principle, unrepaired in practice at n=2. qwen adapter flakiness elevated (4/72 vs 1/72; 3 empty + 1 truncated, known classes). Evidence: docs/evidence/2026-07-04-v04-effect/ (216 reports + the decomposition script and output the buckets are checkable against). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot AI review requested due to automatic review settings July 4, 2026 04:04

Copilot started reviewing on behalf of ryandmonk July 4, 2026 04:04 View session

Copilot AI reviewed Jul 4, 2026

ryandmonk merged commit 06b747f into main Jul 4, 2026
3 checks passed

ryandmonk deleted the eval/v04-effect-rerun branch July 4, 2026 04:41

ryandmonk mentioned this pull request Jul 4, 2026

spec: v0.4 draft amendment — textScope, ∃-within, trigger rule re-anchor (ADR-M3-1 amendment) aestheticfunction/dspack#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

eval: the v0.4-effect rerun — gate failures 78→0; conversion exposes a rule-semantics decision (PR-15)#24

eval: the v0.4-effect rerun — gate failures 78→0; conversion exposes a rule-semantics decision (PR-15)#24
ryandmonk merged 1 commit into
mainfrom
eval/v04-effect-rerun

ryandmonk commented Jul 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ryandmonk commented Jul 4, 2026

What

The pre-registered check: confirmed at full strength

What the conversion exposed (the real review item)

Costs, stated plainly

Evidence

Review (#18 protocol)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants