feat: SWE-bench Phase 2 v2 — grounded rerun confirms 8B model-capability ceiling by coredipper · Pull Request #56 · coredipper/operon

coredipper · 2026-04-17T17:43:38Z

Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using the patch-apply pipeline shipped in #53 (sanitizer + tightened prompts) and #55 (repo grounding + fuzzy path correction + hardened diff parser). Same model (gemma4:latest, digest c6eb396dbd59, 8B/Q4_K_M), same 10 instances, --grounding active.

Result

1/30 reaches harness verdict (django-11001 baseline, unresolved). 29/30 honest empty_patch (sanitizer refused malformed output). Zero error outcomes.

Condition	Unresolved	Empty patch	Evaluated	Mean latency
baseline	1	9	1/10	104 s
organism	0	10	0/10	170 s
langgraph	0	10	0/10	172 s

What this rerun resolves

The 2026-04-16 run confounded two failure modes: (a) model selecting wrong file paths and (b) model writing malformed unified diffs. Grounding solves (a) — every prompt now contains the actual files at base_commit. The fact that 29 of 30 still drop to empty_patch — sanitizer rejecting placeholder hunks, malformed counts, and invented paths — localizes the bottleneck to (b): at 8B/Q4_K_M, diff-format discipline is the binding constraint.

Side observation: the original organism-vs-baseline empty_patch gap (4 vs 0) closed under the new pipeline. The Phase A "[edit] stage emits a single fenced diff and nothing else" instruction eliminated the multi-stage format leak. The remaining 1-vs-0 gap (baseline produced one applicable patch, organism/langgraph zero) suggests a small discipline tax from juggling stage outputs.

Latency cost of grounding

Condition	Original	Now	Δ
baseline	44 s	104 s	+136%
organism	88 s	170 s	+93%
langgraph	90 s	172 s	+91%

The 30 KB of repository context per prompt roughly doubles per-call wall-clock with no compensating gain in evaluated instances at this model scale. Grounding's cost-benefit changes with stronger models, but at 8B it is overhead.

Changes

eval/results/swebench_phase2.json — regenerated by the writer (sanitizer + grounding + harness=ok, model_identity_post_run_check.status=match)
article/paper5/sections/06-experiments.tex — §6.3 retitled "8B Format Discipline Is the Ceiling"; new outcome table; before/after comparison; reframed interpretation; explicit "what this does and does not say"
article/paper5/sections/07-conclusion.tex — limitations item updated to reference the grounded-rerun confirmation

PDF not rebuilt — leaving that to whoever pushes to arXiv.

Test plan

pytest tests/unit/test_swebench_phase2_identity.py -q — 9/9 pass (artifact matches writer's contract).
python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json — round-trips with post_run_check.status=match.
PDF compile (deferred).

What this is not

Not a Phase C deliverable. Format-correction retry, LLM-localized candidate ranking, and a stronger-model rerun are all open follow-ups.
Not a claim that grounding is useless — at 8B it's overhead, but the same infrastructure should add real value at larger model scales where format discipline isn't the binding constraint.

🤖 Generated with Claude Code

…ity ceiling Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using the patch-apply pipeline shipped in PR #53 (sanitizer + tightened prompts) and PR #55 (repo grounding + fuzzy path correction + hardened diff parser). Same model (gemma4:latest, digest c6eb396dbd59, 8B/Q4_K_M), same 10 instances, --grounding active. Result: 1/30 reaches harness verdict (django-11001 baseline, unresolved). 29/30 honest empty_patch (sanitizer refused malformed output). Zero `error` outcomes. Old vs new outcome distribution per condition: baseline 1/10 evaluated, 9 error -> 1/10 evaluated, 9 empty_patch organism 1/10 evaluated, 5 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch langgraph 0/10 evaluated, 6 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch The original run confounded two failure modes: model selecting wrong file paths (a) and model writing malformed unified diffs (b). Grounding solves (a). The fact that 29 of 30 still drop to empty_patch — sanitizer rejecting placeholder hunks, malformed counts, and invented paths — localizes the bottleneck to (b). At 8B/Q4_K_M, diff-format discipline is the binding constraint, not file selection. Side observation: the original organism vs baseline gap (4 vs 0 empty_patch) closed under the new pipeline. The Phase A "[edit] stage emits a single fenced diff and nothing else" instruction eliminated the multi-stage format leak. Latency cost: grounding ~doubles wall-clock per call (baseline 44s -> 104s; organism 88s -> 170s; langgraph 90s -> 172s) for zero evaluated-rate gain at this model scale. Files: - eval/results/swebench_phase2.json: regenerated by the writer (sanitizer + grounding + harness=ok, post_run_check=match) - article/paper5/sections/06-experiments.tex: §6.3 retitled "8B Format Discipline Is the Ceiling"; new outcome table; before/after comparison; reframed interpretation - article/paper5/sections/07-conclusion.tex: limitations item updated to reference the grounded-rerun confirmation 9/9 schema tests still pass. PDF rebuild deferred to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… empty_patch HIGH eval/swebench_phase2.py + eval/results/swebench_phase2.json — the 2026-04-17 grounded rerun had two baseline rows with latency_ms=0.0: silent runtime errors (Ollama API timeouts on astropy-12907 and astropy-14995) that the script's exception handler recorded as empty model_patch. The harness then categorized them as empty_patch, indistinguishable from sanitizer-rejections. Paper §6.3 therefore overstated: - "29/30 honest empty_patch" — actually 27 sanitizer-rejected + 2 silent API failures. - "Zero error outcomes" — true at the git-apply level, but 2 runtime failures were hidden inside empty_patch. - "104 s baseline mean latency" — sum/10 with two zeros = 104 s, but sum over the 8 instances that actually completed = 131 s. Fix: - Added EVAL_RUNTIME_ERROR = "runtime_error" status constant. - Added Prediction.error_reason: str | None field (set in main()'s exception handler with the exception class + truncated message). - Summary computation overrides harness's empty_patch -> runtime_error whenever the prediction carries error_reason. The "evaluated" denominator (resolved + unresolved) is unchanged, but the status_counts dict now exposes runtime_error separately. - mean_latency_ms now divides by `n_completed` (predictions where error_reason is None), not by n. Two new summary fields n_runtime_errors and n_completed surface the breakdown directly. - Per-result writer emits eval_status='runtime_error' (not empty_patch) for these cases and records the error_reason. Post-processed the committed eval/results/swebench_phase2.json: - Reclassified the 2 baseline rows with latency_ms=0.0 as runtime_error with error_reason="TimeoutError: openai-compatible request timed out". - Recomputed all three summaries with the corrected status_counts and the divide-by-n_completed mean_latency_ms. 5 new tests (test_swebench_phase2_identity.py): * EVAL_RUNTIME_ERROR is distinct from EVAL_EMPTY * Prediction.error_reason defaults to None * Prediction.error_reason carries the failure tag when set * Committed artifact's runtime_error rows have non-null error_reason and latency_ms=0.0 (locks the bug-#747 invariant) * Committed artifact's summary exposes n_runtime_errors, n_completed, and a 'runtime_error' key in status_counts Updated paper §6.3 + §7: * New table column: Sanitizer-rejected + Runtime error replace the single "Empty patch" column, so baseline 7+2 (was 9), organism 10+0, langgraph 10+0 are all separately legible. * Mean-latency footnote: "computed over predictions that completed". * Baseline mean latency corrected from 104 s to 131 s. * §6.3 prose softened: "27 sanitizer-dropped + 2 runtime errors" instead of the false "29 empty_patch". The model-capability conclusion is unchanged — the 2 runtime-error instances also failed via organism + langgraph (which completed normally), so they don't refute the format-discipline ceiling. * §7 limitations updated to match. 106/106 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… pin runtime-error rows MEDIUM eval/swebench_phase2.py — the previous reclassification only fired when harness_status == empty_patch. With --skip-harness, with a missing/failed harness report, or when the harness happens to mark a phantom 'error', a Prediction.error_reason set by main()'s exception handler would be silently lost: per-result eval_status reverts to not_evaluated (or whatever the harness emitted), n_runtime_errors drops to 0, and the failure mode this whole change exists to preserve disappears from the artifact. Fix: extracted classify_prediction(error_reason, harness_status) as a small pure function. error_reason is always authoritative — if the model call raised, the result is runtime_error regardless of what the harness said. Both the summary loop and the per-result writer now delegate to classify_prediction, so there's a single contract that can't drift between the two emission sites. LOW tests/unit/test_swebench_phase2_identity.py — the prior "runtime_error rows are well-formed" test passed vacuously when zero rows existed. A silent regeneration that lost the reclassification would slip through. Fix: pinned the count and identities of expected runtime errors. The 2026-04-17 grounded rerun's two baseline timeouts (astropy-12907 and astropy-14995) are now asserted explicitly; both must remain condition='baseline' with non-null error_reason and latency_ms=0.0. The summary's n_runtime_errors and status_counts['runtime_error'] must equal 2 for baseline, 0 for organism, 0 for langgraph. 5 new classify_prediction unit tests covering the override matrix: * error_reason wins over empty_patch (review #747 case) * error_reason wins over not_evaluated (review #748 case A: --skip-harness) * error_reason wins over harness 'error' (review #748 case B: a phantom git-apply error on a patch that wouldn't exist if the model call had raised) * no error_reason -> harness status passes through (resolved, unresolved, empty_patch, error, not_evaluated) * defensive: empty harness_status defaults to not_evaluated rather than leaking the empty value 111/111 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…update) Captures the work landed on this branch: - PR #53 (already merged): patch sanitizer + tightened prompts (Phase A) - PR #55 (already merged): repo grounding + fuzzy path correction + hardened diff parser (Phase B) - PR #56 (this): grounded rerun artifact + paper §6.3 reframed as "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected classification (reviews #747, #748) Bumps: - pyproject.toml 0.34.4 → 0.34.5 - operon_ai/__init__.py 0.34.4 → 0.34.5 - README.md badge v0.34.4 → v0.34.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The .tex source for §6.3 ("8B Format Discipline Is the Ceiling") and §7 limitations was updated in PR #56 (merged on main as 18579b5), but the PDF wasn't rebuilt at the time. This commit runs `tectonic article/paper5/main.tex` so the published PDF matches the post-v0.34.5 narrative — the new outcome-distribution table (with separate sanitizer-rejected and runtime-error columns), the corrected baseline mean latency (131s, was 104s), and the reframed conclusion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coredipper and others added 3 commits April 17, 2026 19:43

coredipper force-pushed the feat/swebench-phase2-v2-grounded-rerun branch from 1192861 to c3a2f55 Compare April 17, 2026 18:14

coredipper merged commit 18579b5 into main Apr 17, 2026
4 checks passed

coredipper deleted the feat/swebench-phase2-v2-grounded-rerun branch April 22, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SWE-bench Phase 2 v2 — grounded rerun confirms 8B model-capability ceiling#56

feat: SWE-bench Phase 2 v2 — grounded rerun confirms 8B model-capability ceiling#56
coredipper merged 4 commits into
mainfrom
feat/swebench-phase2-v2-grounded-rerun

coredipper commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coredipper commented Apr 17, 2026

Result

What this rerun resolves

Latency cost of grounding

Changes

Test plan

What this is not

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant