feat: SWE-bench Phase 2 v2 — grounded rerun confirms 8B model-capability ceiling#56
Merged
Merged
Conversation
…ity ceiling Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using the patch-apply pipeline shipped in PR #53 (sanitizer + tightened prompts) and PR #55 (repo grounding + fuzzy path correction + hardened diff parser). Same model (gemma4:latest, digest c6eb396dbd59, 8B/Q4_K_M), same 10 instances, --grounding active. Result: 1/30 reaches harness verdict (django-11001 baseline, unresolved). 29/30 honest empty_patch (sanitizer refused malformed output). Zero `error` outcomes. Old vs new outcome distribution per condition: baseline 1/10 evaluated, 9 error -> 1/10 evaluated, 9 empty_patch organism 1/10 evaluated, 5 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch langgraph 0/10 evaluated, 6 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch The original run confounded two failure modes: model selecting wrong file paths (a) and model writing malformed unified diffs (b). Grounding solves (a). The fact that 29 of 30 still drop to empty_patch — sanitizer rejecting placeholder hunks, malformed counts, and invented paths — localizes the bottleneck to (b). At 8B/Q4_K_M, diff-format discipline is the binding constraint, not file selection. Side observation: the original organism vs baseline gap (4 vs 0 empty_patch) closed under the new pipeline. The Phase A "[edit] stage emits a single fenced diff and nothing else" instruction eliminated the multi-stage format leak. Latency cost: grounding ~doubles wall-clock per call (baseline 44s -> 104s; organism 88s -> 170s; langgraph 90s -> 172s) for zero evaluated-rate gain at this model scale. Files: - eval/results/swebench_phase2.json: regenerated by the writer (sanitizer + grounding + harness=ok, post_run_check=match) - article/paper5/sections/06-experiments.tex: §6.3 retitled "8B Format Discipline Is the Ceiling"; new outcome table; before/after comparison; reframed interpretation - article/paper5/sections/07-conclusion.tex: limitations item updated to reference the grounded-rerun confirmation 9/9 schema tests still pass. PDF rebuild deferred to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… empty_patch
HIGH eval/swebench_phase2.py + eval/results/swebench_phase2.json — the
2026-04-17 grounded rerun had two baseline rows with latency_ms=0.0:
silent runtime errors (Ollama API timeouts on astropy-12907 and
astropy-14995) that the script's exception handler recorded as
empty model_patch. The harness then categorized them as empty_patch,
indistinguishable from sanitizer-rejections. Paper §6.3 therefore
overstated:
- "29/30 honest empty_patch" — actually 27 sanitizer-rejected + 2
silent API failures.
- "Zero error outcomes" — true at the git-apply level, but 2 runtime
failures were hidden inside empty_patch.
- "104 s baseline mean latency" — sum/10 with two zeros = 104 s, but
sum over the 8 instances that actually completed = 131 s.
Fix:
- Added EVAL_RUNTIME_ERROR = "runtime_error" status constant.
- Added Prediction.error_reason: str | None field (set in main()'s
exception handler with the exception class + truncated message).
- Summary computation overrides harness's empty_patch -> runtime_error
whenever the prediction carries error_reason. The "evaluated"
denominator (resolved + unresolved) is unchanged, but the
status_counts dict now exposes runtime_error separately.
- mean_latency_ms now divides by `n_completed` (predictions where
error_reason is None), not by n. Two new summary fields
n_runtime_errors and n_completed surface the breakdown directly.
- Per-result writer emits eval_status='runtime_error' (not
empty_patch) for these cases and records the error_reason.
Post-processed the committed eval/results/swebench_phase2.json:
- Reclassified the 2 baseline rows with latency_ms=0.0 as
runtime_error with error_reason="TimeoutError: openai-compatible
request timed out".
- Recomputed all three summaries with the corrected status_counts and
the divide-by-n_completed mean_latency_ms.
5 new tests (test_swebench_phase2_identity.py):
* EVAL_RUNTIME_ERROR is distinct from EVAL_EMPTY
* Prediction.error_reason defaults to None
* Prediction.error_reason carries the failure tag when set
* Committed artifact's runtime_error rows have non-null error_reason
and latency_ms=0.0 (locks the bug-#747 invariant)
* Committed artifact's summary exposes n_runtime_errors,
n_completed, and a 'runtime_error' key in status_counts
Updated paper §6.3 + §7:
* New table column: Sanitizer-rejected + Runtime error replace the
single "Empty patch" column, so baseline 7+2 (was 9), organism
10+0, langgraph 10+0 are all separately legible.
* Mean-latency footnote: "computed over predictions that completed".
* Baseline mean latency corrected from 104 s to 131 s.
* §6.3 prose softened: "27 sanitizer-dropped + 2 runtime errors"
instead of the false "29 empty_patch". The model-capability
conclusion is unchanged — the 2 runtime-error instances also
failed via organism + langgraph (which completed normally), so
they don't refute the format-discipline ceiling.
* §7 limitations updated to match.
106/106 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pin runtime-error rows
MEDIUM eval/swebench_phase2.py — the previous reclassification only
fired when harness_status == empty_patch. With --skip-harness, with a
missing/failed harness report, or when the harness happens to mark a
phantom 'error', a Prediction.error_reason set by main()'s exception
handler would be silently lost: per-result eval_status reverts to
not_evaluated (or whatever the harness emitted), n_runtime_errors
drops to 0, and the failure mode this whole change exists to preserve
disappears from the artifact.
Fix: extracted classify_prediction(error_reason, harness_status) as a
small pure function. error_reason is always authoritative — if the
model call raised, the result is runtime_error regardless of what the
harness said. Both the summary loop and the per-result writer now
delegate to classify_prediction, so there's a single contract that
can't drift between the two emission sites.
LOW tests/unit/test_swebench_phase2_identity.py — the prior
"runtime_error rows are well-formed" test passed vacuously when zero
rows existed. A silent regeneration that lost the reclassification
would slip through.
Fix: pinned the count and identities of expected runtime errors. The
2026-04-17 grounded rerun's two baseline timeouts (astropy-12907 and
astropy-14995) are now asserted explicitly; both must remain
condition='baseline' with non-null error_reason and latency_ms=0.0.
The summary's n_runtime_errors and status_counts['runtime_error']
must equal 2 for baseline, 0 for organism, 0 for langgraph.
5 new classify_prediction unit tests covering the override matrix:
* error_reason wins over empty_patch (review #747 case)
* error_reason wins over not_evaluated (review #748 case A:
--skip-harness)
* error_reason wins over harness 'error' (review #748 case B: a
phantom git-apply error on a patch that wouldn't exist if the
model call had raised)
* no error_reason -> harness status passes through (resolved,
unresolved, empty_patch, error, not_evaluated)
* defensive: empty harness_status defaults to not_evaluated rather
than leaking the empty value
111/111 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coredipper
added a commit
that referenced
this pull request
Apr 17, 2026
…update) Captures the work landed on this branch: - PR #53 (already merged): patch sanitizer + tightened prompts (Phase A) - PR #55 (already merged): repo grounding + fuzzy path correction + hardened diff parser (Phase B) - PR #56 (this): grounded rerun artifact + paper §6.3 reframed as "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected classification (reviews #747, #748) Bumps: - pyproject.toml 0.34.4 → 0.34.5 - operon_ai/__init__.py 0.34.4 → 0.34.5 - README.md badge v0.34.4 → v0.34.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…update) Captures the work landed on this branch: - PR #53 (already merged): patch sanitizer + tightened prompts (Phase A) - PR #55 (already merged): repo grounding + fuzzy path correction + hardened diff parser (Phase B) - PR #56 (this): grounded rerun artifact + paper §6.3 reframed as "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected classification (reviews #747, #748) Bumps: - pyproject.toml 0.34.4 → 0.34.5 - operon_ai/__init__.py 0.34.4 → 0.34.5 - README.md badge v0.34.4 → v0.34.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1192861 to
c3a2f55
Compare
coredipper
added a commit
that referenced
this pull request
Apr 17, 2026
The .tex source for §6.3 ("8B Format Discipline Is the Ceiling") and
§7 limitations was updated in PR #56 (merged on main as 18579b5), but
the PDF wasn't rebuilt at the time. This commit runs `tectonic
article/paper5/main.tex` so the published PDF matches the post-v0.34.5
narrative — the new outcome-distribution table (with separate
sanitizer-rejected and runtime-error columns), the corrected baseline
mean latency (131s, was 104s), and the reframed conclusion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using the patch-apply pipeline shipped in #53 (sanitizer + tightened prompts) and #55 (repo grounding + fuzzy path correction + hardened diff parser). Same model (
gemma4:latest, digestc6eb396dbd59, 8B/Q4_K_M), same 10 instances,--groundingactive.Result
1/30 reaches harness verdict (django-11001 baseline,
unresolved). 29/30 honestempty_patch(sanitizer refused malformed output). Zeroerroroutcomes.What this rerun resolves
The 2026-04-16 run confounded two failure modes: (a) model selecting wrong file paths and (b) model writing malformed unified diffs. Grounding solves (a) — every prompt now contains the actual files at
base_commit. The fact that 29 of 30 still drop toempty_patch— sanitizer rejecting placeholder hunks, malformed counts, and invented paths — localizes the bottleneck to (b): at 8B/Q4_K_M, diff-format discipline is the binding constraint.Side observation: the original organism-vs-baseline
empty_patchgap (4 vs 0) closed under the new pipeline. The Phase A "[edit]stage emits a single fenced diff and nothing else" instruction eliminated the multi-stage format leak. The remaining 1-vs-0 gap (baseline produced one applicable patch, organism/langgraph zero) suggests a small discipline tax from juggling stage outputs.Latency cost of grounding
The 30 KB of repository context per prompt roughly doubles per-call wall-clock with no compensating gain in evaluated instances at this model scale. Grounding's cost-benefit changes with stronger models, but at 8B it is overhead.
Changes
eval/results/swebench_phase2.json— regenerated by the writer (sanitizer + grounding +harness=ok,model_identity_post_run_check.status=match)article/paper5/sections/06-experiments.tex— §6.3 retitled "8B Format Discipline Is the Ceiling"; new outcome table; before/after comparison; reframed interpretation; explicit "what this does and does not say"article/paper5/sections/07-conclusion.tex— limitations item updated to reference the grounded-rerun confirmationPDF not rebuilt — leaving that to whoever pushes to arXiv.
Test plan
pytest tests/unit/test_swebench_phase2_identity.py -q— 9/9 pass (artifact matches writer's contract).python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json— round-trips withpost_run_check.status=match.What this is not
🤖 Generated with Claude Code