feat: Phase C — format-correction retry + per-model output paths#58
Merged
Conversation
Adds the format-correction retry wedge from the v0.34.5 blog post's "what's next" list. When the sanitizer rejects a model's diff, the runner can optionally re-prompt the model once with the specific rejection reason and try again. Sanitizer (eval/_patch_sanitizer.py): - New sanitize_with_reason(patch, repo_slug, *, tree_paths) returning (cleaned_patch, reason) tuple. On success reason is "". On failure, reason is a machine-readable code from SANITIZE_REASONS: placeholder_hunk / truncated_hunk / overlong_hunk / malformed_metadata / path_not_found / ambiguous_path / empty_extraction. - sanitize() kept as backward-compat wrapper returning just the patch. - Internal helpers _fuzzy_correct_paths and _validate_hunks refactored to _with_reason variants that propagate the rejection reason up; the legacy names are kept as thin ok/empty wrappers so existing callers don't break. Runner (eval/swebench_phase2.py): - _sanitize_for_submission(raw, repo_slug, tree_paths, retry_callback=None) now optionally invokes retry_callback(reason, failed_output) -> str on rejection. The callback returns a fresh raw response; up to _FORMAT_RETRY_MAX (=1) retries are attempted. When the callback is None (the v0.34.5-era default), behavior is unchanged. - _build_retry_prompt(original_prompt, reason, failed_output) builds a targeted retry prompt with reason-specific guidance (placeholder_hunk says "use real line numbers"; path_not_found says "use exact paths from the repo context"; etc.). Embeds the failed output verbatim so the model corrects rather than regenerates. - run_baseline / run_organism / run_langgraph accept retry_on_reject: bool = False. When True, each constructs a runner-appropriate callback (direct _llm_call for baseline, whole-organism re-run for organism/langgraph) that passes the reason into _build_retry_prompt. CLI: - New --retry-on-reject flag (default off) gates retry behavior. - New --output PATH flag lets Phase C reruns preserve separate artifacts per model / per config without overwriting the v0.34.5 baseline artifact. Defaults to eval/results/swebench_phase2.json so existing callers are unchanged. Tests: - 12 new reason-code tests in test_patch_sanitizer.py exercising every code in SANITIZE_REASONS (including empty_extraction for headers-only and prose-only inputs, placeholder_hunk for XXX/N/?, truncated/overlong distinguishing, path_not_found / ambiguous_path via the tree oracle). - 8 new tests in test_swebench_phase2_retry.py covering _FORMAT_RETRY_MAX contract, no-callback-is-old-behavior, callback-fires-on-reject, callback-not-fired-on-success, retry-also-rejects-returns-empty, retry-cap-enforced, and _build_retry_prompt embedding reason+failed output with reason-specific guidance. 139/139 tests pass across sanitizer, retry, timeout, identity, extraction, repo_cache, repo_grounding suites (+20 new since v0.34.5). All TDD RED→GREEN: each new test was verified to fail before the corresponding implementation landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…long bare-empty misclassification
MEDIUM eval/swebench_phase2.py:954-958, 801-856 — --output was
silently ignored in --rewrite-envelope mode because main() returned
before the new output-path logic ran, and _rewrite_envelope always
wrote back to the input path. A caller trying to preserve the source
artifact while rewriting the envelope could still overwrite it.
Fix: added output_path parameter to _rewrite_envelope (defaults None =
in-place, matching v0.34.5 behavior). main() threads args.output
through when set. The source artifact is untouched when output is
explicit, and the new log line makes the from/to pair visible.
LOW eval/_patch_sanitizer.py:740-746 — overlong hunks whose first
extra body line is a bare empty context line were misclassified as
truncated_hunk. The check for body-shaped post-count content only
recognized "+", "-", " " prefixes but bare empty "" is also body-
shaped (it's the whitespace-stripped-blank-context case the repair
pass handles). The retry prompt would say "include every line" when
the model actually needed to trim extras — a directional error.
Root cause was deeper than just the classification: both
_normalize_paths and _repair_bare_empty_context use splitlines()
followed by "\\n".join(), which silently drops trailing empty lines
from the representation. So even with the reason-code fix, the
validator never saw the offending empty line because normalize had
already eaten it.
Fix (two layers):
1. Moved _validate_hunks_with_reason to run on the pristine input,
BEFORE any string-round-tripping. Validation is invariant under
path normalization and bare-empty repair (neither changes hunk
structure), so running it first is safe and produces accurate
reason codes regardless of downstream round-trip losses.
2. Added "" to the body-shape check in _validate_hunks_with_reason
so bare-empty overlong content is classified as overlong_hunk
rather than truncated_hunk, per the reviewer's direct suggestion.
3 new tests:
* test_sanitize_with_reason_overlong_hunk_bare_empty_extra (LOW)
* test_rewrite_envelope_writes_to_output_path_when_provided (MEDIUM)
* test_rewrite_envelope_defaults_to_input_path_when_output_omitted
(backward-compat regression check)
142/142 tests pass across sanitizer + retry + timeout + identity +
extraction + repo_cache + repo_grounding (was 139, +3 new).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coredipper
added a commit
that referenced
this pull request
Apr 18, 2026
Ships the Phase C experiment described in the v0.34.5 blog's "what's next" list: a locally-runnable cross-model check of the 8B format- discipline ceiling claim, with format-correction retry active. Artifact: eval/results/swebench_phase2_deepseek_retry.json - model: deepseek-r1:8b (DeepSeek-R1 distilled onto Qwen3-8B), digest 6995872bfe4c, 8.2B params, Q4_K_M, 131k context - Same 10 SWE-bench-lite instances as v0.34.5, same grounded pipeline - --retry-on-reject active (format-correction retry) - post_run_check.status = match - harness=ok across all three conditions - Total wall-clock: ~22 hours (reasoning models are slow) Outcome (all three conditions, 10 instances each): resolved=0, unresolved=0, sanitizer-rejected=10, runtime_error=0 mean_latency: baseline 1084s, organism 1073s, langgraph 1130s retry-recovered patches: 0 across all 30 submissions Two load-bearing findings for Paper 5 §6.3 Phase C: 1. Cross-model ceiling is sharp. Both gemma4 (v0.34.5) and deepseek-r1:8b (v0.35.0) produce zero resolved instances. Gemma4 crossed the sanitizer once (django-11001 baseline -> unresolved); deepseek-r1 crossed it zero times. The single v0.34.5 "survivor" was gemma4-specific, not a structurally easy issue. 2. Retry-with-reason-code did not help. The retry mechanism fires correctly (we observed overlong_hunk and truncated_hunk reason codes in the live logs), retry prompts embed the reason + failed output verbatim with reason-specific guidance, but zero "retry recovered" events logged across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative result than the v0.34.5 paper predicted (we suggested retry "might recover 20-40% of sanitizer drops"); at 8B it recovers zero. Paper 5 §6.3 adds new subsubsection "Phase C: Cross-Model Check with Format-Correction Retry" with: - Setup (deepseek-r1:8b identity details) - Per-condition outcome table - Comparison table vs gemma4 v0.34.5 - What Phase C does and does not say - Next wedge: 70B+ via cloud GPU, outside local-only scope Paper 5 §7 Limitations updated: "SWE-bench-lite ceiling at 8B, confirmed cross-model". Paper PDF rebuilt via tectonic (114KB -> 121KB; the Phase C content is the delta). Infrastructure shipped in this release (already on main from PRs #57, #58, #57's #753 and #58's #755 follow-ups): - 900s LLM timeout + 60s probe timeout split (PR #57) - sanitize_with_reason() returning (patch, reason_code) tuple with 8 machine-readable reason codes (PR #58) - _FORMAT_RETRY_MAX = 1 retry callback plumbing in all three runners (PR #58) - EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected empty_patch (carried over from v0.34.5) - classify_prediction() as authoritative override (carried over) - --retry-on-reject and --output CLI flags (PR #58) - --output honored under --rewrite-envelope (PR #58's #755 fix) - overlong-hunk classification handling of bare empty extra (same) Versions bumped: - pyproject.toml 0.34.5 -> 0.35.0 - operon_ai/__init__.py 0.34.5 -> 0.35.0 - README.md badge v0.34.5 -> v0.35.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First half of Phase C per the approved plan: format-correction retry infrastructure and the CLI plumbing for preserving per-model artifacts. Reruns land in a follow-up commit on this branch once benchmark data is in.
Problem
v0.34.5's grounded Phase 2 run had 27 of 30 submissions sanitizer-rejected with diff-format issues the sanitizer can identify (placeholder hunks, truncated bodies, invented paths) but the model never got a chance to correct. The v0.34.5 blog post named format-correction retry as the #2 wedge for moving the result; this PR implements it.
Changes
Sanitizer — reason-code surfaces
eval/_patch_sanitizer.pygets a parallel tuple-return API:The old
sanitize()is kept as a backward-compat wrapper so every existing test and caller continues to work unchanged. Internal helpers (_fuzzy_correct_paths,_validate_hunks) refactored to propagate reason codes.Runner — retry callback plumbing
_build_retry_prompt(original_prompt, reason, failed_output)builds a targeted retry prompt with reason-specific guidance (placeholder_hunk → "use real line numbers"; path_not_found → "use exact paths from the repo context"; etc.). The failed output is embedded verbatim so the model corrects rather than regenerates._FORMAT_RETRY_MAX = 1retries per instance. The constant exists so a future experiment can tune it without diff churn.run_baseline,run_organism,run_langgraph) acceptretry_on_reject: bool = False. When True, each constructs a runner-appropriate callback — direct_llm_callfor baseline; whole-organism re-run for the multi-stage conditions.CLI
Two new flags, both opt-in so v0.34.5 behavior is preserved:
--retry-on-reject--output PATHeval/results/swebench_phase2.jsonso Phase C reruns can be preserved per-model without overwriting the v0.34.5 baseline.Tests (20 new, all TDD RED→GREEN)
Sanitizer reason codes —
tests/unit/test_patch_sanitizer.py(+12):""reason.empty_extraction.@@ -XXX,N +XXX,N @@→placeholder_hunk(XXX and single-letter variants).truncated_hunk.overlong_hunk.path_not_found.ambiguous_path.sanitize()wrapper returnssanitize_with_reason()[0](backward-compat contract).Retry plumbing —
tests/unit/test_swebench_phase2_retry.py(+8):_FORMAT_RETRY_MAX == 1(diagnostic default contract).""(no infinite loop, no fallback to original)._build_retry_promptembeds reason + failed output.Test plan
pytest tests/unit/test_patch_sanitizer.py tests/unit/test_swebench_phase2_retry.py tests/unit/test_swebench_phase2_timeout.py tests/unit/test_swebench_phase2_identity.py tests/unit/test_patch_extraction.py tests/unit/test_repo_cache.py tests/unit/test_repo_grounding.py -q— 139/139 pass (+20 from v0.34.5's 119).python -m eval.swebench_phase2 --help— both new flags render cleanly.--model gemma4:latest --grounding --retry-on-reject --n 10 --output eval/results/swebench_phase2_gemma4_retry.json— follow-up commit on this branch once the 2-3h benchmark completes.Follow-up on this branch (not in this commit)
Once this is merged (or before, if you prefer squash): rerun artifacts with the new flag against gemma4 and deepseek-r1:8b (now unblocked by PR #57's 900s timeout), Paper 5 §6.3 Phase C subsection, v0.35.0 version bump, blog follow-up, release.
🤖 Generated with Claude Code