feat: Phase C — format-correction retry + per-model output paths by coredipper · Pull Request #58 · coredipper/operon

coredipper · 2026-04-17T21:24:38Z

First half of Phase C per the approved plan: format-correction retry infrastructure and the CLI plumbing for preserving per-model artifacts. Reruns land in a follow-up commit on this branch once benchmark data is in.

Problem

v0.34.5's grounded Phase 2 run had 27 of 30 submissions sanitizer-rejected with diff-format issues the sanitizer can identify (placeholder hunks, truncated bodies, invented paths) but the model never got a chance to correct. The v0.34.5 blog post named format-correction retry as the #2 wedge for moving the result; this PR implements it.

Changes

Sanitizer — reason-code surfaces

eval/_patch_sanitizer.py gets a parallel tuple-return API:

patch, reason = sanitize_with_reason(raw_patch, "django/django", tree_paths=tree)
# reason ∈ {"", "placeholder_hunk", "truncated_hunk", "overlong_hunk",
#          "malformed_metadata", "path_not_found", "ambiguous_path",
#          "empty_extraction"}

The old sanitize() is kept as a backward-compat wrapper so every existing test and caller continues to work unchanged. Internal helpers (_fuzzy_correct_paths, _validate_hunks) refactored to propagate reason codes.

Runner — retry callback plumbing

patch = _sanitize_for_submission(
    raw, repo_slug, tree_paths,
    retry_callback=lambda reason, failed: _llm_call(provider, retry_prompt(reason, failed)),
)

_build_retry_prompt(original_prompt, reason, failed_output) builds a targeted retry prompt with reason-specific guidance (placeholder_hunk → "use real line numbers"; path_not_found → "use exact paths from the repo context"; etc.). The failed output is embedded verbatim so the model corrects rather than regenerates.
Up to _FORMAT_RETRY_MAX = 1 retries per instance. The constant exists so a future experiment can tune it without diff churn.
All three runners (run_baseline, run_organism, run_langgraph) accept retry_on_reject: bool = False. When True, each constructs a runner-appropriate callback — direct _llm_call for baseline; whole-organism re-run for the multi-stage conditions.

CLI

Two new flags, both opt-in so v0.34.5 behavior is preserved:

Flag	Purpose
`--retry-on-reject`	Enable the retry callback. Off by default.
`--output PATH`	Write the artifact somewhere other than `eval/results/swebench_phase2.json` so Phase C reruns can be preserved per-model without overwriting the v0.34.5 baseline.

Tests (20 new, all TDD RED→GREEN)

Sanitizer reason codes — tests/unit/test_patch_sanitizer.py (+12):

Happy path returns "" reason.
Empty input / whitespace / headers-only / prose-only → empty_extraction.
@@ -XXX,N +XXX,N @@ → placeholder_hunk (XXX and single-letter variants).
Declared counts > body → truncated_hunk.
Body extra beyond counts → overlong_hunk.
Invented path with no basename match in tree oracle → path_not_found.
Basename matches multiple files → ambiguous_path.
sanitize() wrapper returns sanitize_with_reason()[0] (backward-compat contract).

Retry plumbing — tests/unit/test_swebench_phase2_retry.py (+8):

_FORMAT_RETRY_MAX == 1 (diagnostic default contract).
No callback = old behavior.
Callback fires on reject with reason + failed output.
Callback does NOT fire on success.
Retry also rejects → "" (no infinite loop, no fallback to original).
Retry cap enforced (chronically failing model can't trigger unbounded retries).
_build_retry_prompt embeds reason + failed output.
Reason-specific guidance differs between codes.

Test plan

pytest tests/unit/test_patch_sanitizer.py tests/unit/test_swebench_phase2_retry.py tests/unit/test_swebench_phase2_timeout.py tests/unit/test_swebench_phase2_identity.py tests/unit/test_patch_extraction.py tests/unit/test_repo_cache.py tests/unit/test_repo_grounding.py -q — 139/139 pass (+20 from v0.34.5's 119).
python -m eval.swebench_phase2 --help — both new flags render cleanly.
Live smoke on --model gemma4:latest --grounding --retry-on-reject --n 10 --output eval/results/swebench_phase2_gemma4_retry.json — follow-up commit on this branch once the 2-3h benchmark completes.

Follow-up on this branch (not in this commit)

Once this is merged (or before, if you prefer squash): rerun artifacts with the new flag against gemma4 and deepseek-r1:8b (now unblocked by PR #57's 900s timeout), Paper 5 §6.3 Phase C subsection, v0.35.0 version bump, blog follow-up, release.

🤖 Generated with Claude Code

Adds the format-correction retry wedge from the v0.34.5 blog post's "what's next" list. When the sanitizer rejects a model's diff, the runner can optionally re-prompt the model once with the specific rejection reason and try again. Sanitizer (eval/_patch_sanitizer.py): - New sanitize_with_reason(patch, repo_slug, *, tree_paths) returning (cleaned_patch, reason) tuple. On success reason is "". On failure, reason is a machine-readable code from SANITIZE_REASONS: placeholder_hunk / truncated_hunk / overlong_hunk / malformed_metadata / path_not_found / ambiguous_path / empty_extraction. - sanitize() kept as backward-compat wrapper returning just the patch. - Internal helpers _fuzzy_correct_paths and _validate_hunks refactored to _with_reason variants that propagate the rejection reason up; the legacy names are kept as thin ok/empty wrappers so existing callers don't break. Runner (eval/swebench_phase2.py): - _sanitize_for_submission(raw, repo_slug, tree_paths, retry_callback=None) now optionally invokes retry_callback(reason, failed_output) -> str on rejection. The callback returns a fresh raw response; up to _FORMAT_RETRY_MAX (=1) retries are attempted. When the callback is None (the v0.34.5-era default), behavior is unchanged. - _build_retry_prompt(original_prompt, reason, failed_output) builds a targeted retry prompt with reason-specific guidance (placeholder_hunk says "use real line numbers"; path_not_found says "use exact paths from the repo context"; etc.). Embeds the failed output verbatim so the model corrects rather than regenerates. - run_baseline / run_organism / run_langgraph accept retry_on_reject: bool = False. When True, each constructs a runner-appropriate callback (direct _llm_call for baseline, whole-organism re-run for organism/langgraph) that passes the reason into _build_retry_prompt. CLI: - New --retry-on-reject flag (default off) gates retry behavior. - New --output PATH flag lets Phase C reruns preserve separate artifacts per model / per config without overwriting the v0.34.5 baseline artifact. Defaults to eval/results/swebench_phase2.json so existing callers are unchanged. Tests: - 12 new reason-code tests in test_patch_sanitizer.py exercising every code in SANITIZE_REASONS (including empty_extraction for headers-only and prose-only inputs, placeholder_hunk for XXX/N/?, truncated/overlong distinguishing, path_not_found / ambiguous_path via the tree oracle). - 8 new tests in test_swebench_phase2_retry.py covering _FORMAT_RETRY_MAX contract, no-callback-is-old-behavior, callback-fires-on-reject, callback-not-fired-on-success, retry-also-rejects-returns-empty, retry-cap-enforced, and _build_retry_prompt embedding reason+failed output with reason-specific guidance. 139/139 tests pass across sanitizer, retry, timeout, identity, extraction, repo_cache, repo_grounding suites (+20 new since v0.34.5). All TDD RED→GREEN: each new test was verified to fail before the corresponding implementation landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…long bare-empty misclassification MEDIUM eval/swebench_phase2.py:954-958, 801-856 — --output was silently ignored in --rewrite-envelope mode because main() returned before the new output-path logic ran, and _rewrite_envelope always wrote back to the input path. A caller trying to preserve the source artifact while rewriting the envelope could still overwrite it. Fix: added output_path parameter to _rewrite_envelope (defaults None = in-place, matching v0.34.5 behavior). main() threads args.output through when set. The source artifact is untouched when output is explicit, and the new log line makes the from/to pair visible. LOW eval/_patch_sanitizer.py:740-746 — overlong hunks whose first extra body line is a bare empty context line were misclassified as truncated_hunk. The check for body-shaped post-count content only recognized "+", "-", " " prefixes but bare empty "" is also body- shaped (it's the whitespace-stripped-blank-context case the repair pass handles). The retry prompt would say "include every line" when the model actually needed to trim extras — a directional error. Root cause was deeper than just the classification: both _normalize_paths and _repair_bare_empty_context use splitlines() followed by "\\n".join(), which silently drops trailing empty lines from the representation. So even with the reason-code fix, the validator never saw the offending empty line because normalize had already eaten it. Fix (two layers): 1. Moved _validate_hunks_with_reason to run on the pristine input, BEFORE any string-round-tripping. Validation is invariant under path normalization and bare-empty repair (neither changes hunk structure), so running it first is safe and produces accurate reason codes regardless of downstream round-trip losses. 2. Added "" to the body-shape check in _validate_hunks_with_reason so bare-empty overlong content is classified as overlong_hunk rather than truncated_hunk, per the reviewer's direct suggestion. 3 new tests: * test_sanitize_with_reason_overlong_hunk_bare_empty_extra (LOW) * test_rewrite_envelope_writes_to_output_path_when_provided (MEDIUM) * test_rewrite_envelope_defaults_to_input_path_when_output_omitted (backward-compat regression check) 142/142 tests pass across sanitizer + retry + timeout + identity + extraction + repo_cache + repo_grounding (was 139, +3 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ships the Phase C experiment described in the v0.34.5 blog's "what's next" list: a locally-runnable cross-model check of the 8B format- discipline ceiling claim, with format-correction retry active. Artifact: eval/results/swebench_phase2_deepseek_retry.json - model: deepseek-r1:8b (DeepSeek-R1 distilled onto Qwen3-8B), digest 6995872bfe4c, 8.2B params, Q4_K_M, 131k context - Same 10 SWE-bench-lite instances as v0.34.5, same grounded pipeline - --retry-on-reject active (format-correction retry) - post_run_check.status = match - harness=ok across all three conditions - Total wall-clock: ~22 hours (reasoning models are slow) Outcome (all three conditions, 10 instances each): resolved=0, unresolved=0, sanitizer-rejected=10, runtime_error=0 mean_latency: baseline 1084s, organism 1073s, langgraph 1130s retry-recovered patches: 0 across all 30 submissions Two load-bearing findings for Paper 5 §6.3 Phase C: 1. Cross-model ceiling is sharp. Both gemma4 (v0.34.5) and deepseek-r1:8b (v0.35.0) produce zero resolved instances. Gemma4 crossed the sanitizer once (django-11001 baseline -> unresolved); deepseek-r1 crossed it zero times. The single v0.34.5 "survivor" was gemma4-specific, not a structurally easy issue. 2. Retry-with-reason-code did not help. The retry mechanism fires correctly (we observed overlong_hunk and truncated_hunk reason codes in the live logs), retry prompts embed the reason + failed output verbatim with reason-specific guidance, but zero "retry recovered" events logged across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative result than the v0.34.5 paper predicted (we suggested retry "might recover 20-40% of sanitizer drops"); at 8B it recovers zero. Paper 5 §6.3 adds new subsubsection "Phase C: Cross-Model Check with Format-Correction Retry" with: - Setup (deepseek-r1:8b identity details) - Per-condition outcome table - Comparison table vs gemma4 v0.34.5 - What Phase C does and does not say - Next wedge: 70B+ via cloud GPU, outside local-only scope Paper 5 §7 Limitations updated: "SWE-bench-lite ceiling at 8B, confirmed cross-model". Paper PDF rebuilt via tectonic (114KB -> 121KB; the Phase C content is the delta). Infrastructure shipped in this release (already on main from PRs #57, #58, #57's #753 and #58's #755 follow-ups): - 900s LLM timeout + 60s probe timeout split (PR #57) - sanitize_with_reason() returning (patch, reason_code) tuple with 8 machine-readable reason codes (PR #58) - _FORMAT_RETRY_MAX = 1 retry callback plumbing in all three runners (PR #58) - EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected empty_patch (carried over from v0.34.5) - classify_prediction() as authoritative override (carried over) - --retry-on-reject and --output CLI flags (PR #58) - --output honored under --rewrite-envelope (PR #58's #755 fix) - overlong-hunk classification handling of bare empty extra (same) Versions bumped: - pyproject.toml 0.34.5 -> 0.35.0 - operon_ai/__init__.py 0.34.5 -> 0.35.0 - README.md badge v0.34.5 -> v0.35.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coredipper and others added 2 commits April 17, 2026 23:24

coredipper merged commit 4024e7a into main Apr 17, 2026
4 checks passed

coredipper mentioned this pull request Apr 18, 2026

release: v0.35.0 — Phase C cross-model confirmation + retry null result #59

Merged

5 tasks

coredipper deleted the feat/phase2-format-retry branch April 22, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase C — format-correction retry + per-model output paths#58

feat: Phase C — format-correction retry + per-model output paths#58
coredipper merged 2 commits into
mainfrom
feat/phase2-format-retry

coredipper commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coredipper commented Apr 17, 2026

Problem

Changes

Sanitizer — reason-code surfaces

Runner — retry callback plumbing

CLI

Tests (20 new, all TDD RED→GREEN)

Test plan

Follow-up on this branch (not in this commit)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant