Skip to content

feat: Phase C — format-correction retry + per-model output paths#58

Merged
coredipper merged 2 commits into
mainfrom
feat/phase2-format-retry
Apr 17, 2026
Merged

feat: Phase C — format-correction retry + per-model output paths#58
coredipper merged 2 commits into
mainfrom
feat/phase2-format-retry

Conversation

@coredipper
Copy link
Copy Markdown
Owner

First half of Phase C per the approved plan: format-correction retry infrastructure and the CLI plumbing for preserving per-model artifacts. Reruns land in a follow-up commit on this branch once benchmark data is in.

Problem

v0.34.5's grounded Phase 2 run had 27 of 30 submissions sanitizer-rejected with diff-format issues the sanitizer can identify (placeholder hunks, truncated bodies, invented paths) but the model never got a chance to correct. The v0.34.5 blog post named format-correction retry as the #2 wedge for moving the result; this PR implements it.

Changes

Sanitizer — reason-code surfaces

eval/_patch_sanitizer.py gets a parallel tuple-return API:

patch, reason = sanitize_with_reason(raw_patch, "django/django", tree_paths=tree)
# reason ∈ {"", "placeholder_hunk", "truncated_hunk", "overlong_hunk",
#          "malformed_metadata", "path_not_found", "ambiguous_path",
#          "empty_extraction"}

The old sanitize() is kept as a backward-compat wrapper so every existing test and caller continues to work unchanged. Internal helpers (_fuzzy_correct_paths, _validate_hunks) refactored to propagate reason codes.

Runner — retry callback plumbing

patch = _sanitize_for_submission(
    raw, repo_slug, tree_paths,
    retry_callback=lambda reason, failed: _llm_call(provider, retry_prompt(reason, failed)),
)
  • _build_retry_prompt(original_prompt, reason, failed_output) builds a targeted retry prompt with reason-specific guidance (placeholder_hunk → "use real line numbers"; path_not_found → "use exact paths from the repo context"; etc.). The failed output is embedded verbatim so the model corrects rather than regenerates.
  • Up to _FORMAT_RETRY_MAX = 1 retries per instance. The constant exists so a future experiment can tune it without diff churn.
  • All three runners (run_baseline, run_organism, run_langgraph) accept retry_on_reject: bool = False. When True, each constructs a runner-appropriate callback — direct _llm_call for baseline; whole-organism re-run for the multi-stage conditions.

CLI

Two new flags, both opt-in so v0.34.5 behavior is preserved:

Flag Purpose
--retry-on-reject Enable the retry callback. Off by default.
--output PATH Write the artifact somewhere other than eval/results/swebench_phase2.json so Phase C reruns can be preserved per-model without overwriting the v0.34.5 baseline.

Tests (20 new, all TDD RED→GREEN)

Sanitizer reason codestests/unit/test_patch_sanitizer.py (+12):

  • Happy path returns "" reason.
  • Empty input / whitespace / headers-only / prose-only → empty_extraction.
  • @@ -XXX,N +XXX,N @@placeholder_hunk (XXX and single-letter variants).
  • Declared counts > body → truncated_hunk.
  • Body extra beyond counts → overlong_hunk.
  • Invented path with no basename match in tree oracle → path_not_found.
  • Basename matches multiple files → ambiguous_path.
  • sanitize() wrapper returns sanitize_with_reason()[0] (backward-compat contract).

Retry plumbingtests/unit/test_swebench_phase2_retry.py (+8):

  • _FORMAT_RETRY_MAX == 1 (diagnostic default contract).
  • No callback = old behavior.
  • Callback fires on reject with reason + failed output.
  • Callback does NOT fire on success.
  • Retry also rejects → "" (no infinite loop, no fallback to original).
  • Retry cap enforced (chronically failing model can't trigger unbounded retries).
  • _build_retry_prompt embeds reason + failed output.
  • Reason-specific guidance differs between codes.

Test plan

  • pytest tests/unit/test_patch_sanitizer.py tests/unit/test_swebench_phase2_retry.py tests/unit/test_swebench_phase2_timeout.py tests/unit/test_swebench_phase2_identity.py tests/unit/test_patch_extraction.py tests/unit/test_repo_cache.py tests/unit/test_repo_grounding.py -q — 139/139 pass (+20 from v0.34.5's 119).
  • python -m eval.swebench_phase2 --help — both new flags render cleanly.
  • Live smoke on --model gemma4:latest --grounding --retry-on-reject --n 10 --output eval/results/swebench_phase2_gemma4_retry.json — follow-up commit on this branch once the 2-3h benchmark completes.

Follow-up on this branch (not in this commit)

Once this is merged (or before, if you prefer squash): rerun artifacts with the new flag against gemma4 and deepseek-r1:8b (now unblocked by PR #57's 900s timeout), Paper 5 §6.3 Phase C subsection, v0.35.0 version bump, blog follow-up, release.

🤖 Generated with Claude Code

coredipper and others added 2 commits April 17, 2026 23:24
Adds the format-correction retry wedge from the v0.34.5 blog post's
"what's next" list. When the sanitizer rejects a model's diff, the
runner can optionally re-prompt the model once with the specific
rejection reason and try again.

Sanitizer (eval/_patch_sanitizer.py):
- New sanitize_with_reason(patch, repo_slug, *, tree_paths) returning
  (cleaned_patch, reason) tuple. On success reason is "". On failure,
  reason is a machine-readable code from SANITIZE_REASONS:
  placeholder_hunk / truncated_hunk / overlong_hunk /
  malformed_metadata / path_not_found / ambiguous_path /
  empty_extraction.
- sanitize() kept as backward-compat wrapper returning just the patch.
- Internal helpers _fuzzy_correct_paths and _validate_hunks refactored
  to _with_reason variants that propagate the rejection reason up;
  the legacy names are kept as thin ok/empty wrappers so existing
  callers don't break.

Runner (eval/swebench_phase2.py):
- _sanitize_for_submission(raw, repo_slug, tree_paths, retry_callback=None)
  now optionally invokes retry_callback(reason, failed_output) -> str
  on rejection. The callback returns a fresh raw response; up to
  _FORMAT_RETRY_MAX (=1) retries are attempted. When the callback
  is None (the v0.34.5-era default), behavior is unchanged.
- _build_retry_prompt(original_prompt, reason, failed_output) builds
  a targeted retry prompt with reason-specific guidance
  (placeholder_hunk says "use real line numbers"; path_not_found
  says "use exact paths from the repo context"; etc.). Embeds the
  failed output verbatim so the model corrects rather than regenerates.
- run_baseline / run_organism / run_langgraph accept
  retry_on_reject: bool = False. When True, each constructs a
  runner-appropriate callback (direct _llm_call for baseline,
  whole-organism re-run for organism/langgraph) that passes the
  reason into _build_retry_prompt.

CLI:
- New --retry-on-reject flag (default off) gates retry behavior.
- New --output PATH flag lets Phase C reruns preserve separate
  artifacts per model / per config without overwriting the v0.34.5
  baseline artifact. Defaults to eval/results/swebench_phase2.json
  so existing callers are unchanged.

Tests:
- 12 new reason-code tests in test_patch_sanitizer.py exercising
  every code in SANITIZE_REASONS (including empty_extraction for
  headers-only and prose-only inputs, placeholder_hunk for XXX/N/?,
  truncated/overlong distinguishing, path_not_found / ambiguous_path
  via the tree oracle).
- 8 new tests in test_swebench_phase2_retry.py covering
  _FORMAT_RETRY_MAX contract, no-callback-is-old-behavior,
  callback-fires-on-reject, callback-not-fired-on-success,
  retry-also-rejects-returns-empty, retry-cap-enforced, and
  _build_retry_prompt embedding reason+failed output with
  reason-specific guidance.

139/139 tests pass across sanitizer, retry, timeout, identity,
extraction, repo_cache, repo_grounding suites (+20 new since v0.34.5).

All TDD RED→GREEN: each new test was verified to fail before the
corresponding implementation landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…long bare-empty misclassification

MEDIUM eval/swebench_phase2.py:954-958, 801-856 — --output was
silently ignored in --rewrite-envelope mode because main() returned
before the new output-path logic ran, and _rewrite_envelope always
wrote back to the input path. A caller trying to preserve the source
artifact while rewriting the envelope could still overwrite it.

Fix: added output_path parameter to _rewrite_envelope (defaults None =
in-place, matching v0.34.5 behavior). main() threads args.output
through when set. The source artifact is untouched when output is
explicit, and the new log line makes the from/to pair visible.

LOW eval/_patch_sanitizer.py:740-746 — overlong hunks whose first
extra body line is a bare empty context line were misclassified as
truncated_hunk. The check for body-shaped post-count content only
recognized "+", "-", " " prefixes but bare empty "" is also body-
shaped (it's the whitespace-stripped-blank-context case the repair
pass handles). The retry prompt would say "include every line" when
the model actually needed to trim extras — a directional error.

Root cause was deeper than just the classification: both
_normalize_paths and _repair_bare_empty_context use splitlines()
followed by "\\n".join(), which silently drops trailing empty lines
from the representation. So even with the reason-code fix, the
validator never saw the offending empty line because normalize had
already eaten it.

Fix (two layers):
1. Moved _validate_hunks_with_reason to run on the pristine input,
   BEFORE any string-round-tripping. Validation is invariant under
   path normalization and bare-empty repair (neither changes hunk
   structure), so running it first is safe and produces accurate
   reason codes regardless of downstream round-trip losses.
2. Added "" to the body-shape check in _validate_hunks_with_reason
   so bare-empty overlong content is classified as overlong_hunk
   rather than truncated_hunk, per the reviewer's direct suggestion.

3 new tests:
  * test_sanitize_with_reason_overlong_hunk_bare_empty_extra (LOW)
  * test_rewrite_envelope_writes_to_output_path_when_provided (MEDIUM)
  * test_rewrite_envelope_defaults_to_input_path_when_output_omitted
    (backward-compat regression check)

142/142 tests pass across sanitizer + retry + timeout + identity +
extraction + repo_cache + repo_grounding (was 139, +3 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coredipper coredipper merged commit 4024e7a into main Apr 17, 2026
4 checks passed
coredipper added a commit that referenced this pull request Apr 18, 2026
Ships the Phase C experiment described in the v0.34.5 blog's "what's
next" list: a locally-runnable cross-model check of the 8B format-
discipline ceiling claim, with format-correction retry active.

Artifact: eval/results/swebench_phase2_deepseek_retry.json
- model: deepseek-r1:8b (DeepSeek-R1 distilled onto Qwen3-8B),
  digest 6995872bfe4c, 8.2B params, Q4_K_M, 131k context
- Same 10 SWE-bench-lite instances as v0.34.5, same grounded pipeline
- --retry-on-reject active (format-correction retry)
- post_run_check.status = match
- harness=ok across all three conditions
- Total wall-clock: ~22 hours (reasoning models are slow)

Outcome (all three conditions, 10 instances each):
  resolved=0, unresolved=0, sanitizer-rejected=10, runtime_error=0
  mean_latency: baseline 1084s, organism 1073s, langgraph 1130s
  retry-recovered patches: 0 across all 30 submissions

Two load-bearing findings for Paper 5 §6.3 Phase C:

1. Cross-model ceiling is sharp. Both gemma4 (v0.34.5) and
   deepseek-r1:8b (v0.35.0) produce zero resolved instances. Gemma4
   crossed the sanitizer once (django-11001 baseline -> unresolved);
   deepseek-r1 crossed it zero times. The single v0.34.5 "survivor"
   was gemma4-specific, not a structurally easy issue.

2. Retry-with-reason-code did not help. The retry mechanism fires
   correctly (we observed overlong_hunk and truncated_hunk reason
   codes in the live logs), retry prompts embed the reason + failed
   output verbatim with reason-specific guidance, but zero
   "retry recovered" events logged across 30 submissions. At 8B,
   diff-format failure is not mistake-that-can-be-corrected but
   cannot-produce-format-at-all. Retry helps occasional misformatters,
   not capability-ceiling models. Sharper negative result than the
   v0.34.5 paper predicted (we suggested retry "might recover 20-40%
   of sanitizer drops"); at 8B it recovers zero.

Paper 5 §6.3 adds new subsubsection "Phase C: Cross-Model Check with
Format-Correction Retry" with:
- Setup (deepseek-r1:8b identity details)
- Per-condition outcome table
- Comparison table vs gemma4 v0.34.5
- What Phase C does and does not say
- Next wedge: 70B+ via cloud GPU, outside local-only scope

Paper 5 §7 Limitations updated: "SWE-bench-lite ceiling at 8B,
confirmed cross-model".

Paper PDF rebuilt via tectonic (114KB -> 121KB; the Phase C content
is the delta).

Infrastructure shipped in this release (already on main from
PRs #57, #58, #57's #753 and #58's #755 follow-ups):
- 900s LLM timeout + 60s probe timeout split (PR #57)
- sanitize_with_reason() returning (patch, reason_code) tuple with
  8 machine-readable reason codes (PR #58)
- _FORMAT_RETRY_MAX = 1 retry callback plumbing in all three runners
  (PR #58)
- EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected
  empty_patch (carried over from v0.34.5)
- classify_prediction() as authoritative override (carried over)
- --retry-on-reject and --output CLI flags (PR #58)
- --output honored under --rewrite-envelope (PR #58's #755 fix)
- overlong-hunk classification handling of bare empty extra (same)

Versions bumped:
- pyproject.toml 0.34.5 -> 0.35.0
- operon_ai/__init__.py 0.34.5 -> 0.35.0
- README.md badge v0.34.5 -> v0.35.0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coredipper coredipper deleted the feat/phase2-format-retry branch April 22, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant