SWE-bench Phase 2 Phase B: repo-content grounding + fuzzy path correction by coredipper · Pull Request #55 · coredipper/operon

coredipper · 2026-04-17T10:44:30Z

Supersedes #54 (auto-closed when #53 merged and its base branch was deleted). All 8 commits are the same; review chain #724–#736 is closed and #738 came back Pass.

Summary

Real repo grounding. When --grounding is set, each instance's {repo}@{base_commit} is shallow-fetched into .cache/swebench/. Heuristics mine the issue text for hints, rank python files, and inject a tree listing + top-5 candidate file contents into the task prompt. The grounded task propagates to every organism stage unchanged.
Sanitizer gains a tree oracle. sanitize(..., tree_paths=...) applies stateful, per-file-diff correction. Source-side paths that don't exist are rewritten to unique basename matches or reject the patch; target-side paths of create / rename / copy pass through unchanged. Modify blocks mirror source corrections onto target and diff --git.
Hardened diff parser (from review chain #724 → #726 → #729 → #730 → #733 → #735 → #736):
- Bare multi-file diffs split correctly (per-block seen_plus state).
- Hunk-body deletions shaped --- or +++ never misparsed as file headers (count-driven _scan_hunk_extent).
- Overlong hunks rejected with ok=False when body lines spill past declared counts.
- Post-hunk boundaries require either @@ / diff --git / git-metadata, or a real bare-diff --- + +++ + @@ triplet — shape-only ---/+++ lines after counts drain are treated as overlong body content.
Prompt-injection hardening: file content fences use dynamic backtick runs longer than any sequence in the content; an included file with its own ``` cannot close the outer fence.
Opt-in + safe default. --grounding is off by default. Without it, behavior matches Phase A byte-for-byte.

Implementation

New files:

eval/_repo_cache.py — ensure_repo_at(repo_slug, base_commit, cache_dir) -> Path. Cache key {owner}__{repo}-{commit[:12]}. Cleans up on partial failure.
eval/_repo_grounding.py — extract_hints, rank_candidate_files, format_context_block, walk_tree_paths. Pure filesystem + regex.

Extended:

eval/_patch_sanitizer.py — sanitize(..., tree_paths=...) optional arg. Count-driven _scan_hunk_extent as single source of truth for hunk-body boundaries. Stateful _correct_file_diff classifies each file-diff block (create / rename / copy / delete / modify).
eval/swebench_phase2.py — --grounding / --cache-dir flags, Grounding dataclass with .none() sentinel, per-instance _build_grounding call that degrades gracefully on any failure, pass-through of grounding to all three run_* conditions.
pyproject.toml — eval extras now require datasets + swebench.

Answer-leak posture: grounding only reads problem_statement + hints_text + repo files at base_commit. instance["patch"] and instance["test_patch"] are never accessed.

Test plan

pytest tests/unit/test_patch_sanitizer.py tests/unit/test_patch_extraction.py tests/unit/test_swebench_phase2_identity.py tests/unit/test_repo_cache.py tests/unit/test_repo_grounding.py — 101 pass.
python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json — round-trips with post_run_check.status=match.
End-to-end real clone: astropy/astropy at a real SWE-bench-lite base_commit produced a 30 KB context block with all 5 candidate files under astropy/modeling/.
Actual SWE-bench rerun — deferred; will land as a separate PR (updated eval/results/swebench_phase2.json + paper §6.3 rewrite) only if the numbers justify it.

🤖 Generated with Claude Code

…rrection (Phase B) Phase A (PR #53) eliminated deterministic apply-failure modes in pure Python. It cannot fix content errors — sanitized patches still have to hit the right lines of the right file. Phase B grounds the prompt and the sanitizer in the actual repository at {repo}@{base_commit}. Changes: eval/_repo_cache.py (new): ensure_repo_at(repo_slug, base_commit, cache_dir) shallow-fetches a single commit via git CLI: git init / remote add origin / fetch --depth 1 origin {sha} / checkout FETCH_HEAD Cache key is {owner}__{repo}-{commit[:12]} so different commits of the same repo don't collide. RepoCacheError on any failure; half-populated cache entries are cleaned up so the next attempt starts fresh. eval/_repo_grounding.py (new): three helpers. - extract_hints(text): regex-harvest .py paths + CamelCase + snake_case identifiers from issue text; stopwords filtered. - rank_candidate_files(repo_path, hints, k): filesystem walk, score each .py file by (+3 stem match, +1 path match), tie-break by shorter path. No LLM calls. - format_context_block(repo_path, paths, max_lines): tree listing + '## <path>' + fenced file content (truncated at max_lines). - walk_tree_paths(repo_path): frozenset of relative paths, used as the sanitizer's tree oracle. Skips .git, __pycache__, .venv, etc. eval/_patch_sanitizer.py: add optional tree_paths=frozenset[str] param to sanitize(). When supplied, any file header path that does NOT exist in the tree is rewritten to a unique basename match if exactly one exists, or the entire patch is rejected (returns ""). /dev/null always accepted. When tree_paths is None, behavior matches Phase A exactly — all 19 prior tests stay green. eval/swebench_phase2.py: - --grounding flag (default off) and --cache-dir (default .cache/swebench) added. - New Grounding dataclass + _build_grounding() called per-instance when --grounding is on. Failures fall back to Grounding.none() with a diagnostic; the run continues without grounding for that instance. - _format_task takes an optional grounding_block that's appended as 'Repository context:' — reaches all organism stages since task propagates unchanged across stages. - _sanitize_for_submission takes tree_paths and passes it through. - run_baseline/run_organism/run_langgraph gain a grounding parameter. pyproject.toml: eval extras now require datasets + swebench (the Phase 2 script already needed both; just formalized). Tests (+44 new, 72 pass total across the sanitizer/extraction/identity/ repo-cache/repo-grounding suites): - test_repo_cache.py (7): cache hit path, first-call git sequence, error surfacing, commit-key separation, malformed inputs, cleanup on partial failure. - test_repo_grounding.py (15): hint extraction (paths, CamelCase, snake_case, stopwords, dedup), ranking (stem > path, tie-break, VCS-skip), context-block formatting (tree+snippets, truncation, empty case), tree walking. - test_patch_sanitizer.py (+6): fuzzy rewrite on unique basename match, reject on ambiguous/no match, /dev/null accepted, prior behavior preserved when tree_paths=None. Verification: - 72/72 pytest. - python -m eval.swebench_phase2 --rewrite-envelope ... still round-trips with post_run_check.status=match. - End-to-end real clone: astropy/astropy at a Phase-2 instance's base_commit produced 5 candidate files (all under astropy/modeling/ for a modeling issue) and a 30KB context block. Out of scope (Phase C later): retry-on-format-error, LLM-localized candidate selection, formal answer-leak audit, actual benchmark rerun. Phase C decisions gate on Phase B's observed effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MEDIUM eval/_patch_sanitizer.py (over-aggressive fuzzy correction) — _fuzzy_correct_paths previously forced every non-/dev/null file-header path to exist in tree_paths (the repo at base_commit). That's wrong for any patch where a target path legitimately doesn't exist yet: file creation (--- /dev/null; +++ b/new_path), rename (rename to / +++ b/new_path), and copy (copy to / +++ b/new_path). Worst case, the oracle would silently rewrite a new file's path to some unrelated unique-basename match, turning a valid create into a wrong modify. Fix: stateful per-file-diff correction. - _split_file_diffs groups lines by `diff --git` / `---` boundaries. - _correct_file_diff classifies each block as create / rename / copy / delete / modify based on /dev/null markers and rename-from/to, copy-from/to metadata. - Source-side paths (--- a/, rename from, copy from, diff --git a-side when not a create) are oracle-checked and rejected on fail. - Target-side paths in create/rename/copy blocks pass through unchanged — they legitimately don't exist at base_commit. - In modify blocks, any correction applied to --- a/X is mirrored on +++ b/X and the diff --git line so paths stay internally consistent. 7 new tests in test_patch_sanitizer.py covering: - create target absent from tree → accepted unchanged, not rewritten - rename target not oracle-checked (source still is) - copy target not oracle-checked (source still is) - new file not silently rewritten to unrelated unique basename match (the specific regression #724 called out) - delete source must exist (still enforced) - delete with absent source still rejected - modify correction mirrors source → target → diff --git MEDIUM eval/_repo_grounding.py (prompt-injection via unescaped fence) — format_context_block wrapped each file in a fixed ```python ... ``` fence. A file whose content contains ``` (very common: docstring markdown examples) would close the fence early, leaking the remainder of the file into the prompt as bare instructions. Untrusted repo content becoming prompt content is a prompt-injection surface. Fix: _safe_fence scans each file's content for the longest run of backticks and uses an outer fence one backtick longer (minimum 3). Markdown requires matching fence lengths, so the outer fence cannot be closed by any backtick run inside the content. 2 new tests in test_repo_grounding.py: triple-backtick file yields outer fence ≥4 backticks; quadruple/longer runs correctly adapt. 81/81 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MEDIUM eval/_patch_sanitizer.py — _split_file_diffs used a sticky in_block flag that was set True on the first bare `---` header and never cleared. Subsequent `--- a/<file2>` / `--- a/<file3>` headers in a bare multi-file diff (the form extract_patch explicitly supports) therefore collapsed into one block. _correct_file_diff would then apply a single create/rename/copy/modify classification across multiple files and leak one file's source_rewrite onto another — misclassifying mixed modify/create sections or leaving later `+++` headers unmirrored. Fix: replace the sticky flag with per-block state. A `---` header starts a new block when the current block has already emitted its own `+++` (which signals the previous file's definition is complete). `diff --git` unambiguously always starts a new block; `--- /dev/null` seen before any `+++` in the current block is still part of that block (the create marker after a `diff --git` prelude). Rule: new block iff: - line starts with `diff --git`, OR - line starts with `---` AND (no current block OR current block has seen `+++`) 4 new tests in test_patch_sanitizer.py covering: - bare multi-file modify: each file oracle-checked independently - bare multi-file mixed (modify + create): create target passes through while modify source gets oracle-checked - bare multi-file with bad source in second file: unique-basename rewrite applied to second only, first block untouched - bare multi-file rejects whole patch if any source unresolvable 85/85 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MEDIUM eval/_patch_sanitizer.py — _starts_new_block treated any line matching '--- ' as a file-header boundary. But in a unified diff, a deletion of content starting with '-- ' is encoded on the wire as '--- ...' (the leading '-' is the delete marker). So a hunk body containing a deleted line whose text starts with '-- a/foo.py' or '---' (e.g. YAML frontmatter) would be split mid-hunk, and _correct_file_diff and _validate_hunks would then misinterpret the body text as a file-header shape. Fix: add _is_minus_file_header(line, next_line) as the single disambiguation point. A real '---' file header has TWO properties: (1) the path after '--- ' matches 'a/<path>' or '/dev/null' (2) the very next line is '+++ ' A hunk-body deletion of content shaped like '--- a/foo.py' is always followed by another body line, never by '+++ '. Use this helper in both _split_file_diffs (boundary detection) and _validate_hunks (hunk-body termination). Extra hardening inside _correct_file_diff: the source-path correction and line-rewrite passes only apply BEFORE the first '@@' hunk header. Everything after the first '@@' is hunk-body content and must pass through verbatim — even when it happens to match the '--- a/...' or '+++ b/...' shape. 2 new tests in test_patch_sanitizer.py: - hunk deletes '-- a/foo.py' content → single-file modify survives - YAML frontmatter deletion ('---' content on both sides) → passes 87/87 tests pass across sanitizer / extraction / identity / repo-cache / repo-grounding suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… disambiguation MEDIUM eval/_patch_sanitizer.py — _is_minus_file_header asked only that the next line startswith('+++ '). A hunk where the deletion is '-- a/foo.py' (wire: '--- a/foo.py') and the addition is '++ plus' (wire: '+++ plus') would therefore still be misread as a file-header pair. The sibling line must match a REAL +++ file-header shape — '+++ b/<path>' or '+++ /dev/null' (with optional tab + timestamp). Fix: added _is_plus_file_header(line) with a regex that requires '+++ b/' or '+++ /dev/null' after the '+++ ' prefix. _is_minus_file_header now calls it instead of checking startswith. An addition of arbitrary content (even if shaped '+++ some content') no longer triggers a false file-header match. LOW _repair_bare_empty_context used the old naive check (line.startswith('--- ') or '+++ ') to decide when to leave hunk mode. A valid hunk with an ambiguous '---' deletion followed by a whitespace-stripped blank context line would leave hunk mode on the ambiguous line, so the blank would never get repaired, and validate would then reject the patch. Fix: _repair_bare_empty_context now calls _is_minus_file_header and _is_plus_file_header too — the same lookahead used by the splitter and validator. All three sanitizer passes now interpret ambiguous '---'/'+++' body lines consistently. Also tightened _validate_hunks: replaced the old 'startswith("+++ ") and not startswith("+++ b/") and != "+++ /dev/null"' check with a direct call to _is_plus_file_header. 2 new tests: - hunk body with '--- a/foo.py' delete + '+++ some content' add (ambiguous pair) survives as a single-file modify - hunk with ambiguous delete followed by blank context line gets the blank repaired to ' ' before validation 89/89 tests pass across sanitizer / extraction / identity / repo-cache / repo-grounding suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uation loop MEDIUM eval/_patch_sanitizer.py — the shape-only ``_is_plus_file_header`` still matched arbitrary hunk-body additions whose content happened to start with ``++ b/`` or ``++ /dev/null`` (wire: ``+++ b/...`` and ``+++ /dev/null``). Paired with a ``-- a/...`` deletion, such content would even satisfy the ``_is_minus_file_header`` lookahead. The only structurally-correct distinction between a file-header pair and a body pair is POSITION relative to the declared ``@@ -a,b +c,d @@`` counts: inside the declared body, every line's role is decided by its first character alone. Fix: introduce ``_parse_hunk_counts`` and ``_scan_hunk_extent``, a shared count-driven body walker. ``_scan_hunk_extent`` consumes exactly ``(old_count, new_count)`` body lines, decrementing the counters based on the leading character (``+``, ``-``, ``' '`` / empty). A line shaped ``--- a/foo.py`` inside a body is an unambiguous deletion; ``+++ b/bar.py`` inside a body is an unambiguous addition. Refactored three callers to use the scanner: * ``_split_file_diffs`` — when at ``@@``, appends the exact body extent verbatim and resumes boundary detection only afterwards. * ``_repair_bare_empty_context`` — repairs blanks only within the declared extent; no shape-based exit heuristic. * ``_validate_hunks`` — delegates body consumption to the scanner; ``ok`` return tells us whether counts balanced. The shape-only helpers ``_is_plus_file_header`` and ``_is_minus_file_header`` are retained for OUTSIDE-hunk boundary detection in ``_split_file_diffs``, where they remain correct because outside a hunk body there's no competing "body content" interpretation. 3 new tests covering the review #733 adversarial cases: * body add shaped ``+++ b/foo.py`` (wire of ``++ b/foo.py`` content) * body pair ``--- a/foo.py`` + ``+++ b/bar.py`` (matching both halves of a file-header pair by shape) * body add shaped ``+++ /dev/null`` 92/92 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HIGH eval/_patch_sanitizer.py — the count-driven _scan_hunk_extent stopped as soon as (old_count, new_count) reached zero. Extra body lines AFTER that boundary were silently left in the patch as "between-hunks content". A declared '@@ -1 +1 @@' with body '-old / +new / +extra' would validate, but git apply would still choke on the dangling '+extra'. That's the exact failure class the validator is supposed to prevent. Fix: after counts drain, _scan_hunk_extent now additionally requires the NEXT line (if any) to be a legal post-hunk boundary: another '@@', a '---' / '+++' file header, 'diff --git', a git metadata line, or EOF. Body-shaped content ('+', '-', ' ', or empty) after counts drain returns ok=False. A trailing '\ No newline at end of file' marker is consumed first (it applies to the hunk's final body line) before the boundary check, so its presence doesn't trigger a false-overlong rejection. 4 new tests: * overlong by addition (extra '+' line) — rejected * overlong by addition in grounding mode — rejected * overlong by context (extra ' ' line) — rejected * trailing '\ No newline at end of file' marker — accepted 96/96 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HIGH eval/_patch_sanitizer.py — _is_post_hunk_boundary accepted any line shaped like a ---/+++ file header based on shape alone. Body content shaped '--- a/foo.py' (wire form of deletion '-- a/foo.py') or '+++ b/bar.py' (wire form of addition '++ b/bar.py') after counts drain therefore passed as a valid boundary rather than being rejected as overlong body content. The validator would accept the patch, but git apply would still reject. Fix: tightened _is_post_hunk_boundary. Unambiguous accepts (no body line can fake these shapes): * EOF * '@@' (next hunk of same file) * 'diff --git ' (new file-diff) * git metadata prefixes ('index ', 'new file mode', 'rename from', etc.) — these appear between 'diff --git' and the '---' pair, so they imply a new file For bare multi-file diffs without 'diff --git', a legitimate file boundary requires the full TRIPLET '---' + '+++' + '@@' on three consecutive lines. A body-content imposter pair '-- a/X' + '++ b/X' followed by arbitrary body content would not put '@@' at the third position. Shape-only '---' or '+++' after counts drain (with no '@@' at the expected position) is now treated as overlong body content and rejects the patch. 5 new tests: * extra '+++ b/bar.py' after counts (overlong by addition) — reject * extra '--- a/foo.py' after counts (overlong by deletion) — reject * extra '+++ /dev/null' after counts (overlong /dev/null-shaped addition) — reject * legitimate bare multi-file triplet — accept * two-line --- / +++ pair without following @@ — reject (overlong body content, not a real boundary) 101/101 tests pass across all touched suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ity ceiling Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using the patch-apply pipeline shipped in PR #53 (sanitizer + tightened prompts) and PR #55 (repo grounding + fuzzy path correction + hardened diff parser). Same model (gemma4:latest, digest c6eb396dbd59, 8B/Q4_K_M), same 10 instances, --grounding active. Result: 1/30 reaches harness verdict (django-11001 baseline, unresolved). 29/30 honest empty_patch (sanitizer refused malformed output). Zero `error` outcomes. Old vs new outcome distribution per condition: baseline 1/10 evaluated, 9 error -> 1/10 evaluated, 9 empty_patch organism 1/10 evaluated, 5 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch langgraph 0/10 evaluated, 6 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch The original run confounded two failure modes: model selecting wrong file paths (a) and model writing malformed unified diffs (b). Grounding solves (a). The fact that 29 of 30 still drop to empty_patch — sanitizer rejecting placeholder hunks, malformed counts, and invented paths — localizes the bottleneck to (b). At 8B/Q4_K_M, diff-format discipline is the binding constraint, not file selection. Side observation: the original organism vs baseline gap (4 vs 0 empty_patch) closed under the new pipeline. The Phase A "[edit] stage emits a single fenced diff and nothing else" instruction eliminated the multi-stage format leak. Latency cost: grounding ~doubles wall-clock per call (baseline 44s -> 104s; organism 88s -> 170s; langgraph 90s -> 172s) for zero evaluated-rate gain at this model scale. Files: - eval/results/swebench_phase2.json: regenerated by the writer (sanitizer + grounding + harness=ok, post_run_check=match) - article/paper5/sections/06-experiments.tex: §6.3 retitled "8B Format Discipline Is the Ceiling"; new outcome table; before/after comparison; reframed interpretation - article/paper5/sections/07-conclusion.tex: limitations item updated to reference the grounded-rerun confirmation 9/9 schema tests still pass. PDF rebuild deferred to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…update) Captures the work landed on this branch: - PR #53 (already merged): patch sanitizer + tightened prompts (Phase A) - PR #55 (already merged): repo grounding + fuzzy path correction + hardened diff parser (Phase B) - PR #56 (this): grounded rerun artifact + paper §6.3 reframed as "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected classification (reviews #747, #748) Bumps: - pyproject.toml 0.34.4 → 0.34.5 - operon_ai/__init__.py 0.34.4 → 0.34.5 - README.md badge v0.34.4 → v0.34.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coredipper and others added 8 commits April 17, 2026 00:42

coredipper merged commit 174b731 into main Apr 17, 2026
4 checks passed

coredipper mentioned this pull request Apr 17, 2026

feat: SWE-bench Phase 2 v2 — grounded rerun confirms 8B model-capability ceiling #56

Merged

3 tasks

coredipper deleted the feat/swebench-phase2-grounding branch April 22, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWE-bench Phase 2 Phase B: repo-content grounding + fuzzy path correction#55

SWE-bench Phase 2 Phase B: repo-content grounding + fuzzy path correction#55
coredipper merged 8 commits into
mainfrom
feat/swebench-phase2-grounding

coredipper commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coredipper commented Apr 17, 2026

Summary

Implementation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant