SWE-bench Phase 2 Phase B: repo-content grounding + fuzzy path correction#55
Merged
Conversation
…rrection (Phase B) Phase A (PR #53) eliminated deterministic apply-failure modes in pure Python. It cannot fix content errors — sanitized patches still have to hit the right lines of the right file. Phase B grounds the prompt and the sanitizer in the actual repository at {repo}@{base_commit}. Changes: eval/_repo_cache.py (new): ensure_repo_at(repo_slug, base_commit, cache_dir) shallow-fetches a single commit via git CLI: git init / remote add origin / fetch --depth 1 origin {sha} / checkout FETCH_HEAD Cache key is {owner}__{repo}-{commit[:12]} so different commits of the same repo don't collide. RepoCacheError on any failure; half-populated cache entries are cleaned up so the next attempt starts fresh. eval/_repo_grounding.py (new): three helpers. - extract_hints(text): regex-harvest .py paths + CamelCase + snake_case identifiers from issue text; stopwords filtered. - rank_candidate_files(repo_path, hints, k): filesystem walk, score each .py file by (+3 stem match, +1 path match), tie-break by shorter path. No LLM calls. - format_context_block(repo_path, paths, max_lines): tree listing + '## <path>' + fenced file content (truncated at max_lines). - walk_tree_paths(repo_path): frozenset of relative paths, used as the sanitizer's tree oracle. Skips .git, __pycache__, .venv, etc. eval/_patch_sanitizer.py: add optional tree_paths=frozenset[str] param to sanitize(). When supplied, any file header path that does NOT exist in the tree is rewritten to a unique basename match if exactly one exists, or the entire patch is rejected (returns ""). /dev/null always accepted. When tree_paths is None, behavior matches Phase A exactly — all 19 prior tests stay green. eval/swebench_phase2.py: - --grounding flag (default off) and --cache-dir (default .cache/swebench) added. - New Grounding dataclass + _build_grounding() called per-instance when --grounding is on. Failures fall back to Grounding.none() with a diagnostic; the run continues without grounding for that instance. - _format_task takes an optional grounding_block that's appended as 'Repository context:' — reaches all organism stages since task propagates unchanged across stages. - _sanitize_for_submission takes tree_paths and passes it through. - run_baseline/run_organism/run_langgraph gain a grounding parameter. pyproject.toml: eval extras now require datasets + swebench (the Phase 2 script already needed both; just formalized). Tests (+44 new, 72 pass total across the sanitizer/extraction/identity/ repo-cache/repo-grounding suites): - test_repo_cache.py (7): cache hit path, first-call git sequence, error surfacing, commit-key separation, malformed inputs, cleanup on partial failure. - test_repo_grounding.py (15): hint extraction (paths, CamelCase, snake_case, stopwords, dedup), ranking (stem > path, tie-break, VCS-skip), context-block formatting (tree+snippets, truncation, empty case), tree walking. - test_patch_sanitizer.py (+6): fuzzy rewrite on unique basename match, reject on ambiguous/no match, /dev/null accepted, prior behavior preserved when tree_paths=None. Verification: - 72/72 pytest. - python -m eval.swebench_phase2 --rewrite-envelope ... still round-trips with post_run_check.status=match. - End-to-end real clone: astropy/astropy at a Phase-2 instance's base_commit produced 5 candidate files (all under astropy/modeling/ for a modeling issue) and a 30KB context block. Out of scope (Phase C later): retry-on-format-error, LLM-localized candidate selection, formal answer-leak audit, actual benchmark rerun. Phase C decisions gate on Phase B's observed effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM eval/_patch_sanitizer.py (over-aggressive fuzzy correction) —
_fuzzy_correct_paths previously forced every non-/dev/null file-header
path to exist in tree_paths (the repo at base_commit). That's wrong
for any patch where a target path legitimately doesn't exist yet:
file creation (--- /dev/null; +++ b/new_path), rename (rename to /
+++ b/new_path), and copy (copy to / +++ b/new_path). Worst case, the
oracle would silently rewrite a new file's path to some unrelated
unique-basename match, turning a valid create into a wrong modify.
Fix: stateful per-file-diff correction.
- _split_file_diffs groups lines by `diff --git` / `---` boundaries.
- _correct_file_diff classifies each block as create / rename / copy
/ delete / modify based on /dev/null markers and rename-from/to,
copy-from/to metadata.
- Source-side paths (--- a/, rename from, copy from, diff --git
a-side when not a create) are oracle-checked and rejected on fail.
- Target-side paths in create/rename/copy blocks pass through
unchanged — they legitimately don't exist at base_commit.
- In modify blocks, any correction applied to --- a/X is mirrored on
+++ b/X and the diff --git line so paths stay internally
consistent.
7 new tests in test_patch_sanitizer.py covering:
- create target absent from tree → accepted unchanged, not rewritten
- rename target not oracle-checked (source still is)
- copy target not oracle-checked (source still is)
- new file not silently rewritten to unrelated unique basename match
(the specific regression #724 called out)
- delete source must exist (still enforced)
- delete with absent source still rejected
- modify correction mirrors source → target → diff --git
MEDIUM eval/_repo_grounding.py (prompt-injection via unescaped fence)
— format_context_block wrapped each file in a fixed ```python ...
``` fence. A file whose content contains ``` (very common: docstring
markdown examples) would close the fence early, leaking the remainder
of the file into the prompt as bare instructions. Untrusted repo
content becoming prompt content is a prompt-injection surface.
Fix: _safe_fence scans each file's content for the longest run of
backticks and uses an outer fence one backtick longer (minimum 3).
Markdown requires matching fence lengths, so the outer fence cannot
be closed by any backtick run inside the content.
2 new tests in test_repo_grounding.py: triple-backtick file yields
outer fence ≥4 backticks; quadruple/longer runs correctly adapt.
81/81 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM eval/_patch_sanitizer.py — _split_file_diffs used a sticky
in_block flag that was set True on the first bare `---` header and
never cleared. Subsequent `--- a/<file2>` / `--- a/<file3>` headers
in a bare multi-file diff (the form extract_patch explicitly supports)
therefore collapsed into one block. _correct_file_diff would then
apply a single create/rename/copy/modify classification across
multiple files and leak one file's source_rewrite onto another —
misclassifying mixed modify/create sections or leaving later `+++`
headers unmirrored.
Fix: replace the sticky flag with per-block state. A `---` header
starts a new block when the current block has already emitted its
own `+++` (which signals the previous file's definition is complete).
`diff --git` unambiguously always starts a new block; `--- /dev/null`
seen before any `+++` in the current block is still part of that
block (the create marker after a `diff --git` prelude).
Rule:
new block iff:
- line starts with `diff --git`, OR
- line starts with `---` AND (no current block OR current block
has seen `+++`)
4 new tests in test_patch_sanitizer.py covering:
- bare multi-file modify: each file oracle-checked independently
- bare multi-file mixed (modify + create): create target passes
through while modify source gets oracle-checked
- bare multi-file with bad source in second file: unique-basename
rewrite applied to second only, first block untouched
- bare multi-file rejects whole patch if any source unresolvable
85/85 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM eval/_patch_sanitizer.py — _starts_new_block treated any line
matching '--- ' as a file-header boundary. But in a unified diff, a
deletion of content starting with '-- ' is encoded on the wire as
'--- ...' (the leading '-' is the delete marker). So a hunk body
containing a deleted line whose text starts with '-- a/foo.py' or
'---' (e.g. YAML frontmatter) would be split mid-hunk, and
_correct_file_diff and _validate_hunks would then misinterpret the
body text as a file-header shape.
Fix: add _is_minus_file_header(line, next_line) as the single
disambiguation point. A real '---' file header has TWO properties:
(1) the path after '--- ' matches 'a/<path>' or '/dev/null'
(2) the very next line is '+++ '
A hunk-body deletion of content shaped like '--- a/foo.py' is always
followed by another body line, never by '+++ '. Use this helper in
both _split_file_diffs (boundary detection) and _validate_hunks
(hunk-body termination).
Extra hardening inside _correct_file_diff: the source-path correction
and line-rewrite passes only apply BEFORE the first '@@' hunk header.
Everything after the first '@@' is hunk-body content and must pass
through verbatim — even when it happens to match the '--- a/...' or
'+++ b/...' shape.
2 new tests in test_patch_sanitizer.py:
- hunk deletes '-- a/foo.py' content → single-file modify survives
- YAML frontmatter deletion ('---' content on both sides) → passes
87/87 tests pass across sanitizer / extraction / identity /
repo-cache / repo-grounding suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… disambiguation
MEDIUM eval/_patch_sanitizer.py — _is_minus_file_header asked only
that the next line startswith('+++ '). A hunk where the deletion is
'-- a/foo.py' (wire: '--- a/foo.py') and the addition is '++ plus'
(wire: '+++ plus') would therefore still be misread as a file-header
pair. The sibling line must match a REAL +++ file-header shape —
'+++ b/<path>' or '+++ /dev/null' (with optional tab + timestamp).
Fix: added _is_plus_file_header(line) with a regex that requires
'+++ b/' or '+++ /dev/null' after the '+++ ' prefix. _is_minus_file_header
now calls it instead of checking startswith. An addition of arbitrary
content (even if shaped '+++ some content') no longer triggers a
false file-header match.
LOW _repair_bare_empty_context used the old naive check
(line.startswith('--- ') or '+++ ') to decide when to leave hunk
mode. A valid hunk with an ambiguous '---' deletion followed by a
whitespace-stripped blank context line would leave hunk mode on the
ambiguous line, so the blank would never get repaired, and validate
would then reject the patch.
Fix: _repair_bare_empty_context now calls _is_minus_file_header and
_is_plus_file_header too — the same lookahead used by the splitter
and validator. All three sanitizer passes now interpret ambiguous
'---'/'+++' body lines consistently.
Also tightened _validate_hunks: replaced the old
'startswith("+++ ") and not startswith("+++ b/") and != "+++ /dev/null"'
check with a direct call to _is_plus_file_header.
2 new tests:
- hunk body with '--- a/foo.py' delete + '+++ some content' add
(ambiguous pair) survives as a single-file modify
- hunk with ambiguous delete followed by blank context line gets
the blank repaired to ' ' before validation
89/89 tests pass across sanitizer / extraction / identity /
repo-cache / repo-grounding suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uation loop
MEDIUM eval/_patch_sanitizer.py — the shape-only ``_is_plus_file_header``
still matched arbitrary hunk-body additions whose content happened to
start with ``++ b/`` or ``++ /dev/null`` (wire: ``+++ b/...`` and
``+++ /dev/null``). Paired with a ``-- a/...`` deletion, such content
would even satisfy the ``_is_minus_file_header`` lookahead. The only
structurally-correct distinction between a file-header pair and a
body pair is POSITION relative to the declared ``@@ -a,b +c,d @@``
counts: inside the declared body, every line's role is decided by
its first character alone.
Fix: introduce ``_parse_hunk_counts`` and ``_scan_hunk_extent``, a
shared count-driven body walker. ``_scan_hunk_extent`` consumes
exactly ``(old_count, new_count)`` body lines, decrementing the
counters based on the leading character (``+``, ``-``, ``' '`` /
empty). A line shaped ``--- a/foo.py`` inside a body is an unambiguous
deletion; ``+++ b/bar.py`` inside a body is an unambiguous addition.
Refactored three callers to use the scanner:
* ``_split_file_diffs`` — when at ``@@``, appends the exact body
extent verbatim and resumes boundary detection only afterwards.
* ``_repair_bare_empty_context`` — repairs blanks only within the
declared extent; no shape-based exit heuristic.
* ``_validate_hunks`` — delegates body consumption to the scanner;
``ok`` return tells us whether counts balanced.
The shape-only helpers ``_is_plus_file_header`` and
``_is_minus_file_header`` are retained for OUTSIDE-hunk boundary
detection in ``_split_file_diffs``, where they remain correct
because outside a hunk body there's no competing "body content"
interpretation.
3 new tests covering the review #733 adversarial cases:
* body add shaped ``+++ b/foo.py`` (wire of ``++ b/foo.py`` content)
* body pair ``--- a/foo.py`` + ``+++ b/bar.py`` (matching both
halves of a file-header pair by shape)
* body add shaped ``+++ /dev/null``
92/92 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HIGH eval/_patch_sanitizer.py — the count-driven _scan_hunk_extent
stopped as soon as (old_count, new_count) reached zero. Extra body
lines AFTER that boundary were silently left in the patch as
"between-hunks content". A declared '@@ -1 +1 @@' with body
'-old / +new / +extra' would validate, but git apply would still
choke on the dangling '+extra'. That's the exact failure class the
validator is supposed to prevent.
Fix: after counts drain, _scan_hunk_extent now additionally requires
the NEXT line (if any) to be a legal post-hunk boundary: another
'@@', a '---' / '+++' file header, 'diff --git', a git metadata
line, or EOF. Body-shaped content ('+', '-', ' ', or empty) after
counts drain returns ok=False.
A trailing '\ No newline at end of file' marker is consumed first
(it applies to the hunk's final body line) before the boundary
check, so its presence doesn't trigger a false-overlong rejection.
4 new tests:
* overlong by addition (extra '+' line) — rejected
* overlong by addition in grounding mode — rejected
* overlong by context (extra ' ' line) — rejected
* trailing '\ No newline at end of file' marker — accepted
96/96 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HIGH eval/_patch_sanitizer.py — _is_post_hunk_boundary accepted any
line shaped like a ---/+++ file header based on shape alone. Body
content shaped '--- a/foo.py' (wire form of deletion '-- a/foo.py')
or '+++ b/bar.py' (wire form of addition '++ b/bar.py') after counts
drain therefore passed as a valid boundary rather than being
rejected as overlong body content. The validator would accept the
patch, but git apply would still reject.
Fix: tightened _is_post_hunk_boundary. Unambiguous accepts (no body
line can fake these shapes):
* EOF
* '@@' (next hunk of same file)
* 'diff --git ' (new file-diff)
* git metadata prefixes ('index ', 'new file mode', 'rename from',
etc.) — these appear between 'diff --git' and the '---' pair, so
they imply a new file
For bare multi-file diffs without 'diff --git', a legitimate file
boundary requires the full TRIPLET '---' + '+++' + '@@' on three
consecutive lines. A body-content imposter pair '-- a/X' + '++ b/X'
followed by arbitrary body content would not put '@@' at the third
position.
Shape-only '---' or '+++' after counts drain (with no '@@' at the
expected position) is now treated as overlong body content and
rejects the patch.
5 new tests:
* extra '+++ b/bar.py' after counts (overlong by addition) — reject
* extra '--- a/foo.py' after counts (overlong by deletion) — reject
* extra '+++ /dev/null' after counts (overlong /dev/null-shaped
addition) — reject
* legitimate bare multi-file triplet — accept
* two-line --- / +++ pair without following @@ — reject (overlong
body content, not a real boundary)
101/101 tests pass across all touched suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coredipper
added a commit
that referenced
this pull request
Apr 17, 2026
…ity ceiling Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using the patch-apply pipeline shipped in PR #53 (sanitizer + tightened prompts) and PR #55 (repo grounding + fuzzy path correction + hardened diff parser). Same model (gemma4:latest, digest c6eb396dbd59, 8B/Q4_K_M), same 10 instances, --grounding active. Result: 1/30 reaches harness verdict (django-11001 baseline, unresolved). 29/30 honest empty_patch (sanitizer refused malformed output). Zero `error` outcomes. Old vs new outcome distribution per condition: baseline 1/10 evaluated, 9 error -> 1/10 evaluated, 9 empty_patch organism 1/10 evaluated, 5 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch langgraph 0/10 evaluated, 6 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch The original run confounded two failure modes: model selecting wrong file paths (a) and model writing malformed unified diffs (b). Grounding solves (a). The fact that 29 of 30 still drop to empty_patch — sanitizer rejecting placeholder hunks, malformed counts, and invented paths — localizes the bottleneck to (b). At 8B/Q4_K_M, diff-format discipline is the binding constraint, not file selection. Side observation: the original organism vs baseline gap (4 vs 0 empty_patch) closed under the new pipeline. The Phase A "[edit] stage emits a single fenced diff and nothing else" instruction eliminated the multi-stage format leak. Latency cost: grounding ~doubles wall-clock per call (baseline 44s -> 104s; organism 88s -> 170s; langgraph 90s -> 172s) for zero evaluated-rate gain at this model scale. Files: - eval/results/swebench_phase2.json: regenerated by the writer (sanitizer + grounding + harness=ok, post_run_check=match) - article/paper5/sections/06-experiments.tex: §6.3 retitled "8B Format Discipline Is the Ceiling"; new outcome table; before/after comparison; reframed interpretation - article/paper5/sections/07-conclusion.tex: limitations item updated to reference the grounded-rerun confirmation 9/9 schema tests still pass. PDF rebuild deferred to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
coredipper
added a commit
that referenced
this pull request
Apr 17, 2026
…update) Captures the work landed on this branch: - PR #53 (already merged): patch sanitizer + tightened prompts (Phase A) - PR #55 (already merged): repo grounding + fuzzy path correction + hardened diff parser (Phase B) - PR #56 (this): grounded rerun artifact + paper §6.3 reframed as "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected classification (reviews #747, #748) Bumps: - pyproject.toml 0.34.4 → 0.34.5 - operon_ai/__init__.py 0.34.4 → 0.34.5 - README.md badge v0.34.4 → v0.34.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coredipper
added a commit
that referenced
this pull request
Apr 17, 2026
…update) Captures the work landed on this branch: - PR #53 (already merged): patch sanitizer + tightened prompts (Phase A) - PR #55 (already merged): repo grounding + fuzzy path correction + hardened diff parser (Phase B) - PR #56 (this): grounded rerun artifact + paper §6.3 reframed as "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected classification (reviews #747, #748) Bumps: - pyproject.toml 0.34.4 → 0.34.5 - operon_ai/__init__.py 0.34.4 → 0.34.5 - README.md badge v0.34.4 → v0.34.5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #54 (auto-closed when #53 merged and its base branch was deleted). All 8 commits are the same; review chain #724–#736 is closed and #738 came back Pass.
Summary
--groundingis set, each instance's{repo}@{base_commit}is shallow-fetched into.cache/swebench/. Heuristics mine the issue text for hints, rank python files, and inject a tree listing + top-5 candidate file contents into the task prompt. The grounded task propagates to every organism stage unchanged.sanitize(..., tree_paths=...)applies stateful, per-file-diff correction. Source-side paths that don't exist are rewritten to unique basename matches or reject the patch; target-side paths of create / rename / copy pass through unchanged. Modify blocks mirror source corrections onto target anddiff --git.seen_plusstate).---or+++never misparsed as file headers (count-driven_scan_hunk_extent).ok=Falsewhen body lines spill past declared counts.@@/diff --git/ git-metadata, or a real bare-diff---+++++@@triplet — shape-only---/+++lines after counts drain are treated as overlong body content.--groundingis off by default. Without it, behavior matches Phase A byte-for-byte.Implementation
New files:
eval/_repo_cache.py—ensure_repo_at(repo_slug, base_commit, cache_dir) -> Path. Cache key{owner}__{repo}-{commit[:12]}. Cleans up on partial failure.eval/_repo_grounding.py—extract_hints,rank_candidate_files,format_context_block,walk_tree_paths. Pure filesystem + regex.Extended:
eval/_patch_sanitizer.py—sanitize(..., tree_paths=...)optional arg. Count-driven_scan_hunk_extentas single source of truth for hunk-body boundaries. Stateful_correct_file_diffclassifies each file-diff block (create / rename / copy / delete / modify).eval/swebench_phase2.py—--grounding/--cache-dirflags,Groundingdataclass with.none()sentinel, per-instance_build_groundingcall that degrades gracefully on any failure, pass-through ofgroundingto all threerun_*conditions.pyproject.toml—evalextras now requiredatasets+swebench.Answer-leak posture: grounding only reads
problem_statement+hints_text+ repo files atbase_commit.instance["patch"]andinstance["test_patch"]are never accessed.Test plan
pytest tests/unit/test_patch_sanitizer.py tests/unit/test_patch_extraction.py tests/unit/test_swebench_phase2_identity.py tests/unit/test_repo_cache.py tests/unit/test_repo_grounding.py— 101 pass.python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json— round-trips withpost_run_check.status=match.base_commitproduced a 30 KB context block with all 5 candidate files underastropy/modeling/.eval/results/swebench_phase2.json+ paper §6.3 rewrite) only if the numbers justify it.🤖 Generated with Claude Code