Skip to content

SWE-bench Phase 2 Phase B: repo-content grounding + fuzzy path correction#55

Merged
coredipper merged 8 commits into
mainfrom
feat/swebench-phase2-grounding
Apr 17, 2026
Merged

SWE-bench Phase 2 Phase B: repo-content grounding + fuzzy path correction#55
coredipper merged 8 commits into
mainfrom
feat/swebench-phase2-grounding

Conversation

@coredipper
Copy link
Copy Markdown
Owner

Supersedes #54 (auto-closed when #53 merged and its base branch was deleted). All 8 commits are the same; review chain #724–#736 is closed and #738 came back Pass.

Summary

  • Real repo grounding. When --grounding is set, each instance's {repo}@{base_commit} is shallow-fetched into .cache/swebench/. Heuristics mine the issue text for hints, rank python files, and inject a tree listing + top-5 candidate file contents into the task prompt. The grounded task propagates to every organism stage unchanged.
  • Sanitizer gains a tree oracle. sanitize(..., tree_paths=...) applies stateful, per-file-diff correction. Source-side paths that don't exist are rewritten to unique basename matches or reject the patch; target-side paths of create / rename / copy pass through unchanged. Modify blocks mirror source corrections onto target and diff --git.
  • Hardened diff parser (from review chain #724 → #726 → #729 → #730 → #733 → #735 → #736):
    • Bare multi-file diffs split correctly (per-block seen_plus state).
    • Hunk-body deletions shaped --- or +++ never misparsed as file headers (count-driven _scan_hunk_extent).
    • Overlong hunks rejected with ok=False when body lines spill past declared counts.
    • Post-hunk boundaries require either @@ / diff --git / git-metadata, or a real bare-diff --- + +++ + @@ triplet — shape-only ---/+++ lines after counts drain are treated as overlong body content.
  • Prompt-injection hardening: file content fences use dynamic backtick runs longer than any sequence in the content; an included file with its own ``` cannot close the outer fence.
  • Opt-in + safe default. --grounding is off by default. Without it, behavior matches Phase A byte-for-byte.

Implementation

New files:

  • eval/_repo_cache.pyensure_repo_at(repo_slug, base_commit, cache_dir) -> Path. Cache key {owner}__{repo}-{commit[:12]}. Cleans up on partial failure.
  • eval/_repo_grounding.pyextract_hints, rank_candidate_files, format_context_block, walk_tree_paths. Pure filesystem + regex.

Extended:

  • eval/_patch_sanitizer.pysanitize(..., tree_paths=...) optional arg. Count-driven _scan_hunk_extent as single source of truth for hunk-body boundaries. Stateful _correct_file_diff classifies each file-diff block (create / rename / copy / delete / modify).
  • eval/swebench_phase2.py--grounding / --cache-dir flags, Grounding dataclass with .none() sentinel, per-instance _build_grounding call that degrades gracefully on any failure, pass-through of grounding to all three run_* conditions.
  • pyproject.tomleval extras now require datasets + swebench.

Answer-leak posture: grounding only reads problem_statement + hints_text + repo files at base_commit. instance["patch"] and instance["test_patch"] are never accessed.

Test plan

  • pytest tests/unit/test_patch_sanitizer.py tests/unit/test_patch_extraction.py tests/unit/test_swebench_phase2_identity.py tests/unit/test_repo_cache.py tests/unit/test_repo_grounding.py — 101 pass.
  • python -m eval.swebench_phase2 --rewrite-envelope eval/results/swebench_phase2.json — round-trips with post_run_check.status=match.
  • End-to-end real clone: astropy/astropy at a real SWE-bench-lite base_commit produced a 30 KB context block with all 5 candidate files under astropy/modeling/.
  • Actual SWE-bench rerun — deferred; will land as a separate PR (updated eval/results/swebench_phase2.json + paper §6.3 rewrite) only if the numbers justify it.

🤖 Generated with Claude Code

coredipper and others added 8 commits April 17, 2026 00:42
…rrection (Phase B)

Phase A (PR #53) eliminated deterministic apply-failure modes in pure
Python. It cannot fix content errors — sanitized patches still have to
hit the right lines of the right file. Phase B grounds the prompt and
the sanitizer in the actual repository at {repo}@{base_commit}.

Changes:

eval/_repo_cache.py (new): ensure_repo_at(repo_slug, base_commit,
cache_dir) shallow-fetches a single commit via git CLI:
  git init / remote add origin / fetch --depth 1 origin {sha}
  / checkout FETCH_HEAD
Cache key is {owner}__{repo}-{commit[:12]} so different commits of the
same repo don't collide. RepoCacheError on any failure; half-populated
cache entries are cleaned up so the next attempt starts fresh.

eval/_repo_grounding.py (new): three helpers.
  - extract_hints(text): regex-harvest .py paths + CamelCase +
    snake_case identifiers from issue text; stopwords filtered.
  - rank_candidate_files(repo_path, hints, k): filesystem walk, score
    each .py file by (+3 stem match, +1 path match), tie-break by
    shorter path. No LLM calls.
  - format_context_block(repo_path, paths, max_lines): tree listing
    + '## <path>' + fenced file content (truncated at max_lines).
  - walk_tree_paths(repo_path): frozenset of relative paths, used as
    the sanitizer's tree oracle. Skips .git, __pycache__, .venv, etc.

eval/_patch_sanitizer.py: add optional tree_paths=frozenset[str] param
to sanitize(). When supplied, any file header path that does NOT exist
in the tree is rewritten to a unique basename match if exactly one
exists, or the entire patch is rejected (returns ""). /dev/null
always accepted. When tree_paths is None, behavior matches Phase A
exactly — all 19 prior tests stay green.

eval/swebench_phase2.py:
  - --grounding flag (default off) and --cache-dir (default
    .cache/swebench) added.
  - New Grounding dataclass + _build_grounding() called per-instance
    when --grounding is on. Failures fall back to Grounding.none()
    with a diagnostic; the run continues without grounding for that
    instance.
  - _format_task takes an optional grounding_block that's appended as
    'Repository context:' — reaches all organism stages since task
    propagates unchanged across stages.
  - _sanitize_for_submission takes tree_paths and passes it through.
  - run_baseline/run_organism/run_langgraph gain a grounding parameter.

pyproject.toml: eval extras now require datasets + swebench (the
Phase 2 script already needed both; just formalized).

Tests (+44 new, 72 pass total across the sanitizer/extraction/identity/
repo-cache/repo-grounding suites):
  - test_repo_cache.py (7): cache hit path, first-call git sequence,
    error surfacing, commit-key separation, malformed inputs, cleanup
    on partial failure.
  - test_repo_grounding.py (15): hint extraction (paths, CamelCase,
    snake_case, stopwords, dedup), ranking (stem > path, tie-break,
    VCS-skip), context-block formatting (tree+snippets, truncation,
    empty case), tree walking.
  - test_patch_sanitizer.py (+6): fuzzy rewrite on unique basename
    match, reject on ambiguous/no match, /dev/null accepted, prior
    behavior preserved when tree_paths=None.

Verification:
  - 72/72 pytest.
  - python -m eval.swebench_phase2 --rewrite-envelope ... still
    round-trips with post_run_check.status=match.
  - End-to-end real clone: astropy/astropy at a Phase-2 instance's
    base_commit produced 5 candidate files (all under astropy/modeling/
    for a modeling issue) and a 30KB context block.

Out of scope (Phase C later): retry-on-format-error, LLM-localized
candidate selection, formal answer-leak audit, actual benchmark
rerun. Phase C decisions gate on Phase B's observed effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM eval/_patch_sanitizer.py (over-aggressive fuzzy correction) —
_fuzzy_correct_paths previously forced every non-/dev/null file-header
path to exist in tree_paths (the repo at base_commit). That's wrong
for any patch where a target path legitimately doesn't exist yet:
file creation (--- /dev/null; +++ b/new_path), rename (rename to /
+++ b/new_path), and copy (copy to / +++ b/new_path). Worst case, the
oracle would silently rewrite a new file's path to some unrelated
unique-basename match, turning a valid create into a wrong modify.

Fix: stateful per-file-diff correction.
- _split_file_diffs groups lines by `diff --git` / `---` boundaries.
- _correct_file_diff classifies each block as create / rename / copy
  / delete / modify based on /dev/null markers and rename-from/to,
  copy-from/to metadata.
- Source-side paths (--- a/, rename from, copy from, diff --git
  a-side when not a create) are oracle-checked and rejected on fail.
- Target-side paths in create/rename/copy blocks pass through
  unchanged — they legitimately don't exist at base_commit.
- In modify blocks, any correction applied to --- a/X is mirrored on
  +++ b/X and the diff --git line so paths stay internally
  consistent.

7 new tests in test_patch_sanitizer.py covering:
  - create target absent from tree → accepted unchanged, not rewritten
  - rename target not oracle-checked (source still is)
  - copy target not oracle-checked (source still is)
  - new file not silently rewritten to unrelated unique basename match
    (the specific regression #724 called out)
  - delete source must exist (still enforced)
  - delete with absent source still rejected
  - modify correction mirrors source → target → diff --git

MEDIUM eval/_repo_grounding.py (prompt-injection via unescaped fence)
— format_context_block wrapped each file in a fixed ```python ...
``` fence. A file whose content contains ``` (very common: docstring
markdown examples) would close the fence early, leaking the remainder
of the file into the prompt as bare instructions. Untrusted repo
content becoming prompt content is a prompt-injection surface.

Fix: _safe_fence scans each file's content for the longest run of
backticks and uses an outer fence one backtick longer (minimum 3).
Markdown requires matching fence lengths, so the outer fence cannot
be closed by any backtick run inside the content.

2 new tests in test_repo_grounding.py: triple-backtick file yields
outer fence ≥4 backticks; quadruple/longer runs correctly adapt.

81/81 tests pass across all touched suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM eval/_patch_sanitizer.py — _split_file_diffs used a sticky
in_block flag that was set True on the first bare `---` header and
never cleared. Subsequent `--- a/<file2>` / `--- a/<file3>` headers
in a bare multi-file diff (the form extract_patch explicitly supports)
therefore collapsed into one block. _correct_file_diff would then
apply a single create/rename/copy/modify classification across
multiple files and leak one file's source_rewrite onto another —
misclassifying mixed modify/create sections or leaving later `+++`
headers unmirrored.

Fix: replace the sticky flag with per-block state. A `---` header
starts a new block when the current block has already emitted its
own `+++` (which signals the previous file's definition is complete).
`diff --git` unambiguously always starts a new block; `--- /dev/null`
seen before any `+++` in the current block is still part of that
block (the create marker after a `diff --git` prelude).

Rule:
  new block iff:
    - line starts with `diff --git`, OR
    - line starts with `---` AND (no current block OR current block
      has seen `+++`)

4 new tests in test_patch_sanitizer.py covering:
  - bare multi-file modify: each file oracle-checked independently
  - bare multi-file mixed (modify + create): create target passes
    through while modify source gets oracle-checked
  - bare multi-file with bad source in second file: unique-basename
    rewrite applied to second only, first block untouched
  - bare multi-file rejects whole patch if any source unresolvable

85/85 tests pass across all touched suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM eval/_patch_sanitizer.py — _starts_new_block treated any line
matching '--- ' as a file-header boundary. But in a unified diff, a
deletion of content starting with '-- ' is encoded on the wire as
'--- ...' (the leading '-' is the delete marker). So a hunk body
containing a deleted line whose text starts with '-- a/foo.py' or
'---' (e.g. YAML frontmatter) would be split mid-hunk, and
_correct_file_diff and _validate_hunks would then misinterpret the
body text as a file-header shape.

Fix: add _is_minus_file_header(line, next_line) as the single
disambiguation point. A real '---' file header has TWO properties:
  (1) the path after '--- ' matches 'a/<path>' or '/dev/null'
  (2) the very next line is '+++ '
A hunk-body deletion of content shaped like '--- a/foo.py' is always
followed by another body line, never by '+++ '. Use this helper in
both _split_file_diffs (boundary detection) and _validate_hunks
(hunk-body termination).

Extra hardening inside _correct_file_diff: the source-path correction
and line-rewrite passes only apply BEFORE the first '@@' hunk header.
Everything after the first '@@' is hunk-body content and must pass
through verbatim — even when it happens to match the '--- a/...' or
'+++ b/...' shape.

2 new tests in test_patch_sanitizer.py:
  - hunk deletes '-- a/foo.py' content → single-file modify survives
  - YAML frontmatter deletion ('---' content on both sides) → passes

87/87 tests pass across sanitizer / extraction / identity /
repo-cache / repo-grounding suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… disambiguation

MEDIUM eval/_patch_sanitizer.py — _is_minus_file_header asked only
that the next line startswith('+++ '). A hunk where the deletion is
'-- a/foo.py' (wire: '--- a/foo.py') and the addition is '++ plus'
(wire: '+++ plus') would therefore still be misread as a file-header
pair. The sibling line must match a REAL +++ file-header shape —
'+++ b/<path>' or '+++ /dev/null' (with optional tab + timestamp).

Fix: added _is_plus_file_header(line) with a regex that requires
'+++ b/' or '+++ /dev/null' after the '+++ ' prefix. _is_minus_file_header
now calls it instead of checking startswith. An addition of arbitrary
content (even if shaped '+++ some content') no longer triggers a
false file-header match.

LOW _repair_bare_empty_context used the old naive check
(line.startswith('--- ') or '+++ ') to decide when to leave hunk
mode. A valid hunk with an ambiguous '---' deletion followed by a
whitespace-stripped blank context line would leave hunk mode on the
ambiguous line, so the blank would never get repaired, and validate
would then reject the patch.

Fix: _repair_bare_empty_context now calls _is_minus_file_header and
_is_plus_file_header too — the same lookahead used by the splitter
and validator. All three sanitizer passes now interpret ambiguous
'---'/'+++' body lines consistently.

Also tightened _validate_hunks: replaced the old
'startswith("+++ ") and not startswith("+++ b/") and != "+++ /dev/null"'
check with a direct call to _is_plus_file_header.

2 new tests:
  - hunk body with '--- a/foo.py' delete + '+++ some content' add
    (ambiguous pair) survives as a single-file modify
  - hunk with ambiguous delete followed by blank context line gets
    the blank repaired to ' ' before validation

89/89 tests pass across sanitizer / extraction / identity /
repo-cache / repo-grounding suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uation loop

MEDIUM eval/_patch_sanitizer.py — the shape-only ``_is_plus_file_header``
still matched arbitrary hunk-body additions whose content happened to
start with ``++ b/`` or ``++ /dev/null`` (wire: ``+++ b/...`` and
``+++ /dev/null``). Paired with a ``-- a/...`` deletion, such content
would even satisfy the ``_is_minus_file_header`` lookahead. The only
structurally-correct distinction between a file-header pair and a
body pair is POSITION relative to the declared ``@@ -a,b +c,d @@``
counts: inside the declared body, every line's role is decided by
its first character alone.

Fix: introduce ``_parse_hunk_counts`` and ``_scan_hunk_extent``, a
shared count-driven body walker. ``_scan_hunk_extent`` consumes
exactly ``(old_count, new_count)`` body lines, decrementing the
counters based on the leading character (``+``, ``-``, ``' '`` /
empty). A line shaped ``--- a/foo.py`` inside a body is an unambiguous
deletion; ``+++ b/bar.py`` inside a body is an unambiguous addition.

Refactored three callers to use the scanner:
  * ``_split_file_diffs`` — when at ``@@``, appends the exact body
    extent verbatim and resumes boundary detection only afterwards.
  * ``_repair_bare_empty_context`` — repairs blanks only within the
    declared extent; no shape-based exit heuristic.
  * ``_validate_hunks`` — delegates body consumption to the scanner;
    ``ok`` return tells us whether counts balanced.

The shape-only helpers ``_is_plus_file_header`` and
``_is_minus_file_header`` are retained for OUTSIDE-hunk boundary
detection in ``_split_file_diffs``, where they remain correct
because outside a hunk body there's no competing "body content"
interpretation.

3 new tests covering the review #733 adversarial cases:
  * body add shaped ``+++ b/foo.py`` (wire of ``++ b/foo.py`` content)
  * body pair ``--- a/foo.py`` + ``+++ b/bar.py`` (matching both
    halves of a file-header pair by shape)
  * body add shaped ``+++ /dev/null``

92/92 tests pass across all touched suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HIGH eval/_patch_sanitizer.py — the count-driven _scan_hunk_extent
stopped as soon as (old_count, new_count) reached zero. Extra body
lines AFTER that boundary were silently left in the patch as
"between-hunks content". A declared '@@ -1 +1 @@' with body
'-old / +new / +extra' would validate, but git apply would still
choke on the dangling '+extra'. That's the exact failure class the
validator is supposed to prevent.

Fix: after counts drain, _scan_hunk_extent now additionally requires
the NEXT line (if any) to be a legal post-hunk boundary: another
'@@', a '---' / '+++' file header, 'diff --git', a git metadata
line, or EOF. Body-shaped content ('+', '-', ' ', or empty) after
counts drain returns ok=False.

A trailing '\ No newline at end of file' marker is consumed first
(it applies to the hunk's final body line) before the boundary
check, so its presence doesn't trigger a false-overlong rejection.

4 new tests:
  * overlong by addition (extra '+' line) — rejected
  * overlong by addition in grounding mode — rejected
  * overlong by context (extra ' ' line) — rejected
  * trailing '\ No newline at end of file' marker — accepted

96/96 tests pass across all touched suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HIGH eval/_patch_sanitizer.py — _is_post_hunk_boundary accepted any
line shaped like a ---/+++ file header based on shape alone. Body
content shaped '--- a/foo.py' (wire form of deletion '-- a/foo.py')
or '+++ b/bar.py' (wire form of addition '++ b/bar.py') after counts
drain therefore passed as a valid boundary rather than being
rejected as overlong body content. The validator would accept the
patch, but git apply would still reject.

Fix: tightened _is_post_hunk_boundary. Unambiguous accepts (no body
line can fake these shapes):
  * EOF
  * '@@' (next hunk of same file)
  * 'diff --git ' (new file-diff)
  * git metadata prefixes ('index ', 'new file mode', 'rename from',
    etc.) — these appear between 'diff --git' and the '---' pair, so
    they imply a new file

For bare multi-file diffs without 'diff --git', a legitimate file
boundary requires the full TRIPLET '---' + '+++' + '@@' on three
consecutive lines. A body-content imposter pair '-- a/X' + '++ b/X'
followed by arbitrary body content would not put '@@' at the third
position.

Shape-only '---' or '+++' after counts drain (with no '@@' at the
expected position) is now treated as overlong body content and
rejects the patch.

5 new tests:
  * extra '+++ b/bar.py' after counts (overlong by addition) — reject
  * extra '--- a/foo.py' after counts (overlong by deletion) — reject
  * extra '+++ /dev/null' after counts (overlong /dev/null-shaped
    addition) — reject
  * legitimate bare multi-file triplet — accept
  * two-line --- / +++ pair without following @@ — reject (overlong
    body content, not a real boundary)

101/101 tests pass across all touched suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coredipper coredipper merged commit 174b731 into main Apr 17, 2026
4 checks passed
coredipper added a commit that referenced this pull request Apr 17, 2026
…ity ceiling

Reruns the inconclusive 2026-04-16 SWE-bench Phase 2 experiment using
the patch-apply pipeline shipped in PR #53 (sanitizer + tightened
prompts) and PR #55 (repo grounding + fuzzy path correction +
hardened diff parser). Same model (gemma4:latest, digest c6eb396dbd59,
8B/Q4_K_M), same 10 instances, --grounding active.

Result: 1/30 reaches harness verdict (django-11001 baseline,
unresolved). 29/30 honest empty_patch (sanitizer refused malformed
output). Zero `error` outcomes.

Old vs new outcome distribution per condition:

  baseline   1/10 evaluated, 9 error -> 1/10 evaluated, 9 empty_patch
  organism   1/10 evaluated, 5 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch
  langgraph  0/10 evaluated, 6 error + 4 empty_patch -> 0/10 evaluated, 10 empty_patch

The original run confounded two failure modes: model selecting wrong
file paths (a) and model writing malformed unified diffs (b).
Grounding solves (a). The fact that 29 of 30 still drop to
empty_patch — sanitizer rejecting placeholder hunks, malformed counts,
and invented paths — localizes the bottleneck to (b). At 8B/Q4_K_M,
diff-format discipline is the binding constraint, not file selection.

Side observation: the original organism vs baseline gap (4 vs 0
empty_patch) closed under the new pipeline. The Phase A "[edit] stage
emits a single fenced diff and nothing else" instruction eliminated
the multi-stage format leak.

Latency cost: grounding ~doubles wall-clock per call (baseline 44s ->
104s; organism 88s -> 170s; langgraph 90s -> 172s) for zero
evaluated-rate gain at this model scale.

Files:
- eval/results/swebench_phase2.json: regenerated by the writer
  (sanitizer + grounding + harness=ok, post_run_check=match)
- article/paper5/sections/06-experiments.tex: §6.3 retitled
  "8B Format Discipline Is the Ceiling"; new outcome table; before/after
  comparison; reframed interpretation
- article/paper5/sections/07-conclusion.tex: limitations item updated
  to reference the grounded-rerun confirmation

9/9 schema tests still pass. PDF rebuild deferred to user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coredipper added a commit that referenced this pull request Apr 17, 2026
…update)

Captures the work landed on this branch:
- PR #53 (already merged): patch sanitizer + tightened prompts (Phase A)
- PR #55 (already merged): repo grounding + fuzzy path correction +
  hardened diff parser (Phase B)
- PR #56 (this): grounded rerun artifact + paper §6.3 reframed as
  "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected
  classification (reviews #747, #748)

Bumps:
- pyproject.toml 0.34.4 → 0.34.5
- operon_ai/__init__.py 0.34.4 → 0.34.5
- README.md badge v0.34.4 → v0.34.5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coredipper added a commit that referenced this pull request Apr 17, 2026
…update)

Captures the work landed on this branch:
- PR #53 (already merged): patch sanitizer + tightened prompts (Phase A)
- PR #55 (already merged): repo grounding + fuzzy path correction +
  hardened diff parser (Phase B)
- PR #56 (this): grounded rerun artifact + paper §6.3 reframed as
  "8B format-discipline ceiling"; runtime_error vs sanitizer-rejected
  classification (reviews #747, #748)

Bumps:
- pyproject.toml 0.34.4 → 0.34.5
- operon_ai/__init__.py 0.34.4 → 0.34.5
- README.md badge v0.34.4 → v0.34.5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coredipper coredipper deleted the feat/swebench-phase2-grounding branch April 22, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant