mini_swe_agent_v2: prevent re-exploration loops after compaction#54
mini_swe_agent_v2: prevent re-exploration loops after compaction#54akhatua2 wants to merge 1 commit into
Conversation
The default compaction summarizer was producing terse, generic summaries that lost the file contents agents had already read. After compaction the agent would re-cat/grep the same files, hit the 100-step limit, and fail to submit. In one full coop run, 91% of agents (63/72 sampled trajectories) hit LimitsExceeded with mean ~99 of 100 steps used. Two changes, both targeted at this loop: 1. summarize_context now serializes the prior turns into a single user message as a tagged transcript instead of passing them as role-wise messages. Without this, the model would sometimes role-play as the next assistant turn (emitting text shaped like tool calls) rather than producing a summary. 2. The default compaction_summary_prompt is now a structured template with explicit headings (FILE MAP, RELEVANT CODE READ, KEY SYMBOLS, SEARCH RESULTS, EDITS, BUILD/TEST OUTPUT, COLLEAGUE MESSAGES, OPEN QUESTIONS, CURRENT PLAN), asks for verbatim quoting of relevant snippets with file:line citations, and includes an anti-hallucination guard for line counts. The system prompt and task instance are preserved separately by the existing prefix handling, so the prompt tells the model not to restate them. Measured on an A/B replay of 9 real solver segments (Rust, Go, JS, Py) against the same Qwen3.5-9B endpoint: coverage of post-compaction re-explores went from 40% to 57% on a heuristic file+symbol scorer. Qualitative reads of the produced summaries show the new ones consistently quote the actual code regions agents would otherwise re-cat (e.g. macro definitions cited at exact line ranges). Added unit tests for _serialize_transcript_for_summary covering role labels, tool-call rendering, and empty/missing content edge cases.
f1f97d5 to
e00ac13
Compare
|
LGTM, the following are some compaction features in claude code that we can borrow. |
Mapping this PR to Claude Code's compaction designFor context if it helps prioritize follow-ups: I cross-referenced this PR against the compaction subsystem in the public claude-code skeleton (
Suggested follow-ups, ranked1. Deterministic post-compact file re-injection — highest leverage, most directly compounding with what this PR already does. In The summary prompt currently asks the model to quote 2. Microcompact-style truncation inside Before flattening to transcript, walk 3. Reactive fallback on When the agent still hits the limit after the 28k threshold (single-turn overshoot), call compaction again with 4. Add Read of the empirical results40% → 57% coverage with summary tokens going from ~1.1k to ~2.5k is the right direction, but coverage is still below what deterministic restoration would guarantee for the recent files. I'd expect (1) on its own to push the recent-file portion of coverage close to 100% — at the cost of another ~10–25k tokens of post-compact context, which is well inside the 32k budget given that compaction fires at 28k pre-compact. |
|
Addendum — I should have separated per-message compaction from the post-compact summarizer pipeline in the table above. Distinct mechanisms in CC:
For this PR's failure mode,
How this would compose with what's already in this PR + the file re-injection suggestion from my prior comment:
The harness-level implementation is small: a per-session cache Worth noting CC's stub references "the earlier Read tool_result in this conversation" — that wording is load-bearing post-compact, because the summary is not a tool result. If you do this in coop and combine it with file re-injection, the stub message should reference the re-injected attachment by name/section so the model knows where to look. Otherwise post-compact the model sees "refer to the earlier tool result" and there is no earlier tool result. Sequence to think about (low → high effort):
|
Summary
In a full coop run with Qwen3.5-9B (32K ctx), 91% of agents (63/72 sampled trajectories) hit
LimitsExceededwith mean ~99 of 100 steps consumed. Inspecting trajectories: post-compaction, every agent re-ran read commands it had already executed pre-compaction (e.g.wc -l src/ensure.rsx6,sed -n '870,1000p' …x5). The compaction summary was preserving "files examined: X" without the contents — so the agent re-cat'd to recover the info it needed, burning steps until it ran out.Two complementary fixes:
1. Serialize the prior conversation into a single transcript message
LitellmModel.summarize_contextpreviously passed the prior turns to the summarizer as role-wise messages (system,user,assistant,tool, …) followed by a finalusersummary request. The model would sometimes role-play as the next assistant turn — emitting text shaped like tool calls — instead of producing a summary.New behavior: flatten the prior messages into a tagged text transcript (
[system]\n…\n\n[assistant]\n…\n\n[tool_output]\n…) and wrap it as a singleusermessage:The model is unambiguously an outside observer summarizing a record, not the next participant.
2. Replace the default
compaction_summary_promptwith a structured, prescriptive templateThe old default was a one-paragraph "summarize, be thorough" instruction. The new default asks for these sections, with verbatim quoting where it matters:
## FILE MAP— paths + read ranges + modified flag (line counts only when actually known from awc)## RELEVANT CODE READ— filepath:lines a-bheaders + verbatim fenced snippets## KEY SYMBOLS / IDENTIFIERS—name -> path:lineindex for previously discovered symbols## SEARCH RESULTS WORTH KEEPING— incl. negative results so the agent doesn't re-grep## EDITS ALREADY APPLIED— diff blocks## BUILD / TEST OUTPUT— verbatim error lines only## COLLEAGUE MESSAGES— everysend_message/inbound message verbatim## OPEN QUESTIONS / UNREAD REGIONS— targeted next-reads to avoid exploratory re-ls## CURRENT PLAN— last stated planIt also includes an explicit anti-hallucination clause for line counts/sizes (an early iteration of this prompt fabricated
src/lib.rs: 3210 lineswhen 3210 was actually the file's byte count fromls -la).The system prompt and task instance are kept in the agent's
prefixand preserved across compactions by existing code, so the prompt instructs the model NOT to restate them in the summary.Empirical results
A/B replay of 9 real solver segments (Rust, Go, JS, Py) against the same
Qwen/Qwen3.5-9Bendpoint via a heuristic file+symbol coverage scorer over post-compaction commands:Qualitatively: the new summaries consistently quote the exact code regions agents would otherwise re-cat — e.g. macro definitions cited at
src/ensure.rs lines 870-919with verbatim Rust bodies — which is the specific behavior we wanted.Test plan
uv run ruff check src/cooperbench/— cleanuv run ruff format --check src/cooperbench/— 59 files already formatteduv run python -m mypy src/cooperbench/— 59 source files, no issuesuv run python -m pytest tests/ --ignore=tests/integration— 210 passed, 63 skipped (6 new tests added)New tests:
tests/agents/mini_swe_agent_v2/test_litellm_model.py::TestSerializeTranscriptForSummarycovers role labels, tool-call rendering, missing content, unknown roles, and empty input.