mini_swe_agent_v2: prevent re-exploration loops after compaction by akhatua2 · Pull Request #54 · cooperbench/CooperBench

akhatua2 · 2026-05-17T02:18:21Z

Summary

In a full coop run with Qwen3.5-9B (32K ctx), 91% of agents (63/72 sampled trajectories) hit LimitsExceeded with mean ~99 of 100 steps consumed. Inspecting trajectories: post-compaction, every agent re-ran read commands it had already executed pre-compaction (e.g. wc -l src/ensure.rs x6, sed -n '870,1000p' … x5). The compaction summary was preserving "files examined: X" without the contents — so the agent re-cat'd to recover the info it needed, burning steps until it ran out.

Two complementary fixes:

1. Serialize the prior conversation into a single transcript message

LitellmModel.summarize_context previously passed the prior turns to the summarizer as role-wise messages (system, user, assistant, tool, …) followed by a final user summary request. The model would sometimes role-play as the next assistant turn — emitting text shaped like tool calls — instead of producing a summary.

New behavior: flatten the prior messages into a tagged text transcript ([system]\n…\n\n[assistant]\n…\n\n[tool_output]\n…) and wrap it as a single user message:

<summary prompt>

--- BEGIN TRANSCRIPT ---
[system]
…
[assistant]
…
  -> tool_call bash({"command":"ls -la"})
[tool_output]
…
--- END TRANSCRIPT ---

The model is unambiguously an outside observer summarizing a record, not the next participant.

2. Replace the default `compaction_summary_prompt` with a structured, prescriptive template

The old default was a one-paragraph "summarize, be thorough" instruction. The new default asks for these sections, with verbatim quoting where it matters:

## FILE MAP — paths + read ranges + modified flag (line counts only when actually known from a wc)
## RELEVANT CODE READ — file path:lines a-b headers + verbatim fenced snippets
## KEY SYMBOLS / IDENTIFIERS — name -> path:line index for previously discovered symbols
## SEARCH RESULTS WORTH KEEPING — incl. negative results so the agent doesn't re-grep
## EDITS ALREADY APPLIED — diff blocks
## BUILD / TEST OUTPUT — verbatim error lines only
## COLLEAGUE MESSAGES — every send_message/inbound message verbatim
## OPEN QUESTIONS / UNREAD REGIONS — targeted next-reads to avoid exploratory re-ls
## CURRENT PLAN — last stated plan

It also includes an explicit anti-hallucination clause for line counts/sizes (an early iteration of this prompt fabricated src/lib.rs: 3210 lines when 3210 was actually the file's byte count from ls -la).

The system prompt and task instance are kept in the agent's prefix and preserved across compactions by existing code, so the prompt instructs the model NOT to restate them in the summary.

Empirical results

A/B replay of 9 real solver segments (Rust, Go, JS, Py) against the same Qwen/Qwen3.5-9B endpoint via a heuristic file+symbol coverage scorer over post-compaction commands:

	mean coverage	mean summary tokens
current	40%	~1,100
new	57%	~2,500

Qualitatively: the new summaries consistently quote the exact code regions agents would otherwise re-cat — e.g. macro definitions cited at src/ensure.rs lines 870-919 with verbatim Rust bodies — which is the specific behavior we wanted.

Test plan

uv run ruff check src/cooperbench/ — clean
uv run ruff format --check src/cooperbench/ — 59 files already formatted
uv run python -m mypy src/cooperbench/ — 59 source files, no issues
uv run python -m pytest tests/ --ignore=tests/integration — 210 passed, 63 skipped (6 new tests added)
CI green on GitHub Actions

New tests: tests/agents/mini_swe_agent_v2/test_litellm_model.py::TestSerializeTranscriptForSummary covers role labels, tool-call rendering, missing content, unknown roles, and empty input.

The default compaction summarizer was producing terse, generic summaries that lost the file contents agents had already read. After compaction the agent would re-cat/grep the same files, hit the 100-step limit, and fail to submit. In one full coop run, 91% of agents (63/72 sampled trajectories) hit LimitsExceeded with mean ~99 of 100 steps used. Two changes, both targeted at this loop: 1. summarize_context now serializes the prior turns into a single user message as a tagged transcript instead of passing them as role-wise messages. Without this, the model would sometimes role-play as the next assistant turn (emitting text shaped like tool calls) rather than producing a summary. 2. The default compaction_summary_prompt is now a structured template with explicit headings (FILE MAP, RELEVANT CODE READ, KEY SYMBOLS, SEARCH RESULTS, EDITS, BUILD/TEST OUTPUT, COLLEAGUE MESSAGES, OPEN QUESTIONS, CURRENT PLAN), asks for verbatim quoting of relevant snippets with file:line citations, and includes an anti-hallucination guard for line counts. The system prompt and task instance are preserved separately by the existing prefix handling, so the prompt tells the model not to restate them. Measured on an A/B replay of 9 real solver segments (Rust, Go, JS, Py) against the same Qwen3.5-9B endpoint: coverage of post-compaction re-explores went from 40% to 57% on a heuristic file+symbol scorer. Qualitative reads of the produced summaries show the new ones consistently quote the actual code regions agents would otherwise re-cat (e.g. macro definitions cited at exact line ranges). Added unit tests for _serialize_transcript_for_summary covering role labels, tool-call rendering, and empty/missing content edge cases.

ProKil · 2026-05-17T17:39:25Z

LGTM, the following are some compaction features in claude code that we can borrow.

ProKil · 2026-05-17T18:02:33Z

Mapping this PR to Claude Code's compaction design

For context if it helps prioritize follow-ups: I cross-referenced this PR against the compaction subsystem in the public claude-code skeleton (src/services/compact/ — autoCompact.ts, compact.ts, microCompact.ts, apiMicrocompact.ts, prompt.ts, postCompactCleanup.ts, sessionMemoryCompact.ts, grouping.ts, etc.). The two changes here already implement the most transferable parts of that design. Mapping each CC concept to PR 54:

CC concept	PR 54 status	Notes
Structured multi-section summary prompt	Done, better-tailored	CC's `BASE_COMPACT_PROMPT` uses 9 generic sections (Primary Request / Key Technical Concepts / Files and Code Sections / Errors and fixes / Problem Solving / All user messages / Pending Tasks / Current Work / Optional Next Step). The sections in this PR (FILE MAP / RELEVANT CODE READ / KEY SYMBOLS / SEARCH RESULTS / EDITS / TESTS / COLLEAGUE MESSAGES / OPEN QUESTIONS / CURRENT PLAN) are tighter and aimed precisely at the re-exploration failure mode
Disambiguate summarizer-as-observer	Done, more robust	CC uses `NO_TOOLS_PREAMBLE` (a prose warning at the top of the prompt) and notes a 2.79% role-play failure rate on Sonnet 4.6 vs 0.01% on 4.5. This PR's flatten-to-`[role]`-transcript inside a single user message is structural rather than instructional — the model can't role-play because it's not in a role position
`<analysis>` scratchpad before `<summary>`	Not done	CC has the model write `<analysis>…</analysis><summary>…</summary>` and strips analysis before the summary lands. Cheap to add and may help summary quality
Post-compact verbatim re-injection of last N files	Not done	CC re-injects up to 5 most recently read files after compaction. Constants in `compact.ts`: `POST_COMPACT_MAX_FILES_TO_RESTORE = 5`, `POST_COMPACT_TOKEN_BUDGET = 50_000`, `POST_COMPACT_MAX_TOKENS_PER_FILE = 5_000`. This is a deterministic guarantee that the most-recently-read files survive — independent of summary quality. Directly attacks the failure mode this PR describes (`wc -l src/ensure.rs` x6, `sed -n '870,1000p' …` x5 to recover dropped content)
Microcompact (clear oversized stale tool outputs before summarizing)	Not done	`microCompact.ts` clears old Bash / Grep / Read / Glob / WebFetch / WebSearch / Edit / Write outputs before the summarizer call so it has more room. For mini_swe_agent_v2, equivalent is truncating big shell-output blocks in `old_turns` before passing to `summarize_context`
Reactive recovery on prompt-too-long	Not done	CC has a fallback path that re-summarizes by API-round groups when proactive autocompact didn't fire soon enough. With this PR's `compaction_token_trigger: 28000`, a single turn that blows past 28k still produces `LimitsExceeded` rather than retrying with a tighter `compaction_keep_recent_turns`
Forked-agent with shared cache key	N/A	CC-API-specific; LiteLLM-agnostic harness can't share cache keys across providers
API-side `clear_tool_uses_20250919` / `clear_thinking_20251015`	N/A	Anthropic API beta; doesn't apply to Qwen3.5-9B via LiteLLM
Session memory background subagent	N/A	Heavy infra (post-sampling hook, persistent file, GrowthBook config, background extraction). Overkill for one-shot eval sessions
Auto-compact warning/error/blocking bands	N/A	UI concept
Pre/post-compact user hooks	N/A	Eval harness doesn't expose hooks
`groupMessagesByApiRound`, `adjustIndexToPreserveAPIInvariants`	N/A	This PR's flatten-to-text sidesteps both — no tool_use/tool_result split concerns, no streaming-`message.id` merge concerns
3-failure circuit breaker	N/A	Coop fails the run after one
Skill re-injection (`POST_COMPACT_MAX_TOKENS_PER_SKILL = 5_000`, `POST_COMPACT_SKILLS_TOKEN_BUDGET = 25_000`)	N/A	No skills in coop

Suggested follow-ups, ranked

1. Deterministic post-compact file re-injection — highest leverage, most directly compounding with what this PR already does.

In default.py:_compact_messages, walk old_turns newest-first, identify file-read tool outputs (parse the preceding assistant tool call rather than regexing the output is cleaner), pick up to 5 most recent unique file paths, and append a synthetic user message after [summary_msg] formatted like the prompt's ## RELEVANT CODE READ section, capped per file. Result: prefix + [summary_msg, restored_files_msg] + recent_turns.

The summary prompt currently asks the model to quote ## RELEVANT CODE READ, but a model under context pressure may quote selectively or paraphrase. Re-attaching deterministically removes that risk.

2. Microcompact-style truncation inside LitellmModel.summarize_context — modest leverage, complementary.

Before flattening to transcript, walk summarizer_input and truncate any single tool-output block over ~4k tokens to its first+last N lines with a [M lines elided] marker. The summarizer rarely needs the middle of a 10k-token shell dump, and freeing that space lets it spend more tokens on the structured output. Mirrors CC's TIME_BASED_MC_CLEARED_MESSAGE = '[Old tool result content cleared]' pattern.

3. Reactive fallback on LimitsExceeded — lowest leverage, only worth adding if (1) and (2) don't close the gap.

When the agent still hits the limit after the 28k threshold (single-turn overshoot), call compaction again with compaction_keep_recent_turns = 1 (or 0) before failing the run. CC's MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 is overkill here; one retry is probably enough.

4. Add <analysis> scratchpad — minor; defer until you have data on whether summary quality is the bottleneck.

Read of the empirical results

40% → 57% coverage with summary tokens going from ~1.1k to ~2.5k is the right direction, but coverage is still below what deterministic restoration would guarantee for the recent files. I'd expect (1) on its own to push the recent-file portion of coverage close to 100% — at the cost of another ~10–25k tokens of post-compact context, which is well inside the 32k budget given that compaction fires at 28k pre-compact.

ProKil · 2026-05-17T18:06:15Z

Addendum — I should have separated per-message compaction from the post-compact summarizer pipeline in the table above. Distinct mechanisms in CC:

Mechanism	Layer	Where
`FILE_UNCHANGED_STUB`	Tool result	`tools/FileReadTool/prompt.ts:7-8`, emitted at `FileReadTool.ts:690`. When `Read` is called on a file already read in this conversation and unchanged on disk, the tool returns `"File unchanged since last read. The content from the earlier Read tool_result in this conversation is still current — refer to that instead of re-reading."` instead of the file content. The compactor recognizes these stubs at `compact.ts:1617-1623` to map them back to the original read for restoration
Microcompact (the one I called out)	Pre-summarizer client-side	`microCompact.ts` — replaces old tool result content with `'[Old tool result content cleared]'` for a whitelist of tools
`stripImagesFromMessages`	Pre-summarizer	`compact.ts` — per-message replacement of image/document blocks with text markers, because images can themselves trip prompt_too_long on the summarizer call
API-side `clear_tool_uses_20250919` / `clear_thinking_20251015`	Server-side	`apiMicrocompact.ts`
Tool-level output caps	Tool	`MAX_LINES_TO_READ = 2000` for `Read`, Bash output truncation

For this PR's failure mode, FILE_UNCHANGED_STUB is the most directly relevant one I missed flagging. It's a defense that fires within a conversation (not just post-compact): even before compaction, if the agent re-runs cat src/ensure.rs, the tool result is the stub, not the file content. This:

Costs the agent a step but doesn't re-flood the context with duplicate file bytes.
Tells the model in-band that the content is still in context — usually enough to stop the loop.
Reduces the amount of duplicate content the summarizer has to consolidate when compaction does fire.

How this would compose with what's already in this PR + the file re-injection suggestion from my prior comment:

Within a conversation (pre-compact): intercept Read-like tool calls on a path already read in this conversation, return a stub message. Keyed by (path, mtime, range) or a content hash. Saves context tokens by not re-emitting the same file bytes.
At compaction time (this PR + re-injection): the structured summary captures what was read; deterministic re-injection guarantees the actual bytes for the last ≤5 files survive verbatim.
Post-compact (defense in depth): the same stub mechanism keeps working — if the agent reads a re-injected file again, it gets the stub pointing back to the re-injected content.

The harness-level implementation is small: a per-session cache {path -> (mtime, content_hash, last_read_step)} checked by whichever shell-command interception layer already exists in mini_swe_agent_v2 (the file-detection in connectors/git.py and the bash environment in environments/ suggest the plumbing is already there). The cache needs to be invalidated on any Write/Edit to the same path so the next read returns real content.

Worth noting CC's stub references "the earlier Read tool_result in this conversation" — that wording is load-bearing post-compact, because the summary is not a tool result. If you do this in coop and combine it with file re-injection, the stub message should reference the re-injected attachment by name/section so the model knows where to look. Otherwise post-compact the model sees "refer to the earlier tool result" and there is no earlier tool result.

Sequence to think about (low → high effort):

FILE_UNCHANGED_STUB equivalent at the harness/tool layer — small, prevents the loop even pre-compaction
Post-compact deterministic file re-injection (POST_COMPACT_MAX_FILES_TO_RESTORE) — moderate, prevents the loop when compaction does fire
Microcompact-style truncation of large stale tool outputs inside summarize_context — small, frees summary tokens
Reactive fallback — small, last-resort

(1) + (2) are complementary: (1) reduces how often compaction has to deal with the problem at all, (2) ensures the problem is solved when it does. Either alone is a clear win; both together is the closest analogue to the CC design.

akhatua2 force-pushed the improve-compaction-prompt branch from f1f97d5 to e00ac13 Compare May 17, 2026 02:20

ProKil mentioned this pull request May 18, 2026

agents/codex: add Codex adapter; lift shared coop bits into _coop #51

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mini_swe_agent_v2: prevent re-exploration loops after compaction#54

mini_swe_agent_v2: prevent re-exploration loops after compaction#54
akhatua2 wants to merge 1 commit into
mainfrom
improve-compaction-prompt

akhatua2 commented May 17, 2026 •

edited

Loading

Uh oh!

ProKil commented May 17, 2026 •

edited

Loading

Uh oh!

ProKil commented May 17, 2026

Uh oh!

ProKil commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akhatua2 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Serialize the prior conversation into a single transcript message

2. Replace the default compaction_summary_prompt with a structured, prescriptive template

Empirical results

Test plan

Uh oh!

ProKil commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProKil commented May 17, 2026

Mapping this PR to Claude Code's compaction design

Suggested follow-ups, ranked

Read of the empirical results

Uh oh!

ProKil commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akhatua2 commented May 17, 2026 •

edited

Loading

2. Replace the default `compaction_summary_prompt` with a structured, prescriptive template

ProKil commented May 17, 2026 •

edited

Loading