Skip to content

mini_swe_agent_v2: prevent re-exploration loops after compaction#54

Open
akhatua2 wants to merge 1 commit into
mainfrom
improve-compaction-prompt
Open

mini_swe_agent_v2: prevent re-exploration loops after compaction#54
akhatua2 wants to merge 1 commit into
mainfrom
improve-compaction-prompt

Conversation

@akhatua2
Copy link
Copy Markdown
Collaborator

@akhatua2 akhatua2 commented May 17, 2026

Summary

In a full coop run with Qwen3.5-9B (32K ctx), 91% of agents (63/72 sampled trajectories) hit LimitsExceeded with mean ~99 of 100 steps consumed. Inspecting trajectories: post-compaction, every agent re-ran read commands it had already executed pre-compaction (e.g. wc -l src/ensure.rs x6, sed -n '870,1000p' … x5). The compaction summary was preserving "files examined: X" without the contents — so the agent re-cat'd to recover the info it needed, burning steps until it ran out.

Two complementary fixes:

1. Serialize the prior conversation into a single transcript message

LitellmModel.summarize_context previously passed the prior turns to the summarizer as role-wise messages (system, user, assistant, tool, …) followed by a final user summary request. The model would sometimes role-play as the next assistant turn — emitting text shaped like tool calls — instead of producing a summary.

New behavior: flatten the prior messages into a tagged text transcript ([system]\n…\n\n[assistant]\n…\n\n[tool_output]\n…) and wrap it as a single user message:

<summary prompt>

--- BEGIN TRANSCRIPT ---
[system]
…
[assistant]
…
  -> tool_call bash({"command":"ls -la"})
[tool_output]
…
--- END TRANSCRIPT ---

The model is unambiguously an outside observer summarizing a record, not the next participant.

2. Replace the default compaction_summary_prompt with a structured, prescriptive template

The old default was a one-paragraph "summarize, be thorough" instruction. The new default asks for these sections, with verbatim quoting where it matters:

  • ## FILE MAP — paths + read ranges + modified flag (line counts only when actually known from a wc)
  • ## RELEVANT CODE READ — file path:lines a-b headers + verbatim fenced snippets
  • ## KEY SYMBOLS / IDENTIFIERSname -> path:line index for previously discovered symbols
  • ## SEARCH RESULTS WORTH KEEPING — incl. negative results so the agent doesn't re-grep
  • ## EDITS ALREADY APPLIED — diff blocks
  • ## BUILD / TEST OUTPUT — verbatim error lines only
  • ## COLLEAGUE MESSAGES — every send_message/inbound message verbatim
  • ## OPEN QUESTIONS / UNREAD REGIONS — targeted next-reads to avoid exploratory re-ls
  • ## CURRENT PLAN — last stated plan

It also includes an explicit anti-hallucination clause for line counts/sizes (an early iteration of this prompt fabricated src/lib.rs: 3210 lines when 3210 was actually the file's byte count from ls -la).

The system prompt and task instance are kept in the agent's prefix and preserved across compactions by existing code, so the prompt instructs the model NOT to restate them in the summary.

Empirical results

A/B replay of 9 real solver segments (Rust, Go, JS, Py) against the same Qwen/Qwen3.5-9B endpoint via a heuristic file+symbol coverage scorer over post-compaction commands:

mean coverage mean summary tokens
current 40% ~1,100
new 57% ~2,500

Qualitatively: the new summaries consistently quote the exact code regions agents would otherwise re-cat — e.g. macro definitions cited at src/ensure.rs lines 870-919 with verbatim Rust bodies — which is the specific behavior we wanted.

Test plan

  • uv run ruff check src/cooperbench/ — clean
  • uv run ruff format --check src/cooperbench/ — 59 files already formatted
  • uv run python -m mypy src/cooperbench/ — 59 source files, no issues
  • uv run python -m pytest tests/ --ignore=tests/integration — 210 passed, 63 skipped (6 new tests added)
  • CI green on GitHub Actions

New tests: tests/agents/mini_swe_agent_v2/test_litellm_model.py::TestSerializeTranscriptForSummary covers role labels, tool-call rendering, missing content, unknown roles, and empty input.

The default compaction summarizer was producing terse, generic summaries
that lost the file contents agents had already read. After compaction
the agent would re-cat/grep the same files, hit the 100-step limit, and
fail to submit. In one full coop run, 91% of agents (63/72 sampled
trajectories) hit LimitsExceeded with mean ~99 of 100 steps used.

Two changes, both targeted at this loop:

1. summarize_context now serializes the prior turns into a single user
   message as a tagged transcript instead of passing them as role-wise
   messages. Without this, the model would sometimes role-play as the
   next assistant turn (emitting text shaped like tool calls) rather
   than producing a summary.

2. The default compaction_summary_prompt is now a structured template
   with explicit headings (FILE MAP, RELEVANT CODE READ, KEY SYMBOLS,
   SEARCH RESULTS, EDITS, BUILD/TEST OUTPUT, COLLEAGUE MESSAGES, OPEN
   QUESTIONS, CURRENT PLAN), asks for verbatim quoting of relevant
   snippets with file:line citations, and includes an anti-hallucination
   guard for line counts. The system prompt and task instance are
   preserved separately by the existing prefix handling, so the prompt
   tells the model not to restate them.

Measured on an A/B replay of 9 real solver segments (Rust, Go, JS, Py)
against the same Qwen3.5-9B endpoint: coverage of post-compaction
re-explores went from 40% to 57% on a heuristic file+symbol scorer.
Qualitative reads of the produced summaries show the new ones
consistently quote the actual code regions agents would otherwise
re-cat (e.g. macro definitions cited at exact line ranges).

Added unit tests for _serialize_transcript_for_summary covering role
labels, tool-call rendering, and empty/missing content edge cases.
@akhatua2 akhatua2 force-pushed the improve-compaction-prompt branch from f1f97d5 to e00ac13 Compare May 17, 2026 02:20
@ProKil
Copy link
Copy Markdown
Member

ProKil commented May 17, 2026

LGTM, the following are some compaction features in claude code that we can borrow.

@ProKil
Copy link
Copy Markdown
Member

ProKil commented May 17, 2026

Mapping this PR to Claude Code's compaction design

For context if it helps prioritize follow-ups: I cross-referenced this PR against the compaction subsystem in the public claude-code skeleton (src/services/compact/autoCompact.ts, compact.ts, microCompact.ts, apiMicrocompact.ts, prompt.ts, postCompactCleanup.ts, sessionMemoryCompact.ts, grouping.ts, etc.). The two changes here already implement the most transferable parts of that design. Mapping each CC concept to PR 54:

CC concept PR 54 status Notes
Structured multi-section summary prompt Done, better-tailored CC's BASE_COMPACT_PROMPT uses 9 generic sections (Primary Request / Key Technical Concepts / Files and Code Sections / Errors and fixes / Problem Solving / All user messages / Pending Tasks / Current Work / Optional Next Step). The sections in this PR (FILE MAP / RELEVANT CODE READ / KEY SYMBOLS / SEARCH RESULTS / EDITS / TESTS / COLLEAGUE MESSAGES / OPEN QUESTIONS / CURRENT PLAN) are tighter and aimed precisely at the re-exploration failure mode
Disambiguate summarizer-as-observer Done, more robust CC uses NO_TOOLS_PREAMBLE (a prose warning at the top of the prompt) and notes a 2.79% role-play failure rate on Sonnet 4.6 vs 0.01% on 4.5. This PR's flatten-to-[role]-transcript inside a single user message is structural rather than instructional — the model can't role-play because it's not in a role position
<analysis> scratchpad before <summary> Not done CC has the model write <analysis>…</analysis><summary>…</summary> and strips analysis before the summary lands. Cheap to add and may help summary quality
Post-compact verbatim re-injection of last N files Not done CC re-injects up to 5 most recently read files after compaction. Constants in compact.ts: POST_COMPACT_MAX_FILES_TO_RESTORE = 5, POST_COMPACT_TOKEN_BUDGET = 50_000, POST_COMPACT_MAX_TOKENS_PER_FILE = 5_000. This is a deterministic guarantee that the most-recently-read files survive — independent of summary quality. Directly attacks the failure mode this PR describes (wc -l src/ensure.rs x6, sed -n '870,1000p' … x5 to recover dropped content)
Microcompact (clear oversized stale tool outputs before summarizing) Not done microCompact.ts clears old Bash / Grep / Read / Glob / WebFetch / WebSearch / Edit / Write outputs before the summarizer call so it has more room. For mini_swe_agent_v2, equivalent is truncating big shell-output blocks in old_turns before passing to summarize_context
Reactive recovery on prompt-too-long Not done CC has a fallback path that re-summarizes by API-round groups when proactive autocompact didn't fire soon enough. With this PR's compaction_token_trigger: 28000, a single turn that blows past 28k still produces LimitsExceeded rather than retrying with a tighter compaction_keep_recent_turns
Forked-agent with shared cache key N/A CC-API-specific; LiteLLM-agnostic harness can't share cache keys across providers
API-side clear_tool_uses_20250919 / clear_thinking_20251015 N/A Anthropic API beta; doesn't apply to Qwen3.5-9B via LiteLLM
Session memory background subagent N/A Heavy infra (post-sampling hook, persistent file, GrowthBook config, background extraction). Overkill for one-shot eval sessions
Auto-compact warning/error/blocking bands N/A UI concept
Pre/post-compact user hooks N/A Eval harness doesn't expose hooks
groupMessagesByApiRound, adjustIndexToPreserveAPIInvariants N/A This PR's flatten-to-text sidesteps both — no tool_use/tool_result split concerns, no streaming-message.id merge concerns
3-failure circuit breaker N/A Coop fails the run after one
Skill re-injection (POST_COMPACT_MAX_TOKENS_PER_SKILL = 5_000, POST_COMPACT_SKILLS_TOKEN_BUDGET = 25_000) N/A No skills in coop

Suggested follow-ups, ranked

1. Deterministic post-compact file re-injection — highest leverage, most directly compounding with what this PR already does.

In default.py:_compact_messages, walk old_turns newest-first, identify file-read tool outputs (parse the preceding assistant tool call rather than regexing the output is cleaner), pick up to 5 most recent unique file paths, and append a synthetic user message after [summary_msg] formatted like the prompt's ## RELEVANT CODE READ section, capped per file. Result: prefix + [summary_msg, restored_files_msg] + recent_turns.

The summary prompt currently asks the model to quote ## RELEVANT CODE READ, but a model under context pressure may quote selectively or paraphrase. Re-attaching deterministically removes that risk.

2. Microcompact-style truncation inside LitellmModel.summarize_context — modest leverage, complementary.

Before flattening to transcript, walk summarizer_input and truncate any single tool-output block over ~4k tokens to its first+last N lines with a [M lines elided] marker. The summarizer rarely needs the middle of a 10k-token shell dump, and freeing that space lets it spend more tokens on the structured output. Mirrors CC's TIME_BASED_MC_CLEARED_MESSAGE = '[Old tool result content cleared]' pattern.

3. Reactive fallback on LimitsExceeded — lowest leverage, only worth adding if (1) and (2) don't close the gap.

When the agent still hits the limit after the 28k threshold (single-turn overshoot), call compaction again with compaction_keep_recent_turns = 1 (or 0) before failing the run. CC's MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 is overkill here; one retry is probably enough.

4. Add <analysis> scratchpad — minor; defer until you have data on whether summary quality is the bottleneck.

Read of the empirical results

40% → 57% coverage with summary tokens going from ~1.1k to ~2.5k is the right direction, but coverage is still below what deterministic restoration would guarantee for the recent files. I'd expect (1) on its own to push the recent-file portion of coverage close to 100% — at the cost of another ~10–25k tokens of post-compact context, which is well inside the 32k budget given that compaction fires at 28k pre-compact.

@ProKil
Copy link
Copy Markdown
Member

ProKil commented May 17, 2026

Addendum — I should have separated per-message compaction from the post-compact summarizer pipeline in the table above. Distinct mechanisms in CC:

Mechanism Layer Where
FILE_UNCHANGED_STUB Tool result tools/FileReadTool/prompt.ts:7-8, emitted at FileReadTool.ts:690. When Read is called on a file already read in this conversation and unchanged on disk, the tool returns "File unchanged since last read. The content from the earlier Read tool_result in this conversation is still current — refer to that instead of re-reading." instead of the file content. The compactor recognizes these stubs at compact.ts:1617-1623 to map them back to the original read for restoration
Microcompact (the one I called out) Pre-summarizer client-side microCompact.ts — replaces old tool result content with '[Old tool result content cleared]' for a whitelist of tools
stripImagesFromMessages Pre-summarizer compact.ts — per-message replacement of image/document blocks with text markers, because images can themselves trip prompt_too_long on the summarizer call
API-side clear_tool_uses_20250919 / clear_thinking_20251015 Server-side apiMicrocompact.ts
Tool-level output caps Tool MAX_LINES_TO_READ = 2000 for Read, Bash output truncation

For this PR's failure mode, FILE_UNCHANGED_STUB is the most directly relevant one I missed flagging. It's a defense that fires within a conversation (not just post-compact): even before compaction, if the agent re-runs cat src/ensure.rs, the tool result is the stub, not the file content. This:

  • Costs the agent a step but doesn't re-flood the context with duplicate file bytes.
  • Tells the model in-band that the content is still in context — usually enough to stop the loop.
  • Reduces the amount of duplicate content the summarizer has to consolidate when compaction does fire.

How this would compose with what's already in this PR + the file re-injection suggestion from my prior comment:

  1. Within a conversation (pre-compact): intercept Read-like tool calls on a path already read in this conversation, return a stub message. Keyed by (path, mtime, range) or a content hash. Saves context tokens by not re-emitting the same file bytes.
  2. At compaction time (this PR + re-injection): the structured summary captures what was read; deterministic re-injection guarantees the actual bytes for the last ≤5 files survive verbatim.
  3. Post-compact (defense in depth): the same stub mechanism keeps working — if the agent reads a re-injected file again, it gets the stub pointing back to the re-injected content.

The harness-level implementation is small: a per-session cache {path -> (mtime, content_hash, last_read_step)} checked by whichever shell-command interception layer already exists in mini_swe_agent_v2 (the file-detection in connectors/git.py and the bash environment in environments/ suggest the plumbing is already there). The cache needs to be invalidated on any Write/Edit to the same path so the next read returns real content.

Worth noting CC's stub references "the earlier Read tool_result in this conversation" — that wording is load-bearing post-compact, because the summary is not a tool result. If you do this in coop and combine it with file re-injection, the stub message should reference the re-injected attachment by name/section so the model knows where to look. Otherwise post-compact the model sees "refer to the earlier tool result" and there is no earlier tool result.

Sequence to think about (low → high effort):

  1. FILE_UNCHANGED_STUB equivalent at the harness/tool layer — small, prevents the loop even pre-compaction
  2. Post-compact deterministic file re-injection (POST_COMPACT_MAX_FILES_TO_RESTORE) — moderate, prevents the loop when compaction does fire
  3. Microcompact-style truncation of large stale tool outputs inside summarize_context — small, frees summary tokens
  4. Reactive fallback — small, last-resort

(1) + (2) are complementary: (1) reduces how often compaction has to deal with the problem at all, (2) ensures the problem is solved when it does. Either alone is a clear win; both together is the closest analogue to the CC design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants