Skip to content

[FEATURE]: ~90% of compaction cost is avoidable cache miss #25120

@lloydzhou

Description

@lloydzhou

Feature hasn't been suggested before.

  • I have verified this feature I'm about to request hasn't been suggested before.

Describe the enhancement you want to request

Why compaction matters

opencode runs long multi-turn agent sessions. As conversation history grows, we periodically compact it — asking the LLM to summarize old messages and replacing them with a concise summary. This is essential for staying within the context window.

The paradox

Compaction is supposed to save tokens. But to summarize the history, we have to re-send that same history to the LLM. The compaction request itself becomes one of the largest requests in the entire session — often 40K–50K tokens of input.

That would be acceptable if the provider's prompt cache helped. But it doesn't.

Why the cache misses

opencode's normal chat requests share a common prefix — system prompt, tool definitions — that hits the provider's prompt cache across turns. This is already working well.

The compaction request, however, builds a completely different structure:

  • No system prompt (compaction agent has none)
  • No tools (empty tool set)
  • Different message history (prior compaction messages filtered out)
  • Different serialization (media stripped, tool output truncated)

Because the request shares zero prefix overlap with normal chat requests, every token is charged at full input price. The cache that was built up over dozens of turns is completely wasted.

The cost in numbers

Compacting ~45K tokens of history with Claude Sonnet 4 ($3.00/MTok input, $0.30/MTok cached):

Tokens Full price If cached
System prompt ~2K $0.006 $0.0006
Tools ~3K $0.009 $0.0009
Dropped messages ~40K $0.120 $0.012
Summary instruction ~200 $0.0006 $0.0006 (not cached)
Total ~45.2K $0.136 $0.014 (~90% saved)

The difference is ~10x. Over a long session with multiple compactions, this adds up — especially ironic for an operation whose entire purpose is cost reduction.

The fix is straightforward

The compaction request should share the same prefix as normal chat requests. Instead of building a separate request structure, keep the system prompt, tool definitions, and message serialization identical — only append a short "please summarize" instruction at the end.

Normal chat:   [system] + [tools] + [old messages] + [kept messages] + [user input]
Compaction:    [system] + [tools] + [old messages] + [summary prompt]
                                          ↑ cache hit up to here

The prefix [system] + [tools] + [old messages] was already sent in previous turns. The provider serves it from cache at 10% of the normal price. Only the ~200-token summary instruction is new.

This is the same approach described in the bash-agent wiki on Cache-Aligned Summarization, where it cuts compaction cost by ~90% in practice.

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)discussionUsed for feature requests, proposals, ideas, etc. Open discussionperfIndicates a performance issue or need for optimization

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions