Feature hasn't been suggested before.
Describe the enhancement you want to request
Drafted by Kimi K2.6
Note: Distinct from client-side response caching (#25974). That caches full query-response pairs locally. This is about server-side prefix caching — reusing the LLM's internal KV cache when different requests share the beginning of their prompt. Both are complementary.
What this is about
When an LLM processes a prompt, it builds an internal state called a KV cache. If the next request starts with the exact same text, the model can reuse that state instead of recomputing it from scratch. This saves time and money.
In multi-agent setups, different agents often share the same starting blocks:
| Block |
What's in it |
How often it changes |
Typical size |
| A — Global |
System rules, tool schemas, output format |
Rarely |
10k–50k tokens |
| B, C — Agent |
Agent skills, project memory, codebase summary |
Occasionally |
5k–20k tokens |
| D, E — Task |
Current file, user query, scratchpad |
Every turn |
2k–10k tokens |
Example:
- Agent 1 (backend): A + B + D
- Agent 2 (frontend): A + C + E
- Agent 3 (docs): A + B + E
All three start with block A. Without prefix caching, each agent pays full price to process A every time. With it, Agent 2 and 3 reuse Agent 1's cached A. Shared blocks become ~90% cheaper after the first request.
How the cache should work
Segments are identified by hash
Each block of text gets a hash (e.g. SHA-256 of its content). This hash identifies the segment.
The cache is a tree
The tree has levels. Level 1 is the first segment, level 2 is the second segment, and so on.
Level 1 (first segment):
hash(A) ──→ KV state after processing A
hash(F) ──→ KV state after processing F
Level 2 (second segment, branching under each level-1 node):
under hash(A):
hash(B) ──→ KV state after processing A+B
hash(C) ──→ KV state after processing A+C
under hash(F):
hash(G) ──→ KV state after processing F+G
Level 3:
under hash(A)→hash(B):
hash(D) ──→ KV state after processing A+B+D
hash(E) ──→ KV state after processing A+B+E
To serve a request, the server walks the tree:
- Check level 1 for
hash(segment_1). If found, load that KV state.
- Check under that node at level 2 for
hash(segment_2). If found, extend.
- Continue until a segment is missing. Process remaining segments from scratch, then add the new branch to the tree.
Why a tree: A flat hash of the entire prefix (hash(A+B+D)) would miss partial overlaps. With a tree, hash(A+B+E) reuses the hash(A)→hash(B) branch even though the full prefix differs.
API
POST /v1/chat/completions
{
"model": "kimi-k2.6",
"segments": [
{"id": "global-rules", "hash": "sha256:abc...", "content": "..."},
{"id": "backend-agent", "hash": "sha256:def...", "content": "..."}
],
"messages": [
{"role": "user", "content": "Refactor auth"}
]
}
The server walks the tree using the segment hashes. If all segments hit, it only processes the user message. If a segment misses, it processes that segment and grows the tree.
Response:
{
"usage": {
"input_tokens": 62000,
"output_tokens": 1500,
"cached_prefix_tokens": 60000
}
}
Who this helps
- Multi-agent frameworks (LangGraph, CrewAI): parallel agents sharing global context get cache hits automatically.
- IDE plugins: codebase context blocks are hashed on file change. Unchanged files reuse cache across sessions.
- Enterprise platforms: company-wide compliance guides are pre-cached once and shared by all employee agents.
- Power users: stable system prompt parts are cached across all conversations.
Rough impact estimate
The first request to a cold cache incurs a small write cost to store the KV state. Subsequent requests that hit cached prefixes avoid recomputation.
| Scenario |
No cache |
With prefix cache |
| Agent 1, turn 1 |
62k tokens (~$0.31) |
62k tokens (~$0.31 + write cost) |
| Agent 2, turn 1 |
62k tokens (~$0.31) |
50k cached + 12k new (~$0.08) |
| Agent 1, turn 2 |
62k tokens (~$0.31) |
60k cached + 2k new (~$0.02) |
For a fleet of 10 agents sharing a 50k-token global block: roughly $500/day vs $50/day.
Operational considerations
Multi-tenancy: Cache segments must be scoped per-API-key or per-organization to prevent cross-tenant leakage.
Token boundaries: Segment boundaries should align with token boundaries. If a segment splits a token, the server should reject the request or reprocess the overlapping tokens.
Eviction: KV caches consume significant GPU memory. The tree should support a configurable max size and an LRU eviction policy.
Model versioning: The cache key should include the model version and quantization identifier, so KV state is invalidated when the underlying model changes.
Suggested rollout
- Add
segments[] to the chat completions API. Server hashes content, walks the tree, and reuses KV caches automatically.
- Expose cache telemetry (
cached_prefix_tokens in usage) so users can see what's hitting.
- Pre-warm endpoint:
POST /v1/cache/warm
{
"model": "kimi-k2.6",
"segments": [
{"id": "global-rules", "hash": "sha256:abc...", "content": "..."}
]
}
// Response: { "cached_segments": 1, "root_hash": "sha256:abc..." }
- TTL and eviction controls for segments that should expire.
- Named segment registry as a convenience layer — users reference segments by name instead of managing hashes themselves.
Bottom line
Multi-agent orchestration is compositional. Agents share global blocks, diverge in specialization, and converge on tasks. OpenCode Go should support this with a segment-based prefix cache built as a hash tree. The minimum viable step: accept segments[] in chat requests, hash each segment, walk the tree for matches, and reuse cached KV state.
Environment: OpenCode Go subscription, models Kimi K2.6, DeepSeek V4, GLM-5, Qwen3.6, MiniMax M2.7.
Feature hasn't been suggested before.
Describe the enhancement you want to request
Note: Distinct from client-side response caching (#25974). That caches full query-response pairs locally. This is about server-side prefix caching — reusing the LLM's internal KV cache when different requests share the beginning of their prompt. Both are complementary.
What this is about
When an LLM processes a prompt, it builds an internal state called a KV cache. If the next request starts with the exact same text, the model can reuse that state instead of recomputing it from scratch. This saves time and money.
In multi-agent setups, different agents often share the same starting blocks:
Example:
All three start with block A. Without prefix caching, each agent pays full price to process A every time. With it, Agent 2 and 3 reuse Agent 1's cached A. Shared blocks become ~90% cheaper after the first request.
How the cache should work
Segments are identified by hash
Each block of text gets a hash (e.g. SHA-256 of its content). This hash identifies the segment.
The cache is a tree
The tree has levels. Level 1 is the first segment, level 2 is the second segment, and so on.
To serve a request, the server walks the tree:
hash(segment_1). If found, load that KV state.hash(segment_2). If found, extend.Why a tree: A flat hash of the entire prefix (
hash(A+B+D)) would miss partial overlaps. With a tree,hash(A+B+E)reuses thehash(A)→hash(B)branch even though the full prefix differs.API
The server walks the tree using the segment hashes. If all segments hit, it only processes the user message. If a segment misses, it processes that segment and grows the tree.
Response:
{ "usage": { "input_tokens": 62000, "output_tokens": 1500, "cached_prefix_tokens": 60000 } }Who this helps
Rough impact estimate
The first request to a cold cache incurs a small write cost to store the KV state. Subsequent requests that hit cached prefixes avoid recomputation.
For a fleet of 10 agents sharing a 50k-token global block: roughly $500/day vs $50/day.
Operational considerations
Multi-tenancy: Cache segments must be scoped per-API-key or per-organization to prevent cross-tenant leakage.
Token boundaries: Segment boundaries should align with token boundaries. If a segment splits a token, the server should reject the request or reprocess the overlapping tokens.
Eviction: KV caches consume significant GPU memory. The tree should support a configurable max size and an LRU eviction policy.
Model versioning: The cache key should include the model version and quantization identifier, so KV state is invalidated when the underlying model changes.
Suggested rollout
segments[]to the chat completions API. Server hashes content, walks the tree, and reuses KV caches automatically.cached_prefix_tokensin usage) so users can see what's hitting.Bottom line
Multi-agent orchestration is compositional. Agents share global blocks, diverge in specialization, and converge on tasks. OpenCode Go should support this with a segment-based prefix cache built as a hash tree. The minimum viable step: accept
segments[]in chat requests, hash each segment, walk the tree for matches, and reuse cached KV state.Environment: OpenCode Go subscription, models Kimi K2.6, DeepSeek V4, GLM-5, Qwen3.6, MiniMax M2.7.