[FEATURE]: Segment-Based Prefix Caching for Multi-Agent Orchestration

### Feature hasn't been suggested before.

- [x] I have verified this feature I'm about to request hasn't been suggested before.

### Describe the enhancement you want to request


> Drafted by Kimi K2.6

**Note:** Distinct from client-side response caching (#25974). That caches full query-response pairs locally. This is about **server-side prefix caching** — reusing the LLM's internal KV cache when different requests share the beginning of their prompt. Both are complementary.

## What this is about

When an LLM processes a prompt, it builds an internal state called a **KV cache**. If the next request starts with the exact same text, the model can reuse that state instead of recomputing it from scratch. This saves time and money.

In multi-agent setups, different agents often share the same starting blocks:

| Block | What's in it | How often it changes | Typical size |
|-------|-------------|----------------------|--------------|
| A — Global | System rules, tool schemas, output format | Rarely | 10k–50k tokens |
| B, C — Agent | Agent skills, project memory, codebase summary | Occasionally | 5k–20k tokens |
| D, E — Task | Current file, user query, scratchpad | Every turn | 2k–10k tokens |

**Example:**
- Agent 1 (backend): A + B + D
- Agent 2 (frontend): A + C + E
- Agent 3 (docs): A + B + E

All three start with block A. Without prefix caching, each agent pays full price to process A every time. With it, Agent 2 and 3 reuse Agent 1's cached A. Shared blocks become ~90% cheaper after the first request.

## How the cache should work

### Segments are identified by hash

Each block of text gets a hash (e.g. SHA-256 of its content). This hash identifies the segment.

### The cache is a tree

The tree has levels. Level 1 is the first segment, level 2 is the second segment, and so on.

```
Level 1 (first segment):
  hash(A) ──→ KV state after processing A
  hash(F) ──→ KV state after processing F

Level 2 (second segment, branching under each level-1 node):
  under hash(A):
    hash(B) ──→ KV state after processing A+B
    hash(C) ──→ KV state after processing A+C
  under hash(F):
    hash(G) ──→ KV state after processing F+G

Level 3:
  under hash(A)→hash(B):
    hash(D) ──→ KV state after processing A+B+D
    hash(E) ──→ KV state after processing A+B+E
```

**To serve a request, the server walks the tree:**
1. Check level 1 for `hash(segment_1)`. If found, load that KV state.
2. Check under that node at level 2 for `hash(segment_2)`. If found, extend.
3. Continue until a segment is missing. Process remaining segments from scratch, then add the new branch to the tree.

**Why a tree:** A flat hash of the entire prefix (`hash(A+B+D)`) would miss partial overlaps. With a tree, `hash(A+B+E)` reuses the `hash(A)→hash(B)` branch even though the full prefix differs.

### API

```json
POST /v1/chat/completions
{
  "model": "kimi-k2.6",
  "segments": [
    {"id": "global-rules", "hash": "sha256:abc...", "content": "..."},
    {"id": "backend-agent", "hash": "sha256:def...", "content": "..."}
  ],
  "messages": [
    {"role": "user", "content": "Refactor auth"}
  ]
}
```

The server walks the tree using the segment hashes. If all segments hit, it only processes the user message. If a segment misses, it processes that segment and grows the tree.

**Response:**
```json
{
  "usage": {
    "input_tokens": 62000,
    "output_tokens": 1500,
    "cached_prefix_tokens": 60000
  }
}
```

## Who this helps

- **Multi-agent frameworks** (LangGraph, CrewAI): parallel agents sharing global context get cache hits automatically.
- **IDE plugins**: codebase context blocks are hashed on file change. Unchanged files reuse cache across sessions.
- **Enterprise platforms**: company-wide compliance guides are pre-cached once and shared by all employee agents.
- **Power users**: stable system prompt parts are cached across all conversations.

## Rough impact estimate

The first request to a cold cache incurs a small write cost to store the KV state. Subsequent requests that hit cached prefixes avoid recomputation.

| Scenario | No cache | With prefix cache |
|----------|----------|-------------------|
| Agent 1, turn 1 | 62k tokens (~$0.31) | 62k tokens (~$0.31 + write cost) |
| Agent 2, turn 1 | 62k tokens (~$0.31) | 50k cached + 12k new (~$0.08) |
| Agent 1, turn 2 | 62k tokens (~$0.31) | 60k cached + 2k new (~$0.02) |

For a fleet of 10 agents sharing a 50k-token global block: roughly $500/day vs $50/day.

## Operational considerations

**Multi-tenancy:** Cache segments must be scoped per-API-key or per-organization to prevent cross-tenant leakage.

**Token boundaries:** Segment boundaries should align with token boundaries. If a segment splits a token, the server should reject the request or reprocess the overlapping tokens.

**Eviction:** KV caches consume significant GPU memory. The tree should support a configurable max size and an LRU eviction policy.

**Model versioning:** The cache key should include the model version and quantization identifier, so KV state is invalidated when the underlying model changes.

## Suggested rollout

1. **Add `segments[]` to the chat completions API.** Server hashes content, walks the tree, and reuses KV caches automatically.
2. **Expose cache telemetry** (`cached_prefix_tokens` in usage) so users can see what's hitting.
3. **Pre-warm endpoint:**
   ```json
   POST /v1/cache/warm
   {
     "model": "kimi-k2.6",
     "segments": [
       {"id": "global-rules", "hash": "sha256:abc...", "content": "..."}
     ]
   }
   // Response: { "cached_segments": 1, "root_hash": "sha256:abc..." }
   ```
4. **TTL and eviction controls** for segments that should expire.
5. **Named segment registry** as a convenience layer — users reference segments by name instead of managing hashes themselves.

## Bottom line

Multi-agent orchestration is compositional. Agents share global blocks, diverge in specialization, and converge on tasks. OpenCode Go should support this with a segment-based prefix cache built as a hash tree. The minimum viable step: accept `segments[]` in chat requests, hash each segment, walk the tree for matches, and reuse cached KV state.

**Environment:** OpenCode Go subscription, models Kimi K2.6, DeepSeek V4, GLM-5, Qwen3.6, MiniMax M2.7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Segment-Based Prefix Caching for Multi-Agent Orchestration #26288

Feature hasn't been suggested before.

Describe the enhancement you want to request

What this is about

How the cache should work

Segments are identified by hash

The cache is a tree

API

Who this helps

Rough impact estimate

Operational considerations

Suggested rollout

Bottom line

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Block	What's in it	How often it changes	Typical size
A — Global	System rules, tool schemas, output format	Rarely	10k–50k tokens
B, C — Agent	Agent skills, project memory, codebase summary	Occasionally	5k–20k tokens
D, E — Task	Current file, user query, scratchpad	Every turn	2k–10k tokens

Scenario	No cache	With prefix cache
Agent 1, turn 1	62k tokens (~$0.31)	62k tokens (~$0.31 + write cost)
Agent 2, turn 1	62k tokens (~$0.31)	50k cached + 12k new (~$0.08)
Agent 1, turn 2	62k tokens (~$0.31)	60k cached + 2k new (~$0.02)

[FEATURE]: Segment-Based Prefix Caching for Multi-Agent Orchestration #26288

Description

Feature hasn't been suggested before.

Describe the enhancement you want to request

What this is about

How the cache should work

Segments are identified by hash

The cache is a tree

API

Who this helps

Rough impact estimate

Operational considerations

Suggested rollout

Bottom line

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions