Skip to content

[FEATURE]: Segment-Based Prefix Caching for Multi-Agent Orchestration #26288

@Prathyushmnchla

Description

@Prathyushmnchla

Feature hasn't been suggested before.

  • I have verified this feature I'm about to request hasn't been suggested before.

Describe the enhancement you want to request

Drafted by Kimi K2.6

Note: Distinct from client-side response caching (#25974). That caches full query-response pairs locally. This is about server-side prefix caching — reusing the LLM's internal KV cache when different requests share the beginning of their prompt. Both are complementary.

What this is about

When an LLM processes a prompt, it builds an internal state called a KV cache. If the next request starts with the exact same text, the model can reuse that state instead of recomputing it from scratch. This saves time and money.

In multi-agent setups, different agents often share the same starting blocks:

Block What's in it How often it changes Typical size
A — Global System rules, tool schemas, output format Rarely 10k–50k tokens
B, C — Agent Agent skills, project memory, codebase summary Occasionally 5k–20k tokens
D, E — Task Current file, user query, scratchpad Every turn 2k–10k tokens

Example:

  • Agent 1 (backend): A + B + D
  • Agent 2 (frontend): A + C + E
  • Agent 3 (docs): A + B + E

All three start with block A. Without prefix caching, each agent pays full price to process A every time. With it, Agent 2 and 3 reuse Agent 1's cached A. Shared blocks become ~90% cheaper after the first request.

How the cache should work

Segments are identified by hash

Each block of text gets a hash (e.g. SHA-256 of its content). This hash identifies the segment.

The cache is a tree

The tree has levels. Level 1 is the first segment, level 2 is the second segment, and so on.

Level 1 (first segment):
  hash(A) ──→ KV state after processing A
  hash(F) ──→ KV state after processing F

Level 2 (second segment, branching under each level-1 node):
  under hash(A):
    hash(B) ──→ KV state after processing A+B
    hash(C) ──→ KV state after processing A+C
  under hash(F):
    hash(G) ──→ KV state after processing F+G

Level 3:
  under hash(A)→hash(B):
    hash(D) ──→ KV state after processing A+B+D
    hash(E) ──→ KV state after processing A+B+E

To serve a request, the server walks the tree:

  1. Check level 1 for hash(segment_1). If found, load that KV state.
  2. Check under that node at level 2 for hash(segment_2). If found, extend.
  3. Continue until a segment is missing. Process remaining segments from scratch, then add the new branch to the tree.

Why a tree: A flat hash of the entire prefix (hash(A+B+D)) would miss partial overlaps. With a tree, hash(A+B+E) reuses the hash(A)→hash(B) branch even though the full prefix differs.

API

POST /v1/chat/completions
{
  "model": "kimi-k2.6",
  "segments": [
    {"id": "global-rules", "hash": "sha256:abc...", "content": "..."},
    {"id": "backend-agent", "hash": "sha256:def...", "content": "..."}
  ],
  "messages": [
    {"role": "user", "content": "Refactor auth"}
  ]
}

The server walks the tree using the segment hashes. If all segments hit, it only processes the user message. If a segment misses, it processes that segment and grows the tree.

Response:

{
  "usage": {
    "input_tokens": 62000,
    "output_tokens": 1500,
    "cached_prefix_tokens": 60000
  }
}

Who this helps

  • Multi-agent frameworks (LangGraph, CrewAI): parallel agents sharing global context get cache hits automatically.
  • IDE plugins: codebase context blocks are hashed on file change. Unchanged files reuse cache across sessions.
  • Enterprise platforms: company-wide compliance guides are pre-cached once and shared by all employee agents.
  • Power users: stable system prompt parts are cached across all conversations.

Rough impact estimate

The first request to a cold cache incurs a small write cost to store the KV state. Subsequent requests that hit cached prefixes avoid recomputation.

Scenario No cache With prefix cache
Agent 1, turn 1 62k tokens (~$0.31) 62k tokens (~$0.31 + write cost)
Agent 2, turn 1 62k tokens (~$0.31) 50k cached + 12k new (~$0.08)
Agent 1, turn 2 62k tokens (~$0.31) 60k cached + 2k new (~$0.02)

For a fleet of 10 agents sharing a 50k-token global block: roughly $500/day vs $50/day.

Operational considerations

Multi-tenancy: Cache segments must be scoped per-API-key or per-organization to prevent cross-tenant leakage.

Token boundaries: Segment boundaries should align with token boundaries. If a segment splits a token, the server should reject the request or reprocess the overlapping tokens.

Eviction: KV caches consume significant GPU memory. The tree should support a configurable max size and an LRU eviction policy.

Model versioning: The cache key should include the model version and quantization identifier, so KV state is invalidated when the underlying model changes.

Suggested rollout

  1. Add segments[] to the chat completions API. Server hashes content, walks the tree, and reuses KV caches automatically.
  2. Expose cache telemetry (cached_prefix_tokens in usage) so users can see what's hitting.
  3. Pre-warm endpoint:
    POST /v1/cache/warm
    {
      "model": "kimi-k2.6",
      "segments": [
        {"id": "global-rules", "hash": "sha256:abc...", "content": "..."}
      ]
    }
    // Response: { "cached_segments": 1, "root_hash": "sha256:abc..." }
  4. TTL and eviction controls for segments that should expire.
  5. Named segment registry as a convenience layer — users reference segments by name instead of managing hashes themselves.

Bottom line

Multi-agent orchestration is compositional. Agents share global blocks, diverge in specialization, and converge on tasks. OpenCode Go should support this with a segment-based prefix cache built as a hash tree. The minimum viable step: accept segments[] in chat requests, hash each segment, walk the tree for matches, and reuse cached KV state.

Environment: OpenCode Go subscription, models Kimi K2.6, DeepSeek V4, GLM-5, Qwen3.6, MiniMax M2.7.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions