Problem
OpenKB compiler reuses a "base context A" (system + document) across N+M+2 LLM calls per document (summary → concepts-plan → N create + M update concept pages). Without cache_control markers, every call re-bills the full document content as input tokens.
For Anthropic Sonnet 4.5 (via OpenRouter or direct), prompt caching can reduce input cost by ~90% on cached prefix and reduce TTFT. Minimum cacheable size is 1,024 tokens, easily exceeded by typical document content.
Proposal
Add cache_control: {"type": "ephemeral"} markers at two breakpoints in openkb/agent/compiler.py:
- End of
doc_msg in compile_short_doc + compile_long_doc — caches system + doc for all downstream calls (summary, plan, every concept).
- End of assistant summary message in
_compile_concepts (3 call sites: plan, create, update) — caches system + doc + summary for all concept generation calls.
Two breakpoints, well within Anthropic's max-4 limit.
Compatibility
- Anthropic / OpenRouter→Anthropic: cache_control honored.
- OpenAI: list-of-blocks content format is valid (Vision API uses it); cache_control silently ignored.
- Other providers: LiteLLM normalizes/strips unknown fields.
Side fix
_llm_call_async currently does not forward **kwargs while _llm_call does (asymmetry noted in memory #82886). Add **kwargs for parity.
Out of scope
- OpenRouter Response Caching (
X-OpenRouter-Cache: true) — different mechanism, evaluated separately.
- Refactoring messages into a dedicated builder module — keep patch surgical.
Test plan
- Existing pytest suite passes (mocks accept
*args, **kwargs).
- New assertion: completion payload contains
cache_control block on the doc_msg.
- Manual smoke against a real Anthropic key: observe
cached_tokens in prompt_tokens_details on calls 2..N.
References
- Memory observation S11144 (3-5 line patch feasibility, OpenKB compiler audit).
- CLAUDE.md compiler architecture: "Designed around prompt-cache reuse: a single base context A reused across summary → concept-plan → concept-page calls."
Problem
OpenKB compiler reuses a "base context A" (system + document) across N+M+2 LLM calls per document (summary → concepts-plan → N create + M update concept pages). Without
cache_controlmarkers, every call re-bills the full document content as input tokens.For Anthropic Sonnet 4.5 (via OpenRouter or direct), prompt caching can reduce input cost by ~90% on cached prefix and reduce TTFT. Minimum cacheable size is 1,024 tokens, easily exceeded by typical document content.
Proposal
Add
cache_control: {"type": "ephemeral"}markers at two breakpoints inopenkb/agent/compiler.py:doc_msgincompile_short_doc+compile_long_doc— cachessystem + docfor all downstream calls (summary, plan, every concept)._compile_concepts(3 call sites: plan, create, update) — cachessystem + doc + summaryfor all concept generation calls.Two breakpoints, well within Anthropic's max-4 limit.
Compatibility
Side fix
_llm_call_asynccurrently does not forward**kwargswhile_llm_calldoes (asymmetry noted in memory #82886). Add**kwargsfor parity.Out of scope
X-OpenRouter-Cache: true) — different mechanism, evaluated separately.Test plan
*args, **kwargs).cache_controlblock on the doc_msg.cached_tokensinprompt_tokens_detailson calls 2..N.References