You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
openkb add retries compile_*_doc once when it fails (registered hash only after compile succeeds — see cli.add_single_file). On retry, every LLM call (summary, plan, N+M concept pages) runs again with identical prompts. Without OpenRouter's Response Caching, the retry rebills every token.
Same situation applies to repeated openkb lint runs and dev iteration where the same compile is re-run against the same document.
Proposal
Add an opt-in config flag that enables OpenRouter Response Caching by sending X-OpenRouter-Cache: true (and optional X-OpenRouter-Cache-TTL) on every LiteLLM call from the compiler. When the response is identical (same model, messages, params), OpenRouter returns the cached response in 80–300 ms with zero token billing (docs).
Default OFF — opt-in to avoid surprise on KBs holding sensitive content (response caching stores responses on OpenRouter; conflicts with strict ZDR posture).
Only emits headers when model starts with `openrouter/`. For direct Anthropic/OpenAI/etc., the headers would have no effect; skipping avoids confusing reviewers and stray bytes on the wire.
Headers are passed via LiteLLM's standard extra_headers kwarg.
compile_short_doc and compile_long_doc only (the only direct LiteLLM callers).
Out of scope: query, chat, linter — those go through OpenAI Agents SDK; threading custom headers through is a separate, larger change.
Privacy guard
Default OFF. Document in PR body and a brief note in CLAUDE.md / docs that enabling stores responses on OpenRouter. KBs with classified content (e.g. ISMS data) should leave it disabled or use X-OpenRouter-Cache-Clear per-call.
Test plan
Unit: _response_cache_headers returns {} when disabled, when model is non-OpenRouter, and the right dict when enabled (with and without TTL).
Integration: with response_cache: true in config and model="openrouter/...", litellm.completion is called with extra_headers={"X-OpenRouter-Cache": "true"}.
Regression: with the flag off (default), no extra_headers is passed (existing behaviour).
Depends on
#38 (uses the **kwargs symmetry fix on _llm_call_async).
Problem
openkb addretriescompile_*_doconce when it fails (registered hash only after compile succeeds — seecli.add_single_file). On retry, every LLM call (summary, plan, N+M concept pages) runs again with identical prompts. Without OpenRouter's Response Caching, the retry rebills every token.Same situation applies to repeated
openkb lintruns and dev iteration where the same compile is re-run against the same document.Proposal
Add an opt-in config flag that enables OpenRouter Response Caching by sending
X-OpenRouter-Cache: true(and optionalX-OpenRouter-Cache-TTL) on every LiteLLM call from the compiler. When the response is identical (same model, messages, params), OpenRouter returns the cached response in 80–300 ms with zero token billing (docs).Config
.openkb/config.yaml:Behaviour
modelstarts with `openrouter/`. For direct Anthropic/OpenAI/etc., the headers would have no effect; skipping avoids confusing reviewers and stray bytes on the wire.extra_headerskwarg.Scope
compile_short_docandcompile_long_doconly (the only direct LiteLLM callers).query,chat,linter— those go through OpenAI Agents SDK; threading custom headers through is a separate, larger change.Privacy guard
Default OFF. Document in PR body and a brief note in CLAUDE.md / docs that enabling stores responses on OpenRouter. KBs with classified content (e.g. ISMS data) should leave it disabled or use
X-OpenRouter-Cache-Clearper-call.Test plan
_response_cache_headersreturns{}when disabled, when model is non-OpenRouter, and the right dict when enabled (with and without TTL).response_cache: truein config andmodel="openrouter/...",litellm.completionis called withextra_headers={"X-OpenRouter-Cache": "true"}.extra_headersis passed (existing behaviour).Depends on
#38 (uses the
**kwargssymmetry fix on_llm_call_async).