Skip to content

feat: opt-in OpenRouter Response Caching for compiler retry path #39

@pitimon

Description

@pitimon

Problem

openkb add retries compile_*_doc once when it fails (registered hash only after compile succeeds — see cli.add_single_file). On retry, every LLM call (summary, plan, N+M concept pages) runs again with identical prompts. Without OpenRouter's Response Caching, the retry rebills every token.

Same situation applies to repeated openkb lint runs and dev iteration where the same compile is re-run against the same document.

Proposal

Add an opt-in config flag that enables OpenRouter Response Caching by sending X-OpenRouter-Cache: true (and optional X-OpenRouter-Cache-TTL) on every LiteLLM call from the compiler. When the response is identical (same model, messages, params), OpenRouter returns the cached response in 80–300 ms with zero token billing (docs).

Config

.openkb/config.yaml:

response_cache: true              # default: false
response_cache_ttl: 600           # optional, seconds (1-86400, OpenRouter default: 300)

Behaviour

  • Default OFF — opt-in to avoid surprise on KBs holding sensitive content (response caching stores responses on OpenRouter; conflicts with strict ZDR posture).
  • Only emits headers when model starts with `openrouter/`. For direct Anthropic/OpenAI/etc., the headers would have no effect; skipping avoids confusing reviewers and stray bytes on the wire.
  • Headers are passed via LiteLLM's standard extra_headers kwarg.
  • This complements PR feat(compiler): add cache_control breakpoints for Anthropic prompt caching #38 (Anthropic prompt caching). They are orthogonal: prompt caching reduces cost on the cached prefix per call; response caching skips the model call entirely on identical-payload re-runs.

Scope

  • compile_short_doc and compile_long_doc only (the only direct LiteLLM callers).
  • Out of scope: query, chat, linter — those go through OpenAI Agents SDK; threading custom headers through is a separate, larger change.

Privacy guard

Default OFF. Document in PR body and a brief note in CLAUDE.md / docs that enabling stores responses on OpenRouter. KBs with classified content (e.g. ISMS data) should leave it disabled or use X-OpenRouter-Cache-Clear per-call.

Test plan

  • Unit: _response_cache_headers returns {} when disabled, when model is non-OpenRouter, and the right dict when enabled (with and without TTL).
  • Integration: with response_cache: true in config and model="openrouter/...", litellm.completion is called with extra_headers={"X-OpenRouter-Cache": "true"}.
  • Regression: with the flag off (default), no extra_headers is passed (existing behaviour).

Depends on

#38 (uses the **kwargs symmetry fix on _llm_call_async).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions