Description
Add an ai-cache plugin that provides two-layer caching for LLM API responses:
- L1 exact-match: hash-based lookup on the prompt content, zero cost, deterministic
- L2 semantic: embedding-based vector similarity search when L1 misses, catching semantically equivalent queries (e.g., "how to return an item" ≈ "what is the return policy")
This is one of the highest-ROI capabilities for AI Gateway — industry reports show 30-60% cost and latency savings for common workloads (FAQ bots, document Q&A, translation).
Background:
All major AI gateway products already ship semantic caching: Kong (ai-semantic-cache), LiteLLM, Portkey, Higress (ai-cache), Helicone. APISIX has proxy-cache for generic HTTP caching but nothing that understands LLM prompt semantics.
Proposed design (for reference, open to adjustment during implementation):
plugins:
ai-cache:
layers: [exact, semantic]
cache_key:
include_consumer: false # default shared; set true for multi-tenant isolation
include_vars: [] # additional variables for cache key scoping
exact:
ttl: 3600
match_fields: [messages]
semantic:
vector_backend: redis # Redis Stack with RediSearch module
similarity_threshold: 0.95
top_k: 1
ttl: 86400
embedding:
provider: openai
model: text-embedding-3-small
endpoint: "https://api.openai.com/v1/embeddings"
auth_ref: "$secret://..."
bypass_on:
- header: "X-Cache-Bypass"
equals: "1"
headers:
cache_status: "X-AI-Cache-Status" # HIT-L1 / HIT-L2 / MISS / BYPASS
cache_similarity: "X-AI-Cache-Similarity"
cache_age: "X-AI-Cache-Age"
Key design points:
- L1 hash → L2 embedding dual-layer (industry consensus, not either/or)
- Cache key isolation is user-configured: default is shared (maximizes hit rate for public knowledge scenarios); multi-tenant or RAG scenarios explicitly enable
include_consumer or include_vars
- L2 hit backfills L1 for next identical query
- Cache write only on complete successful upstream response (2xx)
- Streaming responses: accumulate chunks then write; cache replay maintains SSE contract
- Redis Stack as the vector backend (APISIX already depends on Redis for
limit-* plugins; Redis Stack just adds the RediSearch module)
- Response headers expose cache status for debugging and client-side logic
- Prometheus metrics:
apisix_ai_cache_hits_total{layer}, apisix_ai_cache_misses_total, apisix_ai_cache_embedding_latency_seconds
Typical use cases:
| Scenario |
Recommended config |
| Public FAQ / translation |
Default (shared cache), max ROI |
| Multi-tenant SaaS |
include_consumer: true |
| RAG with different retrieval contexts |
include_consumer: true + include_vars: [retrieval scope vars] |
| Sensitive prompts |
Use bypass_on to skip caching |
Happy to submit a PR if this direction makes sense.
Description
Add an
ai-cacheplugin that provides two-layer caching for LLM API responses:This is one of the highest-ROI capabilities for AI Gateway — industry reports show 30-60% cost and latency savings for common workloads (FAQ bots, document Q&A, translation).
Background:
All major AI gateway products already ship semantic caching: Kong (
ai-semantic-cache), LiteLLM, Portkey, Higress (ai-cache), Helicone. APISIX hasproxy-cachefor generic HTTP caching but nothing that understands LLM prompt semantics.Proposed design (for reference, open to adjustment during implementation):
Key design points:
include_consumerorinclude_varslimit-*plugins; Redis Stack just adds the RediSearch module)apisix_ai_cache_hits_total{layer},apisix_ai_cache_misses_total,apisix_ai_cache_embedding_latency_secondsTypical use cases:
include_consumer: trueinclude_consumer: true+include_vars: [retrieval scope vars]bypass_onto skip cachingHappy to submit a PR if this direction makes sense.