Intelligent Prompt Enhancement & Token Caching Proxy
A self-hosted, provider-agnostic LLM proxy that cuts costs by 30-60% and latency by 40%+ through aggressive token caching and cache-safe prompt engineering.
- Overview
- Why LayerCache?
- Core Concept: The Layered Prompt Architecture
- Features
- Quick Start
- Usage Examples
- API Reference
- Configuration
- Docker Deployment
- Architecture
- Development
- Documentation
- License
LayerCache sits between your application and LLM providers (Anthropic, OpenAI, Google Gemini). It is a drop-in replacement for your LLM provider's base URL — just point your OpenAI SDK at LayerCache.
In the background, LayerCache:
- Canonicalizes your prompts for byte-for-byte deterministic output (maximizing prefix cache hits)
- Injects provider-specific cache markers at stable layer boundaries
- Truncates long conversations to fit within a token budget (keeping recent turns, dropping old ones)
- Warns when your prefix is too short for provider caching to work
- Applies prompt enhancements (Chain of Thought, few-shot examples, etc.) without breaking the cache
- Caches semantically similar queries to bypass the LLM entirely on repeat requests
- Tracks metrics — token savings, cost reduction, cache hit rates — via Prometheus and a built-in web dashboard
| Problem | LayerCache Solution |
|---|---|
| Prompt prefix cache misses due to whitespace/ordering differences | Automatic canonicalization ensures identical prompts produce byte-for-byte identical output |
| Adding prompt enhancements (CoT, few-shots) breaks provider caching | Layered architecture (L0-L4) ensures enhancements are injected after the cached prefix |
| No visibility into cache performance or cost savings | Built-in Prometheus metrics and JSON dashboard showing hit rates, tokens saved, and $ saved |
| Different providers have different caching mechanisms | Provider adapters handle Anthropic (ephemeral markers), OpenAI (auto-caching), and Gemini (CachedContent) |
| Repeated similar queries waste tokens and money | Semantic cache with embedding similarity matching bypasses the LLM for near-duplicate queries |
| Long conversations grow an unbounded prefix, reducing cache effectiveness | Automatic L2 session truncation keeps only the last N tokens of conversation history |
| Silent cache misses with no diagnostic | Runtime warning when L0+L1+L2 is below the provider caching threshold (~1024 tokens) |
| Cross-conversation semantic cache hits near zero due to session history in the cache key | Prefix hash redesigned to L0+L1 only — L2 and session_id excluded from the cache key |
| Model names differ between client config and upstream API | Configurable model aliases + automatic upstream model discovery |
The key insight behind LayerCache is that prompts have naturally occurring layers with different stability profiles. By enforcing strict separation between these layers, we can optimize caching and enhance prompts without invalidating provider prefix caches.
| Layer | Content | Mutability | Cache Status |
|---|---|---|---|
| L0: System | Core persona, safety rules, output format | Immutable | Cached |
| L1: Context | Domain knowledge, tool definitions, static few-shots | Updated rarely | Cached |
| L2: Session | Conversation history, user preferences | Per session/turn | Cached (short TTL) |
| L3: Enhancement | Dynamic instructions (CoT, RAG, dynamic few-shots) | Per request | Uncached |
| L4: User Input | The actual user query | Dynamic | Uncached |
Cache breakpoints are placed at L0/L1/L2 boundaries. Enhancements are injected at L3, ensuring they never invalidate the stable prefix.
- Prompt Canonicalizer — Whitespace normalization, JSON minification, tool sorting for byte-for-byte deterministic output
- Layered Architecture (L0-L4) — Separates system, context, session, enhancement, and user content so enhancements never invalidate the cached prefix
- Provider Cache Markers — Anthropic
cache_control, OpenAI auto-prefix caching, GeminiCachedContent - Injection at Stable Layers — Markers placed at L0/L1/L2 boundaries; L3/L4 left uncached
- Smart Truncation — Automatically drop old conversation turns to fit within token budgets (
recentorimportantstrategies) - Prefix Threshold Diagnostics — Info-level warning when L0+L1+L2 is below the ~1024-token caching threshold
- Dual Backend Support — SQLite (dev) or Redis (production) with automatic fallback
- Local Embeddings — FastEmbed (BAAI/bge-small-en-v1.5) in ProcessPoolExecutor
- Dual-Key Strategy — Prefix hash (exact) + query embedding (semantic similarity)
- Session Isolation — Prevent cross-session cache pollution with automatic session ID management
- Configurable TTLs — Per-request and default TTLs with automatic cleanup
- Enhancement API — Composable prompt engineering via request metadata
- Suffix Injection — Enhancements injected at L3, never breaking L0-L2 cache
- Dynamic Few-Shot Selector — Embedding-based retrieval of relevant examples
- Prompt Registry — Named, versioned prompt templates (YAML/JSON)
- Analytics Dashboard — Interactive charts for cache hit rates, token savings, latency trends, and cost tracking
- Prometheus + JSON Metrics — Token savings, cost reduction, cache hit rates
- Web Dashboard — Overview charts, per-model breakdown, cache browser, config editor, live log viewer (Jinja2 + HTMX + Chart.js)
- Persistent Time-Series — Metric snapshots in SQLite with background collection loop
- Config Hot-Reload — Update log level, pipeline timeout/retries at runtime without restart
- Universal Routing — LiteLLM-based multi-provider routing with automatic failover
# Clone the repository
git clone https://github.com/ZeroClue/layercache.git
cd layercache
# Set your API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
# For Redis backend (production):
# export LAYERCACHE_REDIS_URL=redis://localhost:6379/0
# Start the proxy
docker-compose up -d# Install dependencies
pip install -r requirements.txt
# Set environment variables
export ANTHROPIC_API_KEY=your-key
export OPENAI_API_KEY=your-key
# Run the proxy
uvicorn layercache.main:app --host 0.0.0.0 --port 8000curl http://localhost:8000/health
# {"status":"healthy","version":"1.5.0","semantic_cache":true}Open http://localhost:8000/dashboard for the web dashboard (config editor, metrics charts, logs, template CRUD).
Dashboard overview with live metrics
Per-model breakdown with adapter column
Just point your existing OpenAI client at LayerCache. No code changes needed — caching works automatically.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-ant-your-anthropic-key" # Provider key passed through
)
response = client.chat.completions.create(
model="anthropic/claude-3-5-sonnet-20241022",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain async/await in Python."}
]
)Add Chain of Thought reasoning without breaking the cache prefix:
response = client.chat.completions.create(
model="anthropic/claude-3-5-sonnet-20241022",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the time complexity of quicksort?"}
],
extra_body={
"lc_enhancements": ["chain_of_thought"]
}
)Reference a named template from the registry instead of sending L0/L1 with every request:
response = client.chat.completions.create(
model="anthropic/claude-3-5-sonnet-20241022",
messages=[
{"role": "user", "content": "Review this code for bugs."}
],
extra_body={
"lc_template": "code-assistant"
}
)# Skip semantic cache for this request
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
extra_body={
"lc_cache_ttl": 0, # No semantic caching
"lc_enhancements": ["self_critique"]
}
)
# Custom TTL (10 minutes)
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
extra_body={
"lc_cache_ttl": 600
}
)# JSON dashboard
curl http://localhost:8000/v1/cache/metrics
# Prometheus metrics
curl http://localhost:8000/metrics| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completions (drop-in OpenAI replacement) |
POST |
/v1/messages |
Anthropic Messages API (drop-in Claude Code replacement) |
GET |
/v1/models |
List available models |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/v1/cache/metrics |
Cache performance metrics (JSON) |
GET |
/v1/cache/metrics/history |
Bucketed time-series for charting |
GET |
/v1/cache/metrics/status |
Snapshot age tracking |
GET |
/metrics |
Prometheus metrics (text/plain) |
GET |
/v1/prompts/templates |
List prompt templates |
POST |
/v1/prompts/templates |
Create/update a template |
DELETE |
/v1/prompts/templates/{name} |
Delete a template |
POST |
/v1/prompts/reload |
Reload templates from disk |
| Method | Endpoint | Description |
|---|---|---|
GET |
/dashboard |
Overview with stat cards + charts |
GET |
/dashboard/models |
Provider/model table |
GET |
/dashboard/cache |
Semantic cache stats + invalidation |
GET |
/dashboard/templates |
Prompt template CRUD |
GET |
/dashboard/config |
YAML config editor |
POST |
/dashboard/config/save |
Save config (HTMX, CSRF-protected) |
GET |
/dashboard/logs |
Log tail from ring buffer |
GET |
/dashboard/login |
Login form (when proxy key is set) |
POST |
/dashboard/login |
Login action |
These fields can be added to any POST /v1/chat/completions request:
| Field | Type | Default | Description |
|---|---|---|---|
lc_template |
string |
null |
Name of a prompt template to use for L0/L1 |
lc_enhancements |
string[] |
[] |
Enhancement names to apply at L3 |
lc_cache_ttl |
int |
300 |
Semantic cache TTL in seconds (0 = skip) |
lc_layer_hints |
object |
null |
Explicit index -> layer mapping |
lc_skip_semantic_cache |
bool |
false |
Skip semantic cache lookup entirely |
lc_bypass_cache |
bool |
false |
Skip all caching (semantic + provider) |
| Name | Description |
|---|---|
chain_of_thought |
Instructs the LLM to reason step-by-step |
structured_json |
Enforces JSON output format (optional schema) |
self_critique |
Instructs the LLM to review and refine its own response |
dynamic_few_shot |
Retrieves relevant few-shot examples from a local vector store |
All configuration is done via layercache.yaml. A JSON Schema is provided for IDE autocompletion (VS Code, PyCharm). Regenerate it with layercache-schema:
# yaml-language-server: $schema=./layercache.schema.json
proxy:
host: 0.0.0.0
port: 8000
proxy_api_key: "your-optional-proxy-secret" # Protect the proxy itself
providers:
anthropic:
api_key_env: ANTHROPIC_API_KEY # Env var holding the key
openai:
api_key_env: OPENAI_API_KEY
gemini:
api_key_env: GOOGLE_API_KEY
deepseek:
api_key_env: DEEPSEEK_API_KEY # Any LiteLLM provider works
# adapter: openai # Override cache strategy (auto-detected if unset)
opencode-go:
api_key_env: OPENCODE_ZEN_API_KEY
base_url: https://opencode.ai/zen/go/v1
# No model_aliases needed — Go models use the same names client sends
caching:
semantic:
enabled: true
backend: "sqlite" # or "redis" for production
db_path: /data/semantic_cache.db
redis_url: "redis://localhost:6379/0" # Redis backend URL (optional)
redis_pool_size: 20 # Redis connection pool size
default_ttl: 300 # 5 minutes
similarity_threshold: 0.95 # Cosine similarity for semantic cache
embedder: "BAAI/bge-small-en-v1.5"
session_id_header: X-Session-ID # Header for opt-in session isolation
max_session_tokens: 2000 # Optional: truncate L2 to keep within token budget
prefix_hash_max_tokens: 250 # Max L0 tokens for cache key (excludes per-project context)
truncation_strategy: "recent" # or "important" (score-based)
metrics:
db_path: /data/metrics.db # Time-series snapshot storage
snapshot_interval_seconds: 60 # Background snapshot interval
snapshot_retention_days: 7 # Snapshot retention
enhancements:
registered:
- name: chain_of_thought
- name: structured_json
- name: self_critique
- name: dynamic_few_shot
config:
vector_store: /data/few_shots/examples.json
top_k: 3When a proxy sits between the client and the upstream API, model names sent by the client may differ from the names the upstream accepts. LayerCache supports two resolution mechanisms:
- Explicit aliases (
model_aliasesinProviderConfig): Maps a client-side model name to an upstream name. Use this when an upstream renames or versions a model (e.g.,qwen3.5-plus → qwen-3.5-plus-v2). - Auto-discovery: On startup, LayerCache fetches
GET /v1/modelsfrom each configured provider'sbase_urland builds a reverse index. If a requested model name isn't in the upstream list but matches a single ID by prefix (e.g.,deepseek-v4-flashmatchesdeepseek-v4-flash-freeon Zen), it resolves automatically.
Auto-discovery only matches the same model — it does not redirect to a different product. deepseek-v4-flash (paid, Zero Retention on Go) and deepseek-v4-flash-free (free, data-collecting on Zen) are different models with different privacy guarantees; auto-discovery would not map between them for a subscription-based provider.
providers:
opencode:
api_key_env: OPENCODE_ZEN_API_KEY
base_url: https://opencode.ai/zen/v1
model_aliases:
my-custom-model-name: qwen3.5-plusAuto-discovery requires api_key_env to be set so the model list endpoint can be authenticated. If both an explicit alias and an auto-discovered match exist, the explicit alias takes precedence.
| Variable | Description | Required |
|---|---|---|
ANTHROPIC_API_KEY |
Anthropic API key | If using Anthropic |
OPENAI_API_KEY |
OpenAI API key | If using OpenAI |
GOOGLE_API_KEY |
Google Gemini API key | If using Gemini |
| (custom) | Any env var name per providers.{name}.api_key_env in config |
Depends on config |
# Build and start
docker-compose up -d
# View logs
docker-compose logs -f layercache
# Stop
docker-compose down| Host Path | Container Path | Purpose |
|---|---|---|
./data |
/data |
Persistent storage (cache DB, templates, examples) |
./layercache.yaml |
/app/layercache.yaml |
Configuration file (read-only) |
Client Application
│
▼
┌──────────────────────────────────────┐
│ LayerCache Proxy │
│ ┌────────────────────────────────┐ │
│ │ Request Pipeline │ │
│ │ 1. Semantic Cache Lookup │ │
│ │ 2. Stratify (L0→L4) │ │
│ │ 3. Canonicalize │ │
│ │ 3b. Truncate Session │ │
│ │ 3c. Prefix Threshold Check │ │
│ │ 4. Enhance (L3 injection) │ │
│ │ 5. Inject Cache Markers │ │
│ │ 6. Route via LiteLLM │ │
│ │ 7. Handle Response │ │
│ │ 8. Store & Record Metrics │ │
│ └────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌────────┐ ┌────────┐ │
│ │ Semantic │ │ Prompt │ │Metrics │ │
│ │ Cache │ │Registry│ │Collector│ │
│ └──────────┘ └────────┘ └────────┘ │
└──────────────────────────────────────┘
│ │ │
▼ ▼ ▼
Anthropic OpenAI Gemini
- Python 3.11+
- pip
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=layercache --cov-report=term-missing# All tests
pytest tests/ -v
# Specific test file
pytest tests/test_stratifier.py -v
# With verbose output
pytest tests/ -v --tb=short# Lint and format
ruff check layercache/
ruff format layercache/
# Type checking
mypy layercache/layercache/
├── layercache/ # Core package
│ ├── main.py # FastAPI application
│ ├── pipeline.py # Request processing pipeline
│ ├── models.py # Pydantic data models
│ ├── stratifier.py # L0-L4 message classification
│ ├── canonicalizer.py # Prompt normalization
│ ├── config.py # YAML configuration
│ ├── schema.py # JSON Schema generator for IDE autocompletion
│ ├── adapters/ # Provider cache marker injection
│ │ ├── anthropic.py # Anthropic cache_control
│ │ ├── anthropic_messages.py # /v1/messages wire-format shim
│ │ ├── openai.py # OpenAI auto-caching
│ │ └── gemini.py # Gemini CachedContent
│ ├── enhancements/ # Cache-safe prompt enhancements
│ │ ├── base.py # BaseEnhancement ABC
│ │ ├── chain_of_thought.py # Step-by-step reasoning
│ │ ├── structured_output.py # JSON format enforcement
│ │ ├── self_critique.py # Self-review injection
│ │ └── dynamic_few_shot.py # Vector-based example retrieval
│ ├── cache/ # Semantic caching
│ │ ├── semantic.py # SQLite-backed cache
│ │ └── embedder.py # FastEmbed wrapper
│ ├── dashboard/ # Web dashboard (Jinja2 + HTMX)
│ │ ├── router.py # Dashboard routes
│ │ └── templates/ # Jinja2 templates
│ ├── metrics/ # Observability
│ │ ├── collector.py # Prometheus + ROI tracking
│ │ └── storage.py # Persistent time-series snapshots
│ ├── static/ # Dashboard assets
│ └── registry/ # Prompt template management
│ └── prompt_registry.py # YAML/JSON template loader
├── tests/ # Test suite (117 tests)
├── data/ # Sample data
│ ├── prompts/ # Prompt templates
│ └── few_shots/ # Few-shot examples
├── docs/ # Documentation
│ ├── PRD.md # Product Requirements
│ ├── TDD.md # Technical Design
│ ├── IMPLEMENTATION_PLAN.md # Sprint plan
│ ├── ARCHITECTURE.md # Architecture deep-dive
│ ├── DEPLOYMENT.md # Deployment guide
│ ├── USER_GUIDE.md # User guide
│ └── API.md # API reference
├── Dockerfile # Production image
├── docker-compose.yml # Docker Compose config
├── layercache.yaml # Default configuration
├── pyproject.toml # Python project config
├── layercache.schema.json # JSON Schema for IDE autocompletion
└── requirements.txt # Dependencies
| Document | Description |
|---|---|
| PRD | Product Requirements Document |
| TDD | Technical Design Document |
| Implementation Plan | 8-sprint development roadmap |
| Architecture | System architecture deep-dive |
| Roadmap | Prioritized future development plan |
| Deployment Guide | Production deployment instructions |
| User Guide | Comprehensive usage guide |
| API Reference | Full API documentation |
| Contributing | How to contribute, setup, and PR process |
| CHANGELOG | Version history and changes |
| Document | Description |
|---|---|
| Redis Setup | Production Redis configuration and tuning |
| Migration Guide | Zero-downtime migration from SQLite to Redis |
| Load Test Report | Performance benchmarks and throughput analysis |
Built with OpenCode Go — fork, automate, ship.
This project is licensed under the MIT License. See LICENSE for the full text.
What this means:
- ✅ You can freely use, copy, modify, and distribute this software
- ✅ You can use it for commercial and private purposes
- ✅ You must include a copy of the license and copyright notice
⚠️ The software is provided "as-is" without warranty
Third-party code: Vendored libraries (Chart.js, HTMX) are included under their own licenses in THIRD_PARTY_NOTICES.md.
For more information, visit https://opensource.org/licenses/MIT
