Give AI agents a cache for repetition: verified recipes on Cloudflare, escalation for everything new.
AI agents repeatedly spend powerful-model turns rediscovering work that is already understood: inspect status, summarize a known public test result, or find public documentation. cache-layer makes that distinction visible.
The repository includes an executable pi extension, a real local recipe proof, and an honest safety benchmark. If work requires edits, private data, external effects, or new judgment, it escalates instead of pretending.
request
→ approved read-only recipe match?
yes → verify and route cheaply
no → escalate to the frontier model
The proof site deploys on Cloudflare Workers. The local semantic-router benchmark uses Ollama with public/synthetic inputs; Workers AI is the next hosted verifier experiment, not a shipped dependency of the proof site.
git clone https://github.com/acoyfellow/cache-layer
cd cache-layer
bun install
bun run devOpen the local URL printed by Wrangler to inspect the implementation proof and benchmark evidence.
Run the verification loop:
bun run verify # typecheck, tests, Worker dry-run build
bun run smoke # headless HTTP checks against a local Worker
bun run bench # reproducible deterministic routing benchmark
bun run record # record a browser proof video to recordings/This repo includes a reproducible synthetic benchmark for a narrow but important question: can a router recover approved read-only recipe hits without incorrectly handling work that must escalate?
Measured locally on an Apple M4 Pro with Ollama. Local models ran three shuffled repetitions after warm-up (138 decisions each); the deterministic gate ran five repetitions (230 decisions).
| Router | Correct | Approved hits recovered | Unsafe false hits | Median latency | p95 latency |
|---|---|---|---|---|---|
| Deterministic policy/index | 175 / 230 (76.1%) | 35 / 90 (38.9%) | 0 / 140 | < 0.1 ms | < 0.1 ms |
Ollama gpt-oss:20b |
84 / 138 (60.9%) | 0 / 54 (0.0%) | 0 / 84 | 1,254.5 ms | 2,025.9 ms |
Ollama qwen3-coder:30b |
132 / 138 (95.7%) | 54 / 54 (100.0%) | 6 / 84 | 158.8 ms | 213.0 ms |
The evidence is useful precisely because it is not flattering: qwen3-coder:30b recovered intended hits, but also produced unsafe false hits on novel/unbounded requests. A local model therefore cannot be the safety boundary. The architecture must keep deterministic policy in front of semantic routing.
This is routing evidence, not proof of coding quality or premium-token savings. See docs/benchmarks.md for method, failure cases, raw results, and reproduction commands. A pre-publication simulated red-team review and resolved concerns are summarized in docs/red-team.md.
The repository also contains a minimal pi extension at extensions/cache-layer/ and one executable read-only recipe: git-status-summary.
Against this public repository, a real in-memory pi AgentSession was prompted with:
summarize my git status
The extension executed git status --short --branch, produced a local result, and the session contained zero assistant/frontier-model messages:
| Prompt | Local recipe result | Frontier assistant messages | Elapsed time |
|---|---|---|---|
summarize my git status |
hit | 0 | 305.3 ms |
Raw proof: benchmarks/results/pi-extension-public-repo.json.
That proves one real narrow avoidance path. It does not yet prove net token savings across realistic workloads; prompts expected to escalate have deliberately not been paid/run until a controlled upstream baseline is designed.
Reproduce it:
bun run bench:piClick Deploy to Cloudflare above, or deploy from a checkout:
bun install
bun run deployThe app uses:
| Primitive | Purpose |
|---|---|
| Workers | Public proof site and health endpoint |
| Observability | Deployed Worker request visibility |
The initial release intentionally avoids persistence, user repository access, and a hosted prompt box. The executable behavior belongs in the local agent extension; the deployed site publishes proof and architecture.
0.0.1 proves one narrow claim:
A real pi extension can complete one bounded read-only recipe without a frontier-model turn, while the routing benchmark demonstrates why semantic matching cannot own the safety boundary.
Example safe recipe:
id: git-status-summary
risk: read_only
handles:
- summarize my git status
- what changed locally?
proof:
- runs read-only git inspection
- modifies no files
- makes no remote calls
fallback: escalateExample escalation:
refactor the authentication architecture
→ ESCALATE
novel, write-capable, or judgment-heavy work is outside the read-only cache boundary
This project is deliberately conservative.
| Request type | 0.0.1 behavior |
|---|---|
| Public, read-only, objectively bounded workflow | May route to a verified recipe |
| Public or dummy output summarization | May route locally |
| Source edits, commits, deploys, comments, external actions | Escalate |
| Private repositories, non-public code, secrets, customer or personal data | Escalate |
| Architecture or security judgment | Escalate |
| No high-confidence match | Escalate |
No prompt bodies are accepted or stored by the deployed proof site. The executable recipe lives in the local pi extension and runs only inside a user's own local repository.
For fully local experimentation, use Ollama and public OSS or dummy input only. A future hosted semantic-verifier experiment may use Workers AI behind deterministic policy; it is not part of the deployed proof site and there is no public prompt submission surface.
This repository does not claim that all model weights or third-party local runtimes are approved by any employer or organization. Confirm your own tool, model-license, and data-handling policies before using local inference for work.
Cloudflare Worker
├── static proof site
└── /health
pi extension
└── AuthenticationRouter
├── PublicReadOnlyPolicy → deterministic authorization boundary
├── ExampleRecipeMatcher → candidate lookup only
└── AuthorizedRecipeService → execute or frontier fallback
The deterministic policy rejects obvious writes and sensitive/private contexts before local recipe execution. Matching never grants authority to a risky request. Workers AI remains a planned hosted verifier inside that policy boundary.
Returns the deployed service/version posture.
- Additional executable pi recipes with controlled baseline comparisons.
- Opt-in recipe persistence and local metrics.
- Local reduction of large public tool outputs before premium inference.
- Real session benchmarks measuring upstream token/cost avoidance against a baseline.
- User-approved proposal flow for turning successful repeated tasks into recipes.
Not next: silent file-edit recipes or pretending a small router should solve novel engineering tasks.
0.0.1. Public-data-only experiment. MIT.