Your AI agent is paying to send the same file dump five times. tokdiet is a local proxy that sits between your agent and the model API, meters every token, puts your bloated context on a diet β and proves the answer didn't get worse.
ccusage that shrinks the bill β without losing quality.
π Live demo (watch one request lose the weight): agiwhitelist.github.io/tokdiet π Launch write-up + full benchmark methodology: I cut an AI agent's input tokens by 71% and quality held β here's the 66-task benchmark
Every "context optimizer" cuts tokens. The scary question is the one they can't answer:
"If I cut the context, does the model get dumber?"
So we measured it. A 66-task A/B benchmark across 6 categories on a real model (MiniMaxβM3), each task run twice β full context (baseline) vs through tokdiet (governed) β graded against the known answer, repeated Γ3 and majorityβvoted to cancel model noise:
baseline tokdiet
input tokens 5.07M β 1.46M β71%
quality (66 tasks) 64/66 63/66 β parity (95β97%)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
198 paired runs Β· LLM-judge 92% similarity Β· confirmed on a 2nd model (MiniMax-M2.5: β72%)
β71% tokens, quality on par with baseline. Real requests, real grading β not a mock. The ~1β2 task gap is model nondeterminism plus the model declining to echo a secret β not context loss; the hardest "needle buried in junk" adversarial cases pass, because tokdiet doesn't delete blindly β it pages cold context out recoverably and protects anything onβtopic. Reproduce it yourself: node bench/run.mjs (needs an API key in env).
| shows your bill | cuts the bill | proves quality held | |
|---|---|---|---|
eyeballing /cost, ccusage |
β | β | β |
manual /compact, hand-pruning context |
β | β (blind) | β |
| tokdiet | β | β | β measured + auto safe-mode |
Everyone shows the bill or cuts it blind. tokdiet is the one that cuts it and proves the model didn't get dumber β and stops cutting the moment it might.
# 1. Start the proxy (and live dashboard) β no install needed
npx tokdiet start# 2. Point your agent at the proxy instead of the real API
export ANTHROPIC_BASE_URL=http://localhost:7787
export OPENAI_BASE_URL=http://localhost:7787/v1Now run your agent (Claude Code, Cursor, Codex, your own script) as usual. Traffic flows through tokdiet, gets metered and compacted, and is forwarded upstream unchanged in every way that matters.
Your API key stays with you. tokdiet reads x-api-key / Authorization only to forward them upstream. They are never written to SQLite and never written to any log. And it's failβopen: if anything inside the governor errors, it falls back to transparent passthrough β the proxy will never break your request or surface its own 5xx.
Default ports: proxy
7787, dashboard7878. Override with--port/--dashboard-port.
tokdiet ships as a Claude Code plugin via its own marketplace:
/plugin marketplace add agiwhitelist/tokdiet
/plugin install tokdietWhat the plugin does β and what it doesn't. The plugin ships a lightweight
metering hook plus a /tokdiet command. The hook runs on every tool call
(PreToolUse + PostToolUse) and logs tool I/O byte sizes to
~/.tokdiet/tool-meter.log. It does not save tokens by itself β a plugin
can't set ANTHROPIC_BASE_URL for the Claude Code process, so it can't route
your traffic through the compacting proxy.
The actual token savings come from the proxy. Start it and point Claude Code at it (this is what gives you the ~β71% token reduction):
npx tokdiet start
export ANTHROPIC_BASE_URL=http://localhost:7787 # then launch Claude Code from this shellView metered tokens, cost, and savings any time with npx tokdiet report, or run
/tokdiet inside Claude Code for these instructions.
Claude Code is the flagship use case, and it has two landmines a naive compacting proxy walks straight into. tokdiet handles both:
- Prompt caching. Claude Code marks a cached prefix with
cache_control; cached input costs ~10% of normal. Rewriting that prefix invalidates the cache and can make a request cost more.tokdietis cacheβaware β it never touches content at or before acache_controlbreakpoint. - Extended thinking. Claude Code sends signed
thinkingblocks that Anthropic requires returned verbatim; touching one is an instant400.tokdietis thinkingβsafe β signed/thinking blocks are never surfaced or mutated.
Both are covered by regression tests (tests/cc-compat.test.ts).
A note on honesty: the dollarβsavings story applies to payβperβtoken API keys (MiniMax, Anthropic API, OpenAI, β¦). On a flat Claude subscription there are no perβtoken charges to cut, so the value there is metering, budgets, and the live dashboard β not dollars.
tokdiet is a streaming reverse proxy. SSE responses are proxied incrementally (never buffered whole), so your agent's tokens still stream in real time.
tokdiet (localhost:7787)
agent ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ model API
(Claude request βββββββββββββ βββββββββ ββββββββββ βββββββββββββ (Anthropic /
Code, βββββββββββΊ βinterceptorβββΊβ meter βββΊβ budget βββΊβ compactor ββββΊ OpenAI /
Cursor, raw key βββββββββββββ βββββββββ ββββββββββ βββββββ¬ββββββ Gemini /
Codex, forwarded detect count session/ β dedup / elision / MiniMax)
β¦) provider, tokens day / repo β mid-summarize
keep body & cost limits βΌ
byte-faithful βββββββββββββββββ
response β quality guard β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ shadow-eval + β
streamed back, token-for-token β safe-mode β
ββββββββββββββββ βββββββββ¬ββββββββ
β store(SQLite)βββββββββββββ
β + dashboard β telemetry, savings, degradation
ββββββββββββββββ
Blind compaction is "delete and pray." tokdiet treats your context like virtual memory: hot content (recent, pinned, relevant to the current question) stays resident; cold content (stale, redundant) is paged out to a local store as a recoverable stub β not deleted. The full block is kept in SQLite keyed by an id, so it can be audited and (roadmap) paged back in on demand when the model actually needs it.
| Mechanism | What it does |
|---|---|
| Shadowβeval | Reβruns a sampled fraction of compacted requests against the unβcompacted baseline and scores the divergence (0 = identical, 100 = unrelated). This is the measurement that answers "did quality drop?" |
| Quality budget | A hard ceiling on acceptable measured degradation (qualityBudget.maxDegradationPct, default 2%). As you approach it, the compactor restricts itself to its safest strategies. |
| Safeβmode | If rolling degradation exceeds the budget, the offending strategy is disabled (perβstrategy) and a safe-mode event fires. Savings stop before quality does. |
- Dedup β lossβfree. When the same large block is reβpasted across a conversation, keep the freshest copy verbatim and replace earlier copies with a pointer marker. Works on nearβduplicates too (a file reβpasted with a few lines changed), not just byteβidentical ones.
- Elision β recoverable. Page out the bulk of old tool results (file dumps, command output), keeping a preview plus the salient lines (errors, ids,
KEY=VALUE, URLs, paths, numbers) and storing the full body for recovery. Recent, pinned, and questionβrelevant results are kept intact. - Midβsummarize (off by default) β summarize midβhistory with a cheap model. Optβin (it costs money).
tokdiet <command> [flags] # alias: td| Command | What it does | Key flags |
|---|---|---|
start |
Run the proxy + live dashboard | --port, --dashboard-port, --no-dashboard, --config <path> |
report |
Print a usage report (or export) | --since <days>, --json, --csv <file>, --config <path> |
init |
Scaffold tokdiet.config.json in the cwd |
--force |
install-claude-plugin |
Install an idempotent Claude Code metering hook | --settings <path> |
Run tokdiet init to create tokdiet.config.json, or pass one with --config. All fields are optional and merge over sensible defaults.
| Field | Default | Description |
|---|---|---|
proxyPort / dashboardPort |
7787 / 7878 |
Ports (both bound to loopback only). |
dashboardEnabled |
true |
Start the dashboard alongside the proxy. |
contextWindowTokens |
"auto" |
Window size for utilization %; "auto" infers from the model. |
contextUtilizationThreshold |
0.7 |
Compaction triggers once input utilization reaches this fraction. |
onBudgetExceeded |
"warn" |
"warn" | "compact" | "block" when a spend budget is hit. |
budgets.perSessionUSD / perDayUSD / perRepoMonthlyUSD |
5 / 50 / 400 |
Spend ceilings (any may be null). |
compaction.strategies.{elision,dedup,midSummarize} |
true/true/false |
Perβstrategy switches. |
compaction.keepRecentToolResults |
4 |
Mostβrecent tool results always kept intact. |
compaction.minToolResultTokens |
500 |
Only elide tool results at least this large. |
compaction.elisionPreviewChars / elisionSalientLines |
240 / 12 |
How much of a pagedβout block to keep (head + salient lines). |
compaction.relevanceProtect |
true |
Shield blocks lexically onβtopic with the latest question. |
compaction.recoverable |
true |
Persist pagedβout blocks for recovery/audit (virtual memory). |
compaction.protectCachedPrefix |
true |
Never compact a provider cache (cache_control) prefix. |
compaction.semanticDedup |
true |
Collapse nearβduplicates, not just exact ones. |
qualityBudget.maxDegradationPct |
2.0 |
Max measured degradation before safeβmode trips. |
shadowEval.enabled / sampleRate |
true / 0.05 |
Whether/how often to shadowβevaluate. |
shadowEval.judge |
"heuristic" |
"heuristic" | "llm" ("embedding" reserved, falls back to heuristic). |
shadowEval.judgeModel |
"claude-haiku-4" |
Cheap model for the LLM judge / midβsummarize. |
pageFault |
{ enabled: true, maxReinjections: 1 } |
Reβinject a pagedβout block if the model can't answer without it. |
safeMode |
true |
Autoβdisable a strategy when it exceeds the quality budget. |
dataDir |
~/.tokdiet |
Where SQLite telemetry lives. |
pricingPath |
null |
Override path for pricing.json (null = bundled). |
Upstream overrides (point at a nonβdefault origin β e.g. MiniMax):
TOKDIET_ANTHROPIC_UPSTREAM,TOKDIET_OPENAI_UPSTREAM,TOKDIET_GEMINI_UPSTREAM(legacyCTXGOV_*_UPSTREAMstill read for backβcompat).
With the proxy running, open http://localhost:7878 β a single selfβcontained page that streams live updates over SSE (loopback only; your cost data never leaves the machine):
ββ tokdiet βββββββββββββββββββββββββββββββββββββββββ β live Β· :7878 ββ
β β
β SESSION claude-code βΊ my-repo βΊ MiniMax-M3 β
β context βββββββββββββββββββββββββββββ 64% 128,402 / 200,000 tok β
β β
β βββ TODAY βββββββββββββββββ βββ SAVED (cumulative) ββββββββββββββ β
β β sent 1.43M tok β β $12.40 ββββ
βββ β saving $1.07/h β β
β β saved 3.64M tok β β 3.6M tokens never left this box β β
β β spend $0.43 β β β71.8% on real traffic β β
β ββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββ β
β β
β QUALITY GUARD measured degradation 0.4% ββββββββββββ budget 2.0% β
β ββββββββββ 72 shadow-evals safe-mode β ON Β· OK β
β β
β STRATEGY LEADERBOARD fires tokens saved Ξ quality β
β βΈ dedup βββββββββββ 312 1.91M +0.0% β
β βΈ elision ββββββ 168 1.42M +0.6% β
β βΈ midSummarize Β· off Β· 0 β β β
β β
β BY TOOL claude-code ββββββββββ $0.31 cursor βββ $0.09 codex β$03 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Five live screens: Live session, Savings, Quality (degradation + safeβmode status), By tool & repo, and Strategy leaderboard β all updating in real time over SSE.
npm run build && node scripts/demo.mjsStands up a mock Anthropic upstream on loopback, starts the real tokdiet proxy in front of it, and sends one realistic bloated agent request through the whole pipeline β actual interceptor, tokenizer, compactor, pricing, telemetry, and shadowβeval. No external network, no real key. It prints a before/after table proving the input shrank while the answer stayed identical (so shadowβeval reports ~0% degradation). (The scenario is synthetic; your real savings depend on how much your own conversations repeat.)
| Provider | Endpoint detected | Base URL to set |
|---|---|---|
| Anthropic | /v1/messages |
ANTHROPIC_BASE_URL=http://localhost:7787 |
| OpenAI | /v1/chat/completions |
OPENAI_BASE_URL=http://localhost:7787/v1 |
| Gemini | :generateContent / /v1beta/β¦ |
point the Gemini SDK base URL at the proxy |
| MiniMax (and any OpenAI/Anthropicβcompatible API) | mimics OpenAI /v1 & Anthropic /anthropic |
OPENAI_BASE_URL=http://localhost:7787/v1 + TOKDIET_OPENAI_UPSTREAM=https://api.minimax.io |
Prices come from pricing.json (USD per 1,000,000 tokens, dated, userβupdatable, hotβreloaded on start; exact match then longestβprefix).
tokdiet needs no per-tool plugin. The rule is simple:
If your tool lets you override the model base URL and speaks Anthropic / OpenAI / Gemini, it works with tokdiet β point it at
http://localhost:7787/v1(OpenAI) orhttp://localhost:7787(Anthropic) and run as usual.
Confirmed from each tool's official docs (2026-06-18):
| Tool | Where you set the base URL |
|---|---|
| Claude Code | ANTHROPIC_BASE_URL env |
| opencode | opencode.json β provider.<id>.options.baseURL |
| Aider | OPENAI_API_BASE / ANTHROPIC_API_BASE env (model prefixed openai/Β·anthropic/) |
| Continue.dev | config.yaml β apiBase |
| Cline / Roo Code / Kilo Code | GUI: "OpenAI Compatible" Base URL, or Anthropic "Use custom base URL" |
| Goose | OPENAI_HOST env β host root, no /v1 |
| Zed | settings.json β language_models.openai_compatible.<name>.api_url |
| JetBrains AI Assistant | Settings β AI Assistant β third-party OpenAI-compatible URL |
| Open Interpreter | --api_base flag |
| llm (Datasette) | extra-openai-models.yaml β api_base |
| Crush (Charm) | crush.json β providers.<id>.base_url (both formats) |
| pi (earendil-works) Β· oh-my-pi | models.json / models.yml β providers.<id>.baseUrl |
| Hermes Agent (Nous) | ~/.hermes/config.yaml β model.base_url |
Per-tool copy-paste setup, gotchas (/v1 vs bare host, dummy keys), and the partial/unsupported details β docs/integrations.md.
- Pageβfault autoβreinjection β when the model references a pagedβout id or signals it's missing content, restore it and retry automatically (partially shipped).
- Semantic dedup (shipped) β nearβduplicate collapsing.
- Embedding judge β local semantic scoring instead of the heuristic.
- Selfβcalibrating policy β learn safe aggressiveness per repo from shadowβeval outcomes.
- Quality ledger β auditable before/after + measuredβdegradation record.
See docs/DESIGN-context-virtual-memory.md for the full design.
- The default judge is a heuristic (word/char similarity), not a semantic oracle. Switch
shadowEval.judgeto"llm"for a modelβgraded score. Embedding judge isn't implemented yet. - Shadowβeval costs money β it's a real extra upstream request, so it's sampled (5% default) and its cost is reported separately.
- Session inference is heuristic β perβsession/perβrepo attribution is inferred from request metadata.
- Pageβfault recovery is limited for streaming responses.
- Cost figures are estimates β only as accurate as your
pricing.json.
Point Claude Code at tokdiet instead of the model API directly. tokdiet is a local streaming reverse proxy that sits between your agent and the API, meters every token, and compacts bloated context before it hits the model:
npx tokdiet start # proxy :7787 + dashboard :7878 (loopback only)
export ANTHROPIC_BASE_URL=http://localhost:7787
export OPENAI_BASE_URL=http://localhost:7787/v1In our 66-task A/B benchmark on a real model (MiniMaxβM3), input tokens dropped from 5.07M to 1.46M (β71%) while quality stayed at parity (baseline 64/66 vs governed 63/66). That is the mechanism behind claude code token optimization here: the proxy shrinks the context, not your workflow.
Note: there is also a Claude Code plugin (/plugin marketplace add agiwhitelist/tokdiet), but the plugin is only a metering hook β it cannot set ANTHROPIC_BASE_URL for the Claude Code process, so the plugin alone does not save tokens. The proxy is what cuts the bill.
It depends on how you pay. Dollar savings apply to pay-per-token API keys (MiniMax, the Anthropic API, OpenAI), where fewer input tokens means a smaller bill. If you are on a flat Claude subscription there are no per-token charges to cut, so the value there is the metering, budgets, and live local dashboard β you see exactly where tokens go, not a smaller invoice. For anyone hitting "claude code too expensive" on a metered API key, the context compression is where the cost-optimization comes from.
Yes β think of it as "ccusage that shrinks the bill." It does ccusage-style token and USD cost tracking, plus a live local dashboard, but it goes further: it is an active llm token cost proxy that compacts context to cut pay-per-token API spend, then runs shadow-eval to verify quality held. If you came looking for a ccusage alternative or a claude code usage monitor that does more than report, that is the difference β measurement plus reduction.
In our testing, no β quality held within model noise. The honest framing: this is "β parity," not "lossless." On the 66-task benchmark, baseline scored 64/66 and governed 63/66; across 198 paired runs an LLM judge reported 92% similarity, and a second model (MiniMax-M2.5) confirmed β72% tokens at parity. The ~1-2 task gap is model nondeterminism plus the model declining to echo a secret β not context loss. The hardest "needle buried in junk" adversarial cases pass.
Three mechanisms keep it honest:
- shadow-eval re-runs a sampled fraction (5% by default) of compacted requests against the uncompacted baseline and scores divergence (0 = identical β¦ 100 = unrelated). This is the measurement.
- quality budget is a hard ceiling on measured degradation (default 2%); near it, the compactor restricts itself to the safest strategies.
- safe-mode disables any offending strategy per-strategy when rolling degradation exceeds budget. Savings stop before quality does.
Not overall β be precise here. Only dedup is loss-free: re-pasted blocks keep the freshest copy verbatim and replace earlier copies with a marker (it handles near-duplicates too). Elision is recoverable, not lossless: it pages out the bulk of old tool results to local SQLite while keeping a preview plus salient lines (errors, ids, KEY=VALUE pairs, URLs, paths, numbers), and stores the full body by id for recovery. We model context as virtual memory β hot content (recent, pinned, question-relevant) stays resident; cold content is paged out as a recoverable stub, not deleted. A third strategy, mid-conversation summarize, is off by default and opt-in because it costs money.
Yes. tokdiet speaks the Anthropic Messages API, OpenAI Chat Completions, Gemini, and MiniMax β plus any OpenAI-compatible or Anthropic-compatible API. Any tool that respects ANTHROPIC_BASE_URL / OPENAI_BASE_URL works: Claude Code, Cursor, Codex, and custom scripts. So this doubles as a way to reduce Cursor token usage / track Cursor API cost and to do anthropic api cost reduce or openai api cost tracking, all through one self-hosted local proxy.
No β this is regression-tested. tokdiet is cache-aware: it never rewrites a cache_control prefix, so it won't break your Claude Code prompt cache. It is also thinking-safe: it never mutates signed/thinking blocks, so it won't trigger a 400 on extended thinking. Because it is a streaming reverse proxy, SSE is proxied incrementally β tokens still stream live to your editor.
No. API keys are forwarded only β never written to SQLite and never written to any log. The proxy binds to loopback only (no external interface), and it is fail-open: if anything goes wrong in the pipeline, your request still reaches the API. It is self-hosted and local by design; the only thing persisted to SQLite is metering data and paged-out context bodies (kept by id for recovery), not credentials.
Most self-hosted llm api gateway / ai gateway tools route and log traffic. tokdiet is a litellm alternative focused on cost: it intercepts LLM API traffic as a reverse proxy, then actively does context compression and prompt compression to shrink llm context tokens, and proves the result with shadow-eval. The pipeline is: interceptor β meter β budget β compactor β quality guard β store (SQLite) + dashboard. So it is a local llm proxy dashboard plus a context compression proxy in one, not just a passthrough router.
Reproduce the benchmark: node bench/run.mjs (you need an API key in env). It runs a 66-task A/B across 6 categories, each task run twice β full context (baseline) vs through tokdiet (governed) β graded against the known answer, repeated x3 and majority-voted to cancel model noise. You will also see live token-usage and cost tracking on the dashboard at http://localhost:7878 as you work.
Honest caveats: the default quality judge is a heuristic (an LLM judge is opt-in); shadow-eval costs money because it re-runs sampled requests; session detection is heuristic (inferred, not guaranteed); page-fault recovery is currently limited for streaming responses; and the per-request cost figures are estimates. tiktoken is used for token counting. tokdiet is MIT-licensed and built on TypeScript (Node 20+).