You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Named academic benchmarks: TruthfulQA, SQuAD v2, BFCL. The quality A/B now ships
three more standard suites alongside GSM8K, so the accuracy-preservation results name
the benchmarks a reader already knows. bench/scripts/download.py fetches them
reproducibly (download.py 40 truthfulqa,squad2,bfcl, sha256-pinned in the manifest), bench suite runs them at a conservative shape-matched preset, and the results table is
in the README. BFCL uses the multi-tool live_multiple slice (2 to 37 candidate
functions per call), where tool selection cuts 33% of input by dropping the schemas the
query doesn't need, at unchanged tool-call accuracy. SQuAD v2's unanswerable questions
are handled correctly: a right "no answer" scores as a hit. A new choice (MC1) scorer
grades TruthfulQA by the selected option letter, not by any letter the model mentions in
passing.
llmtrim mcp runs an MCP server over stdio. Any MCP client (Claude Code, Cursor,
custom agents) can spawn llmtrim mcp and call the engine as tools: llmtrim_compress
(compress a full request body and report the token deltas, honoring your ~/.llmtrim
config like the proxy and CLI), llmtrim_compress_text (shrink a single text blob with
the lossless safe preset, independent of config), and llmtrim_stats (read the savings
ledger, the same data llmtrim status --json shows). Every call records to the same
ledger, so MCP traffic shows up in llmtrim status. Behind the mcp feature, which ships
in the default build. llmtrim mcp install registers the server with Claude Code via its claude mcp add CLI (idempotent); llmtrim mcp install --print emits the config block to
paste into any other client.
Changed
The benchmark commands are now one bench subcommand group.llmtrim bench and llmtrim bench-agent are replaced by llmtrim bench quality and llmtrim bench agent,
joined by three new axes under the same dispatcher: bench suite (the full corpus matrix
in one process, replacing the run_all.sh shell script and its per-corpus cargo run
spawns), bench latency (the warm compress-path micro-bench, folded in from the loose latency.rs), and bench compare <headroom|caveman> (a thin dispatcher over the Python
head-to-head comparators). bench suite refuses to run live while an *_PROXY var is set,
so the llmtrim proxy can no longer silently contaminate the A/B baseline.
Benchmark result JSON now carries a shared envelope. Every --json-out (quality,
suite, agent) wraps its body in { schema, produced_at, commit, llmtrim_version, meta, result }, so any consumer can identify the schema and the code that produced it. The
README/chart synthesizers unwrap it transparently and still read pre-envelope files.
bench quality --offline --json-out now writes its results. Previously --json-out
was honored only on live runs, so the free offline savings pass produced nothing on disk.
It now writes a quality-offline-v1 envelope (per-case input-token before/after plus the
totals), which makes bench suite --offline reproducible without an API key.
Fixed
setup's caveman warning no longer claims llmtrim shapes output the same way caveman
does. caveman users run coding agents, which route to the agent preset where auto
deliberately leaves output unshaped, so the old "llmtrim already does this (Stage F)" reason
was wrong for exactly the people who saw it. The warning now explains that auto already
shapes output where it pays (code, long context, plain prose) and skips tool-call traffic
because terse shaping saves no tokens on short replies (bench: quality neutral), so caveman
is redundant either way.