Skip to content

llmtrim 0.1.11

Choose a tag to compare

@github-actions github-actions released this 14 Jun 20:54
· 38 commits to main since this release

Added

  • Named academic benchmarks: TruthfulQA, SQuAD v2, BFCL. The quality A/B now ships
    three more standard suites alongside GSM8K, so the accuracy-preservation results name
    the benchmarks a reader already knows. bench/scripts/download.py fetches them
    reproducibly (download.py 40 truthfulqa,squad2,bfcl, sha256-pinned in the manifest),
    bench suite runs them at a conservative shape-matched preset, and the results table is
    in the README. BFCL uses the multi-tool live_multiple slice (2 to 37 candidate
    functions per call), where tool selection cuts 33% of input by dropping the schemas the
    query doesn't need, at unchanged tool-call accuracy. SQuAD v2's unanswerable questions
    are handled correctly: a right "no answer" scores as a hit. A new choice (MC1) scorer
    grades TruthfulQA by the selected option letter, not by any letter the model mentions in
    passing.
  • llmtrim mcp runs an MCP server over stdio. Any MCP client (Claude Code, Cursor,
    custom agents) can spawn llmtrim mcp and call the engine as tools: llmtrim_compress
    (compress a full request body and report the token deltas, honoring your ~/.llmtrim
    config like the proxy and CLI), llmtrim_compress_text (shrink a single text blob with
    the lossless safe preset, independent of config), and llmtrim_stats (read the savings
    ledger, the same data llmtrim status --json shows). Every call records to the same
    ledger, so MCP traffic shows up in llmtrim status. Behind the mcp feature, which ships
    in the default build. llmtrim mcp install registers the server with Claude Code via its
    claude mcp add CLI (idempotent); llmtrim mcp install --print emits the config block to
    paste into any other client.

Changed

  • The benchmark commands are now one bench subcommand group. llmtrim bench and
    llmtrim bench-agent are replaced by llmtrim bench quality and llmtrim bench agent,
    joined by three new axes under the same dispatcher: bench suite (the full corpus matrix
    in one process, replacing the run_all.sh shell script and its per-corpus cargo run
    spawns), bench latency (the warm compress-path micro-bench, folded in from the loose
    latency.rs), and bench compare <headroom|caveman> (a thin dispatcher over the Python
    head-to-head comparators). bench suite refuses to run live while an *_PROXY var is set,
    so the llmtrim proxy can no longer silently contaminate the A/B baseline.
  • Benchmark result JSON now carries a shared envelope. Every --json-out (quality,
    suite, agent) wraps its body in { schema, produced_at, commit, llmtrim_version, meta, result }, so any consumer can identify the schema and the code that produced it. The
    README/chart synthesizers unwrap it transparently and still read pre-envelope files.
  • bench quality --offline --json-out now writes its results. Previously --json-out
    was honored only on live runs, so the free offline savings pass produced nothing on disk.
    It now writes a quality-offline-v1 envelope (per-case input-token before/after plus the
    totals), which makes bench suite --offline reproducible without an API key.

Fixed

  • setup's caveman warning no longer claims llmtrim shapes output the same way caveman
    does.
    caveman users run coding agents, which route to the agent preset where auto
    deliberately leaves output unshaped, so the old "llmtrim already does this (Stage F)" reason
    was wrong for exactly the people who saw it. The warning now explains that auto already
    shapes output where it pays (code, long context, plain prose) and skips tool-call traffic
    because terse shaping saves no tokens on short replies (bench: quality neutral), so caveman
    is redundant either way.