llmtrim 0.1.7

github-actions released this 13 Jun 21:03

· 75 commits to main since this release

4d70afb

Added

LLMTRIM_CAPTURE_DIR records the applied stages. Each capture JSON now carries a
stages array — the names of the compression stages that actually rewrote the request.
Previously only plan (the output-rehydration plan, a different axis and usually empty)
was recorded, so an external auditor could not tell a lossless run that dropped content
(a bug) from a lossy stage doing its job.
UniFFI bindings (llmtrim-uniffi) + Python wheel. A new binding crate exposes
llmtrim-core to Python, Ruby, Swift and Kotlin from one Rust definition: a flat
compress(input, provider, preset) -> CompressOutput call with errors mapped to native
exceptions, running natively in-process (no server, no extra model calls). Each language
ships as a published package with the compiled engine bundled (no Rust toolchain needed
by consumers): a Python wheel (PyPI), a Ruby gem (RubyGems), a Kotlin/JVM jar (Maven
Central) and a Swift package (SwiftPM/XCFramework), built for Linux, macOS and Windows.
All four are exercised in CI. See crates/llmtrim-uniffi/README.md.

Changed

Split into a Cargo workspace: llmtrim-core (engine) + llmtrim (CLI/proxy).
The deterministic compression engine — compress/compress_with_config/route/
rehydrate/CompressResult plus the pipeline, stage, provider, tokenizer, gate and
config modules — now lives in a standalone llmtrim-core crate with no async/tokio
in its dependency tree, so it can be embedded as a library. The llmtrim binary,
MITM interceptor, daemon, token ledger, live benchmark and terminal UI move to the
llmtrim CLI crate, which depends on llmtrim-core. No behavior change; the llmtrim
command and its install paths are unchanged. rehydrate is now pub (the CLI's
interceptor calls it across the crate boundary).

Fixed

Tool selection no longer churns the cached prompt prefix on agent loops (#9): tool
selection keeps only the tools its relevance ranking scores against the conversation,
so the kept subset changes from turn to turn. Providers fold the tools[] block into the
cached prompt prefix, so a changing block invalidated the prefix on every turn of an agent
loop — provider prompt-cache reads dropped and the prefix was rebilled as fresh input,
which on a cache-warm loop can cost more than not compressing at all. Selection now runs
only on the first turn of a conversation (where there is no prior prefix to bust and the
saving is free); from the second turn on the tool set is left intact, and only the
deterministic description-trim and schema-minify stages shrink the block — they are pure
functions of the toolset, so the block stays byte-identical turn to turn (regression-tested).
Applies to every preset that selects tools (agent, aggressive). A single-shot request with
a large toolset still gets the full pruning saving. On a cache-warm multi-turn loop this keeps
the tool prefix reusable instead of rebilling it each turn (an exploratory gpt-4o-mini run
showed it roughly halving freshly-billed input once the prefix is warm — indicative, not a
committed benchmark). The first turn ships the pruned set and turn two the full set, so there is
a one-time prefix change at that boundary (a single extra cache write, ~25% on Anthropic) before
it stays warm. This stabilizes the tool block on its own; keeping earlier-turn message
content byte-stable across turns still relies on the turn-stability memo (memo = true, default).

Assets 15