Skip to content

llmtrim 0.1.7

Choose a tag to compare

@github-actions github-actions released this 13 Jun 21:03
· 75 commits to main since this release

Added

  • LLMTRIM_CAPTURE_DIR records the applied stages. Each capture JSON now carries a
    stages array — the names of the compression stages that actually rewrote the request.
    Previously only plan (the output-rehydration plan, a different axis and usually empty)
    was recorded, so an external auditor could not tell a lossless run that dropped content
    (a bug) from a lossy stage doing its job.
  • UniFFI bindings (llmtrim-uniffi) + Python wheel. A new binding crate exposes
    llmtrim-core to Python, Ruby, Swift and Kotlin from one Rust definition: a flat
    compress(input, provider, preset) -> CompressOutput call with errors mapped to native
    exceptions, running natively in-process (no server, no extra model calls). Each language
    ships as a published package with the compiled engine bundled (no Rust toolchain needed
    by consumers): a Python wheel (PyPI), a Ruby gem (RubyGems), a Kotlin/JVM jar (Maven
    Central) and a Swift package (SwiftPM/XCFramework), built for Linux, macOS and Windows.
    All four are exercised in CI. See crates/llmtrim-uniffi/README.md.

Changed

  • Split into a Cargo workspace: llmtrim-core (engine) + llmtrim (CLI/proxy).
    The deterministic compression engine — compress/compress_with_config/route/
    rehydrate/CompressResult plus the pipeline, stage, provider, tokenizer, gate and
    config modules — now lives in a standalone llmtrim-core crate with no async/tokio
    in its dependency tree, so it can be embedded as a library. The llmtrim binary,
    MITM interceptor, daemon, token ledger, live benchmark and terminal UI move to the
    llmtrim CLI crate, which depends on llmtrim-core. No behavior change; the llmtrim
    command and its install paths are unchanged. rehydrate is now pub (the CLI's
    interceptor calls it across the crate boundary).

Fixed

  • Tool selection no longer churns the cached prompt prefix on agent loops (#9): tool
    selection keeps only the tools its relevance ranking scores against the conversation,
    so the kept subset changes from turn to turn. Providers fold the tools[] block into the
    cached prompt prefix, so a changing block invalidated the prefix on every turn of an agent
    loop — provider prompt-cache reads dropped and the prefix was rebilled as fresh input,
    which on a cache-warm loop can cost more than not compressing at all. Selection now runs
    only on the first turn of a conversation (where there is no prior prefix to bust and the
    saving is free); from the second turn on the tool set is left intact, and only the
    deterministic description-trim and schema-minify stages shrink the block — they are pure
    functions of the toolset, so the block stays byte-identical turn to turn (regression-tested).
    Applies to every preset that selects tools (agent, aggressive). A single-shot request with
    a large toolset still gets the full pruning saving. On a cache-warm multi-turn loop this keeps
    the tool prefix reusable instead of rebilling it each turn (an exploratory gpt-4o-mini run
    showed it roughly halving freshly-billed input once the prefix is warm — indicative, not a
    committed benchmark). The first turn ships the pruned set and turn two the full set, so there is
    a one-time prefix change at that boundary (a single extra cache write, ~25% on Anthropic) before
    it stays warm. This stabilizes the tool block on its own; keeping earlier-turn message
    content
    byte-stable across turns still relies on the turn-stability memo (memo = true, default).