Skip to content

v0.2.0 — Split mode, estimate endpoint, auth & rate limiting

Choose a tag to compare

@finktech-dev finktech-dev released this 08 Jun 01:51
· 8 commits to main since this release

llm-zip v0.2.0

Split mode, estimate endpoint, API key auth, and rate limiting.

What's new

Split mode — run the API and the inference engine as separate containers.
DEPLOY_MODE=split in config (or via env var) makes llmzip-api stateless
and delegates compression and scoring to llmzip-models over HTTP.
Scale the API layer independently without duplicating the ~700MB model weight.
See docker-compose.split.yml and Dockerfile.api / Dockerfile.models.

Estimate endpointPOST /v1/estimate returns token counts and savings
estimates without performing actual compression. Useful for agents deciding
whether compression is worth the CPU cost before committing.

API key auth — set API_KEY in [server] to require
Authorization: Bearer <key> on all endpoints. Health checks remain public.
Off by default — if no key is set, the API is unauthenticated.

Rate limitingslowapi integration with configurable REQUESTS_PER_MINUTE
and REQUESTS_PER_DAY in .llmzip.config. Off by default.

Concurrency — removed the global lock around PromptCompressor inference.
Batch requests now compress items in true parallel on CPU.

Scorer reliabilitySCORER_TIMEOUT and SCORER_MODEL are now
configurable. Slow embedding models no longer hang the entire request.

Fixed

  • CLI --json flag now silences human-readable metrics — output is valid JSON
  • Token counting uses tiktoken.encoding_for_model() — fixes ambiguous matches
    between model families (e.g. gpt-4o vs gpt-4)

Upgrading from 0.1.x

No breaking changes. Copy the new keys from .llmzip.config.example into your
existing config if you want to use auth, rate limiting, or split mode.
Monolith mode (docker-compose up) works exactly as before.