v0.2.0 — Split mode, estimate endpoint, auth & rate limiting
llm-zip v0.2.0
Split mode, estimate endpoint, API key auth, and rate limiting.
What's new
Split mode — run the API and the inference engine as separate containers.
DEPLOY_MODE=split in config (or via env var) makes llmzip-api stateless
and delegates compression and scoring to llmzip-models over HTTP.
Scale the API layer independently without duplicating the ~700MB model weight.
See docker-compose.split.yml and Dockerfile.api / Dockerfile.models.
Estimate endpoint — POST /v1/estimate returns token counts and savings
estimates without performing actual compression. Useful for agents deciding
whether compression is worth the CPU cost before committing.
API key auth — set API_KEY in [server] to require
Authorization: Bearer <key> on all endpoints. Health checks remain public.
Off by default — if no key is set, the API is unauthenticated.
Rate limiting — slowapi integration with configurable REQUESTS_PER_MINUTE
and REQUESTS_PER_DAY in .llmzip.config. Off by default.
Concurrency — removed the global lock around PromptCompressor inference.
Batch requests now compress items in true parallel on CPU.
Scorer reliability — SCORER_TIMEOUT and SCORER_MODEL are now
configurable. Slow embedding models no longer hang the entire request.
Fixed
- CLI
--jsonflag now silences human-readable metrics — output is valid JSON - Token counting uses
tiktoken.encoding_for_model()— fixes ambiguous matches
between model families (e.g.gpt-4ovsgpt-4)
Upgrading from 0.1.x
No breaking changes. Copy the new keys from .llmzip.config.example into your
existing config if you want to use auth, rate limiting, or split mode.
Monolith mode (docker-compose up) works exactly as before.