Releases: finktech-dev/llm-zip
v0.2.2: Cascading chunker, core optimizations, and stability fixes
v0.2.0
This release finalizes the stabilization phase, focusing on processing reliability and CI/CD robustness.
Changelog
- fix(core): Implemented a 4-level cascading chunker (Paragraphs → Sentences → Lines → Sliding Window) to prevent silent BERT truncation on dense prose.
- perf(core): Optimized token counting by grouping encodings and caching
tiktokenobjects, reducing latency in batch requests. - fix(types): Performed a comprehensive refactor of type annotations across the API, CLI, and Pricing modules to satisfy strict CI/CD pipelines.
- feat(ci): Added GitHub Actions workflows for automated linting, type checking, and PyPI releases.
- feat(i18n): Added truncation warning keys in five languages: English (
en), Spanish (es), Portuguese (pt), Chinese (zh), and Japanese (ja). - chore(ci): Initialized default environment configuration and installed inference dependencies for GitHub Actions runners.
- fix(tests): Marked API integration tests with the
integrationmarker to prevent out-of-memory (OOM) errors in memory-constrained environments (Win32).
llm-zip v0.2.1
llm-zip v0.2.1
Health probes, structured logging, system info endpoint, and dependency fixes.
What's New
Health probes
Added Kubernetes-compliant /health/live and /health/ready endpoints.
liveguarantees the HTTP server is running.readyremains unavailable until inference models are fully loaded into memory, handling the typical 2–5 minute cold-start latency.
Structured logging
Added rotating JSON file logging in logs/llmzip.log alongside colored console output.
Logs now include structured fields such as:
tokens_intokens_outratioelapsed_ms
This makes ingestion by monitoring platforms such as Datadog and Loki significantly easier.
Info endpoint
Added GET /v1/info.
Returns:
- Current system configuration
- Loaded models
- Enabled features
- Active hardware limits (e.g.
max_tokens,max_file_size_mb)
File size limits
Enforced MAX_FILE_SIZE_MB (default: 50 MB) on the /v1/compress/file endpoint to prevent memory exhaustion when processing large documents.
CLI commands
Added:
llmzip versionto quickly verify the installed package version.
Documentation
Added:
DOCKER.mdwith detailed guidance for monolith and split deployments, including Kubernetes examples.KNOWN_LIMITATIONS.mddocumenting current architectural constraints and expected behavior.
Fixed
Docker dependencies
Resolved a ModuleNotFoundError affecting split-mode deployments by ensuring sentence-transformers is installed in the stateless API container when semantic scoring is enabled.
Dependency scope
Moved heavy machine learning dependencies (llmlingua, markitdown) into the optional [inference] dependency group in pyproject.toml.
API reliability
Fixed a NameError involving _get_warning that could trigger HTTP 500 responses during single-file and batch compression requests.
Upgrading from 0.2.0
No breaking changes.
The logs/ directory will be created automatically on startup.
If you want to override the default 50 MB upload limit, copy MAX_FILE_SIZE_MB from .llmzip.config.example into your existing configuration file.
v0.2.0 — Split mode, estimate endpoint, auth & rate limiting
llm-zip v0.2.0
Split mode, estimate endpoint, API key auth, and rate limiting.
What's new
Split mode — run the API and the inference engine as separate containers.
DEPLOY_MODE=split in config (or via env var) makes llmzip-api stateless
and delegates compression and scoring to llmzip-models over HTTP.
Scale the API layer independently without duplicating the ~700MB model weight.
See docker-compose.split.yml and Dockerfile.api / Dockerfile.models.
Estimate endpoint — POST /v1/estimate returns token counts and savings
estimates without performing actual compression. Useful for agents deciding
whether compression is worth the CPU cost before committing.
API key auth — set API_KEY in [server] to require
Authorization: Bearer <key> on all endpoints. Health checks remain public.
Off by default — if no key is set, the API is unauthenticated.
Rate limiting — slowapi integration with configurable REQUESTS_PER_MINUTE
and REQUESTS_PER_DAY in .llmzip.config. Off by default.
Concurrency — removed the global lock around PromptCompressor inference.
Batch requests now compress items in true parallel on CPU.
Scorer reliability — SCORER_TIMEOUT and SCORER_MODEL are now
configurable. Slow embedding models no longer hang the entire request.
Fixed
- CLI
--jsonflag now silences human-readable metrics — output is valid JSON - Token counting uses
tiktoken.encoding_for_model()— fixes ambiguous matches
between model families (e.g.gpt-4ovsgpt-4)
Upgrading from 0.1.x
No breaking changes. Copy the new keys from .llmzip.config.example into your
existing config if you want to use auth, rate limiting, or split mode.
Monolith mode (docker-compose up) works exactly as before.
Large document support via smart paragraph-based chunking
v0.1.9
-
Implemented a smart paragraph-aware chunking layer to overcome the 512-token architectural limitation of BERT-based models (like
bert-base-multilingualused in LLMLingua-2).Previously, processing documents larger than the model's native context window would trigger transformer indexing warnings and could result in unstable behavior or silent data truncation. The new logic segmentizes input text by double newlines (
\n\n) into chunks that fit within a 400-token safety margin (configurable viaCHUNK_SIZE).These segments are compressed independently using a thread-safe workflow and then reassembled. This allows
llm-zipto compress large RAG contexts and multi-page documents of any length while preserving semantic coherence at the paragraph level and ensuring the compression model operates within its optimal efficiency range. -
Integrated a comprehensive internal benchmark suite into the README using real-world technical and academic datasets.
These results include tests on 100+ page academic PDFs (290k+ tokens) and complex technical manuals in Spanish. The data verifies that
llm-zipmaintains a preservation score above 0.89 and achieves compression ratios between 1.7x and 2.5x on high-density material, providing developers with empirical evidence of token savings before deployment. -
Added
CHUNK_SIZEconfiguration to the[compression]section of.llmzip.config, enabling fine-grained control over the segmentation process based on specific hardware capabilities and document structures. -
Reformatted all documentation tables with manual pipe and cell alignment to ensure perfect readability in terminal-based pagers, plain-text editors, and the GitHub web interface.
Concurrency and reliability fixes
v0.1.8
- Fixed ready endpoint returning incorrect status under multi-worker uvicorn deployments (disk-based marker replaces in-memory global)
- Fixed race condition in price resolver: LiteLLM fetch now runs outside the lock, preventing duplicate concurrent HTTP requests
v.0.1.7 pricing data corrections and internal consistency fixes
v0.1.7
- Fixed incorrect fallback prices for GPT-5.5, GPT-5.4 family, Gemini 3.x, and DeepSeek V4 (verified 2026-06-06)
- Added missing models to fallback: gpt-5.4, gpt-4.1-nano, claude-opus-4-6
- Fixed redundant count_tokens call in /v1/compress route
- Compression failure warnings now route through i18n instead of returning raw strings
- Centralized FEATURED_MODELS into core/featured_models.py, removing duplicate definitions in savings_calculator.py and compress_cmd.py
v0.1.6 — Hotfix
v0.1.6 — Hotfix
Fixed: literal newline artifact in lingua_adapter.py — force_tokens=["\n"] had a raw newline character instead of the escape sequence, causing SyntaxError on startup
Fixed: missing from typing import NoReturn import in loader.py causing NameError on startup
v0.1.5 — Concurrency & config fixes
v0.1.5 — Concurrency & config fixes
- Fixed:
CompressRequestandBatchItemno longer hardcodegpt-4o-mini— model now falls back toconfig.default_modelfrom.llmzip.configwhen not specified in the request- Fixed: added
threading.Lockwith double-checked locking toresolver.py— prevents simultaneous LiteLLM fetches under concurrent batch load- Fixed:
_metain fetcher and resolver now includes an explicit"source"field ("litellm"or"fallback") instead of inferring it from the note string- Fixed:
_fail()inloader.pynow correctly typed asNoReturn- Fixed:
convert_bytes()infile_converter.pynow closes the tempfile before passing it to MarkItDown, fixingPermissionErroron Windows
v0.1.4 — Consolidated fixes
v0.1.4 — Consolidated fixes
Fixed: threading.Lock and count_tokens import were missing from lingua_adapter.py — batch compression under concurrency would crash at runtime
Fixed: SemanticScorer now accepts and uses models_dir, ensuring the CLI and API both download scorer model to the same volume
Fixed: compress_file.py had corrupted trailing code from a previous patch — cleaned up
Fixed: importlib.metadata import was missing from app.py despite the dynamic version call being present
Fixed: NamedTemporaryFile fix from v0.1.1 was not present in the built wheel — reapplied
v0.1.3 — Critical bugfix
v0.1.3 — Critical bugfix
Fixed: POST /v1/compress was calling lingua.compress() without the target_model argument, causing a TypeError on every request — this was the main compression endpoint
Fixed: API version in Swagger UI was hardcoded as 0.1.0; now reads dynamically from package metadata