Skip to content

Releases: finktech-dev/llm-zip

v0.2.2: Cascading chunker, core optimizations, and stability fixes

10 Jun 17:55

Choose a tag to compare

v0.2.0

This release finalizes the stabilization phase, focusing on processing reliability and CI/CD robustness.

Changelog

  • fix(core): Implemented a 4-level cascading chunker (Paragraphs → Sentences → Lines → Sliding Window) to prevent silent BERT truncation on dense prose.
  • perf(core): Optimized token counting by grouping encodings and caching tiktoken objects, reducing latency in batch requests.
  • fix(types): Performed a comprehensive refactor of type annotations across the API, CLI, and Pricing modules to satisfy strict CI/CD pipelines.
  • feat(ci): Added GitHub Actions workflows for automated linting, type checking, and PyPI releases.
  • feat(i18n): Added truncation warning keys in five languages: English (en), Spanish (es), Portuguese (pt), Chinese (zh), and Japanese (ja).
  • chore(ci): Initialized default environment configuration and installed inference dependencies for GitHub Actions runners.
  • fix(tests): Marked API integration tests with the integration marker to prevent out-of-memory (OOM) errors in memory-constrained environments (Win32).

llm-zip v0.2.1

09 Jun 04:58

Choose a tag to compare

llm-zip v0.2.1

Health probes, structured logging, system info endpoint, and dependency fixes.

What's New

Health probes

Added Kubernetes-compliant /health/live and /health/ready endpoints.

  • live guarantees the HTTP server is running.
  • ready remains unavailable until inference models are fully loaded into memory, handling the typical 2–5 minute cold-start latency.

Structured logging

Added rotating JSON file logging in logs/llmzip.log alongside colored console output.

Logs now include structured fields such as:

  • tokens_in
  • tokens_out
  • ratio
  • elapsed_ms

This makes ingestion by monitoring platforms such as Datadog and Loki significantly easier.

Info endpoint

Added GET /v1/info.

Returns:

  • Current system configuration
  • Loaded models
  • Enabled features
  • Active hardware limits (e.g. max_tokens, max_file_size_mb)

File size limits

Enforced MAX_FILE_SIZE_MB (default: 50 MB) on the /v1/compress/file endpoint to prevent memory exhaustion when processing large documents.

CLI commands

Added:

llmzip version

to quickly verify the installed package version.

Documentation

Added:

  • DOCKER.md with detailed guidance for monolith and split deployments, including Kubernetes examples.
  • KNOWN_LIMITATIONS.md documenting current architectural constraints and expected behavior.

Fixed

Docker dependencies

Resolved a ModuleNotFoundError affecting split-mode deployments by ensuring sentence-transformers is installed in the stateless API container when semantic scoring is enabled.

Dependency scope

Moved heavy machine learning dependencies (llmlingua, markitdown) into the optional [inference] dependency group in pyproject.toml.

API reliability

Fixed a NameError involving _get_warning that could trigger HTTP 500 responses during single-file and batch compression requests.

Upgrading from 0.2.0

No breaking changes.

The logs/ directory will be created automatically on startup.

If you want to override the default 50 MB upload limit, copy MAX_FILE_SIZE_MB from .llmzip.config.example into your existing configuration file.

v0.2.0 — Split mode, estimate endpoint, auth & rate limiting

08 Jun 01:51

Choose a tag to compare

llm-zip v0.2.0

Split mode, estimate endpoint, API key auth, and rate limiting.

What's new

Split mode — run the API and the inference engine as separate containers.
DEPLOY_MODE=split in config (or via env var) makes llmzip-api stateless
and delegates compression and scoring to llmzip-models over HTTP.
Scale the API layer independently without duplicating the ~700MB model weight.
See docker-compose.split.yml and Dockerfile.api / Dockerfile.models.

Estimate endpointPOST /v1/estimate returns token counts and savings
estimates without performing actual compression. Useful for agents deciding
whether compression is worth the CPU cost before committing.

API key auth — set API_KEY in [server] to require
Authorization: Bearer <key> on all endpoints. Health checks remain public.
Off by default — if no key is set, the API is unauthenticated.

Rate limitingslowapi integration with configurable REQUESTS_PER_MINUTE
and REQUESTS_PER_DAY in .llmzip.config. Off by default.

Concurrency — removed the global lock around PromptCompressor inference.
Batch requests now compress items in true parallel on CPU.

Scorer reliabilitySCORER_TIMEOUT and SCORER_MODEL are now
configurable. Slow embedding models no longer hang the entire request.

Fixed

  • CLI --json flag now silences human-readable metrics — output is valid JSON
  • Token counting uses tiktoken.encoding_for_model() — fixes ambiguous matches
    between model families (e.g. gpt-4o vs gpt-4)

Upgrading from 0.1.x

No breaking changes. Copy the new keys from .llmzip.config.example into your
existing config if you want to use auth, rate limiting, or split mode.
Monolith mode (docker-compose up) works exactly as before.

Large document support via smart paragraph-based chunking

07 Jun 15:16

Choose a tag to compare

v0.1.9

  • Implemented a smart paragraph-aware chunking layer to overcome the 512-token architectural limitation of BERT-based models (like bert-base-multilingual used in LLMLingua-2).

    Previously, processing documents larger than the model's native context window would trigger transformer indexing warnings and could result in unstable behavior or silent data truncation. The new logic segmentizes input text by double newlines (\n\n) into chunks that fit within a 400-token safety margin (configurable via CHUNK_SIZE).

    These segments are compressed independently using a thread-safe workflow and then reassembled. This allows llm-zip to compress large RAG contexts and multi-page documents of any length while preserving semantic coherence at the paragraph level and ensuring the compression model operates within its optimal efficiency range.

  • Integrated a comprehensive internal benchmark suite into the README using real-world technical and academic datasets.

    These results include tests on 100+ page academic PDFs (290k+ tokens) and complex technical manuals in Spanish. The data verifies that llm-zip maintains a preservation score above 0.89 and achieves compression ratios between 1.7x and 2.5x on high-density material, providing developers with empirical evidence of token savings before deployment.

  • Added CHUNK_SIZE configuration to the [compression] section of .llmzip.config, enabling fine-grained control over the segmentation process based on specific hardware capabilities and document structures.

  • Reformatted all documentation tables with manual pipe and cell alignment to ensure perfect readability in terminal-based pagers, plain-text editors, and the GitHub web interface.

Concurrency and reliability fixes

07 Jun 02:56

Choose a tag to compare

v0.1.8

  • Fixed ready endpoint returning incorrect status under multi-worker uvicorn deployments (disk-based marker replaces in-memory global)
  • Fixed race condition in price resolver: LiteLLM fetch now runs outside the lock, preventing duplicate concurrent HTTP requests

v.0.1.7 pricing data corrections and internal consistency fixes

07 Jun 02:49

Choose a tag to compare

v0.1.7

  • Fixed incorrect fallback prices for GPT-5.5, GPT-5.4 family, Gemini 3.x, and DeepSeek V4 (verified 2026-06-06)
  • Added missing models to fallback: gpt-5.4, gpt-4.1-nano, claude-opus-4-6
  • Fixed redundant count_tokens call in /v1/compress route
  • Compression failure warnings now route through i18n instead of returning raw strings
  • Centralized FEATURED_MODELS into core/featured_models.py, removing duplicate definitions in savings_calculator.py and compress_cmd.py

v0.1.6 — Hotfix

06 Jun 23:31

Choose a tag to compare

v0.1.6 — Hotfix

Fixed: literal newline artifact in lingua_adapter.py — force_tokens=["\n"] had a raw newline character instead of the escape sequence, causing SyntaxError on startup
Fixed: missing from typing import NoReturn import in loader.py causing NameError on startup

v0.1.5 — Concurrency & config fixes

06 Jun 22:48

Choose a tag to compare

v0.1.5 — Concurrency & config fixes

  • Fixed: CompressRequest and BatchItem no longer hardcode gpt-4o-mini — model now falls back to config.default_model from .llmzip.config when not specified in the request
  • Fixed: added threading.Lock with double-checked locking to resolver.py — prevents simultaneous LiteLLM fetches under concurrent batch load
  • Fixed: _meta in fetcher and resolver now includes an explicit "source" field ("litellm" or "fallback") instead of inferring it from the note string
  • Fixed: _fail() in loader.py now correctly typed as NoReturn
  • Fixed: convert_bytes() in file_converter.py now closes the tempfile before passing it to MarkItDown, fixing PermissionError on Windows

v0.1.4 — Consolidated fixes

06 Jun 22:14

Choose a tag to compare

v0.1.4 — Consolidated fixes

Fixed: threading.Lock and count_tokens import were missing from lingua_adapter.py — batch compression under concurrency would crash at runtime
Fixed: SemanticScorer now accepts and uses models_dir, ensuring the CLI and API both download scorer model to the same volume
Fixed: compress_file.py had corrupted trailing code from a previous patch — cleaned up
Fixed: importlib.metadata import was missing from app.py despite the dynamic version call being present
Fixed: NamedTemporaryFile fix from v0.1.1 was not present in the built wheel — reapplied

v0.1.3 — Critical bugfix

06 Jun 19:11

Choose a tag to compare

v0.1.3 — Critical bugfix

Fixed: POST /v1/compress was calling lingua.compress() without the target_model argument, causing a TypeError on every request — this was the main compression endpoint
Fixed: API version in Swagger UI was hardcoded as 0.1.0; now reads dynamically from package metadata