Releases · finktech-dev/llm-zip

10 Jun 17:55

finktech-dev

v0.2.2

53028b5

v0.2.2: Cascading chunker, core optimizations, and stability fixes Latest

Latest

v0.2.0

This release finalizes the stabilization phase, focusing on processing reliability and CI/CD robustness.

Changelog

fix(core): Implemented a 4-level cascading chunker (Paragraphs → Sentences → Lines → Sliding Window) to prevent silent BERT truncation on dense prose.
perf(core): Optimized token counting by grouping encodings and caching tiktoken objects, reducing latency in batch requests.
fix(types): Performed a comprehensive refactor of type annotations across the API, CLI, and Pricing modules to satisfy strict CI/CD pipelines.
feat(ci): Added GitHub Actions workflows for automated linting, type checking, and PyPI releases.
feat(i18n): Added truncation warning keys in five languages: English (en), Spanish (es), Portuguese (pt), Chinese (zh), and Japanese (ja).
chore(ci): Initialized default environment configuration and installed inference dependencies for GitHub Actions runners.
fix(tests): Marked API integration tests with the integration marker to prevent out-of-memory (OOM) errors in memory-constrained environments (Win32).

Assets 4

09 Jun 04:58

finktech-dev

v0.2.1

a064d97

llm-zip v0.2.1

Health probes, structured logging, system info endpoint, and dependency fixes.

What's New

Health probes

Added Kubernetes-compliant /health/live and /health/ready endpoints.

live guarantees the HTTP server is running.
ready remains unavailable until inference models are fully loaded into memory, handling the typical 2–5 minute cold-start latency.

Structured logging

Added rotating JSON file logging in logs/llmzip.log alongside colored console output.

Logs now include structured fields such as:

tokens_in
tokens_out
ratio
elapsed_ms

This makes ingestion by monitoring platforms such as Datadog and Loki significantly easier.

Info endpoint

Added GET /v1/info.

Returns:

Current system configuration
Loaded models
Enabled features
Active hardware limits (e.g. max_tokens, max_file_size_mb)

File size limits

Enforced MAX_FILE_SIZE_MB (default: 50 MB) on the /v1/compress/file endpoint to prevent memory exhaustion when processing large documents.

CLI commands

Added:

llmzip version

to quickly verify the installed package version.

Documentation

Added:

DOCKER.md with detailed guidance for monolith and split deployments, including Kubernetes examples.
KNOWN_LIMITATIONS.md documenting current architectural constraints and expected behavior.

Fixed

Docker dependencies

Resolved a ModuleNotFoundError affecting split-mode deployments by ensuring sentence-transformers is installed in the stateless API container when semantic scoring is enabled.

Dependency scope

Moved heavy machine learning dependencies (llmlingua, markitdown) into the optional [inference] dependency group in pyproject.toml.

API reliability

Fixed a NameError involving _get_warning that could trigger HTTP 500 responses during single-file and batch compression requests.

Upgrading from 0.2.0

No breaking changes.

The logs/ directory will be created automatically on startup.

If you want to override the default 50 MB upload limit, copy MAX_FILE_SIZE_MB from .llmzip.config.example into your existing configuration file.

Assets 4

08 Jun 01:51

finktech-dev

v0.2.0

b8d0a68

v0.2.0 — Split mode, estimate endpoint, auth & rate limiting

llm-zip v0.2.0

Split mode, estimate endpoint, API key auth, and rate limiting.

What's new

Split mode — run the API and the inference engine as separate containers.
DEPLOY_MODE=split in config (or via env var) makes llmzip-api stateless
and delegates compression and scoring to llmzip-models over HTTP.
Scale the API layer independently without duplicating the ~700MB model weight.
See docker-compose.split.yml and Dockerfile.api / Dockerfile.models.

Estimate endpoint — POST /v1/estimate returns token counts and savings
estimates without performing actual compression. Useful for agents deciding
whether compression is worth the CPU cost before committing.

API key auth — set API_KEY in [server] to require
Authorization: Bearer <key> on all endpoints. Health checks remain public.
Off by default — if no key is set, the API is unauthenticated.

Rate limiting — slowapi integration with configurable REQUESTS_PER_MINUTE
and REQUESTS_PER_DAY in .llmzip.config. Off by default.

Concurrency — removed the global lock around PromptCompressor inference.
Batch requests now compress items in true parallel on CPU.

Scorer reliability — SCORER_TIMEOUT and SCORER_MODEL are now
configurable. Slow embedding models no longer hang the entire request.

Fixed

CLI --json flag now silences human-readable metrics — output is valid JSON
Token counting uses tiktoken.encoding_for_model() — fixes ambiguous matches
between model families (e.g. gpt-4o vs gpt-4)

Upgrading from 0.1.x

No breaking changes. Copy the new keys from .llmzip.config.example into your
existing config if you want to use auth, rate limiting, or split mode.
Monolith mode (docker-compose up) works exactly as before.

Assets 4

07 Jun 15:16

finktech-dev

v0.1.9

dfd1ae9

Large document support via smart paragraph-based chunking

v0.1.9

Implemented a smart paragraph-aware chunking layer to overcome the 512-token architectural limitation of BERT-based models (like bert-base-multilingual used in LLMLingua-2).

Previously, processing documents larger than the model's native context window would trigger transformer indexing warnings and could result in unstable behavior or silent data truncation. The new logic segmentizes input text by double newlines (\n\n) into chunks that fit within a 400-token safety margin (configurable via CHUNK_SIZE).

These segments are compressed independently using a thread-safe workflow and then reassembled. This allows llm-zip to compress large RAG contexts and multi-page documents of any length while preserving semantic coherence at the paragraph level and ensuring the compression model operates within its optimal efficiency range.
Integrated a comprehensive internal benchmark suite into the README using real-world technical and academic datasets.

These results include tests on 100+ page academic PDFs (290k+ tokens) and complex technical manuals in Spanish. The data verifies that llm-zip maintains a preservation score above 0.89 and achieves compression ratios between 1.7x and 2.5x on high-density material, providing developers with empirical evidence of token savings before deployment.
Added CHUNK_SIZE configuration to the [compression] section of .llmzip.config, enabling fine-grained control over the segmentation process based on specific hardware capabilities and document structures.
Reformatted all documentation tables with manual pipe and cell alignment to ensure perfect readability in terminal-based pagers, plain-text editors, and the GitHub web interface.

Assets 4

07 Jun 02:56

finktech-dev

v0.1.8

a217735

Concurrency and reliability fixes

v0.1.8

Fixed ready endpoint returning incorrect status under multi-worker uvicorn deployments (disk-based marker replaces in-memory global)
Fixed race condition in price resolver: LiteLLM fetch now runs outside the lock, preventing duplicate concurrent HTTP requests

Assets 4

07 Jun 02:49

finktech-dev

v0.1.7

8fe2928

v.0.1.7 pricing data corrections and internal consistency fixes

v0.1.7

Fixed incorrect fallback prices for GPT-5.5, GPT-5.4 family, Gemini 3.x, and DeepSeek V4 (verified 2026-06-06)
Added missing models to fallback: gpt-5.4, gpt-4.1-nano, claude-opus-4-6
Fixed redundant count_tokens call in /v1/compress route
Compression failure warnings now route through i18n instead of returning raw strings
Centralized FEATURED_MODELS into core/featured_models.py, removing duplicate definitions in savings_calculator.py and compress_cmd.py

Assets 4

06 Jun 23:31

finktech-dev

v0.1.6

74b7d6c

v0.1.6 — Hotfix

Fixed: literal newline artifact in lingua_adapter.py — force_tokens=["\n"] had a raw newline character instead of the escape sequence, causing SyntaxError on startup
Fixed: missing from typing import NoReturn import in loader.py causing NameError on startup

Assets 4

06 Jun 22:48

finktech-dev

v0.1.5

dbe8325

v0.1.5 — Concurrency & config fixes

v0.1.5 — Concurrency & config fixes

Fixed: CompressRequest and BatchItem no longer hardcode gpt-4o-mini — model now falls back to config.default_model from .llmzip.config when not specified in the request

Fixed: added threading.Lock with double-checked locking to resolver.py — prevents simultaneous LiteLLM fetches under concurrent batch load

Fixed: _meta in fetcher and resolver now includes an explicit "source" field ("litellm" or "fallback") instead of inferring it from the note string

Fixed: _fail() in loader.py now correctly typed as NoReturn

Fixed: convert_bytes() in file_converter.py now closes the tempfile before passing it to MarkItDown, fixing PermissionError on Windows

Assets 4

06 Jun 22:14

finktech-dev

v0.1.4

2cad520

v0.1.4 — Consolidated fixes

Fixed: threading.Lock and count_tokens import were missing from lingua_adapter.py — batch compression under concurrency would crash at runtime
Fixed: SemanticScorer now accepts and uses models_dir, ensuring the CLI and API both download scorer model to the same volume
Fixed: compress_file.py had corrupted trailing code from a previous patch — cleaned up
Fixed: importlib.metadata import was missing from app.py despite the dynamic version call being present
Fixed: NamedTemporaryFile fix from v0.1.1 was not present in the built wheel — reapplied

Assets 4

06 Jun 19:11

finktech-dev

v0.1.3

1f1cd3e

v0.1.3 — Critical bugfix

Fixed: POST /v1/compress was calling lingua.compress() without the target_model argument, causing a TypeError on every request — this was the main compression endpoint
Fixed: API version in Swagger UI was hardcoded as 0.1.0; now reads dynamically from package metadata

Assets 4

Releases: finktech-dev/llm-zip

v0.2.2: Cascading chunker, core optimizations, and stability fixes

v0.2.0

Changelog

Uh oh!

llm-zip v0.2.1

llm-zip v0.2.1

What's New

Health probes

Structured logging

Info endpoint

File size limits

CLI commands

Documentation

Fixed

Docker dependencies

Dependency scope

API reliability

Upgrading from 0.2.0

Uh oh!

v0.2.0 — Split mode, estimate endpoint, auth & rate limiting

What's new

Fixed

Upgrading from 0.1.x

Uh oh!

Large document support via smart paragraph-based chunking

v0.1.9

Uh oh!

Concurrency and reliability fixes

v0.1.8

Uh oh!

v.0.1.7 pricing data corrections and internal consistency fixes

v0.1.7

Uh oh!

v0.1.6 — Hotfix

Uh oh!

v0.1.5 — Concurrency & config fixes

Uh oh!

v0.1.4 — Consolidated fixes

Uh oh!

v0.1.3 — Critical bugfix

Uh oh!