Large document support via smart paragraph-based chunking
v0.1.9
-
Implemented a smart paragraph-aware chunking layer to overcome the 512-token architectural limitation of BERT-based models (like
bert-base-multilingualused in LLMLingua-2).Previously, processing documents larger than the model's native context window would trigger transformer indexing warnings and could result in unstable behavior or silent data truncation. The new logic segmentizes input text by double newlines (
\n\n) into chunks that fit within a 400-token safety margin (configurable viaCHUNK_SIZE).These segments are compressed independently using a thread-safe workflow and then reassembled. This allows
llm-zipto compress large RAG contexts and multi-page documents of any length while preserving semantic coherence at the paragraph level and ensuring the compression model operates within its optimal efficiency range. -
Integrated a comprehensive internal benchmark suite into the README using real-world technical and academic datasets.
These results include tests on 100+ page academic PDFs (290k+ tokens) and complex technical manuals in Spanish. The data verifies that
llm-zipmaintains a preservation score above 0.89 and achieves compression ratios between 1.7x and 2.5x on high-density material, providing developers with empirical evidence of token savings before deployment. -
Added
CHUNK_SIZEconfiguration to the[compression]section of.llmzip.config, enabling fine-grained control over the segmentation process based on specific hardware capabilities and document structures. -
Reformatted all documentation tables with manual pipe and cell alignment to ensure perfect readability in terminal-based pagers, plain-text editors, and the GitHub web interface.