Skip to content

Large document support via smart paragraph-based chunking

Choose a tag to compare

@finktech-dev finktech-dev released this 07 Jun 15:16
· 10 commits to main since this release

v0.1.9

  • Implemented a smart paragraph-aware chunking layer to overcome the 512-token architectural limitation of BERT-based models (like bert-base-multilingual used in LLMLingua-2).

    Previously, processing documents larger than the model's native context window would trigger transformer indexing warnings and could result in unstable behavior or silent data truncation. The new logic segmentizes input text by double newlines (\n\n) into chunks that fit within a 400-token safety margin (configurable via CHUNK_SIZE).

    These segments are compressed independently using a thread-safe workflow and then reassembled. This allows llm-zip to compress large RAG contexts and multi-page documents of any length while preserving semantic coherence at the paragraph level and ensuring the compression model operates within its optimal efficiency range.

  • Integrated a comprehensive internal benchmark suite into the README using real-world technical and academic datasets.

    These results include tests on 100+ page academic PDFs (290k+ tokens) and complex technical manuals in Spanish. The data verifies that llm-zip maintains a preservation score above 0.89 and achieves compression ratios between 1.7x and 2.5x on high-density material, providing developers with empirical evidence of token savings before deployment.

  • Added CHUNK_SIZE configuration to the [compression] section of .llmzip.config, enabling fine-grained control over the segmentation process based on specific hardware capabilities and document structures.

  • Reformatted all documentation tables with manual pipe and cell alignment to ensure perfect readability in terminal-based pagers, plain-text editors, and the GitHub web interface.