A comprehensive exploration and evaluation of document chunking strategies for Retrieval-Augmented Generation (RAG) systems. This project implements and compares various chunking methods to help developers choose the best approach for their document processing pipelines.
This project implements the following chunking methods:
- Simple and fast implementation
- Predictable chunk sizes
- Best for homogeneous data like log files
- Preserves context at chunk boundaries
- Reduces retrieval failures
- Ideal for long-form documents
- Preserves complete meaningful units
- Excellent for factual Q&A systems
- Groups 3-5 sentences for optimal context
- Natural, human-readable boundaries
- Good balance between size and context
- Works well for structured documents
- Uses document hierarchy (headings, sections)
- Excellent for technical documentation
- Leverages Docling's HybridChunker
- Most semantically coherent chunks
- Best retrieval quality for complex documents
- Uses sentence embeddings for topic detection
- Advanced semantic chunking with multiple scales
- Balances local and global context
- Tunable parameters for different use cases
- Respects natural language structure
- Widely used in production RAG systems
- Default strategy for most applications
The project includes a comprehensive evaluation system with three levels:
- Chunk count and size statistics
- Token distribution analysis
- Empty chunk detection
- Measures topical tightness within chunks
- Uses sentence embeddings for similarity scoring
- Proxy for chunk quality
- Q&A pair evaluation using document content
- Hit rate calculation with keyword matching
- Similarity-based retrieval assessment
- Clone the repository:
git clone <repository-url>
cd ChunkMethods- Create a virtual environment with uv:
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies with uv:
uv sync