ChunkMethods

A comprehensive exploration and evaluation of document chunking strategies for Retrieval-Augmented Generation (RAG) systems. This project implements and compares various chunking methods to help developers choose the best approach for their document processing pipelines.

Features

This project implements the following chunking methods:

1. Fixed Size Chunking

Simple and fast implementation
Predictable chunk sizes
Best for homogeneous data like log files

2. Overlapping Chunking

Preserves context at chunk boundaries
Reduces retrieval failures
Ideal for long-form documents

3. Sentence-Based Chunking

Preserves complete meaningful units
Excellent for factual Q&A systems
Groups 3-5 sentences for optimal context

4. Paragraph-Based Chunking

Natural, human-readable boundaries
Good balance between size and context
Works well for structured documents

5. Document Structure-Based Chunking

Uses document hierarchy (headings, sections)
Excellent for technical documentation
Leverages Docling's HybridChunker

6. Semantic Chunking

Most semantically coherent chunks
Best retrieval quality for complex documents
Uses sentence embeddings for topic detection

7. Multiscale Semantic Chunking

Advanced semantic chunking with multiple scales
Balances local and global context
Tunable parameters for different use cases

8. Recursive Chunking

Respects natural language structure
Widely used in production RAG systems
Default strategy for most applications

Evaluation Framework

The project includes a comprehensive evaluation system with three levels:

Level 1: Intrinsic Metrics

Chunk count and size statistics
Token distribution analysis
Empty chunk detection

Level 2: Semantic Coherence

Measures topical tightness within chunks
Uses sentence embeddings for similarity scoring
Proxy for chunk quality

Level 3: Retrieval Quality

Q&A pair evaluation using document content
Hit rate calculation with keyword matching
Similarity-based retrieval assessment

Installation

Clone the repository:

git clone <repository-url>
cd ChunkMethods

Create a virtual environment with uv:

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies with uv:

uv sync

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
chunking.ipynb		chunking.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChunkMethods

Features

1. Fixed Size Chunking

2. Overlapping Chunking

3. Sentence-Based Chunking

4. Paragraph-Based Chunking

5. Document Structure-Based Chunking

6. Semantic Chunking

7. Multiscale Semantic Chunking

8. Recursive Chunking

Evaluation Framework

Level 1: Intrinsic Metrics

Level 2: Semantic Coherence

Level 3: Retrieval Quality

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChunkMethods

Features

1. Fixed Size Chunking

2. Overlapping Chunking

3. Sentence-Based Chunking

4. Paragraph-Based Chunking

5. Document Structure-Based Chunking

6. Semantic Chunking

7. Multiscale Semantic Chunking

8. Recursive Chunking

Evaluation Framework

Level 1: Intrinsic Metrics

Level 2: Semantic Coherence

Level 3: Retrieval Quality

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages