Skip to content

ajanujaj/Chunking_Methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

ChunkMethods

A comprehensive exploration and evaluation of document chunking strategies for Retrieval-Augmented Generation (RAG) systems. This project implements and compares various chunking methods to help developers choose the best approach for their document processing pipelines.

Features

This project implements the following chunking methods:

1. Fixed Size Chunking

  • Simple and fast implementation
  • Predictable chunk sizes
  • Best for homogeneous data like log files

2. Overlapping Chunking

  • Preserves context at chunk boundaries
  • Reduces retrieval failures
  • Ideal for long-form documents

3. Sentence-Based Chunking

  • Preserves complete meaningful units
  • Excellent for factual Q&A systems
  • Groups 3-5 sentences for optimal context

4. Paragraph-Based Chunking

  • Natural, human-readable boundaries
  • Good balance between size and context
  • Works well for structured documents

5. Document Structure-Based Chunking

  • Uses document hierarchy (headings, sections)
  • Excellent for technical documentation
  • Leverages Docling's HybridChunker

6. Semantic Chunking

  • Most semantically coherent chunks
  • Best retrieval quality for complex documents
  • Uses sentence embeddings for topic detection

7. Multiscale Semantic Chunking

  • Advanced semantic chunking with multiple scales
  • Balances local and global context
  • Tunable parameters for different use cases

8. Recursive Chunking

  • Respects natural language structure
  • Widely used in production RAG systems
  • Default strategy for most applications

Evaluation Framework

The project includes a comprehensive evaluation system with three levels:

Level 1: Intrinsic Metrics

  • Chunk count and size statistics
  • Token distribution analysis
  • Empty chunk detection

Level 2: Semantic Coherence

  • Measures topical tightness within chunks
  • Uses sentence embeddings for similarity scoring
  • Proxy for chunk quality

Level 3: Retrieval Quality

  • Q&A pair evaluation using document content
  • Hit rate calculation with keyword matching
  • Similarity-based retrieval assessment

Installation

  1. Clone the repository:
git clone <repository-url>
cd ChunkMethods
  1. Create a virtual environment with uv:
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies with uv:
uv sync

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors