# Document Chunking Strategies

Chunking is the process of breaking down documents into smaller pieces that can be efficiently processed by language models and retrieval systems. In this notebook, we'll explore different chunking strategies and compare their impact on retrieval quality.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand why chunking is essential for RAG systems
- Implement different chunking strategies (sentence, token, hierarchical, structure-aware)
- Compare chunking approaches using retrieval experiments
- Choose appropriate chunk sizes and overlap for your use case
- Understand tradeoffs between chunk size, context, and precision

## Why Chunking Matters

The way you chunk your documents directly impacts:

- **Retrieval Precision** - How accurately your system finds relevant information
- **Context Preservation** - How much surrounding information is maintained
- **Token Economy** - How efficiently you use your LLM's context window
- **Storage Requirements** - How much vector storage you need
- **Search Quality** - Balance between granularity and coherence

Let's explore different chunking strategies and their practical implications.

In [None]:
## Sample Document Setup

from llama_index.core.schema import Document
import textwrap

# Sample document with clear structure
sample_text = """
# Introduction to Vector Databases

Vector databases are specialized database systems designed to store and query vector embeddings efficiently.
Unlike traditional databases optimized for exact matches, vector databases excel at similarity searches.

## Key Advantages

Vector databases offer several advantages for AI applications:
- Efficient similarity search using algorithms like HNSW and IVF
- Support for high-dimensional vector data
- Optimized for retrieval-augmented generation (RAG) applications

## Common Operations

The most common operations in vector databases include:
1. Adding vectors with associated metadata
2. Searching for similar vectors using distance metrics
3. Filtering results based on metadata
4. Building and optimizing indexes for faster retrieval

# Performance Considerations

When working with vector databases at scale, consider:
- Index construction time vs. query performance
- Memory usage vs. search accuracy
- Batch processing for efficient vector insertion
"""

# Create a Document
document = Document(text=sample_text)

# Display document info
print("Sample Document:\n")
print("=" * 80)
print(f"Total length: {len(document.text):,} characters")
print(f"Preview (first 200 chars):\n")
print(textwrap.fill(document.text[:200], 80))
print("...\n")
print("✓ Document loaded for chunking experiments")

In [None]:
## Strategy 1: Sentence-Based Chunking

from llama_index.core.node_parser import SentenceSplitter

# Sentence-based chunking
sentence_splitter = SentenceSplitter(
    chunk_size=200,  # Target chunk size (in characters)
    chunk_overlap=20  # Overlap between chunks (in characters)
)

sentence_nodes = sentence_splitter.get_nodes_from_documents([document])

print("Sentence-Based Chunking Results:\n")
print("=" * 80)
print(f"Chunks created: {len(sentence_nodes)}\n")

print("Sample chunks:\n")
for i in range(min(3, len(sentence_nodes))):
    print(f"Chunk {i+1}:")
    print(f"  Length: {len(sentence_nodes[i].text)} characters")
    print(f"  Text: {textwrap.fill(sentence_nodes[i].text[:150], 76)}...")
    print()

print("✓ Sentence-based chunking preserves natural sentence boundaries")

In [None]:
## Strategy 2: Token-Based Chunking

from llama_index.core.node_parser import TokenTextSplitter

# Token-based chunking
token_splitter = TokenTextSplitter(
    chunk_size=100,  # Target chunk size (in tokens)
    chunk_overlap=20  # Overlap between chunks (in tokens)
)

token_nodes = token_splitter.get_nodes_from_documents([document])

print("Token-Based Chunking Results:\n")
print("=" * 80)
print(f"Chunks created: {len(token_nodes)}\n")

print("Sample chunks:\n")
for i in range(min(3, len(token_nodes))):
    print(f"Chunk {i+1}:")
    print(f"  Length: {len(token_nodes[i].text)} characters")
    print(f"  Text: {textwrap.fill(token_nodes[i].text[:150], 76)}...")
    print()

print("✓ Token-based chunking ensures consistent token counts for LLM processing")

In [None]:
## Strategy 3: Hierarchical Chunking

from llama_index.core.node_parser import HierarchicalNodeParser

# Hierarchical chunking with multiple levels
hierarchical_splitter = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[512, 256, 128]  # Multi-level chunking from large to small
)

hierarchical_nodes = hierarchical_splitter.get_nodes_from_documents([document])

print("Hierarchical Chunking Results:\n")
print("=" * 80)
print(f"Chunks created: {len(hierarchical_nodes)}\n")

print("Sample chunks:\n")
for i in range(min(3, len(hierarchical_nodes))):
    print(f"Chunk {i+1}:")
    print(f"  Length: {len(hierarchical_nodes[i].text)} characters")
    print(f"  Text: {textwrap.fill(hierarchical_nodes[i].text[:150], 76)}...")
    print()

print("✓ Hierarchical chunking creates multi-level representations")

In [None]:
## Strategy 4: Structure-Aware Chunking (Markdown)

from llama_index.core.node_parser import MarkdownNodeParser

# Structure-aware chunking for Markdown documents
markdown_splitter = MarkdownNodeParser()

markdown_nodes = markdown_splitter.get_nodes_from_documents([document])

print("Markdown-Aware Chunking Results:\n")
print("=" * 80)
print(f"Chunks created: {len(markdown_nodes)}\n")

print("Sample chunks with structure metadata:\n")
for i in range(min(3, len(markdown_nodes))):
    header_path = markdown_nodes[i].metadata.get('header_path', 'N/A')
    print(f"Chunk {i+1}:")
    print(f"  Header path: {header_path}")
    print(f"  Length: {len(markdown_nodes[i].text)} characters")
    print(f"  Text: {textwrap.fill(markdown_nodes[i].text[:150], 76)}...")
    print()

print("✓ Markdown-aware chunking preserves document structure and hierarchy")

## Setup: Install Embedding Model

Before we can compare retrieval quality, we need to install the embedding model library.

In [None]:
# Install embedding model library
%env UV_LINK_MODE=copy
!uv pip install llama-index-embeddings-huggingface

In [None]:
## Comparing Chunking Strategies with Retrieval

from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Create a local embedding model
print("Loading embedding model...")
local_embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")
print("✓ Embedding model loaded\n")

# Create vector indexes with different chunking strategies
print("Creating vector indexes...")
sentence_index = VectorStoreIndex(sentence_nodes, embed_model=local_embed_model)
token_index = VectorStoreIndex(token_nodes, embed_model=local_embed_model)
markdown_index = VectorStoreIndex(markdown_nodes, embed_model=local_embed_model)
print("✓ Vector indexes created\n")

# Test query
query = "What are the common operations in vector databases?"

# Get retrieval results from each strategy
print("=" * 80)
print(f"Query: '{query}'\n")
print("=" * 80)

sentence_results = sentence_index.as_retriever(similarity_top_k=1).retrieve(query)
token_results = token_index.as_retriever(similarity_top_k=1).retrieve(query)
markdown_results = markdown_index.as_retriever(similarity_top_k=1).retrieve(query)

# Compare results
print("\n1. Sentence-based chunking result:")
print(f"   Score: {sentence_results[0].score:.4f}")
print(f"   Text: {textwrap.fill(sentence_results[0].node.text[:200], 76)}...")

print("\n2. Token-based chunking result:")
print(f"   Score: {token_results[0].score:.4f}")
print(f"   Text: {textwrap.fill(token_results[0].node.text[:200], 76)}...")

print("\n3. Markdown-aware chunking result:")
print(f"   Score: {markdown_results[0].score:.4f}")
print(f"   Text: {textwrap.fill(markdown_results[0].node.text[:200], 76)}...")

print("\n" + "=" * 80)
print("✓ Retrieval comparison complete")

## Summary

We've explored four different chunking strategies and their impact on retrieval:

### Chunking Strategies Compared

1. **Sentence-Based Chunking**
   - Preserves natural sentence boundaries
   - Good for maintaining semantic coherence
   - Variable chunk sizes
   - Best for: General text documents with clear sentence structure

2. **Token-Based Chunking**
   - Consistent token counts for LLM processing
   - Predictable memory usage
   - May split mid-sentence
   - Best for: Token-limited scenarios, cost optimization

3. **Hierarchical Chunking**
   - Multi-level representations (coarse to fine)
   - Enables multi-scale retrieval
   - More complex to implement
   - Best for: Large documents, when you need both overview and detail

4. **Structure-Aware Chunking**
   - Respects document structure (headings, sections)
   - Preserves hierarchical metadata
   - Format-specific (Markdown, HTML)
   - Best for: Structured documents with clear organization

### Key Tradeoffs

**Chunk Size:**
- **Smaller chunks** → More precise retrieval, less context, more storage
- **Larger chunks** → More context, less precision, fewer chunks

**Chunk Overlap:**
- **More overlap** → Better boundary handling, more storage, potential duplication
- **Less overlap** → Efficient storage, risk of missing boundary information

**Structure Awareness:**
- **Structure-aware** → Better semantic coherence, format-specific
- **Structure-agnostic** → Simpler, works with any text, may break logical units

### Best Practices

1. **Start with structure-aware chunking** if your documents have clear structure
2. **Use 200-500 tokens** as a starting chunk size for most applications
3. **Add 10-20% overlap** to handle boundary cases
4. **Test with actual queries** to validate your chunking strategy
5. **Consider your use case**:
   - Q&A systems: Smaller chunks (100-300 tokens)
   - Summarization: Larger chunks (500-1000 tokens)
   - General RAG: Medium chunks (200-500 tokens)

6. **Monitor retrieval quality** and adjust based on results

Chunking strategy significantly impacts RAG system performance - invest time in testing different approaches with your specific documents and queries.