## What is Text Chunking?

**Text chunking** is the process of breaking down large documents into smaller, manageable pieces called "chunks". 

In RAG (Retrieval-Augmented Generation) systems, chunking solves several critical problems:

### 1. **Context Window Limitations**
- Language models (like GPT) can only process a limited amount of text at once
- Example: If a model can handle 4,000 tokens but your document is 10,000 tokens, you need to split it

### 2. **Better Search & Retrieval**
- Smaller chunks allow for more precise searching
- Instead of finding a whole document, users can find the specific paragraph they need

### 3. **Improved Relevance**
- When a user asks a question, the system can retrieve the most relevant chunk(s) instead of the entire document
- This leads to more focused and accurate answers


## Key Chunking Concepts

**Chunk Size**: How many words/tokens per chunk
- Too small → Lose context
- Too large → Poor search precision

**Chunk Overlap**: How much chunks should overlap
- Prevents cutting off important information at chunk boundaries
- Helps maintain context between adjacent chunks

**Preserving Structure**: Keeping paragraphs, headings, and formatting intact
- Maintains readability and context
- Helps with better understanding

# Recursive Chunking

## What is Recursive Chunking?

Instead of randomly cutting text, recursive chunking follows a **hierarchical approach**:

1. **First**: Try to split by paragraphs (keeps ideas together)
2. **Then**: If paragraphs are still too big, split by sentences
3. **Finally**: If sentences are too big, split by words

This ensures we **preserve meaning and structure** as much as possible

In [5]:
import re
from typing import List
from langchain_text_splitters.base import TextSplitter

In [6]:
def _split_paragraphs(text: str) -> List[str]:
    """
    Split text into paragraphs while preserving formatting.
    - Takes a big block of text
    - Breaks it into separate paragraphs
    - Keeps the original formatting (spaces, line breaks)
    - Removes empty paragraphs
    """
    
    paragraphs = re.split(r'\n\s*\n', text)
    return [p for p in paragraphs if p.strip()]


def _split_sentences(text: str) -> List[str]:
    """
    Split text into sentences while preserving original formatting.
    
    - Takes a paragraph or block of text
    - Breaks it into individual sentences
    - Keeps the original spacing and punctuation
    - Preserves how the text was originally formatted
    """
    

    pattern = r'(?<=[.!?])(\s+)'
    parts = re.split(pattern, text)
    
    sentences = []
    for i in range(0, len(parts), 2):
        sentence = parts[i] 
        
        if i + 1 < len(parts):
            sentence += parts[i + 1]  
            
        sentences.append(sentence)

    return [s for s in sentences if s.strip()]

In [7]:
print("=" * 50)

# Test paragraph splitting
test_text = """This is the first paragraph.

This is the second paragraph with some text.


This is the third paragraph after extra empty lines."""

print("\nOriginal text:")
print(test_text)

print("\nSplit into paragraphs:")
paragraphs = _split_paragraphs(test_text)
for i, para in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {para}")

test_sentence = "Hello world! How are you today? I'm doing great.  Let's learn about chunking!"

print(f"\nOriginal sentences:")
print(test_sentence)

print(f"\nSplit into sentences:")
sentences = _split_sentences(test_sentence)
for i, sent in enumerate(sentences, 1):
    print(f"Sentence {i}: {sent}")


Original text:
This is the first paragraph.

This is the second paragraph with some text.


This is the third paragraph after extra empty lines.

Split into paragraphs:
Paragraph 1: This is the first paragraph.
Paragraph 2: This is the second paragraph with some text.
Paragraph 3: This is the third paragraph after extra empty lines.

Original sentences:
Hello world! How are you today? I'm doing great.  Let's learn about chunking!

Split into sentences:
Sentence 1: Hello world! 
Sentence 2: How are you today? 
Sentence 3: I'm doing great.  
Sentence 4: Let's learn about chunking!


In [12]:
class RecursiveMarkdownSplitter(TextSplitter):
    """
    Split text into smaller chunks while preserving meaning and structure
    
    This class implements our recursive chunking strategy.
    
    - Preserves Markdown formatting (headings, code blocks, etc.)
    - Follows natural text boundaries (paragraphs → sentences → words)
    - Configurable chunk size and overlap
    - Maintains original spacing and formatting
    """
    
    def __init__(self, chunk_size: int = 100, chunk_overlap: int = 0):
        """
        Args:
            chunk_size (int): Maximum number of words per chunk (default: 100)
            chunk_overlap (int): How many words should overlap between chunks (default: 0)
        """
        super().__init__(keep_separator=True)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        print(f"   RecursiveMarkdownSplitter initialized!")
        print(f"   Chunk size: {chunk_size} words")
        print(f"   Chunk overlap: {chunk_overlap} words")

    def split_text(self, text: str) -> List[str]:
        return self._recursive_split(text)
    
    def chunk(self, text: str) -> List[str]:
        return self._recursive_split(text)

    def _count_words(self, text: str) -> int:
        """
        This is a simple word counter. We split by spaces and count the results.
        """
        return len(text.split())

    def _recursive_split(self, text: str) -> List[str]:
        """
        This method implements our hierarchical splitting strategy:
        
        1. Check if text is already small enough
        2. Try splitting by paragraphs first
        3. If paragraphs are too big, try sentences  
        4. If sentences are too big, split by words
        """

        if self._count_words(text) <= self.chunk_size:
            print(f"text is small enough ({self._count_words(text)} words): Using as one chunk")
            return [text]

        
        paragraphs = _split_paragraphs(text)
        print(f"Found {len(paragraphs)} paragraphs")
        
        if len(paragraphs) > 1:
            chunks = []
            current = ""
            
            for p in paragraphs:
                test_text = current + ("\n\n" if current else "") + p
                
                if self._count_words(test_text) <= self.chunk_size or not current:
                    current += ("\n\n" if current else "") + p
                else:
                    chunks.extend(self._recursive_split(current))
                    current = p  # Start new chunk with this paragraph
            
            if current:
                chunks.extend(self._recursive_split(current))
                
            print(f"Paragraph split complete: {len(chunks)} chunks created")
            return chunks

        sentences = _split_sentences(text)
        print(f" Found {len(sentences)} sentences")
        
        if len(sentences) > 1:
            chunks = []
            current = ""
            
            for s in sentences:
                if self._count_words(current + s) <= self.chunk_size or not current:
                    current += s
                else:
                    chunks.append(current)
                    current = s 
            
            if current:
                chunks.append(current)
                
            print(f"Sentence split complete: {len(chunks)} chunks created")
            return chunks

        words = text.split()
        chunks = []
        start = 0
        
        while start < len(words):
            end = start + self.chunk_size
            chunk_text = " ".join(words[start:end])
            chunks.append(chunk_text)
            
            if self.chunk_overlap > 0:
                start = end - self.chunk_overlap
            else:
                start = end
                
        print(f"Word split complete: {len(chunks)} chunks created")
        return chunks

In [None]:
sample = """
# Example Document

This is a short example demonstrating how the Recursive Markdown
Splitter works. It keeps paragraphs and sentence spacing in place.

Here is another paragraph to force paragraph-level splitting. This paragraph is intentionally longer to show how the chunker handles content that exceeds the chunk size limit.

## Code Example

```python
def hello_world():
    print("Hello, chunking world!")
```

This demonstrates that code blocks and formatting are preserved properly in our chunks.
"""

print("=" * 60)

# testest with different chunk sizes to see the behavior
chunk_sizes = [15, 25, 50]

for size in chunk_sizes:
    print(f"\nTesting with chunk size: {size} words")
    print("-" * 40)
    
    splitter = RecursiveMarkdownSplitter(chunk_size=size)
    
    chunks = splitter.chunk(sample)
    
    print(f"\nResults: {len(chunks)} chunks created")
    
    for i, chunk in enumerate(chunks, 1):
        word_count = splitter._count_words(chunk)
        print(f"\n--- Chunk {i} ({word_count} words) ---")
        print("Content:")
        print(chunk)


Testing with chunk size: 15 words
----------------------------------------
   RecursiveMarkdownSplitter initialized!
   Chunk size: 15 words
   Chunk overlap: 0 words
Found 6 paragraphs
text is small enough (3 words): Using as one chunk
Found 1 paragraphs
 Found 2 sentences
Sentence split complete: 2 chunks created
Found 1 paragraphs
 Found 2 sentences
Sentence split complete: 2 chunks created
text is small enough (10 words): Using as one chunk
text is small enough (13 words): Using as one chunk
Paragraph split complete: 7 chunks created

Results: 7 chunks created

--- Chunk 1 (3 words) ---
Content:

# Example Document

--- Chunk 2 (12 words) ---
Content:
This is a short example demonstrating how the Recursive Markdown
Splitter works. 

--- Chunk 3 (8 words) ---
Content:
It keeps paragraphs and sentence spacing in place.

--- Chunk 4 (8 words) ---
Content:
Here is another paragraph to force paragraph-level splitting. 

--- Chunk 5 (18 words) ---
Content:
This paragraph is intentionall