## What is Text Chunking?

**Text chunking** is the process of breaking down large documents into smaller, manageable pieces called "chunks". 

In RAG (Retrieval-Augmented Generation) systems, chunking solves several critical problems:

### 1. **Context Window Limitations**
- Language models (like GPT) can only process a limited amount of text at once
- Example: If a model can handle 4,000 tokens but your document is 10,000 tokens, you need to split it

### 2. **Better Search & Retrieval**
- Smaller chunks allow for more precise searching
- Instead of finding a whole document, users can find the specific paragraph they need

### 3. **Improved Relevance**
- When a user asks a question, the system can retrieve the most relevant chunk(s) instead of the entire document
- This leads to more focused and accurate answers


## Key Chunking Concepts

**Chunk Size**: How many words/tokens per chunk
- Too small → Lose context
- Too large → Poor search precision

**Chunk Overlap**: How much chunks should overlap
- Prevents cutting off important information at chunk boundaries
- Helps maintain context between adjacent chunks

**Preserving Structure**: Keeping paragraphs, headings, and formatting intact
- Maintains readability and context
- Helps with better understanding

# Recursive Chunking

## What is Recursive Chunking?

Instead of randomly cutting text, recursive chunking follows a **hierarchical approach**:

1. **First**: Try to split by paragraphs (keeps ideas together)
2. **Then**: If paragraphs are still too big, split by sentences
3. **Finally**: If sentences are too big, split by words

This ensures we **preserve meaning and structure** as much as possible

In [5]:
import re
from typing import List
from langchain_text_splitters.base import TextSplitter

In [6]:
def _split_paragraphs(text: str) -> List[str]:
    """
    Split text into paragraphs while preserving formatting.
    - Takes a big block of text
    - Breaks it into separate paragraphs
    - Keeps the original formatting (spaces, line breaks)
    - Removes empty paragraphs
    """
    
    paragraphs = re.split(r'\n\s*\n', text)
    return [p for p in paragraphs if p.strip()]


def _split_sentences(text: str) -> List[str]:
    """
    Split text into sentences while preserving original formatting.
    
    - Takes a paragraph or block of text
    - Breaks it into individual sentences
    - Keeps the original spacing and punctuation
    - Preserves how the text was originally formatted
    """
    

    pattern = r'(?<=[.!?])(\s+)'
    parts = re.split(pattern, text)
    
    sentences = []
    for i in range(0, len(parts), 2):
        sentence = parts[i] 
        
        if i + 1 < len(parts):
            sentence += parts[i + 1]  
            
        sentences.append(sentence)

    return [s for s in sentences if s.strip()]

In [7]:
print("=" * 50)

# Test paragraph splitting
test_text = """This is the first paragraph.

This is the second paragraph with some text.


This is the third paragraph after extra empty lines."""

print("\nOriginal text:")
print(test_text)

print("\nSplit into paragraphs:")
paragraphs = _split_paragraphs(test_text)
for i, para in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {para}")

test_sentence = "Hello world! How are you today? I'm doing great.  Let's learn about chunking!"

print(f"\nOriginal sentences:")
print(test_sentence)

print(f"\nSplit into sentences:")
sentences = _split_sentences(test_sentence)
for i, sent in enumerate(sentences, 1):
    print(f"Sentence {i}: {sent}")


Original text:
This is the first paragraph.

This is the second paragraph with some text.


This is the third paragraph after extra empty lines.

Split into paragraphs:
Paragraph 1: This is the first paragraph.
Paragraph 2: This is the second paragraph with some text.
Paragraph 3: This is the third paragraph after extra empty lines.

Original sentences:
Hello world! How are you today? I'm doing great.  Let's learn about chunking!

Split into sentences:
Sentence 1: Hello world! 
Sentence 2: How are you today? 
Sentence 3: I'm doing great.  
Sentence 4: Let's learn about chunking!


In [12]:
class RecursiveMarkdownSplitter(TextSplitter):
    """
    Split text into smaller chunks while preserving meaning and structure
    
    This class implements our recursive chunking strategy.
    
    - Preserves Markdown formatting (headings, code blocks, etc.)
    - Follows natural text boundaries (paragraphs → sentences → words)
    - Configurable chunk size and overlap
    - Maintains original spacing and formatting
    """
    
    def __init__(self, chunk_size: int = 100, chunk_overlap: int = 0):
        """
        Args:
            chunk_size (int): Maximum number of words per chunk (default: 100)
            chunk_overlap (int): How many words should overlap between chunks (default: 0)
        """
        super().__init__(keep_separator=True)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        print(f"   RecursiveMarkdownSplitter initialized!")
        print(f"   Chunk size: {chunk_size} words")
        print(f"   Chunk overlap: {chunk_overlap} words")

    def split_text(self, text: str) -> List[str]:
        return self._recursive_split(text)
    
    def chunk(self, text: str) -> List[str]:
        return self._recursive_split(text)

    def _count_words(self, text: str) -> int:
        """
        This is a simple word counter. We split by spaces and count the results.
        """
        return len(text.split())

    def _recursive_split(self, text: str) -> List[str]:
        """
        This method implements our hierarchical splitting strategy:
        
        1. Check if text is already small enough
        2. Try splitting by paragraphs first
        3. If paragraphs are too big, try sentences  
        4. If sentences are too big, split by words
        """

        if self._count_words(text) <= self.chunk_size:
            print(f"text is small enough ({self._count_words(text)} words): Using as one chunk")
            return [text]

        
        paragraphs = _split_paragraphs(text)
        print(f"Found {len(paragraphs)} paragraphs")
        
        if len(paragraphs) > 1:
            chunks = []
            current = ""
            
            for p in paragraphs:
                test_text = current + ("\n\n" if current else "") + p
                
                if self._count_words(test_text) <= self.chunk_size or not current:
                    current += ("\n\n" if current else "") + p
                else:
                    chunks.extend(self._recursive_split(current))
                    current = p  # Start new chunk with this paragraph
            
            if current:
                chunks.extend(self._recursive_split(current))
                
            print(f"Paragraph split complete: {len(chunks)} chunks created")
            return chunks

        sentences = _split_sentences(text)
        print(f" Found {len(sentences)} sentences")
        
        if len(sentences) > 1:
            chunks = []
            current = ""
            
            for s in sentences:
                if self._count_words(current + s) <= self.chunk_size or not current:
                    current += s
                else:
                    chunks.append(current)
                    current = s 
            
            if current:
                chunks.append(current)
                
            print(f"Sentence split complete: {len(chunks)} chunks created")
            return chunks

        words = text.split()
        chunks = []
        start = 0
        
        while start < len(words):
            end = start + self.chunk_size
            chunk_text = " ".join(words[start:end])
            chunks.append(chunk_text)
            
            if self.chunk_overlap > 0:
                start = end - self.chunk_overlap
            else:
                start = end
                
        print(f"Word split complete: {len(chunks)} chunks created")
        return chunks

In [None]:
sample = """
# Example Document

This is a short example demonstrating how the Recursive Markdown
Splitter works. It keeps paragraphs and sentence spacing in place.

Here is another paragraph to force paragraph-level splitting. This paragraph is intentionally longer to show how the chunker handles content that exceeds the chunk size limit.

## Code Example

```python
def hello_world():
    print("Hello, chunking world!")
```

This demonstrates that code blocks and formatting are preserved properly in our chunks.
"""

print("=" * 60)

# testest with different chunk sizes to see the behavior
chunk_sizes = [15, 25, 50]

for size in chunk_sizes:
    print(f"\nTesting with chunk size: {size} words")
    print("-" * 40)
    
    splitter = RecursiveMarkdownSplitter(chunk_size=size)
    
    chunks = splitter.chunk(sample)
    
    print(f"\nResults: {len(chunks)} chunks created")
    
    for i, chunk in enumerate(chunks, 1):
        word_count = splitter._count_words(chunk)
        print(f"\n--- Chunk {i} ({word_count} words) ---")
        print("Content:")
        print(chunk)


Testing with chunk size: 15 words
----------------------------------------
   RecursiveMarkdownSplitter initialized!
   Chunk size: 15 words
   Chunk overlap: 0 words
Found 6 paragraphs
text is small enough (3 words): Using as one chunk
Found 1 paragraphs
 Found 2 sentences
Sentence split complete: 2 chunks created
Found 1 paragraphs
 Found 2 sentences
Sentence split complete: 2 chunks created
text is small enough (10 words): Using as one chunk
text is small enough (13 words): Using as one chunk
Paragraph split complete: 7 chunks created

Results: 7 chunks created

--- Chunk 1 (3 words) ---
Content:

# Example Document

--- Chunk 2 (12 words) ---
Content:
This is a short example demonstrating how the Recursive Markdown
Splitter works. 

--- Chunk 3 (8 words) ---
Content:
It keeps paragraphs and sentence spacing in place.

--- Chunk 4 (8 words) ---
Content:
Here is another paragraph to force paragraph-level splitting. 

--- Chunk 5 (18 words) ---
Content:
This paragraph is intentionall

# Hierarchial Chunking

## What is Hierarchial Chunking?

This is a two step chunking strategy where -

1. **First**: Split the document into logical sections based on Markdown headers
2. **Then**: If any section exceeds the max chunk size, apply a secondary chunking method
       that preserves LaTeX formulas and tables
3. **Finally**: Try to aggregate subsections into sections when possible


In [3]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

levels = [1, 2, 3, 4, 5, 6]
headers_to_split_on = [("#" * level, level) for level in levels]
headers_to_split_on # we will be splitting the markdown on all these levels

[('#', 1), ('##', 2), ('###', 3), ('####', 4), ('#####', 5), ('######', 6)]

In [5]:
sample = """
# Example Document

This is a short example text.

## Code Example

```python
def hello_world():
    print("Hello, chunking world!")
```

This demonstrates that code blocks and formatting are preserved properly in our chunks.

### Comments Example

also make sure to add comments in your code
"""
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
sections = markdown_splitter.split_text(sample)

In [10]:
for idx, section in enumerate(sections):
    print(f"\n--- Section {idx + 1} ---")
    print(section)


--- Section 1 ---
page_content='# Example Document  
This is a short example text.' metadata={1: 'Example Document'}

--- Section 2 ---
page_content='## Code Example  
```python
def hello_world():
print("Hello, chunking world!")
```  
This demonstrates that code blocks and formatting are preserved properly in our chunks.' metadata={1: 'Example Document', 2: 'Code Example'}

--- Section 3 ---
page_content='### Comments Example  
also make sure to add comments in your code' metadata={1: 'Example Document', 2: 'Code Example', 3: 'Comments Example'}


## Sentence Text Splitter

Now that we have divided text into meaningful sections, we need to divide them into smaller fragments like sentences while preserving meaning


Let's write a utility to find latex environments

In [18]:
import re

def find_latex_environments(text):
    """
    Identify all LaTeX environments in the text, handling nested environments correctly.

    Args:
        text: The text to analyze

    Returns:
        List of (start, end) tuples for all LaTeX environments
    """
    environments = []
    pos = 0

    while True:
        # Find the next \begin
        begin_pos = text.find("\\begin{", pos)
        if begin_pos == -1:
            break

        # Find the matching \end
        end_pos = find_matching_end(text, begin_pos)
        if end_pos == -1:
            # Skip this \begin if there's no matching \end
            pos = begin_pos + 6  # Move past "\begin"
            continue

        environments.append((begin_pos, end_pos))
        pos = end_pos

    return environments

def find_matching_end(text, begin_pos):
    """
    Find the matching \\end{...} for a \\begin{...} at the given position.
    Handles nested environments correctly.

    Args:
        text: The text to search in
        begin_pos: Position of the \\begin{...} command

    Returns:
        Position of the end of the matching \\end{...} command or -1 if not found
    """
    # Extract the environment name
    begin_match = re.search(r'\\begin\{([^}]+)\}', text[begin_pos:])
    if not begin_match:
        return -1

    env_name = begin_match.group(1)
    env_begin = f"\\begin{{{env_name}}}"
    env_end = f"\\end{{{env_name}}}"

    # Find the end of the current \begin command
    current_pos = begin_pos + len(env_begin)
    nesting_level = 1

    while nesting_level > 0 and current_pos < len(text):
        # Look for the next \begin or \end of the same environment
        begin_idx = text.find(env_begin, current_pos)
        end_idx = text.find(env_end, current_pos)

        # If no more begin/end tags, environment is not properly closed
        if end_idx == -1:
            return -1

        # If we find an end tag first or no more begin tags
        if begin_idx == -1 or end_idx < begin_idx:
            nesting_level -= 1
            current_pos = end_idx + len(env_end)
        else:
            nesting_level += 1
            current_pos = begin_idx + len(env_begin)

    return current_pos if nesting_level == 0 else -1

In [27]:
text = """Here is some text.

\\begin{figure}
    Some figure content.

    \\begin{center}
        This is centered text.
    \\end{center}

    More figure content.
\\end{figure}

End of document.
"""

envs = find_latex_environments(text)
print(text[20:166])

\begin{figure}
    Some figure content.

    \begin{center}
        This is centered text.
    \end{center}

    More figure content.
\end{figure}


Along with latex environments, we also need to preserve other segments like markdown tables

In [23]:
def find_tables(text):
    markdown_tables = []
    table_pattern = re.compile(r'(\|[^\n]+\|\n)((?:\|[^\n]+\|\n)+)')
    for match in table_pattern.finditer(text):
        markdown_tables.append((match.start(), match.end()))
    return markdown_tables

In [30]:
text = """Here is some text before the table.

| Name | Age | City |
| John | 30  | NY   |
| Ana  | 22  | LA   |

Some text after the table.

| Product | Price |
| Apple   | 1.2   |
| Orange  | 0.8   |
"""

tables_idx = find_tables(text)
print(tables_idx)
print(text[37:103], text[132: 192])

[(37, 103), (132, 192)]
| Name | Age | City |
| John | 30  | NY   |
| Ana  | 22  | LA   |
 | Product | Price |
| Apple   | 1.2   |
| Orange  | 0.8   |



Put them together as a utility

In [33]:
def identify_preserved_spans(text):
    """
    Identify all spans in the text that should be preserved atomically.

    Args:
        text: The text to analyze

    Returns:
        List of (start, end) tuples for preserved spans
    """
    preserved_spans = []

    # Find LaTeX environments (tables, equations, etc.)
    preserved_spans.extend(find_latex_environments(text))

    # Find markdown tables
    table_pattern = re.compile(r'(\|[^\n]+\|\n)((?:\|[^\n]+\|\n)+)')
    for match in table_pattern.finditer(text):
        is_inside_env = any(start <= match.start() and match.end() <= end
                            for start, end in preserved_spans)
        if not is_inside_env:
            preserved_spans.append((match.start(), match.end()))

    # Sort and merge overlapping spans
    if preserved_spans:
        preserved_spans.sort()
        merged_spans = []
        current_start, current_end = preserved_spans[0]

        for start, end in preserved_spans[1:]:
            if start <= current_end:  # Spans overlap
                current_end = max(current_end, end)
            else:  # No overlap
                merged_spans.append((current_start, current_end))
                current_start, current_end = start, end

        merged_spans.append((current_start, current_end))
        preserved_spans = merged_spans

    return preserved_spans

Now we are ready to split text into chunks based on sentences while preserving LaTeX content and tables.

In [32]:
import nltk

def tokenize_with_protection(text):
    # Store patterns that should be protected
    protected_patterns = []

    # Find and replace LaTeX formulas with placeholders
    def replace_protected(match):
        protected_patterns.append(match.group(0))
        return f"PROTECTED_PLACEHOLDER_{len(protected_patterns) - 1}"

    # Pattern to match LaTeX formulas enclosed in \[ \] or $ $
    latex_pattern = r'\\\[.*?\\\]|\$.*?\$'

    # Pattern to match figure references like "Fig. 2:" or "Table 1."
    figure_pattern = r'(Fig\.|Figure|Tab\.|Table|Eq\.|Equation)\s+\d+[\.:][^\.]*?'

    # Combine patterns
    combined_pattern = f"({latex_pattern})|({figure_pattern})"

    # Replace protected elements with placeholders
    protected_text = re.sub(combined_pattern, replace_protected, text, flags=re.DOTALL)

    # Tokenize the protected text
    sentences = nltk.sent_tokenize(protected_text)

    # Restore protected elements
    for i, sentence in enumerate(sentences):
        for j, protected in enumerate(protected_patterns):
            sentences[i] = sentences[i].replace(f"PROTECTED_PLACEHOLDER_{j}", protected)

    return sentences

In [34]:
def split_text(text):
    preserved_spans = identify_preserved_spans(text)

    # Tokenize text into sentences, but skip the preserved spans
    sentences = []
    last_end = 0

    for start, end in preserved_spans:
        # Process text before the preserved span
        if start > last_end:
            before_text = text[last_end:start]
            if before_text.strip():
                # Split the text before the preserved span into sentences
                before_sentences = tokenize_with_protection(before_text)
                sentences.extend(before_sentences)

        # Add the preserved span as a single "sentence"
        sentences.append(text[start:end])
        last_end = end

    # Process text after the last preserved span
    if last_end < len(text):
        after_text = text[last_end:]
        if after_text.strip():
            after_sentences = tokenize_with_protection(after_text)
            sentences.extend(after_sentences)
    
    return sentences

In [37]:
text = """Here is some text before the table.

| Name | Age | City |
| John | 30  | NY   |
| Ana  | 22  | LA   |

Some text after the table.

| Product | Price |
| Apple   | 1.2   |
| Orange  | 0.8   |

Here is some text.

\\begin{figure}
    Some figure content.

    \\begin{center}
        This is centered text.
    \\end{center}

    More figure content.
\\end{figure}

End of document.
"""
sentences = split_text(text)
for idx, sent in enumerate(sentences):
    print(f"\n--- Chunk {idx + 1} ---")
    print(sent)


--- Chunk 1 ---
Here is some text before the table.

--- Chunk 2 ---
| Name | Age | City |
| John | 30  | NY   |
| Ana  | 22  | LA   |


--- Chunk 3 ---

Some text after the table.

--- Chunk 4 ---
| Product | Price |
| Apple   | 1.2   |
| Orange  | 0.8   |


--- Chunk 5 ---

Here is some text.

--- Chunk 6 ---
\begin{figure}
    Some figure content.

    \begin{center}
        This is centered text.
    \end{center}

    More figure content.
\end{figure}

--- Chunk 7 ---


End of document.
