# Document Splitting for RAG Systems

## Overview

This notebook explores **text splitting strategies** for Retrieval Augmented Generation (RAG) systems. When building RAG applications, effective document chunking is critical for:

1. **Context Window Management** - LLMs have token limits; we must fit relevant context within those constraints
2. **Semantic Coherence** - Chunks should contain complete thoughts/concepts for accurate retrieval
3. **Retrieval Quality** - Smaller, focused chunks improve embedding similarity matching
4. **Cost Optimization** - Efficient chunking reduces token usage and API costs

## Key Concepts

- **Chunk Size**: Maximum length of each text segment (characters or tokens)
- **Chunk Overlap**: Overlapping characters between adjacent chunks to preserve context at boundaries
- **Separators**: Hierarchical delimiters used to split text at natural boundaries (paragraphs, sentences, words)

## What We'll Cover

1. Character-based splitting strategies
2. Recursive splitting for semantic preservation
3. Token-aware splitting for LLM compatibility
4. Format-specific splitting (PDF, Markdown)

## Environment Setup

Initialize OpenAI client and load environment variables. The `tiktoken` library provides token counting capabilities aligned with OpenAI's tokenization.

In [1]:
import os
import openai
import tiktoken  # OpenAI's tokenizer for accurate token counting
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file (contains OPENAI_API_KEY)
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

## Text Splitter Imports

LangChain provides two primary text splitting strategies:

### RecursiveCharacterTextSplitter (Recommended)
- **Use Case**: General-purpose text splitting
- **Strategy**: Tries separators hierarchically (`\n\n` â†’ `\n` â†’ space â†’ character)
- **Benefit**: Preserves semantic structure by keeping paragraphs/sentences intact when possible

### CharacterTextSplitter
- **Use Case**: Simple splitting with a single separator
- **Strategy**: Splits on one delimiter only
- **Benefit**: Predictable, deterministic behavior for structured data

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter

## Chunking Parameters Configuration

**Design Decisions:**
- `chunk_size=26`: Small size for demonstration (production typically uses 500-2000)
- `chunk_overlap=4`: ~15% overlap ensures context continuity across chunk boundaries

**Why Overlap Matters:**
If a key concept spans a chunk boundary, overlap ensures both chunks contain sufficient context for retrieval.

In [3]:
# Small sizes for demonstration (production: 500-2000)
chunk_size = 26
chunk_overlap = 4  # ~15% overlap for context continuity

## Initialize Both Splitters

Creating instances with identical parameters to compare behavior side-by-side.

In [4]:
# Recursive splitter - tries separators hierarchically (recommended)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Character splitter - uses single separator (default: '\n\n')
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

## Experiment 1: Text Shorter Than Chunk Size

**Hypothesis**: When text length < chunk_size, no splitting should occur.

**Test Data**: 26-character alphabet (exactly equals chunk_size)

In [5]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [6]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

**Result**: Single chunk returned - no splitting needed.

---

## Experiment 2: Text Exceeding Chunk Size

**Test Data**: 33 characters (exceeds chunk_size by 7)

**Expected Behavior**: Split into 2 chunks with 4-character overlap

In [7]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [8]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

**Result Analysis**:
- First chunk: `'abcdefghijklmnopqrstuvwxyz'` (26 chars)
- Second chunk: `'wxyzabcdefg'` (11 chars)
- Overlap: `'wxyz'` appears in both chunks (4 chars as configured)

---

## Experiment 3: Space-Separated Text (RecursiveCharacterTextSplitter)

**Test Data**: Alphabet with spaces between each character (51 total chars)

**Key Difference**: RecursiveCharacterTextSplitter will try to split on spaces *before* splitting mid-character

In [9]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [10]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [11]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

**Result**: 3 chunks created, splitting at word boundaries (spaces) rather than mid-word.

Notice the overlap: `'l m'` appears in chunks 1-2, `'w x'` appears in chunks 2-3.

---

## Experiment 4: Space-Separated Text (CharacterTextSplitter - No Separator)

**Hypothesis**: Without specifying a separator, CharacterTextSplitter may behave differently.

**Expected**: Default separator (`\n\n`) won't find matches, so text remains unsplit or splits by character count.

In [12]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [13]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

**Result**: No splitting occurred (text fits in one chunk with default `\n\n` separator).

---

## Experiment 5: CharacterTextSplitter with Space Separator

Reconfiguring CharacterTextSplitter to split on spaces (matching RecursiveCharacterTextSplitter behavior).

In [14]:
len(some_text)

496

In [15]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

**Result**: Now identical to RecursiveCharacterTextSplitter output - both split on spaces with proper overlap.

**Key Insight**: RecursiveCharacterTextSplitter automates separator hierarchy; CharacterTextSplitter requires manual configuration.

---

## Real-World Example: Document Structure

This example demonstrates splitting structured prose with paragraphs, sentences, and words.

In [16]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [17]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

**Text Analysis**: 496 characters containing:
- Two paragraphs (separated by `\n\n`)
- Multiple sentences
- Natural language structure

Let's compare how splitters handle this at 450-character chunks.

In [18]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

  separators=["\n\n", "\n", "\. ", " ", ""]


["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [19]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

  separators=["\n\n", "\n", "(?<=\. )", " ", ""]


["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

## Reconfigure Splitters for Longer Text

- `chunk_size=450`: More realistic size for production
- `chunk_overlap=0`: Disabled to see clean splits
- `separators=["\n\n", "\n", " ", ""]`: RecursiveCharacterTextSplitter hierarchy

In [20]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("./99-DPDPA.pdf")
pages = loader.load()

## CharacterTextSplitter Result

**Problem Identified**: Splits mid-sentence (`"...but also,` / `have a space..."`)

This breaks semantic coherence - the chunk boundary cuts through a thought.

In [21]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [22]:
docs = text_splitter.split_documents(pages)

## RecursiveCharacterTextSplitter Result

**Improvement**: Splits at paragraph boundary (`\n\n` separator) instead of mid-sentence.

Each chunk now contains complete thoughts, preserving semantic integrity.

In [23]:
len(pages)

21

In [24]:
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages) 
pages[0].metadata


{'producer': 'iTextSharpâ„¢ 5.5.13.1 Â©2000-2019 iText Group NV (AGPL-version)',
 'creator': 'PyPDF',
 'creationdate': '2023-08-12T02:13:03+05:30',
 'moddate': '2023-08-12T02:14:35+05:30',
 'source': './99-DPDPA.pdf',
 'total_pages': 21,
 'page': 0,
 'page_label': '1'}

## Adding Sentence-Level Splitting

Reducing chunk_size to 150 and adding sentence separator (`\. ` - period + space).

**Note**: The `\.` escape sequence should be `r"\. "` or `r"\.` for proper regex, but LangChain handles this internally.

In [25]:
from langchain_community.document_loaders import NotionDirectoryLoader
from langchain_text_splitters import MarkdownHeaderTextSplitter

In [26]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

**Result**: 4 smaller chunks, attempting sentence-level boundaries (though regex syntax warning appears).

---

## Using Regex Lookbehind for Sentence Splitting

`(?<=\. )` is a **positive lookbehind** that splits after period+space without consuming the period.

In [27]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [28]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

**Result**: Similar chunking (regex lookbehind works but still produces syntax warning in Python).

**Production Recommendation**: Use raw strings `r"(?<=\. )"` or rely on default separators.

---

## Real-World Application: PDF Document Splitting

Loading the DPDPA (Digital Personal Data Protection Act) PDF to demonstrate production-scale text splitting.

In [29]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')

## Configure Character-Based Splitter for PDF

**Production Parameters:**
- `chunk_size=1000`: Balances context vs granularity
- `chunk_overlap=150`: 15% overlap for context preservation
- `separator="\n"`: Split on newlines (paragraph/section boundaries in PDFs)
- `length_function=len`: Count characters (not tokens yet)

## Split PDF Pages into Chunks

`split_documents()` processes Document objects (with metadata) rather than raw strings.

Metadata (page numbers, source file) is preserved in each chunk for traceability.

**Verification**: 21 pages loaded from PDF. After splitting, we'll have more chunks than pages.

---

## Token-Based Splitting

### Why Token Splitting Matters

Character count â‰  token count. Different encodings produce different token counts:
- "Hello" might be 1 token
- "ðŸ¤—" might be 3+ tokens

**LLM Context Windows** measure capacity in **tokens**, not characters. For accurate context management, split by tokens.

### TokenTextSplitter

Uses `tiktoken` (OpenAI's tokenizer) to ensure chunks respect actual token limits.

**Demonstration**:
- `chunk_size=1`: Splits text into individual tokens
- Example: "foo bar bazzyfoo" â†’ tokenized and inspected
- `chunk_size=10`: Realistic token-based chunking of PDF pages

**Result**: Metadata preserved, showing source page and PDF attributes.

---

## Format-Aware Splitting: Markdown

### Challenge

Markdown documents have hierarchical structure (headers, subheaders). Naive splitting can:
- Lose context (chunk missing its parent header)
- Break semantic boundaries mid-section

### Solution: MarkdownHeaderTextSplitter

Splits on header boundaries AND injects header hierarchy into chunk metadata.

## Markdown Example Document

Structure:
```
# Title
  ## Chapter 1
    Hi this is Jim
    Hi this is Joe
    ### Section
      Hi this is Lance
  ## Chapter 2
    Hi this is Molly
```

**Configuration**: Map each header level to metadata keys for retrieval filtering.

## Split Markdown by Headers

Each resulting chunk includes metadata showing its position in the document hierarchy.

### Chunk 0 Analysis

**Content**: "Hi this is Jim\nHi this is Joe"

**Metadata**: `{'Header 1': 'Title', 'Header 2': 'Chapter 1'}`

The chunk preserves hierarchical context - we know this content belongs to Title â†’ Chapter 1.

### Chunk 1 Analysis

**Content**: "Hi this is Lance"

**Metadata**: `{'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}`

Full hierarchy preserved: Title â†’ Chapter 1 â†’ Section.

### Benefits for RAG

1. **Retrieval Filtering**: Query "Chapter 1" to get only relevant chunks
2. **Context Injection**: Prepend headers to chunk text during LLM calls
3. **Citation Tracing**: Source attribution down to section level

---

## Summary & Production Recommendations

| Splitter Type | Use Case | Key Benefit |
|--------------|----------|-------------|
| **RecursiveCharacterTextSplitter** | General text, prose, documentation | Semantic coherence via hierarchical separators |
| **CharacterTextSplitter** | Structured data, single-separator needs | Simple, predictable |
| **TokenTextSplitter** | LLM context window management | Accurate token counting |
| **MarkdownHeaderTextSplitter** | Markdown docs, wikis, README files | Hierarchy preservation in metadata |

### Production Tuning Guidelines

- **chunk_size**: 500-2000 tokens (balance context vs granularity)
- **chunk_overlap**: 10-20% of chunk_size (context continuity)
- **separators**: Hierarchical (`\n\n`, `\n`, `.`, ` `) for RecursiveCharacterTextSplitter
- **Metadata**: Always preserve source, page, section for traceability