# Document Chunking Strategies

This notebook demonstrates 5 different types of document chunking strategies using LangChain.


## 1. Character Text Splitting

Basic splitting based on character count with separator.


In [1]:
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Load the text file
loader = TextLoader('SteveJobsSpeech.txt')
docs = loader.load()
text = docs[0].page_content

print(f"Original text length: {len(text)} characters")
print(f"First 200 characters: {text[:200]}...")


Original text length: 12098 characters
First 200 characters: ‘You’ve got to find what you love,’ Jobs says

I’m honored to be with you today for your commencement from one of the finest universities in the world. Truth be told, I never graduated from college. A...


In [2]:
# Character-based splitting
char_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=100
)

char_chunks = char_splitter.split_text(text)
print(f"Number of chunks: {len(char_chunks)}")
print(f"First chunk length: {len(char_chunks[0])} characters")
print(f"First chunk: {char_chunks[0][:300]}...")


Number of chunks: 14
First chunk length: 858 characters
First chunk: ‘You’ve got to find what you love,’ Jobs says

I’m honored to be with you today for your commencement from one of the finest universities in the world. Truth be told, I never graduated from college. And this is the closest I’ve ever gotten to a college graduation.

Today I want to tell you three sto...


## 2. Token-based Chunking

Splitting based on token count using tiktoken.


In [13]:
from langchain.text_splitter import TokenTextSplitter

# Token-based splitting
token_splitter = TokenTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)

token_chunks = token_splitter.split_text(text)
print(f"Number of token-based chunks: {len(token_chunks)}")
print(f"First token chunk: {token_chunks[5][:500]}...")


Number of token-based chunks: 17
First token chunk: y class, and personal computers might not have the wonderful typography that they do. Of course, it was impossible to connect the dots looking forward when I was in college. But it was very, very clear looking backwards, ten years later. Again, you can’t connect the dots looking forward. You can only connect them looking backwards, so you have to trust that the dots will somehow connect in your future.

You have to trust in something: your gut, destiny, life, karma, whatever. Because believing t...


## 3. Recursive Character Text Splitting

Smart splitting that tries to preserve structure by using multiple separators.


In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recursive character splitting
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", " ", ""]
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"Number of recursive chunks: {len(recursive_chunks)}")
print(f"First recursive chunk: {recursive_chunks[0][:300]}...")


Number of recursive chunks: 14
First recursive chunk: ‘You’ve got to find what you love,’ Jobs says

I’m honored to be with you today for your commencement from one of the finest universities in the world. Truth be told, I never graduated from college. And this is the closest I’ve ever gotten to a college graduation.

Today I want to tell you three sto...


## 4. Markdown Header Text Splitting

Splits markdown documents based on header structure.


In [15]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Load markdown file
md_loader = TextLoader('examplemdfile.md')
md_docs = md_loader.load()
md_text = md_docs[0].page_content

# Markdown header splitting
headers_to_split_on = [
    ("#", "Header1"),
    ("##", "Header2"),
    ("###", "Header3"),
]

md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_chunks = md_splitter.split_text(md_text)

print(f"Number of markdown chunks: {len(md_chunks)}")
for i, chunk in enumerate(md_chunks[:3]):
    print(f"\nChunk {i+1}:")
    print(f"Metadata: {chunk.metadata}")
    print(f"Content preview: {chunk.page_content[:200]}...")


Number of markdown chunks: 20

Chunk 1:
Metadata: {}
Content preview: ---
__Advertisement :)__  
- __[pica](https://nodeca.github.io/pica/demo/)__ - high quality and fast image
resize in browser.
- __[babelfish](https://github.com/nodeca/babelfish/)__ - developer friend...

Chunk 2:
Metadata: {'Header1': 'h1 Heading 8-)', 'Header2': 'h2 Heading', 'Header3': 'h3 Heading'}
Content preview: #### h4 Heading
##### h5 Heading
###### h6 Heading...

Chunk 3:
Metadata: {'Header1': 'h1 Heading 8-)', 'Header2': 'Horizontal Rules'}
Content preview: ___  
---  
***...


## 5. Semantic Chunking

AI-powered chunking based on semantic similarity using OpenAI embeddings.

**Note:** Requires OpenAI API key to be set in environment or passed to OpenAIEmbeddings.


In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
import os

# Initialize embeddings - add your API key here if not in environment
embeddings = OpenAIEmbeddings()
# Create semantic chunker
semantic_chunker = SemanticChunker(embeddings)

# Split text semantically (using subset for demo)
semantic_chunks = semantic_chunker.split_text(text[:2000])

print(f"Number of semantic chunks: {len(semantic_chunks)}")
for i, chunk in enumerate(semantic_chunks):
    print(f"\nSemantic Chunk {i+1} (length: {len(chunk)}):")
    print(f"{chunk[:200]}...")
