# **Chunking Strategies**

## **What's Covered?**
1. Why Chunking Matters?
2. What is Chunking?
3. Visualize Chunking
4. How Chunking Works?
5. Choosing Chunk Size
6. Types of Text Splitters
7. Length based splitting - CharacterTextSplitter
8. Recursively Character Splitting
9. Recursive Character Splitting with tiktoken
10. Document Structure-Based Splitting
    - MarkdownHeaderTextSplitter
    - HTMLHeaderTextSplitter
    - RecursiveJsonSplitter
    - Code Splitter
11. Passing Metadata with Chunks


## **Why Chunking Matters?**
Notice that after you load a Documnet object from a source, you end up with strings by grabbing them from page_content. In certain situations, the length of the strings may be too large to feed into a model, both embedding and chat model.

**Note: LLMs have fixed maximum context window. If you document contains more token then the maximum context window size, LLMs won't be able to process it. So, it is important to break the documents into chunks.**

In short:
- Good chunks â†’ high retrieval accuracy
- Bad chunks â†’ LLM hallucinations + incomplete answers
- Chunking is 50% of RAG performance.

## **What is Chunking?**
Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

**Note: Think of it like organizing a massive book into chapters and sections. When someone asks a question, you only retrieve the relevant chapters, not the entire book, making retrieval faster and more accurate.**

A good chunk has:
- one clear idea/theme (not 5 unrelated things)
- long enough to contain useful info
- not too long for embedding models
- fits the retrieval goal (FAQs vs Legal Docs vs Code need different chunk sizes)

## **Visualize Chunking**  
You can evaluate text splitters with the [Chunkviz utility](https://www.chunkviz.com/) created by Greg Kamradt. Chunkviz is a great tool for visualizing how your text splitter is working. It will show you how your text is being split up and help in tuning up the splitting parameters.

## **How Chunking Works?**
1. Split the text up into small, semantically meaningful chunks (often sentences).
2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

1. How the text is split
2. How the chunk size is measured

## **Choosing Chunk Size**
- **Small chunks (250-500 tokens):** Better for specific factual questions, reduces noise in retrieval
- **Medium chunks (500-1000 tokens):** Balanced approach, good for general Q&A
- **Large chunks (1000+ tokens):** Preserves more context, better for complex reasoning but retrieves more text

Pro tip: Start with **chunk_size=1000**, **chunk_overlap=200** and adjust if retrieval quality is poor.

- chunk_size=1000: Target chunk size in characters. Adjust based on your model's context window
- chunk_overlap=200: Characters to overlap between consecutive chunks. Prevents context loss at chunk boundaries

## **Types of Text Splitters**
LangChain offers many different types of text splitters. These all live in the `langchain-text-splitters` package. Below is a table listing all of them, along with a few characteristics:
> **Name:** Name of the text splitter  
> **Splits On:** How this text splitter splits text  
> **Adds Metadata:** Whether or not this text splitter adds metadata about where each chunk came from.  
> **Description:** Description of the splitter, including recommendation on when to use it.

#### **1. Length Based Splitting**
- CharacterTextSplitter()

#### **2. Recursive Character Splitting**
- RecursiveCharacterTextSplitter()
- RecursiveCharacterTextSplitter.from_tiktoken_encoder()

#### **3. Document Structure-Based Splitting**
- Markdown Splitting()
- HTML Splitting()
- JSON Splitting()
- Code Splitting()




## **Length based splitting - CharacterTextSplitter**

**When to use? - Use when you need simple, consistent chunk sizes.**

This is the simplest method. This splits based on characters (by default "") and measure chunk length by number of characters.

Splits documents based purely on lengthâ€”either characters or tokens. This is straightforward but doesn't respect document structure, so it can split sentences or ideas mid-way.

1. **How the text is split:** by single character.
2. **How the chunk size is measured:** by number of characters.

### **Loading the data**

In [86]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/misc_files/text_file.txt")

data = loader.load()

doc_content = data[0].page_content

print(doc_content)

This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."


### **Applying the CharacterText Splitter**

In [87]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=0
)

chunks = text_splitter.split_text(doc_content)

print("Type of texts variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside texts list:", len(chunks))

Type of texts variable: <class 'list'>

Type of each object inside the list: <class 'str'>

Total number of documents inside texts list: 3


In [65]:
print("* Content of chunk:")
print(chunks[0])

* Content of chunk:
This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.


In [66]:
for chunk in chunks:
    print("Chunk Size: ", len(chunk))

Chunk Size:  88
Chunk Size:  83
Chunk Size:  52


## **Recursively Character Splitting**

**When to use? - Use this for most casesâ€”it's the default strategy that works well out of the box.**

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

**Note: The RecursiveCharacterTextSplitter intelligently splits text while preserving meaning. It works hierarchically: first tries to split at paragraphs, then sentences, then words, until chunks fit the size limit. This maintains natural language flow and semantic coherence.**

1. **How the text is split:** by list of characters.
2. **How the chunk size is measured:** by number of characters.

### **Loading the data**

In [67]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 10/10 [00:00<00:00, 4146.21it/s]

Type of Data Variable:  <class 'list'>
Number of Documents: 10





In [68]:
doc_contents = [doc.page_content for doc in data]

### **Applying RecursiveCharacterTextSplitter**

In [69]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=50,
)

chunks = text_splitter.create_documents(doc_contents)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 514


In [70]:
print("* Content of first chunk:")
print(chunks[0].page_content)

* Content of first chunk:
1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...

4
00:00:10,651 --> 00:00:12,529
...Iike cats, Italian guys.

5
00:00:12,736 --> 00:00:15,922
And finally, Chandler was,
like, "Forget about her."

6
00:00:16,166 --> 00:00:20,762
When Ross was in China, Chandler
let it slip that Ross loved Rachel.


In [71]:
sum_of_chunk_size = 0
for chunk in chunks:
    sum_of_chunk_size += len(chunk.page_content)

print("Average Chunk Size: ", sum_of_chunk_size/len(chunks))

Average Chunk Size:  459.02334630350197


## **Recursive Character Splitting with tiktoken**

**When to use? - More precise for LLM context windows**

In [14]:
# !pip install tiktoken

In [78]:
# tiktoken is a fast BPE tokenizer created by OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                        encoding_name="cl100k_base",      # OpenAI encoding
                        chunk_size=100,                   # represents tokens, not characters
                        chunk_overlap=20
)

chunks = text_splitter.create_documents(doc_contents)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 1161


In [81]:
print("* Content of first chunk:")
print(chunks[0].page_content)

* Content of first chunk:
1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...


## **4. Document Structure-Based Splitting**
Use when documents have inherent structure (Markdown, HTML, JSON, code).

These strategies leverage document structure to keep semantically related content together.

### **4.1 MarkdownHeaderTextSplitter**

For structured docs in Markdown. This preserves the document's hierarchical structure, so each chunk knows its context (e.g., "Section 2.1 under Chapter 3").

In [108]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/misc_files/langchain_chunking.md")

data = loader.load()

markdown_text = data[0].page_content

In [110]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3")
    ]
)

chunks = md_splitter.split_text(markdown_text)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 5


In [111]:
print("* Content of first chunk:")
print(chunks[0].page_content)

* Content of first chunk:
Notice that after you load a Documnet object from a source, you end up with strings by grabbing them from page_content. In certain situations, the length of the strings may be too large to feed into a model, both embedding and chat model.  
**Note: LLMs have fixed maximum context window. If you document contains more token then the maximum context window size, LLMs won't be able to process it. So, it is important to break the documents into chunks.**  
In short:
- Good chunks â†’ high retrieval accuracy
- Bad chunks â†’ LLM hallucinations + incomplete answers
- Chunking is 50% of RAG performance.


In [112]:
for chunk in chunks:
    print("Chunk Size: ", len(chunk.page_content))

Chunk Size:  599
Chunk Size:  953
Chunk Size:  284
Chunk Size:  519
Chunk Size:  1758


### **4.2. HTMLHeaderTextSplitter**

This is great for websites.

Handles:
- `<h1>`
- `<h2>`
- `<h3>`
- `paragraphs`

In [113]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/misc_files/web_page.html")

data = loader.load()

doc_content = data[0].page_content

# print(doc_content)

In [114]:
html_string = data[0].page_content

In [116]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[
        ("h1", "Header 1"),
        ("h2", "Header 2"),
        ("h3", "Header 3"),
    ]
)

chunks = html_splitter.split_text(html_string)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 12


In [117]:
print("* Content of a chunk:")
print(chunks[3].page_content)
print(chunks[3].metadata)

* Content of a chunk:
ThatAIGuy's Avatar  
Hello! I'm ThatAIGuy, a developer obsessed with **Artificial Intelligence** and **Machine Learning**. My passion lies in creating tools that simplify complex tasks using cutting-edge models.  
Click for a quick quote...  
"The best way to predict the future is to invent it."  
â€”  
ThatAIGuy
{'Header 1': 'ThatAIGuy: The Future is Now', 'Header 2': 'ðŸ¤– About ThatAIGuy'}


### **4.3. RecursiveJsonSplitter**

Splits JSON data into smaller, structured chunks while preserving hierarchy.

This class provides methods to split JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use.

In [158]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/misc_files/product_list.json")

data = loader.load()

doc_content = data[0].page_content

# print(doc_content)

### **Applying RecursiveJsonSplitter**

- **.split_json():** Expects a single JSON object (dict) - This function splits JSON into a list of JSON chunks
- **.create_documents():** Expects a list of JSON objects - This function splits list of JSON into a list of Documents

In [159]:
import json

prod_json = json.loads(doc_content)

print(type(prod_json))

print(prod_json[0])

<class 'list'>
{'product_id': 'ELC001', 'name': 'Noise-Cancelling Over-Ear Headphones', 'brand': 'AcoustiVox', 'category': 'Electronics', 'sku': 'AV-HP-NC2024', 'description': 'Premium over-ear headphones with industry-leading noise cancellation and 40-hour battery life. Perfect for travel and focus.', 'price_data': {'base_price': 249.99, 'currency': 'USD', 'is_on_sale': True, 'sale_price': 199.99}, 'inventory': {'stock_level': 350, 'is_available': True}, 'attributes': [{'key': 'Color', 'value': 'Midnight Black'}, {'key': 'Connection', 'value': 'Bluetooth 5.2'}], 'reviews': [{'user_id': 101, 'rating': 5, 'comment': 'Incredible sound quality and comfort!'}]}


In [206]:
from langchain_text_splitters import RecursiveJsonSplitter

json_splitter = RecursiveJsonSplitter(
    max_chunk_size=300
)

chunks = json_splitter.create_documents(prod_json)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 28


In [207]:
print("* Content of a chunk:")
print(chunks[3].page_content)
print(chunks[3].metadata)

* Content of a chunk:
{"product_id": "FSH005", "name": "Organic Cotton Crewneck T-Shirt", "brand": "EcoThread", "category": "Apparel", "sku": "ET-TS-CRC005", "description": "Soft, breathable, and sustainably sourced cotton t-shirt. Available in multiple sizes."}
{}


### **4.4. Code Splitter**

For code, uses language-specific delimiters (functions, classes, etc.) instead of generic newlines.

In [175]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/misc_files/code.py")

data = loader.load()

doc_content = data[0].page_content

print(doc_content)

def fun(x):
    print(f"Input: {x}")
    return x**2

def main():
    fun(3)

main()


In [179]:
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

# Split code by language-specific delimiters
code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=50,
    chunk_overlap=10
)

chunks = code_splitter.split_text(doc_content)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))

Type of variable: <class 'list'>

Type of each object inside the list: <class 'str'>

Total number of documents inside list: 3


In [185]:
print("* Content of a chunk:")
print(chunks[0])

* Content of a chunk:
def fun(x):
    print(f"Input: {x}")


## **Passing Metadata with Chunks**

If there are more than one document, you should add some metadata with each chunk to identify which document it belongs to. 

Hereâ€™s an example of passing metadata along with the documents, notice that it is split along with the documents.

We will use **create_documents()** function to do this.

In [7]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/text', glob="*.txt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 2397.43it/s]

Type of Data Variable:  <class 'list'>
Number of Documents: 2





In [8]:
# metadatas = [{"document": 1}, {"document": 2}]
doc_contents = [doc.page_content for doc in data]
meta_datas = [doc.metadata for doc in data]

chunks = text_splitter.create_documents(doc_contents, meta_datas)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside chunks:", len(chunks))
print()
print("Content of chunks:", chunks)

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside chunks: 3

Content of chunks: [Document(metadata={'source': 'data/text/file_1.txt'}, page_content="This is pretty much\nwhat's happened so far.\n\nRoss was in love\nwith Rachel since forever.\n\nEvery time he tried to tell her,\nsomething got in the way\n\nIike cats, Italian guys."), Document(metadata={'source': 'data/text/file_1.txt'}, page_content='Iike cats, Italian guys.\n\nAnd finally, Chandler was,\nlike, "Forget about her."'), Document(metadata={'source': 'data/text/file_2.txt'}, page_content="These were unbelievely expensive\nand he'll grow out of them\n\nin 20 minutes,\nbut I couldn't resist!\n\nLook at these.")]
