# Lesson 3.2: Text Splitters

---

In the previous lesson, we learned how to load data from various sources into LangChain as `Document` objects. However, these documents can be very long, exceeding the token limits of Large Language Models (LLMs) or reducing the efficiency of search tasks. This is where **Text Splitters** come into play.

## 1. Why is Splitting Large Texts (Chunking) Necessary?

### 1.1. LLM Token Limits

Each LLM has a limit on the number of tokens (words or parts of words) it can process in a single API call. If a document is too long, you cannot send its entire content to the LLM.

* **Problems:**
    * **Exceeding limit error:** The LLM will refuse to process if the prompt exceeds `max_tokens`.
    * **High cost:** Even if not exceeding the limit, sending too many tokens will be more expensive.
    * **Reduced performance:** The LLM can be "overwhelmed" by large amounts of information, leading to less accurate or irrelevant responses.



### 1.2. Search Optimization (Retrieval)

In RAG (Retrieval-Augmented Generation) systems, when you search for relevant information, you want to retrieve small, precise text snippets containing the necessary context. If you store entire long documents in a vector database, searching will be less efficient:

* **Low accuracy:** A long document might contain multiple topics. When searching for a specific topic, you might retrieve the entire document, including irrelevant parts.
* **Costly and slow:** Embedding and searching on large text blocks will consume more resources and time.
* **"Lost in the middle":** Even if a large document is retrieved, the LLM might miss crucial information if it's located in the middle of a very long text segment.

### 1.3. The Role of Text Splitters (Chunking)

**Text Splitters** are tools in LangChain that help break down large documents into smaller text segments, called **chunks**. This process is known as **chunking**.

* **Benefits:**
    * **Adhere to token limits:** Ensures each chunk is small enough to fit within the LLM's token limit.
    * **Improve search:** Smaller chunks enable more precise semantic search, as each chunk focuses on a specific topic or idea.
    * **Optimize cost:** Reduces the number of tokens needed to send to the LLM for each query.
    * **Easy to manage:** Smaller chunks are easier to store, embed, and process.




---

## 2. Text Splitting Strategies

Splitting text is not just about cutting strings by a certain number of characters. The goal is to split text "intelligently" so that chunks still retain semantic meaning. LangChain offers various text splitting strategies.

To run the examples below, ensure you have installed the `langchain` library and other dependencies if needed.

In [None]:
# Install the library if not already installed:
# pip install langchain
# pip install tiktoken # To estimate tokens for some splitters

### 2.1. `CharacterTextSplitter`: Splitting by Character

* **Concept:** This is the simplest splitter. It splits text based on a set of separator characters (default is `["\n\n", "\n", " ", ""]`, meaning it prioritizes paragraphs, then lines, words, and finally characters). It tries to create chunks as close to `chunk_size` as possible.
* **When to Use:** When you need a basic splitting method and are not overly concerned with preserving complex semantic structure.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text = """
Artificial intelligence (AI) is changing the world.
It includes machine learning and deep learning.

LangChain is a powerful framework.
It helps build LLM applications.
"""

# Initialize CharacterTextSplitter
# chunk_size: maximum size of each chunk (in characters)
# chunk_overlap: number of overlapping characters between adjacent chunks
splitter = CharacterTextSplitter(
    separator="\n\n", # Prioritize splitting by paragraph
    chunk_size=50,
    chunk_overlap=0,
    length_function=len, # Function to calculate length (default is len)
    is_separator_regex=False # Is it a regex?
)

chunks = splitter.split_text(text)

print("--- CharacterTextSplitter ---")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} (len={len(chunk)}):\n'{chunk}'\n---")

### 2.2. `RecursiveCharacterTextSplitter`: Recursive Splitting, Attempting to Maintain Meaningful Chunks

* **Concept:** This is the recommended and most commonly used splitter. It attempts to split text more intelligently by using a list of separator characters (default is `["\n\n", "\n", " ", ""]`). It will try to split by the first separator. If the chunk is still too large, it will try the next separator, and so on, until the chunks reach the desired size. This helps maintain context by avoiding cutting in the middle of a sentence or paragraph.
* **When to Use:** Most cases, especially when you need to preserve the context and structure of the text.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
# Main Article Title

## Introduction

This is a long paragraph about the importance of artificial intelligence in modern life. AI is changing the way we work, learn, and entertain ourselves. It is not just a technology but also a driving force for innovation globally.

### Benefits of AI

AI brings many significant benefits, from automating repetitive tasks to providing deep insights from big data. It helps improve efficiency, reduce costs, and unlock new possibilities that were previously unattainable.

## Conclusion

In summary, AI is a promising field with immense potential. Understanding and applying AI correctly will be key to future development.
"""

# Initialize RecursiveCharacterTextSplitter
# It will try to split by "\n\n", then "\n", then " ", then ""
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200, # Maximum size of each chunk
    chunk_overlap=20, # Number of overlapping characters between chunks
    length_function=len,
    add_start_index=True # Add start index of chunk to metadata
)

chunks = splitter.split_text(text)

print("--- RecursiveCharacterTextSplitter ---")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} (len={len(chunk)}, start_index={chunk.metadata.get('start_index')}):\n'{chunk.page_content}'\n---")

**Explanation:**
* `chunk_size`: The maximum size of each chunk.
* `chunk_overlap`: The number of characters (or tokens) that adjacent chunks will overlap. This helps maintain context when an idea might span across a chunk boundary.
* `length_function`: The function used to calculate the length of the chunk. Default is `len` (number of characters). For LLM models, you might want to use a token counting function.
* `add_start_index`: Adds the starting index of the chunk in the original document to the metadata, useful for debugging or provenance.

### 2.3. Specialized Splitters: `MarkdownTextSplitter`, `PythonCodeTextSplitter`

LangChain also provides splitters designed to understand the semantic structure or specific formatting of certain text types.

* **Concept:** These splitters use knowledge of the language's or format's syntax (e.g., Markdown, Python) to split text more intelligently, trying not to cut in the middle of code blocks, headings, lists, etc.
* **When to Use:** When you are working with clearly structured document types like source code, Markdown documents, HTML, etc.
* **Requirement:** Some specialized splitters might require additional library installations (e.g., `tiktoken` for token-based splitters).

In [None]:
from langchain.text_splitter import MarkdownTextSplitter, PythonCodeTextSplitter

# Example MarkdownTextSplitter
markdown_text = """
# Main Title

This is the introduction paragraph.

## Section 1: Introduction
- Item A
- Item B

### Subsection 1.1

This is the content of the subsection.

```python
def hello_world():
    print("Hello, World!")
```

## Section 2: Conclusion
"""

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
markdown_chunks = markdown_splitter.split_text(markdown_text)

print("--- MarkdownTextSplitter ---")
for i, chunk in enumerate(markdown_chunks):
    print(f"Chunk {i+1} (len={len(chunk)}):\n'{chunk}'\n---")

# Example PythonCodeTextSplitter
python_code = """
import os

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        return f"Hello, {self.name}!"

def main():
    obj = MyClass("LangChain")
    print(obj.greet())

if __name__ == "__main__":
    main()
"""

python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
python_chunks = python_splitter.split_text(python_code)

print("\n--- PythonCodeTextSplitter ---")
for i, chunk in enumerate(python_chunks):
    print(f"Chunk {i+1} (len={len(chunk)}):\n'{chunk}'\n---")


---

## 3. The Importance of `chunk_size` and `chunk_overlap`

These two parameters are extremely important when splitting text and need to be carefully adjusted depending on your use case.

### 3.1. `chunk_size`

* **Concept:** The maximum size (in characters or tokens) that each text segment (chunk) is allowed to have.
* **Impact:**
    * **Small `chunk_size`:** More chunks, each focusing on a small idea. Good for very specific searches, but can lose context if a large idea is split.
    * **Large `chunk_size`:** Fewer chunks, each containing more context. Good for capturing overall ideas, but can exceed LLM token limits or reduce search accuracy if the chunk is too broad.
* **Selection:** Needs to balance the LLM's token limit and the need for context. For common LLMs, a `chunk_size` of 500-1500 tokens is often a good starting point.

### 3.2. `chunk_overlap`

* **Concept:** The number of characters (or tokens) that adjacent chunks will overlap.
* **Impact:**
    * **Small/no `chunk_overlap`:** Chunks are completely separate. Can lead to loss of context if an important sentence or idea spans the boundary between two chunks.
    * **Large `chunk_overlap`:** Chunks have a lot of overlapping information. Ensures context is maintained, but can lead to redundancy and increased embedding/search costs.
* **Selection:** Often set to 10-20% of `chunk_size`. The goal is to ensure that important ideas are not cut off and context is preserved when transitioning from one chunk to another.

[Image illustrating chunk_size and chunk_overlap visually]


---

## 4. Practical Example: Splitting Loaded Documents

We will use the documents loaded from Lesson 3.1 and apply `RecursiveCharacterTextSplitter` to split them.

In [None]:
# Install all necessary libraries for this example
# pip install langchain pypdf beautifulsoup4 tiktoken reportlab

import os
from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader, CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# --- Prepare sample files (similar to Lesson 3.1) ---
txt_content = "This is important information from a text note.\nIt talks about the benefits of learning programming."
with open("note.txt", "w", encoding="utf-8") as f:
    f.write(txt_content)

csv_content = "Product,Price,Quantity\nLaptop,1200,50\nPhone,800,120\nHeadphones,50,300"
with open("products.csv", "w", encoding="utf-8") as f:
    f.write(csv_content)

pdf_test_path = "document.pdf"
try:
    from reportlab.pdfgen import canvas
    c = canvas.Canvas(pdf_test_path)
    c.drawString(100, 750, "This is content from a sample PDF file.")
    c.drawString(100, 730, "It is generated for illustration purposes.")
    c.save()
except ImportError:
    with open(pdf_test_path, "w") as f:
        f.write("This is a dummy PDF file. Please replace with a real PDF.\n")
    print("Could not create real PDF. Using dummy file.")

# --- Load data using different loaders ---
all_documents = []

print("Loading documents...")
txt_loader = TextLoader("note.txt", encoding="utf-8")
all_documents.extend(txt_loader.load())

csv_loader = CSVLoader("products.csv", encoding="utf-8")
all_documents.extend(csv_loader.load())

web_loader = WebBaseLoader("https://www.langchain.com/blog")
try:
    all_documents.extend(web_loader.load())
except Exception as e:
    print(f"Error loading from web: {e}. Skipping web load.")

try:
    pdf_loader = PyPDFLoader(pdf_test_path)
    all_documents.extend(pdf_loader.load())
except Exception as e:
    print(f"Error loading from PDF: {e}. Skipping PDF load.")

print(f"Total original documents loaded: {len(all_documents)}\n")

# --- Split documents using RecursiveCharacterTextSplitter ---
print("Splitting documents...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, # Desired chunk size
    chunk_overlap=50, # Overlap
    length_function=len, # Calculate length by characters
    add_start_index=True # Add start index to metadata
)

# split_documents() takes a list of Documents and returns a list of split Documents
chunks = text_splitter.split_documents(all_documents)

print(f"Total chunks after splitting: {len(chunks)}\n")

print("--- Split chunks (a few examples) ---")
for i, chunk in enumerate(chunks[:5]): # Print first 5 chunks for illustration
    print(f"Chunk {i+1}:")
    print(f"  Content (partial): {chunk.page_content[:200]}...")
    print(f"  Length: {len(chunk.page_content)}")
    print(f"  Metadata: {chunk.metadata}")
    print("=" * 50)

# --- Clean up sample files ---
os.remove("note.txt")
os.remove("products.csv")
os.remove(pdf_test_path)
print("\nSample files removed.")

**Explanation:**
This example illustrates a complete process:
1.  Load documents from various sources using Document Loaders.
2.  Use `RecursiveCharacterTextSplitter` to split all loaded documents into chunks with desired size and overlap.
3.  Print information about the split chunks, including content, length, and metadata (including `start_index` if enabled).

This splitting process is an essential step before creating embeddings and storing them in a Vector Store, ensuring that the embedded and retrieved text segments are optimal for the LLM.


---

## Lesson Summary

This lesson explained why it's necessary to **split large texts (chunking)** in LLM applications, primarily to address **LLM token limits** and **optimize search efficiency**. We explored various **text splitting strategies** in LangChain:
* **`CharacterTextSplitter`** for simple character-based splitting.
* **`RecursiveCharacterTextSplitter`** (recommended) for more intelligent splitting, attempting to preserve context.
* Specialized splitters like **`MarkdownTextSplitter`** and **`PythonCodeTextSplitter`** for splitting by semantic structure.

The lesson also emphasized the **importance of `chunk_size` and `chunk_overlap`**, two key parameters affecting the size and overlap of text segments, and how to adjust them for optimal performance. Finally, a practical example illustrated how to apply `RecursiveCharacterTextSplitter` to split loaded documents, laying a solid foundation for the next steps in building RAG systems.