# **Text Transformation using Text Splitters**

Notice that after you load a Documnet object from a source, you end up with strings by grabbing them from page_content. In certain situations, the length of the strings may be too large to feed into a model, both embedding and chat model.

**Note: LLMs have fixed maximum context window. If you document contains more token then the maximum context window size, LLMs won't be able to process it. So, it is important to break the documents into chunks.**

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

1. Split the text up into small, semantically meaningful chunks (often sentences).
2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

1. How the text is split
2. How the chunk size is measured

## **Types of Text Splitters**

**Evaluate text splitters**  
You can evaluate text splitters with the [Chunkviz utility](https://www.chunkviz.com/) created by Greg Kamradt. Chunkviz is a great tool for visualizing how your text splitter is working. It will show you how your text is being split up and help in tuning up the splitting parameters.

LangChain offers many different types of text splitters. These all live in the `langchain-text-splitters` package. Below is a table listing all of them, along with a few characteristics:
> **Name:** Name of the text splitter  
> **Splits On:** How this text splitter splits text  
> **Adds Metadata:** Whether or not this text splitter adds metadata about where each chunk came from.  
> **Description:** Description of the splitter, including recommendation on when to use it.

| Name | Splits On | Adds Metadata | Description |
| :--- | :--- | :---: | :--- |
| Recursive | A list of user defined characters | . | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. |
| HTML | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |
| Markdown | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
| Code | Code (Python, JS) specific characters | . | Splits text based on characters specific to coding languages. 15 different languages are available to choose from.
| Token | Tokens | . | Splits text on tokens. There exist a few different ways to measure tokens. |
| Character | A user defined character | . | Splits text based on a user defined character. One of the simpler methods. |
| [Experimental] Semantic Chunker | Sentences | . | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt |

## **Loading the data**

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/subtitles_data/Friends - [2x01] - The One with Ross's New Girlfriend.srt")

doc = loader.load()

In [2]:
doc[0].page_content[:100]

"1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far.\n\n2\n00:00:04,395 --> 00:0"

In [3]:
doc_content = doc[0].page_content

## **1. Split by Character**

This is the simplest method. This splits based on characters (by default “”) and measure chunk length by number of characters.

1. **How the text is split:** by single character.
2. **How the chunk size is measured:** by number of characters.

In [4]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [5]:
texts = text_splitter.create_documents([doc_content])

print("Type of texts variable:", type(texts))

print("Type of each object inside the list:", type(texts[0]))

print("Total number of documents inside texts list:", len(texts))

print("Content of first document:", texts[0])

Type of texts variable: <class 'list'>
Type of each object inside the list: <class 'langchain_core.documents.base.Document'>
Total number of documents inside texts list: 140
Content of first document: page_content="1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far.\n\n2\n00:00:04,395 --> 00:00:07,179\nRoss was in love\nwith Rachel since forever."


In [6]:
texts[:5]

[Document(page_content="1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far.\n\n2\n00:00:04,395 --> 00:00:07,179\nRoss was in love\nwith Rachel since forever."),
 Document(page_content='3\n00:00:07,423 --> 00:00:10,437\nEvery time he tried to tell her,\nsomething got in the way...\n\n4\n00:00:10,651 --> 00:00:12,529\n...Iike cats, Italian guys.'),
 Document(page_content='5\n00:00:12,736 --> 00:00:15,922\nAnd finally, Chandler was,\nlike, "Forget about her."\n\n6\n00:00:16,166 --> 00:00:20,762\nWhen Ross was in China, Chandler\nlet it slip that Ross loved Rachel.'),
 Document(page_content='7\n00:00:20,975 --> 00:00:22,818\nShe was, like, "Oh, my God!"\n\n8\n00:00:23,061 --> 00:00:25,845\nSo she went to the airport to meet him.'),
 Document(page_content="9\n00:00:26,089 --> 00:00:29,710\nShe didn 't know Ross was getting\noff the plane with another woman.\n\n10\n00:00:31,165 --> 00:00:33,651\nThat's pretty much everything\nyou need to know.")]

### **Passing Metadata with Documents**

If there are more than one document, you should add some metadata with each chunk to identify which document it belongs to. 

Here’s an example of passing metadata along with the documents, notice that it is split along with the documents.

In [7]:
metadatas = [{"document": 1}, {"document": 2}]

chunks = text_splitter.create_documents(
    [doc_content, doc_content], metadatas=metadatas
)

print("Type of texts variable:", type(chunks))

print("Type of each object inside the list:", type(chunks[0]))

print("Total number of documents inside texts list:", len(chunks))

print("Content of first document:", chunks[0])

Type of texts variable: <class 'list'>
Type of each object inside the list: <class 'langchain_core.documents.base.Document'>
Total number of documents inside texts list: 280
Content of first document: page_content="1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far.\n\n2\n00:00:04,395 --> 00:00:07,179\nRoss was in love\nwith Rachel since forever." metadata={'document': 1}


In [8]:
print(chunks[0])
print()
print(chunks[-1])

page_content="1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far.\n\n2\n00:00:04,395 --> 00:00:07,179\nRoss was in love\nwith Rachel since forever." metadata={'document': 1}

page_content="351\n00:23:48,261 --> 00:23:51,360\nAndie MacDowell is the guy\nfrom Planet of the Apes.\n\n352\n00:23:54,450 --> 00:23:56,316\n-Thank you.\n-You're welcome." metadata={'document': 2}


## **2. Recursively split by character**

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

1. **How the text is split:** by list of characters.
2. **How the chunk size is measured:** by number of characters.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([doc_content])

print("Type of texts variable:", type(texts))

print("Type of each object inside the list:", type(texts[0]))

print("Total number of documents inside texts list:", len(texts))

print("Content of first document:", texts[0])

Type of texts variable: <class 'list'>
Type of each object inside the list: <class 'langchain_core.documents.base.Document'>
Total number of documents inside texts list: 335
Content of first document: page_content="1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far."


In [10]:
print(texts[0])
print()
print(texts[1])

page_content="1\n00:00:01,435 --> 00:00:04,082\nThis is pretty much\nwhat's happened so far."

page_content='2\n00:00:04,395 --> 00:00:07,179\nRoss was in love\nwith Rachel since forever.'


## **Split by tokens**

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

1. nltk tokenizer
```python
from langchain_text_splitters import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(text_string)
```
2. spaCy tokenizer
```python
from langchain_text_splitters import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(text_string)
```
3. tiktoken (created by OpenAI)
```python
# tiktoken is a fast BPE tokenizer created by OpenAI.
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(text_string)
```
4. sentence-transformers tokenizer

5. KoNLPY (for Korean Language)

6. Hugging Face tokenizer

### **NLTK Text Splitter**

In [11]:
from langchain_text_splitters import NLTKTextSplitter

nltk_text_splitter = NLTKTextSplitter(chunk_size=1000)

texts = nltk_text_splitter.split_text(doc_content)

print(len(texts))

print(type(texts[0]))

29
<class 'str'>


In [12]:
print(texts[0][:300])

1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...

4
00:00:10,651 --> 00:00:12,529
...Iike cats, Italia


### **Spacy Text Splitter**

In [13]:
from langchain_text_splitters import SpacyTextSplitter

spacy_text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = spacy_text_splitter.split_text(doc_content)

print(len(texts))

print(type(texts[0]))

30
<class 'str'>




In [14]:
print(texts[0][:300])

1
00:00:01,435 --> 00:00:04,082


This is pretty much
what's happened so far.



2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.



3
00:00:07,423 --> 00:00:10,437


Every time he tried to tell her,
something got in the way...

4
00:00:10,651 --> 00:00:12,529
...

Iike ca


### **tiktoken Text Splitter**

In [15]:
# !pip install tiktoken

In [16]:
# tiktoken is a fast BPE tokenizer created by OpenAI.
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)

texts = text_splitter.split_text(doc_content)

print(len(texts))

print(type(texts[0]))

33
<class 'str'>


In [17]:
print(texts[0][:300])

1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...

4
00:00:10,651 --> 00:00:12,529
...Iike cats, Italia


## **Semantic Chunking**

Splits the text based on semantic similarity.

In [18]:
# f = open('keys/.openai_api_key.txt')
# OPENAI_API_KEY = f.read()

In [19]:
# from langchain_openai import OpenAIEmbeddings

# embedding_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [20]:
# from langchain_experimental.text_splitter import SemanticChunker

# text_splitter = SemanticChunker(embedding_model)