# **Text Transformation using Text Splitters**

Notice that after you load a Documnet object from a source, you end up with strings by grabbing them from page_content. In certain situations, the length of the strings may be too large to feed into a model, both embedding and chat model.

**Note: LLMs have fixed maximum context window. If you document contains more token then the maximum context window size, LLMs won't be able to process it. So, it is important to break the documents into chunks.**

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

1. Split the text up into small, semantically meaningful chunks (often sentences).
2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

1. How the text is split
2. How the chunk size is measured

## **Types of Text Splitters**

**Evaluate text splitters**  
You can evaluate text splitters with the [Chunkviz utility](https://www.chunkviz.com/) created by Greg Kamradt. Chunkviz is a great tool for visualizing how your text splitter is working. It will show you how your text is being split up and help in tuning up the splitting parameters.

LangChain offers many different types of text splitters. These all live in the `langchain-text-splitters` package. Below is a table listing all of them, along with a few characteristics:
> **Name:** Name of the text splitter  
> **Splits On:** How this text splitter splits text  
> **Adds Metadata:** Whether or not this text splitter adds metadata about where each chunk came from.  
> **Description:** Description of the splitter, including recommendation on when to use it.

| Name | Splits On | Adds Metadata | Description |
| :--- | :--- | :---: | :--- |
| Character | A user defined character | . | Splits text based on a user defined character. One of the simpler methods. |
| Recursive | A list of user defined characters | . | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. |
| HTML | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |  
| Markdown | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |  
| Code | Code (Python, JS) specific characters | . | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
| Token | Tokens | . | Splits text on tokens. There exist a few different ways to measure tokens. |
| [Experimental] Semantic Chunker | Sentences | . | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt |

## **Loading the data**

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/text/file_1.txt")

data = loader.load()



In [2]:
print(data[0].page_content)

This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."


In [3]:
doc_content = data[0].page_content

## **1. Split by Character - CharacterTextSplitter**

This is the simplest method. This splits based on characters (by default “”) and measure chunk length by number of characters.

1. **How the text is split:** by single character.
2. **How the chunk size is measured:** by number of characters.

In [4]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=200,
    chunk_overlap=50
)

In [5]:
chunks = text_splitter.split_text(doc_content)

print("Type of texts variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside texts list:", len(chunks))
print()
print("* Content of first chunk:", chunks[0])
print()
print("* Content of second chunk:", chunks[1])

Type of texts variable: <class 'list'>

Type of each object inside the list: <class 'str'>

Total number of documents inside texts list: 2

* Content of first chunk: This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.

* Content of second chunk: Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."


In [6]:
for chunk in chunks:
    print(len(chunk))

173
78


### **Passing Metadata with Chunks**

If there are more than one document, you should add some metadata with each chunk to identify which document it belongs to. 

Here’s an example of passing metadata along with the documents, notice that it is split along with the documents.

We will use **create_documents()** function to do this.

In [7]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/text', glob="*.txt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 2397.43it/s]

Type of Data Variable:  <class 'list'>
Number of Documents: 2





In [8]:
# metadatas = [{"document": 1}, {"document": 2}]
doc_contents = [doc.page_content for doc in data]
meta_datas = [doc.metadata for doc in data]

chunks = text_splitter.create_documents(doc_contents, meta_datas)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside chunks:", len(chunks))
print()
print("Content of chunks:", chunks)

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside chunks: 3

Content of chunks: [Document(metadata={'source': 'data/text/file_1.txt'}, page_content="This is pretty much\nwhat's happened so far.\n\nRoss was in love\nwith Rachel since forever.\n\nEvery time he tried to tell her,\nsomething got in the way\n\nIike cats, Italian guys."), Document(metadata={'source': 'data/text/file_1.txt'}, page_content='Iike cats, Italian guys.\n\nAnd finally, Chandler was,\nlike, "Forget about her."'), Document(metadata={'source': 'data/text/file_2.txt'}, page_content="These were unbelievely expensive\nand he'll grow out of them\n\nin 20 minutes,\nbut I couldn't resist!\n\nLook at these.")]


## **2. Recursively split by character**

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

1. **How the text is split:** by list of characters.
2. **How the chunk size is measured:** by number of characters.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=200,
    chunk_overlap=50,
)

chunks = text_splitter.create_documents(doc_contents, meta_datas)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))
print()
print("* Content of first chunk:", chunks[0])
print()
print("* Content of second chunk:", chunks[1])

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 3

* Content of first chunk: page_content='This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.' metadata={'source': 'data/text/file_1.txt'}

* Content of second chunk: page_content='Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."' metadata={'source': 'data/text/file_1.txt'}


## **3. Split by tokens**

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

1. nltk tokenizer
```python
from langchain_text_splitters import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(text_string)
```
2. spaCy tokenizer
```python
from langchain_text_splitters import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(text_string)
```
3. tiktoken (created by OpenAI)
```python
# tiktoken is a fast BPE tokenizer created by OpenAI.
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(text_string)
```
4. sentence-transformers tokenizer

5. KoNLPY (for Korean Language)

6. Hugging Face tokenizer

### **NLTK Text Splitter**

In [10]:
# !pip install nltk

In [11]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kanavbansal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
from langchain_text_splitters import NLTKTextSplitter

nltk_text_splitter = NLTKTextSplitter(chunk_size=100, chunk_overlap=50)

chunks = nltk_text_splitter.create_documents(doc_contents, meta_datas)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))
print()
print("* Content of first chunk:", chunks[0])
print()
print("* Content of second chunk:", chunks[1])

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 5

* Content of first chunk: page_content='This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.' metadata={'source': 'data/text/file_1.txt'}

* Content of second chunk: page_content='Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.' metadata={'source': 'data/text/file_1.txt'}


In [13]:
print(chunks)

[Document(metadata={'source': 'data/text/file_1.txt'}, page_content="This is pretty much\nwhat's happened so far.\n\nRoss was in love\nwith Rachel since forever."), Document(metadata={'source': 'data/text/file_1.txt'}, page_content='Every time he tried to tell her,\nsomething got in the way\n\nIike cats, Italian guys.'), Document(metadata={'source': 'data/text/file_1.txt'}, page_content='And finally, Chandler was,\nlike, "Forget about her."'), Document(metadata={'source': 'data/text/file_2.txt'}, page_content="These were unbelievely expensive\nand he'll grow out of them\n\nin 20 minutes,\nbut I couldn't resist!"), Document(metadata={'source': 'data/text/file_2.txt'}, page_content='Look at these.')]


### **tiktoken Text Splitter**

In [14]:
# !pip install tiktoken

In [15]:
# tiktoken is a fast BPE tokenizer created by OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                                                            model_name="gpt-3.5-turbo",
                                                            chunk_size=50, 
                                                            chunk_overlap=20
                )

chunks = text_splitter.create_documents(doc_contents, meta_datas)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))
print()
print("* Content of first chunk:", chunks[0])
print()
print("* Content of second chunk:", chunks[1])

Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 3

* Content of first chunk: page_content='This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.' metadata={'source': 'data/text/file_1.txt'}

* Content of second chunk: page_content='Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."' metadata={'source': 'data/text/file_1.txt'}


### **Sentence Transformers Token Text Splitter**

In [16]:
# !pip install sentence-transformers

In [17]:
from langchain_text_splitters.sentence_transformers import SentenceTransformersTokenTextSplitter

st_text_splitter = SentenceTransformersTokenTextSplitter(model_name="sentence-transformers/all-mpnet-base-v2", 
                                                         chunk_size=100, 
                                                         chunk_overlap=50)

chunks = st_text_splitter.create_documents(doc_contents, meta_datas)

print("Type of variable:", type(chunks))
print()
print("Type of each object inside the list:", type(chunks[0]))
print()
print("Total number of documents inside list:", len(chunks))
print()
print("* Content of first chunk:", chunks[0])
print()
print("* Content of second chunk:", chunks[1])

  from tqdm.autonotebook import tqdm, trange


Type of variable: <class 'list'>

Type of each object inside the list: <class 'langchain_core.documents.base.Document'>

Total number of documents inside list: 2

* Content of first chunk: page_content='this is pretty much what's happened so far. ross was in love with rachel since forever. every time he tried to tell her, something got in the way iike cats, italian guys. and finally, chandler was, like, " forget about her. "' metadata={'source': 'data/text/file_1.txt'}

* Content of second chunk: page_content='these were unbelievely expensive and he'll grow out of them in 20 minutes, but i couldn't resist! look at these.' metadata={'source': 'data/text/file_2.txt'}
