# Document Splitting

We just went over how to load documents into a standard format. 

Now, we're going to talk about how to split them up into smaller chunks. 

This may sound really easy, but there's a lot of subtleties here that make a big impact down the line. 

![Split](immagini/14_splitting.png)

__Document splitting__ happens after you load your data into the document format. 

But before, it goes into the __vector store__, and this may seem really simple. You can just split the chunks according to the lengths of each character or something like that. 

But as an __example__ of why this is both trickier and very important down the line, let's take a look at this example here. 

We've got a sentence about the Toyota Camry and some specifications. And if we did a simple splitting, we could __end up with part of the sentence in one chunk, and the other part of the sentence in another chunk__. And then, when we're trying to answer a question down the line about what are the specifications on the Camry, __we actually don't have the right information in either chunk__, and so it's split apart. And so, we __wouldn't be able to answer this question correctly__. 

So, there's a lot of nuance and importance in __HOW YOU SPLIT THE CHUNKS__ so that you get __semantically relevant chunks together__.


![Split](immagini/15_splitting.png)

The basis of all the text splitters in Lang Chain involves:
- splitting on chunks in some chunk size with some chunk overlap. 
And so, we have a little diagram here below to show what that looks like.

So, the __CHUNK SIZE__ corresponds to the size of a chunk, and the size of the chunk can be measured in a few different ways. And we'll talk about a few of those in the lesson. And so, we allow passing in a length function to measure the size of the chunk. This is often characters or tokens.  

A __CHUNK OVERLAP__ is generally kept as a little overlap between two chunks, like a sliding window as we move from one to the other. And this allows for the same piece of context to be at the end of one chunk and at the start of the other and helps create some notion of consistency. 

The __TEXT SPLITTERS__ in Lang Chain all have a __create__ documents and a __split__ documents method. 
This involves the same logic under the hood, it just exposes a slightly different interface, 
- one that takes in a list of text and 
- another that takes in a list of documents. 


![Split](immagini/16_splitting.png)

There are a lot of different types of splitters in Lang Chain, and we'll cover a few of them in this lesson. But, I would encourage you to check out the rest of them in your spare time. These text splitters vary across a bunch of dimensions. 

They can vary on how they split the chunks, what characters go into that. They can vary on how they measure the length of the chunks. Is it by characters? Is it by tokens? There are even some that use other smaller models to __determine when the end of a sentence might be and use that as a way of splitting chunks__.

Another important part of splitting into chunks is also the __METADATA__.

Maintaining the same metadata across all chunks, but also adding in new pieces of metadata when relevant, and so there are some text splitters that are really focused on that. 

The splitting of chunks can often be specific on the type of document that we're working with, and this is really apparent when you're splitting on code. So, we have a __language text splitter__ that has a bunch of different separators for a variety of different languages like Python, Ruby, C. And when splitting these documents, it takes those different languages and the relevant separators for those languages into account when it's doing the splitting. 


In [None]:
# Enviroment
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

Next, we're going to import two of the most common types of text splitters in Lang Chain. 

### THE RECURSIVE CHARACTER TEXT SPLITTER & THE CHARACTER TEXT SPLITTER. 

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

We're going to first play around with a few toy use cases just to get a sense of what exactly these do. 

In [None]:
# We're going to set a relatively small chunk size of 26, 
# and an even smaller chunk overlap of 4, just so we can see what these can do.

chunk_size =26
chunk_overlap = 4

Let's initialize these two different text splitters as *r_splitter* and *c_splitter*. 

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [None]:
# load in the first string
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [None]:
r_splitter.split_text(text1)

*OUTPUT*
```
['abcdefghijklmnopqrstuvwxyz']
```

no need to even do any splitting

In [None]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [None]:
r_splitter.split_text(text2)

*OUTPUT*
```
['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']
```

We can see starts with W, X, Y, Z. Those are the four CHUNK OVERLAPS, And then it continues with the rest of the string. 

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [None]:
# spaces between characters

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [None]:
r_splitter.split_text(text3)

*OUTPUT*
```
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
```

It's split into three chunks because there are spaces, so it takes up more space.

That seems like only two characters but because of the space both in between the L and M, and then also, before the L and after the M that actually counts as the four that makes up the chunk overlap. 

In [None]:
c_splitter.split_text(text3)

*OUTPUT*
```
['a b c d e f g h i j k l m n o p q r s t u v w x y z']
```

The issue is the CHARACTER TEXT SPLITTER splits on a single character and by default that character is a newline character. But here, there are no newlines.

In [None]:
#  If we set the separator to be an empty space, we can see what happens then. Here it's split in the same way as before. 

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

*OUTPUT*
```
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
```

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

\n\n double newline symbol which is a typical separator between paragraphs

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

*OUTPUT*

496

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
    # list of separators, and these are the default separators 
    # but we're just putting them in this notebook to better show what's going on.
)

What this mean is that when you're splitting a piece of text it will first try to split it by double newlines. And then, if it still needs to split the individual chunks more it will go on to single newlines. And then, if it still needs to do more it goes on to the space. And then, finally it will just go character by character if it really needs to do that. 

In [None]:
c_splitter.split_text(some_text)

*OUTPUT*
```
['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.'] 
```

In [None]:
r_splitter.split_text(some_text)

*OUTPUT*
```
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
```

Let's reduce the chunk size a bit and add a period to our separators:

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

*OUTPUT*
```
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']
```
If we run this text splitter, we can see that it's split on sentences, but the periods are actually in the wrong places. This is because of the regex that's going on underneath the scenes

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]  # split properly
)
r_splitter.split_text(some_text)

*OUTPUT*
```
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']
 ```

## Real-world example with one of the PDFs that we worked with in the first document loading section

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# define our text splitter 

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len   # using LEN, the Python built-in
)

In [None]:
# Using the split documents method

docs = text_splitter.split_documents(pages)

In [None]:
len(docs)

*OUTPUT*

77

In [None]:
len(pages)

*OUTPUT*

22

If we compare the length of those documents to the length of the original pages, we can see that there's been a bunch more documents that have been created as a result of this splitting. 

## Notion_DB

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [None]:
docs = text_splitter.split_documents(notion_db)

In [None]:
len(notion_db)

*OUTPUT*

52

In [None]:
len(docs)

*OUTPUT*

353

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

The reason that this is useful is because often LLMs have context windows that are designated by token count. And so, it's important to know what the tokens are, and where they appear. And then, we can split on them to have a slightly more representative idea of how the LLM would view them. 

In [None]:
from langchain.text_splitter import TokenTextSplitter

To really get a sense for what the difference is between tokens and characters. 
Let's initialize the token text splitter with a chunk size of 1, and a chunk overlap of 0. So, this will split any text into a list of the relevant tokens. Let's create a fun made-up text, and when we split it, we can see that it's split into a bunch of different tokens, and they're all a little bit different in terms of their length and the number of characters in them. 


In [None]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [None]:
text1 = "foo bar bazzyfoo"

In [None]:
text_splitter.split_text(text1)

*OUTPUT*

```
['foo', ' bar', ' b', 'az', 'zy', 'foo']
```

In [None]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
docs[0]

*OUTPUT*

```
Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})
```

In [None]:
pages[0].metadata

*OUTPUT*

```
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}
```

This can contain information like where in the document, the chunk came from where it is relative to other things or concepts in the document and generally this information can be used when answering questions to provide more context about what this chunk is exactly. 

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

This text splitter is the markdown header text splitter and what it will do is it will split a markdown file based on the header or any subheaders and then it will add those headers as content to the metadata fields and that will get passed on along to any chunks that originate from those splits.  

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [None]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [None]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [None]:
md_header_splits[0]

*OUTPUT*

```
Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})
```

In [None]:
md_header_splits[1]

*OUTPUT*

```
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})
```

Try on a real Markdown file, like a Notion database.

In [None]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [None]:
md_header_splits = markdown_splitter.split_text(txt)

In [None]:
md_header_splits[0]

*OUTPUT*

```
Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn from you. We're very much interested in your feedback (including weeding out typo's and Dunglish ;)). Email us at hr@blendle.com. If you're starting your own company or if you're curious as to how we do things at Blendle, we hope that our employee handbook inspires you.  \nIf you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/). If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2).", metadata={'Header 1': "Blendle's Employee Handbook"})
```