# Document Splitting

The steps in this notebook include: 
- **Use Langchain Document Splitters** 

## Contents
1. [Installation](#installation)
2. [CharacterTextSplitter vs RecursiveCharacterTextSplitte](#vs)
3. [Recursive Text Splitter](#recursivetextsplitter)  
4. [Token splitting](#token)
5. [Context aware splitting](#context)

**Source:** https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/3/document-splitting

![overview.png](./images/overview.png)

# **Installation** <a name="installation"></a>

In [None]:
!pip install -U langchain openai python-dotenv

In [1]:
import os
import openai
import sys

sys.path.append('../..')

# Load from a .env file 
#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = "eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIxNDYyNzU5IiwiYXVkIjoiV0VCIiwiaWF0IjoxNjk5NDUxNzMzLCJleHAiOjE3MDAwNTY1MzN9.7mqcOZ3w4gd7m9QGWcdOx7U1ayk1l22LNZ8LfPOLqjE"

openai.api_key  = os.environ['OPENAI_API_KEY']

# **CharacterTextSplitter vs RecursiveCharacterTextSplitter** <a name="vs"></a>

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

**Parameters:**  
- `chunk_size`: Maximum size of chunks to return
- `chunk_overlap`: Overlap in characters between chunks

In [3]:
chunk_size =26
chunk_overlap = 4

In [4]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [13]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
print(len(text1))

print(r_splitter.split_text(text1))
print(c_splitter.split_text(text1))

['abcdefghijklmnopqrstuvwxyz']
['abcdefghijklmnopqrstuvwxyz']


In [15]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
print(len(text2))

print(r_splitter.split_text(text2))
print(c_splitter.split_text(text2))

33
['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']
['abcdefghijklmnopqrstuvwxyzabcdefg']


- `RecursiveCharacterTextSplitter` or `CharacterTextSplitter` don't split the string `text1` because its length is less than 26 (`chunk size`)  
- `CharacterTextSplitter` don't split the string `text2` because the default separator value is `separator: str = '\n\n'`, so it don't split the string. 

<img src="images/chunk_overlap.png" width=500 />

In [16]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
print(len(text3))

print(r_splitter.split_text(text3))
print(c_splitter.split_text(text3))

51
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
['a b c d e f g h i j k l m n o p q r s t u v w x y z']


By using a new `separator` for the `CharacterTextSplitter`:

In [17]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' ',
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

# **Recursive Text Splitter** <a name="recursivetextsplitter"></a>

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [18]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [19]:
len(some_text)

496

In [20]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [21]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [22]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

<div class="alert alert-info"> 💡<b>Separators:</b>
<ul>
    <li><code>CharacterTextSplitter</code> need an unique <i>string</i> as separator. <code>separator: str = '\n\n'</code>  
    <li><code>RecursiveCharacterTextSplitter</code> can take mulitple <i>string</i> as separators (<i>list</i>). <code>Optional[List[str]]</code>
</div>

Let's reduce the chunk size a bit and add a period to our separators:

In [24]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [25]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

**Load PDF file**

In [35]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/MachineLearning-Lecture01.pdf")
pages = loader.load()

print(type(pages))
print(pages[0:1])

<class 'list'>
[Document(page_content='MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we\'ll start to  talk a bit about machine learning.  \nBy way of introduction, my name\'s  Andrew Ng and I\'ll be instru ctor for this class. And so \nI personally work in machine learning, and I\' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I\'m actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspec

In [28]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [29]:
docs = text_splitter.split_documents(pages)

<div class="alert alert-info"> 💡<b>Split documents:</b>
<code>split_documents(documents: Iterable[Document]) → List[Document]</code>
</div>

In [40]:
print(len(docs))
print(len(docs[0].page_content))

print(docs[0])


77
985
page_content="MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we'll start to  talk a bit about machine learning.  \nBy way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \nI personally work in machine learning, and I' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I'm actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learning

In [31]:
len(pages)

22

We have **22 pages** in our PDF. With the Text splitter, we have splitted the `Document` into **77 chunks** where each `page_content`'s length is <1000 (`chunk_size`)

**Load Notion zip file**

In [41]:
!unzip -o "data/27261042-ae4b-4a74-b2d7-6154d7246eb4_Export-0d6d611c-c647-4296-b0f6-4c989f8c5d0d.zip" -d "data/Notion_DB"

Archive:  data/27261042-ae4b-4a74-b2d7-6154d7246eb4_Export-0d6d611c-c647-4296-b0f6-4c989f8c5d0d.zip
  inflating: data/Notion_DB/MLOps tools 2023 830865f5e014447eb4c8c2cf5dbb7367.md  


In [50]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("data/Notion_DB")
notion_db = loader.load()
notion_db

[Document(page_content='# MLOps tools 2023\n\n[https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1](https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1)\n\n![https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1](https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1)', metadata={'source': 'data/Notion_DB/MLOps tools 2023 830865f5e014447eb4c8c2cf5dbb7367.md'})]

In [47]:
docs = text_splitter.split_documents(notion_db)

In [48]:
len(notion_db)

1

In [49]:
len(docs)

1

# **Token splitting** <a name="token"></a>

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in **tokens**.

Tokens are often ~4 characters.

In [53]:
!pip install -U tiktoken

Collecting tiktoken
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/f4/2e/0adf6e264b996e263b1c57cad6560ffd5492a69beb9fd779ed0463d486bc/tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.5.1


In [61]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

text1 = "foo bar bazzyfoo"

text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

<div class="alert alert-info"> 💡 <b>Token splitter:</b>
     
<code>TokenTextSplitter</code> splits a raw text string to tokens using <i>model tokenizer</i>. By first converting the text into <b>BPE tokens</b>, then split these tokens into chunks and convert the tokens within a single chunk back into text.
<ul>
    <li> Uses the OpenAI language model to split text into fragments based on tokens. By default it uses <code>encoding_name: str = 'gpt2</code>.
    <li> <b>Byte-Pair Encoding (BPE)</b> was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa. <a href=https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt#byte-pair-encoding-tokenization>More</a>.
</div>


In [65]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [71]:
docs = text_splitter.split_documents(pages)
docs[0]

Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'data/MachineLearning-Lecture01.pdf', 'page': 0})

In [68]:
pages[0].metadata

{'source': 'data/MachineLearning-Lecture01.pdf', 'page': 0}

# **Context aware splitting** <a name="context"></a>

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [72]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

**Parameters:** 
- `headers_to_split_on` _List[Tuple[str, str]]_ – Headers we want to track
- `return_each_line` _bool_ – Return each line w/ associated headers

In [75]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [76]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [94]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_line=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(len(md_header_splits))
print(md_header_splits)
print()
print(md_header_splits[0])
print(md_header_splits[1])


3
[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}), Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}), Document(page_content='Hi this is Molly', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 2'})]

page_content='Hi this is Jim  \nHi this is Joe' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}
page_content='Hi this is Lance' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}


In [95]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_line=True
)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(len(md_header_splits))
print(md_header_splits)
print()
print(md_header_splits[0])
print(md_header_splits[1])

4
[Document(page_content='Hi this is Jim', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}), Document(page_content='Hi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}), Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}), Document(page_content='Hi this is Molly', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 2'})]

page_content='Hi this is Jim' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}
page_content='Hi this is Joe' metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}


<div class="alert alert-info"> 💡 <b>Markdown splitter:</b>

<code>return_each_line</code>: Output line-by-line instead of aggregated into chunks with common headers

</div>


Try on a real Markdown file, like a Notion database.

In [96]:
loader = NotionDirectoryLoader("data/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [100]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

md_header_splits = markdown_splitter.split_text(txt)

md_header_splits

[Document(page_content='[https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1](https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1)  \n![https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1](https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/MLOps-Landscape-in-2023-Top-Tools-and-Platforms-5.png?ssl=1)', metadata={'Header 1': 'MLOps tools 2023'})]