# Document Splitting
<img src="./images/L2-Document_splitting.png" alt="workflow" width="500"/>

<img src="./images/L2_textsplitter.png" alt="methods" width="500"/>

- long document into smaller chunks that can **fit** into your **model's context window**
- built-in document transformers that make it easy to **split, combine, filter**, and otherwise **manipulate documents**. Packages -- `**langchain-text-splitters**`
- keep the semantically related pieces of text together

1. How the text is split2. 
How the chunk size is measure

- Split the text up into small, semantically meaningful chunks (often sentences).- 
Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function)
- 
Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).d

In [240]:
! pip install langchain
!pip install -U langchain-community
!pip install pymupdf
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp312-cp312-win_amd64.whl.metadata (6.8 kB)
Downloading tiktoken-0.7.0-cp312-cp312-win_amd64.whl (799 kB)
   ---------------------------------------- 0.0/799.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/799.3 kB 660.6 kB/s eta 0:00:02
   ---- ----------------------------------- 92.2/799.3 kB 1.1 MB/s eta 0:00:01
   ----------- ---------------------------- 225.3/799.3 kB 1.7 MB/s eta 0:00:01
   ---------------------- ----------------- 450.6/799.3 kB 2.6 MB/s eta 0:00:01
   ------------------------------- -------- 634.9/799.3 kB 2.9 MB/s eta 0:00:01
   ------------------------------------- -- 747.5/799.3 kB 2.8 MB/s eta 0:00:01
   ---------------------------------------- 799.3/799.3 kB 2.8 MB/s eta 0:00:00
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [166]:
from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter

## RecursiveCharacterSplitter
- `split-text`
- `create_documents`

In [197]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=5,
    chunk_overlap=2,
    separators=[
        "\n\n",
        "\n",
        " ",
        "",
        #---- Set for different language-----
        # ".",
        # ",",
        # "\u200b",  # Zero-width space
        # "\uff0c",  # Fullwidth comma
        # "\u3001",  # Ideographic comma
        # "\uff0e",  # Fullwidth full stop
        # "\u3002",  # Ideographic full stop    
    ],
    length_function=len,
)
text1 = '12\n\n123 456789\n123456\n\n123456789'

split_list = text_splitter.split_text(text1)
print("After Split: ",split_list)
for i in range(len(split_list)):
    split_list[i] = split_list[i]+"->"+str(len(split_list[i])) 
print(split_list)

text2 = text_splitter.create_documents([text1])
# text2



After Split:  ['12', '123', '4567', '6789', '1234', '3456', '1234', '34567', '6789']
['12->2', '123->3', '4567->4', '6789->4', '1234->4', '3456->4', '1234->4', '34567->5', '6789->4']


## CharacterTextSplitter   
- not considering multiple levels of splitting criteria

In [170]:
text4 = '''12345  6789\n\n123456789123456789123456789\n123456789 '''
c_splitter = CharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=1,
    length_function= len,
    separator = '\n\n'
)
split_list2 = c_splitter.split_text(text4)
print("After Split: ",split_list2)
for i in range(len(split_list2)):
    split_list2[i] = split_list2[i]+"->"+str(len(split_list2[i])) 
print(split_list2)

Created a chunk of size 11, which is longer than the specified 10


After Split:  ['12345  6789', '123456789123456789123456789\n123456789']
['12345  6789->11', '123456789123456789123456789\n123456789->37']


Let's reduce the chunk size a bit and add a period to our separators:


In [209]:
from langchain.document_loaders import PyMuPDFLoader

# Specify the path to your PDF file
pdf_file_path = "docs/cs229_lectures/MachineLearning-Lecture01.pdf"

# Load the PDF document
loader = PyMuPDFLoader(pdf_file_path)
documents = loader.load()

# # Now you can use the documents in LangChain
# for doc in documents:
#     print(doc)

In [211]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [213]:
docs = text_splitter.split_documents(documents)

In [215]:
len(docs)

78

In [244]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [246]:
len(notion_db)

52

In [248]:
len(docs)

78

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.


In [242]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [251]:
docs = text_splitter.split_documents(docs)

In [253]:
docs[0]

Document(metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'file_path': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0, 'total_pages': 22, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creationDate': "D:20080711112523-07'00'", 'modDate': "D:20080711112523-07'00'", 'trapped': ''}, page_content='Machine')

In [33]:
pages[0].metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use MarkdownHeaderTextSplitter to preserve header metadata in our chunks, as show below.


In [255]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [271]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)
type(md_header_splits),len(md_header_splits)

(list, 3)

In [273]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')

In [275]:
md_header_splits[1]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}, page_content='Hi this is Lance')

In [285]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])
# txt

In [44]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [45]:
md_header_splits = markdown_splitter.split_text(txt)

In [46]:
md_header_splits[0]

{'content': "We kick off with the practical side of things and then dig into the idea behind it.  \n- **How *time off* works at Blendle**\n- Time off is about the time you **need,** not about a **quota.**\n- At Blendle, **HR doesn't keep track** of your holidays **and we don't 'pay out' at the end of the ride.** When in doubt: 4-6 weeks is a good bandwidth. Less than that is not enough, more than that can happen, just check with your lead if you're in doubt if it's reasonable.\n- **We stick to the commonly used national holidays**, which comes down to ~8 days per year. We are a startup and there are teams who have work to be done 24/7. We don't like being told whether we are off or not on Eid al-Fitr (Suikerfeest, ending of Ramadan) by some rules someone made up, so this is a guideline we use: feel free to work or be off when you want.\n- Make sure to **take enough moments of rest when you have periods of working hard**. For example: worked a few nights to finish a project? Go home at 