# Text Splitting Documents

There are ways to split text so that we can keep track of the number of tokens we send as context. 

We took an example from the pdf file to start from previous section:

In [14]:
# To Read a PDF file; note this requires the PyPDF library
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("sample-files/king.dreamspeech.excerpts.pdf")

docs = loader.load()
print(docs)

[Document(metadata={'producer': 'Adobe PDF Library 10.0', 'creator': 'Acrobat PDFMaker 10.1 for Word', 'creationdate': '2015-02-25T11:08:01-05:00', 'author': 'Sasha Rolon-Pereira', 'moddate': '2015-02-25T11:08:44-05:00', 'title': 'Martin Luther King Jr.pdf', 'source': 'sample-files/king.dreamspeech.excerpts.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='©2014 The Gilder Lehrman Institute of American History \nwww.gilderlehrman.org \n“I Have a Dream” Speech by the Rev. Martin Luther King Jr. at the “March on Washington,” \n1963 (excerpts ) \n \nI am happy to join with you today in what will go down in history as the greatest demonstration for \nfreedom in the history of our nation. \nFive score years ago a great American in whose symbolic shadow we stand today signed the \nEmancipation Proclamation. This momentous decree is a great beacon light of hope to millions of Negro \nslaves who had been seared in the flames of withering injustice. It came as a joyous daybre

In [15]:
# Recursively split the documents by characters 
# we need the `langchain-text-splitters` package for this 
from langchain_text_splitters import RecursiveCharacterTextSplitter

# setting up the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # define how many characters per chunk
    chunk_overlap=200, # define how many characters to overlap between chunks
    length_function=len, # function to calculate the length of the text
)

In [16]:
split_docs = text_splitter.create_documents(docs) # this will split the documents into smaller chunks

# when dividing text into chunks, 

TypeError: expected string or bytes-like object, got 'Document'

In [None]:
type(docs[0]) # confirming this returns a Document object

langchain_core.documents.base.Document

In [None]:
# Since that returns documents and not text, we need to do the text_splitting differently

final_docs = text_splitter.split_documents(docs) # this will split the documents into smaller chunks as a list of Document objects

print(final_docs[0]) # printing the first chunk to see the result

page_content='©2014 The Gilder Lehrman Institute of American History 
www.gilderlehrman.org 
“I Have a Dream” Speech by the Rev. Martin Luther King Jr. at the “March on Washington,” 
1963 (excerpts ) 
 
I am happy to join with you today in what will go down in history as the greatest demonstration for 
freedom in the history of our nation. 
Five score years ago a great American in whose symbolic shadow we stand today signed the 
Emancipation Proclamation. This momentous decree is a great beacon light of hope to millions of Negro 
slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the 
long night of their captivity. But 100 years later the Negro still is not free. One hundred years later the 
life of the Negro is still badly crippled by the manacles of segregation and the chains of discrimination. 
One hundred years later the Negro lives on a lonely island of poverty in the midst of a vast ocean of' metadata={'producer': 'Adobe PDF Libra

If you check out sequential documents from the `final_docs` variable, you'll see that some sentences are repeated, which is done from that overlapping text paramenter.

Also the chunks returned will have page content and metadata to reference.

## With Text files

In [None]:
from langchain_community.document_loaders import TextLoader

#  initialize the TextLoader with the path to your text file
loader = TextLoader("sample-files/sample-text.txt", encoding="utf-8")

# output is a TextLoader object
loader

<langchain_community.document_loaders.text.TextLoader at 0x20fffd1b380>

In [None]:
# This loads the text as a string, not as a Document object

text = ""
with open("sample-files/sample-text.txt", "r", encoding="utf-8") as f:
    text = f.read()  # read the content of the file


text

'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse laoreet sapien at scelerisque aliquet. Mauris congue suscipit neque id dapibus. Nullam blandit leo id felis mollis, at porttitor nisi lobortis. Aliquam mi ipsum, egestas et scelerisque eget, ultricies vitae odio. Praesent nec neque porta, mollis neque gravida, congue tortor. Donec aliquet, nulla ut ornare vulputate, est turpis pulvinar nunc, eget mattis ante lorem rutrum arcu. Pellentesque elementum eros nulla, vitae malesuada nisi placerat quis. Integer dapibus, turpis a laoreet lacinia, mauris ex ultrices diam, faucibus mollis leo ex eu ex. Nam at nisi egestas, feugiat diam at, fringilla velit. Sed pulvinar tellus sed iaculis pharetra. Maecenas ut gravida diam. Integer accumsan, massa vel convallis maximus, sem purus vestibulum est, eu congue odio diam vitae leo. Donec in auctor dui. Cras vitae est scelerisque urna placerat mollis. In neque metus, fringilla id turpis a, iaculis ultrices lorem. Maecenas vitae rutrum

In [None]:
# then we can use the recursivle text splitter to split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # define how many characters per chunk
    chunk_overlap=200,  # define how many characters to overlap between chunks
    length_function=len,  # function to calculate the length of the text
)  

text_split = text_splitter.create_documents([text])  # can use a similar approach to the PDF loader, which outputs a list of Document objects

print(text_split[0])  # printing the first chunk to see the result

page_content='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse laoreet sapien at scelerisque aliquet. Mauris congue suscipit neque id dapibus. Nullam blandit leo id felis mollis, at porttitor nisi lobortis. Aliquam mi ipsum, egestas et scelerisque eget, ultricies vitae odio. Praesent nec neque porta, mollis neque gravida, congue tortor. Donec aliquet, nulla ut ornare vulputate, est turpis pulvinar nunc, eget mattis ante lorem rutrum arcu. Pellentesque elementum eros nulla, vitae malesuada nisi placerat quis. Integer dapibus, turpis a laoreet lacinia, mauris ex ultrices diam, faucibus mollis leo ex eu ex. Nam at nisi egestas, feugiat diam at, fringilla velit. Sed pulvinar tellus sed iaculis pharetra. Maecenas ut gravida diam. Integer accumsan, massa vel convallis maximus, sem purus vestibulum est, eu congue odio diam vitae leo. Donec in auctor dui. Cras vitae est scelerisque urna placerat mollis. In neque metus, fringilla id turpis a, iaculis ultrices lorem. Maecenas

# Text Splitting with Character Text Splitter

The recursive character text splitter is recommended for generic text.  This has the effect of keeping all the paragraphs together as long as possible. 

Another way is the Character Text splitter which splits on a given character sequence, which defaults to two newlines back to back `\n\n`. Church length is measured by the number of characters.

In [None]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",  # define the separator for splitting the text
    chunk_size=1000,  # define how many characters per chunk
    chunk_overlap=200,  # define how many characters to overlap between chunks
)

text_splitter.split_documents(docs)

[Document(metadata={'producer': 'Adobe PDF Library 10.0', 'creator': 'Acrobat PDFMaker 10.1 for Word', 'creationdate': '2015-02-25T11:08:01-05:00', 'author': 'Sasha Rolon-Pereira', 'moddate': '2015-02-25T11:08:44-05:00', 'title': 'Martin Luther King Jr.pdf', 'source': 'sample-files/king.dreamspeech.excerpts.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='©2014 The Gilder Lehrman Institute of American History \nwww.gilderlehrman.org \n“I Have a Dream” Speech by the Rev. Martin Luther King Jr. at the “March on Washington,” \n1963 (excerpts ) \n \nI am happy to join with you today in what will go down in history as the greatest demonstration for \nfreedom in the history of our nation. \nFive score years ago a great American in whose symbolic shadow we stand today signed the \nEmancipation Proclamation. This momentous decree is a great beacon light of hope to millions of Negro \nslaves who had been seared in the flames of withering injustice. It came as a joyous daybre

In [None]:
# wherever the split happens determines the chunk size, so the chunks may not be exactly 1000 characters long


# HTML Text Splitter

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample HTML Document</title> 
</head>
<body> 
    <h1>Welcome to the Sample HTML Document</h1>
    <p>This is a sample paragraph in the HTML document.</p>
    <h2>Subheading</h2>
    <p>This is another paragraph under the subheading.</p>
    <h3>Another Subheading</h3>
    <p>Yet another paragraph under a different subheading.</p>
</body> 
</html>
    """

# First create all the headers you want to split by within the HTML document
headers = [
    ("h1", "Header 1"),
    ("h2", "Header 2"), 
    ("h3", "Header 3")
    ]  # define the headers to split by NOTE: the tuples are based on the `(tag, label)` format; the label is optional

html_splitter = HTMLHeaderTextSplitter(headers)

html_header_splits = html_splitter.split_text(html_string)

print(html_header_splits)  # printing the splits to see the result


[Document(metadata={'Header 1': 'Welcome to the Sample HTML Document'}, page_content='Welcome to the Sample HTML Document'), Document(metadata={'Header 1': 'Welcome to the Sample HTML Document'}, page_content='This is a sample paragraph in the HTML document.'), Document(metadata={'Header 1': 'Welcome to the Sample HTML Document', 'Header 2': 'Subheading'}, page_content='Subheading'), Document(metadata={'Header 1': 'Welcome to the Sample HTML Document', 'Header 2': 'Subheading'}, page_content='This is another paragraph under the subheading.'), Document(metadata={'Header 1': 'Welcome to the Sample HTML Document', 'Header 2': 'Subheading', 'Header 3': 'Another Subheading'}, page_content='Another Subheading'), Document(metadata={'Header 1': 'Welcome to the Sample HTML Document', 'Header 2': 'Subheading', 'Header 3': 'Another Subheading'}, page_content='Yet another paragraph under a different subheading.')]


### With a URL 

In [None]:
# grab a url to split on:
url = "https://plato.stanford.edu/entries/goedel/"

# keeing the same headers as before

html_header_splits = html_splitter.split_text_from_url(url)

print(html_header_splits)  # printing the splits to see the result

[Document(metadata={}, page_content='End container NOTE: Script required for drop-down button to work (mirrors).  \nEnd header wrapper End content End footer  \nEnd header  \nEnd navigation End search  \nStanford Encyclopedia of Philosophy  \nMenu  \nBrowse  \nTable of Contents  \nWhat\'s New  \nRandom Entry  \nChronological  \nArchives  \nAbout  \nEditorial Information  \nAbout the SEP  \nEditorial Board  \nHow to Cite the SEP  \nSpecial Characters  \nAdvanced Tools  \nContact  \nSupport SEP  \nSupport the SEP  \nPDFs for SEP Friends  \nMake a Donation  \nSEPIA for Libraries  \nBegin article sidebar End article sidebar NOTE: Article content must have two wrapper divs: id="article" and id="article-content" End article NOTE: article banner is outside of the id="article" div. End article-banner  \nEntry Navigation  \nEntry Contents  \nBibliography  \nAcademic Tools  \nFriends PDF Preview  \nAuthor and Citation Info  \nBack to Top  \nEnd article-content  \nBEGIN ARTICLE HTML #aueditable D

# Recursive JSON Splitter

In [24]:
import json
import requests 
from langchain_text_splitters import RecursiveJsonSplitter

# Get the data
json_data = requests.get("https://jsonplaceholder.typicode.com/posts").json() # turns out this request returned a list of JSON objects, 

# Debug the data structure
print(f"Type: {type(json_data)}")
print(f"Length: {len(json_data)}")
print(f"First item: {json_data[0]}")
print(f"First item keys: {json_data[0].keys()}")

Type: <class 'list'>
Length: 100
First item: {'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}
First item keys: dict_keys(['userId', 'id', 'title', 'body'])


In [34]:
import json
import requests 
from langchain_text_splitters import RecursiveJsonSplitter

json_data = requests.get("https://jsonplaceholder.typicode.com/posts").json()

# Wrap the list in a dictionary to give it structure
wrapped_data = {"posts": json_data}

In [35]:


json_splitter = RecursiveJsonSplitter(max_chunk_size=10)
json_chunks = json_splitter.split_json(wrapped_data)

if json_chunks and len(json_chunks) > 0:
    print(f"Created {len(json_chunks)} chunks")
    print("First chunk:", json_chunks[0])
    # The chunks will have structure like: {"posts": [subset of posts]}
else:
    print("No chunks were created.")

Created 1 chunks
First chunk: {'posts': [{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}, {'userId': 1, 'id': 2, 'title': 'qui est esse', 'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'}, {'userId': 1, 'id': 3, 'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut', 'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'}, {'userId': 1, 'id': 4, 'title': 'eum et est occaecati', 'body': 'ullam et saepe reiciendis voluptatem adipisci\nsit amet autem ass

In [36]:
for chunk in json_chunks[:3]:
    print(chunk)

{'posts': [{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}, {'userId': 1, 'id': 2, 'title': 'qui est esse', 'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'}, {'userId': 1, 'id': 3, 'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut', 'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'}, {'userId': 1, 'id': 4, 'title': 'eum et est occaecati', 'body': 'ullam et saepe reiciendis voluptatem adipisci\nsit amet autem assumenda provident rerum culpa\n

In [None]:
# This splitter can also output documents instead of chunks of JSON data

docs = json_splitter.create_documents(texts=json_data) # its expecting a list of texts, so we pass the json_data directly, instead of wrapping it in a dictionary

for docs_chunk in docs[:3]:
    print(docs_chunk)  # Each chunk is a Document object with metadata and text content
    print("Text:", docs_chunk.page_content)  # Access the text content of the Document
    print("Metadata:", docs_chunk.metadata)  # Access the metadata of the Document

page_content='{"userId": 1, "id": 1, "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit"}'
Text: {"userId": 1, "id": 1, "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit"}
Metadata: {}
page_content='{"body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"}'
Text: {"body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"}
Metadata: {}
page_content='{"userId": 1, "id": 2, "title": "qui est esse", "body": "est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla"}'
Text: {"userId": 1, "id": 2, "title": "qui est esse", "body": "est rerum tempore