# Text Splitting in RAG

## Character TextSplitter

In [10]:
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

In [16]:
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()

In [17]:
documents

[Document(metadata={'source': 'state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determin

In [18]:
# Split documents into fixed number of characters
char_splitter = CharacterTextSplitter(chunk_size= 500,chunk_overlap= 0)
docs = char_splitter.split_documents(documents)
len(docs)

88

In [19]:
mtc_count = 0
for doc in docs:
    if  len(doc.page_content) > 500:
        mtc_count += 1
    print(len(doc.page_content))
    print(doc.page_content)
    print("\n" + "=" * 50)
print(f"Total documents with more than 500 characters: {mtc_count}")

490
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

446
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

424
Groups of citizens blocking tanks with their bod

In [45]:
text_splitter1 = CharacterTextSplitter(separator="\n", chunk_size= 500, is_separator_regex= False)
docs1 = text_splitter1.split_documents(documents)

In [46]:
len(docs1)

124

In [47]:
mtc_count = 0
for doc in docs1:
    if len(doc.page_content) > 500:
        mtc_count += 1
    print(len(doc.page_content))
    print(doc.page_content)
    print("\n" + "=" * 50)
print(f"Total documents with more than 500 characters: {mtc_count}")

486
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  
Last year COVID-19 kept us apart. This year we are finally together again. 
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 
With a duty to one another to the American people to the Constitution. 
And with an unwavering resolve that freedom will always triumph over tyranny.

472
With a duty to one another to the American people to the Constitution. 
And with an unwavering resolve that freedom will always triumph over tyranny. 
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 
He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 
He met the Ukrainian people.

413
He thought he could roll i

## Recursive Text Splitter

In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [49]:
rc_splitter = RecursiveCharacterTextSplitter(
    chunk_size= 500
)

In [50]:
docs2 = rc_splitter.split_documents(documents)

In [52]:
len(docs2)

124

In [53]:
mtc_count = 0
for doc in docs2:
    if len(doc.page_content) > 500:
        mtc_count += 1
    print(len(doc.page_content))
    print(doc.page_content)
    print("\n" + "=" * 50)
print(f"Total documents with more than 500 characters: {mtc_count}")

490
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

476
With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people.

416
He thought he coul

In [54]:
rc_splitter._separators

['\n\n', '\n', ' ', '']

In [56]:
rc_splitter1 = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ",", "", "."],
    chunk_size= 500
)

In [57]:
docs3 = rc_splitter1.split_documents(documents)
len(docs3)

124

## Token Splitter

In [58]:
# 1st Way
from langchain_text_splitters import TokenTextSplitter

In [59]:
tt_splitter = TokenTextSplitter(chunk_size= 500)

In [60]:
docs4 = tt_splitter.split_documents(documents)
len(docs4)

31

In [61]:
mtc_count = 0
for doc in docs4:
    if len(doc.page_content) > 500:
        mtc_count += 1
    print(len(doc.page_content))
    print(doc.page_content)
    print("\n" + "=" * 50)
print(f"Total documents with more than 500 characters: {mtc_count}")

2148
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. 

Groups of citizens blocking tanks with their bodies. 

In [66]:
# 2nd Way
rct_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size= 500)

In [67]:
docs5 = rct_splitter.split_documents(documents)
len(docs5)

32

## HTML Header Loader
Split and load from URL

In [68]:
from langchain_text_splitters import HTMLHeaderTextSplitter

In [70]:
headers_to_split = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on= headers_to_split, return_each_element= True)

In [71]:
page_url = "https://python.langchain.com/docs/concepts/text_splitters/"

In [72]:
splitted_html = html_splitter.split_text_from_url(page_url)

In [73]:
len(splitted_html)

430

In [74]:
for doc in splitted_html:
    print(doc.metadata)
    print(doc.page_content)
    print("\n" + "=" * 50)

{}
!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();null!==e?t(e):window.matchMedia("(prefers-color-scheme: dark)").matches?t("dark"):(window.matchMedia("(prefers-color-scheme: light)").matches,t("light"))}(),function(){try{const n=new URLSearchParams(window.location.search).entries();for(var[t,e]of n)if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())

{}
Skip to main content

{}
Our course is now available on LangChain Academy!

{}
Building Ambient Agents with LangGraph

{}
Integrations

{}
API Refer

In [78]:
html_splitter1 = HTMLHeaderTextSplitter(headers_to_split_on= headers_to_split)

In [79]:
splitted_html1 = html_splitter1.split_text_from_url(page_url)
len(splitted_html1)

19

In [80]:
for i, text in enumerate(splitted_html1):
    print(f"Text {i+1}:")
    print(text.metadata)
    print(text.page_content)
    print("\n" + "=" * 50)

Text 1:
{}
!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();null!==e?t(e):window.matchMedia("(prefers-color-scheme: dark)").matches?t("dark"):(window.matchMedia("(prefers-color-scheme: light)").matches,t("light"))}(),function(){try{const n=new URLSearchParams(window.location.search).entries();for(var[t,e]of n)if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())  
Skip to main content  
Our course is now available on LangChain Academy!  
Building Ambient Agents with LangGraph  
Integrations  
API Referen

## HTML Section Splitter

In [75]:
from langchain_text_splitters import HTMLSectionSplitter

In [83]:
sections_to_split = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("p", "Paragraph"),
    ("div", "Div")
]

In [84]:
section_splitter = HTMLSectionSplitter(sections_to_split, return_each_element=True)

In [86]:
import requests
r = requests.get(page_url)

In [87]:
splitted_html_sections = section_splitter.split_text(r.text)

In [88]:
len(splitted_html_sections)

66

In [89]:
for i, text in enumerate(splitted_html_sections):
    print(f"Text {i+1}:")
    print(text.metadata)
    print(text.page_content)
    print("\n" + "=" * 50)

Text 1:
{'Header 1': '#TITLE#'}
!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();null!==e?t(e):window.matchMedia("(prefers-color-scheme: dark)").matches?t("dark"):(window.matchMedia("(prefers-color-scheme: light)").matches,t("light"))}(),function(){try{const n=new URLSearchParams(window.location.search).entries();for(var[t,e]of n)if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())

Text 2:
{'Div': 'Skip to main content'}
Skip to main content

Text 3:
{'Div': 'Our Building Ambient Agents with LangGraph 