## chunking often aims to keep text with common context together.

In [81]:
import os
os.environ["OPENAI_API_KEY"] = "sk--"

In [2]:
%pip install -qU langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


CharacterTextSplitter splits text at a fixed character limit without considering natural breakpoints. Chunks may split mid-sentence or mid-word if the limit is reached.

RecursiveCharacterTextSplitter also splits at a character limit but tries to split at natural breakpoints (like periods or spaces) by checking progressively larger sections

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load example document
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [4]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, #This sets the maximum number of characters allowed in each chunk of text
    chunk_overlap=20, #so each chunk will overlap with the next by 20 characters.
    length_function=len, #which just counts the number of characters.
    is_separator_regex=False, #meaning it will interpret the separator as plain text and not as a regex pattern.
)

In [5]:
texts = text_splitter.create_documents([state_of_the_union])

In [6]:
texts

[Document(metadata={}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'),
 Document(metadata={}, page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'),
 Document(metadata={}, page_content='Last year COVID-19 kept us apart. This year we are finally together again.'),
 Document(metadata={}, page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.'),
 Document(metadata={}, page_content='With a duty to one another to the American people to the Constitution.'),
 Document(metadata={}, page_content='And with an unwavering resolve that freedom will always triumph over tyranny.'),
 Document(metadata={}, page_content='Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he'),
 Document(metadata={}, page_content='world thinking he could make it bend to his menacing ways. But he badly miscalculated

In [7]:
print(texts[0])
print("\n")
print("----")
print("\n")
print(texts[1])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'


----


page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'


In [8]:
texts2 = text_splitter.split_text(state_of_the_union)

In [9]:
texts2[:3]

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.',
 'Last year COVID-19 kept us apart. This year we are finally together again.']

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, #This sets the maximum number of characters allowed in each chunk of text
    chunk_overlap=20, #so each chunk will overlap with the next by 20 characters.
    length_function=len, #which just counts the number of characters.
    is_separator_regex=False, #meaning it will interpret the separator as plain text and not as a regex pattern.
    separators=[
        "\n\n",
        "\n",
        ".",
        ",",
        ", "]
)
text_splitter.split_text(state_of_the_union)

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman',
 '. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.',
 'Last year COVID-19 kept us apart. This year we are finally together again.',
 'Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.',
 'With a duty to one another to the American people to the Constitution.',
 'And with an unwavering resolve that freedom will always triumph over tyranny.',
 'Six days ago',
 ', Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways',
 '. But he badly miscalculated.',
 'He thought he could roll into Ukraine and the world would roll over',
 '. Instead he met a wall of strength he never imagined.',
 'He met the Ukrainian people.',
 'From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination',
 ', inspires the world',
 '.',
 'Groups of

# HTMLHeaderTextSplitter 

It can return chunks element by element or combine elements with the same metadata, with the objectives of 

(a) keeping related text grouped (more or less) semantically and 

(b) preserving context-rich information encoded in document structures. 

In [44]:
html_string = """
<!DOCTYPE html>
<html>
<head>
    <title>School Project on Ecosystems</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            color: #333;
        }
        h1, h2, h3 {
            color: #2c3e50;
        }
        .section {
            margin-bottom: 20px;
        }
        .subsection {
            margin-left: 20px;
        }
    </style>
</head>
<body>
    <div>
        <h1>Ecosystems</h1>
        <p>This project explores the components, types, and importance of ecosystems in the environment.</p>
        
        <div class="section">
            <h2>Introduction to Ecosystems</h2>
            <p>An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.</p>
        </div>

        <div class="section">
            <h2>Types of Ecosystems</h2>
            <p>There are various types of ecosystems, each with unique characteristics and species.</p>

            <div class="subsection">
                <h3>Forest Ecosystems</h3>
                <p>Forests are characterized by dense tree cover and a rich diversity of plants and animals.</p>
            </div>
            <div class="subsection">
                <h3>Aquatic Ecosystems</h3>
                <p>Aquatic ecosystems include freshwater and marine environments, each supporting various life forms adapted to water.</p>
            </div>
            <div class="subsection">
                <h3>Desert Ecosystems</h3>
                <p>Deserts have limited water availability and vegetation but support specialized plant and animal life.</p>
            </div>
        </div>

        <div class="section">
            <h2>Importance of Ecosystems</h2>
            <p>Ecosystems provide numerous essential services such as air purification, climate regulation, and biodiversity.</p>
        </div>
        
        <div class="section">
            <h2>Conclusion</h2>
            <p>In conclusion, ecosystems are critical for maintaining balance in the environment, supporting life, and providing resources.</p>
        </div>
    </div>
</body>
</html>
"""

In [45]:
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)



In [46]:
html_header_splits

[Document(metadata={}, page_content='Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='This project explores the components, types, and importance of ecosystems in the environment.  \nIntroduction to Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Introduction to Ecosystems'}, page_content='An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='Types of Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems'}, page_content='There are various types of ecosystems, each with unique characteristics and species.  \nForest Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems', 'Header 3': 'Forest Ecosystems'}, page_content='Forests are characterized by dense tree cover and a rich diversity of plants and animals.'),
 Document(metadata={'Header 1':

In [18]:
html_header_splits[1].metadata

{'Header 1': 'Ecosystems'}

In [19]:
html_header_splits[1].page_content

'This project explores the components, types, and importance of ecosystems in the environment.  \nIntroduction to Ecosystems'

In [47]:
html_header_splits[4].metadata

{'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems'}

In [20]:
html_header_splits[4].page_content

'There are various types of ecosystems, each with unique characteristics and species.  \nForest Ecosystems'

In [21]:
html_header_splits[5].page_content

'Forests are characterized by dense tree cover and a rich diversity of plants and animals.'

In [22]:
html_header_splits[6].page_content

'Aquatic Ecosystems'

In [39]:
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on,
    return_each_element=True,
)
html_header_splits_elements = html_splitter.split_text(html_string)

In [40]:
html_header_splits_elements

[Document(metadata={}, page_content='Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='This project explores the components, types, and importance of ecosystems in the environment.'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='Introduction to Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Introduction to Ecosystems'}, page_content='An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='Types of Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems'}, page_content='There are various types of ecosystems, each with unique characteristics and species.'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems'}, page_content='Forest Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems', 'Header 3': 'Fo

In [41]:
html_header_splits_elements[10]

Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems'}, page_content='Desert Ecosystems')

In [43]:
html_header_splits_elements[11].page_content

'Deserts have limited water availability and vegetation but support specialized plant and animal life.'

In [23]:
from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split_on = [
    ("div", "Division")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)



In [27]:
html_header_splits

[Document(metadata={}, page_content='Ecosystems  \nThis project explores the components, types, and importance of ecosystems in the environment.  \nIntroduction to Ecosystems  \nAn ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.  \nTypes of Ecosystems  \nThere are various types of ecosystems, each with unique characteristics and species.  \nForest Ecosystems  \nForests are characterized by dense tree cover and a rich diversity of plants and animals.  \nAquatic Ecosystems  \nAquatic ecosystems include freshwater and marine environments, each supporting various life forms adapted to water.  \nDesert Ecosystems  \nDeserts have limited water availability and vegetation but support specialized plant and animal life.  \nImportance of Ecosystems  \nEcosystems provide numerous essential services such as air purification, climate regulation, and biodiversity.  \nConclusion  \nIn conclusion, ecosystems are critical for maintaining bal

In [28]:
print(html_header_splits[0].page_content)

Ecosystems  
This project explores the components, types, and importance of ecosystems in the environment.  
Introduction to Ecosystems  
An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.  
Types of Ecosystems  
There are various types of ecosystems, each with unique characteristics and species.  
Forest Ecosystems  
Forests are characterized by dense tree cover and a rich diversity of plants and animals.  
Aquatic Ecosystems  
Aquatic ecosystems include freshwater and marine environments, each supporting various life forms adapted to water.  
Desert Ecosystems  
Deserts have limited water availability and vegetation but support specialized plant and animal life.  
Importance of Ecosystems  
Ecosystems provide numerous essential services such as air purification, climate regulation, and biodiversity.  
Conclusion  
In conclusion, ecosystems are critical for maintaining balance in the environment, supporting life, and provid

In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
# splits[80:85]

In [49]:
splits

[Document(metadata={}, page_content='Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='This project explores the components, types, and importance of ecosystems in the environment.  \nIntroduction to Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Introduction to Ecosystems'}, page_content='An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.'),
 Document(metadata={'Header 1': 'Ecosystems'}, page_content='Types of Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems'}, page_content='There are various types of ecosystems, each with unique characteristics and species.  \nForest Ecosystems'),
 Document(metadata={'Header 1': 'Ecosystems', 'Header 2': 'Types of Ecosystems', 'Header 3': 'Forest Ecosystems'}, page_content='Forests are characterized by dense tree cover and a rich diversity of plants and animals.'),
 Document(metadata={'Header 1':

#Like  HTMLHeaderTextSplitter, the HTMLSectionSplitter is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk.

In [50]:
from langchain_text_splitters import HTMLSectionSplitter

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"),("h3","Header 3")]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Ecosystems'}, page_content='Ecosystems \n This project explores the components, types, and importance of ecosystems in the environment.'),
 Document(metadata={'Header 2': 'Introduction to Ecosystems'}, page_content='Introduction to Ecosystems \n An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.'),
 Document(metadata={'Header 2': 'Types of Ecosystems'}, page_content='Types of Ecosystems \n There are various types of ecosystems, each with unique characteristics and species.'),
 Document(metadata={'Header 3': 'Forest Ecosystems'}, page_content='Forest Ecosystems \n Forests are characterized by dense tree cover and a rich diversity of plants and animals.'),
 Document(metadata={'Header 3': 'Aquatic Ecosystems'}, page_content='Aquatic Ecosystems \n Aquatic ecosystems include freshwater and marine environments, each supporting various life forms adapted to water.'),
 Document(metadata={'Header 3': 

In [51]:
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits

[Document(metadata={'Header 1': 'Ecosystems'}, page_content='Ecosystems \n This project explores the components, types, and importance of ecosystems in the environment.'),
 Document(metadata={'Header 2': 'Introduction to Ecosystems'}, page_content='Introduction to Ecosystems \n An ecosystem includes all the living organisms and non-living elements in a specific area and how they interact.'),
 Document(metadata={'Header 2': 'Types of Ecosystems'}, page_content='Types of Ecosystems \n There are various types of ecosystems, each with unique characteristics and species.'),
 Document(metadata={'Header 3': 'Forest Ecosystems'}, page_content='Forest Ecosystems \n Forests are characterized by dense tree cover and a rich diversity of plants and animals.'),
 Document(metadata={'Header 3': 'Aquatic Ecosystems'}, page_content='Aquatic Ecosystems \n Aquatic ecosystems include freshwater and marine environments, each supporting various life forms adapted to water.'),
 Document(metadata={'Header 3': 

#CharacterTextSplitter

In [54]:
from langchain_text_splitters import CharacterTextSplitter

# Load an example document
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'


In [55]:
texts

[Document(metadata={}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'),
 Doc

In [57]:
text_splits = text_splitter.split_text(state_of_the_union)

In [58]:
text_splits[0]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'

# Markdown doc

# Love

**Love** is a powerful and universal emotion that binds us to others in profound ways. It comes in many forms and is essential to human connection.

## Types of Love

1. **Romantic Love**  
   The deep affection felt between partners, often intense and passionate.

2. **Familial Love**  
   The love shared among family members, providing a foundation of support.

3. **Friendship**  
   A bond of trust and companionship that enhances our lives.

4. **Self-Love**  
   Caring for oneself, essential for mental and emotional well-being.

## Conclusion

Love, in all its forms, enriches our lives, giving us purpose, happiness, and resilience.


In [59]:
markdown_doc = """
# Love

**Love** is a powerful and universal emotion that binds us to others in profound ways. It comes in many forms and is essential to human connection.

## Types of Love

1. **Romantic Love**  
   The deep affection felt between partners, often intense and passionate.

2. **Familial Love**  
   The love shared among family members, providing a foundation of support.

3. **Friendship**  
   A bond of trust and companionship that enhances our lives.

4. **Self-Love**  
   Caring for oneself, essential for mental and emotional well-being.

## Conclusion

Love, in all its forms, enriches our lives, giving us purpose, happiness, and resilience.

"""

In [61]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_doc)
md_header_splits

[Document(metadata={'Header 1': 'Love'}, page_content='**Love** is a powerful and universal emotion that binds us to others in profound ways. It comes in many forms and is essential to human connection.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='1. **Romantic Love**\nThe deep affection felt between partners, often intense and passionate.  \n2. **Familial Love**\nThe love shared among family members, providing a foundation of support.  \n3. **Friendship**\nA bond of trust and companionship that enhances our lives.  \n4. **Self-Love**\nCaring for oneself, essential for mental and emotional well-being.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Conclusion'}, page_content='Love, in all its forms, enriches our lives, giving us purpose, happiness, and resilience.')]

#strip_headers=False -- will include the header in page_content as well !!


In [65]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_doc)
md_header_splits

[Document(metadata={'Header 1': 'Love'}, page_content='# Love  \n**Love** is a powerful and universal emotion that binds us to others in profound ways. It comes in many forms and is essential to human connection.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='## Types of Love  \n1. **Romantic Love**\nThe deep affection felt between partners, often intense and passionate.  \n2. **Familial Love**\nThe love shared among family members, providing a foundation of support.  \n3. **Friendship**\nA bond of trust and companionship that enhances our lives.  \n4. **Self-Love**\nCaring for oneself, essential for mental and emotional well-being.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Conclusion'}, page_content='## Conclusion  \nLove, in all its forms, enriches our lives, giving us purpose, happiness, and resilience.')]

#return_each_line=True -- returns each line has seperate document

In [66]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on,
    return_each_line=True,
)
md_header_splits = markdown_splitter.split_text(markdown_doc)
md_header_splits

[Document(metadata={'Header 1': 'Love'}, page_content='**Love** is a powerful and universal emotion that binds us to others in profound ways. It comes in many forms and is essential to human connection.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='1. **Romantic Love**\nThe deep affection felt between partners, often intense and passionate.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='2. **Familial Love**\nThe love shared among family members, providing a foundation of support.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='3. **Friendship**\nA bond of trust and companionship that enhances our lives.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='4. **Self-Love**\nCaring for oneself, essential for mental and emotional well-being.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Conclusion'}, page_content='Love, in all 

In [68]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_doc)

# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits

[Document(metadata={'Header 1': 'Love'}, page_content='# Love  \n**Love** is a powerful and universal emotion that binds us to others in profound ways. It comes in many forms and is essential to human connection.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='## Types of Love  \n1. **Romantic Love**\nThe deep affection felt between partners, often intense and passionate.  \n2. **Familial Love**\nThe love shared among family members, providing a foundation of support.  \n3. **Friendship**'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Types of Love'}, page_content='3. **Friendship**\nA bond of trust and companionship that enhances our lives.  \n4. **Self-Love**\nCaring for oneself, essential for mental and emotional well-being.'),
 Document(metadata={'Header 1': 'Love', 'Header 2': 'Conclusion'}, page_content='## Conclusion  \nLove, in all its forms, enriches our lives, giving us purpose, happiness, and resilience.')]

This json splitter splits json data while allowing control over chunk sizes.

It tried to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

In [69]:
import json

import requests

# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

In [71]:
json_data

{'openapi': '3.1.0',
 'info': {'title': 'LangSmith', 'version': '0.1.0'},
 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'],
    'summary': 'Read Tracer Session',
    'description': 'Get a specific session.',
    'operationId': 'read_tracer_session_api_v1_sessions__session_id__get',
    'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}],
    'parameters': [{'name': 'session_id',
      'in': 'path',
      'required': True,
      'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
     {'name': 'include_stats',
      'in': 'query',
      'required': False,
      'schema': {'type': 'boolean',
       'default': False,
       'title': 'Include Stats'}},
     {'name': 'accept',
      'in': 'header',
      'required': False,
      'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
       'title': 'Accept'}}],
    'responses': {'200': {'description': 'Successful Response',
      'content': {'application/json': {'sch

In [70]:
from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=300)

In [72]:
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)

for chunk in json_chunks[:3]:
    print("+++++++++++ \n")
    print(chunk)

+++++++++++ 

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}
+++++++++++ 

{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
+++++++++++ 

{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}


convert_lists=True to preprocess the json, converting list content to dicts with index:item as key:val pairs:

In [73]:
texts = splitter.split_text(json_data=json_data, convert_lists=True)

In [75]:
print(texts[3])

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": {"1": {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}}}}}}


how to split chunks based on their semantic similarity?



At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

In [77]:
!pip install --quiet langchain_experimental langchain_openai

This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.



In [82]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

In [83]:
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

In [84]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone f

breakpoint_threshold_type="standard_deviation"

breakpoint_threshold_type="interquartile"

breakpoint_threshold_type="gradient"

Percentile : 
A measure that indicates the value below which a given percentage of data falls. For example, the 90th percentile is the value below which 90% of data lies.

Standard Deviation : 
Shows how spread out the data is from the average. A small standard deviation means data is close to the average; a large one means data is spread out.

Interquartile Range (IQR) : 
The range between the 25th percentile (Q1) and the 75th percentile (Q3). It measures the middle spread of data, helping to identify outliers.

Gradient : 
The rate of change between values, like the slope of a line. In data, it shows how quickly values increase or decrease.

In [85]:
%pip install --upgrade --quiet langchain-text-splitters tiktoken

Note: you may need to restart the kernel to use updated packages.


In [86]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)