# Text Splitters

When working with large documents we will usually need to split them into smaller, more manageable chunks.  
  
This is especially important in the context of in the context of retrieval-augmented generation (RAG).  
RAG involves fetching relevant pieces of information from a large dataset and using them to generate accurate and context-aware responses. Without properly split text, the retrieval process can become inefficient, potentially missing critical pieces of information or returning irrelevant data. By using text splitters to create well-defined chunks, the retrieval process can be streamlined, ensuring that the most relevant information is easily accessible. This not only enhances the efficiency of data retrieval but also improves the quality and relevance of the generated responses, making text splitters an important tool in the RAG workflow.

In this notebook I will use some text splitters from langchain.  


## Key parameters

- separator: Define the characters that will be used for splitting the text.
- chunk_size: Specify the maximum size of your chunks to ensure they are as granular or broad as needed.
- chunk_overlap: Maintain context between chunks by setting the chunk_overlap parameter, which determines the number of characters that overlap between consecutive chunks. This helps ensure that information isn't lost at the chunk boundaries.
- length_function: Define how the length of chunks is calculated.

In [28]:
# import necessary libraries
import urllib.request
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_text_splitters import HTMLHeaderTextSplitter
from langchain_text_splitters import HTMLSectionSplitter

In [29]:
# download a sample document
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/YRYau14UJyh0DdiLDdzFcA/companypolicies.txt"

filename = 'data/companypolicies.txt'
urllib.request.urlretrieve(url, filename)

('data/companypolicies.txt', <http.client.HTTPMessage at 0x1f50fb14dd0>)

In [30]:
with open("data/companypolicies.txt") as f:
    companypolicies = f.read()

# a long document that can be split up
print(companypolicies)

1.	Code of Conduct

Our Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity, respect, and accountability.
Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We respect and protect sensitive information, and we avoid conflicts of interest.
Respect: We embrace diversity and value each individual's contributions. Discrimination, harassment, or any form of disrespectful behavior is unacceptable. We create an inclusive environment where differences are celebrated and everyone is treated with dignity and courtesy.
Accountability: We take responsibility for our actions and decisions. We follow all relevant laws and regulations, and we strive to continuously improve our practices. We report any potential violations of 

## Split by Character

CharacterTextSplitter  
  
This is the simplest method, which splits the text based on characters (by default "\n\n") and measures chunk length by the number of characters.  

How the text is split: By single character.  
How the chunk size is measured: By number of characters.  

In the first example below we will use:  
- Separator: Set to '', meaning that any character can act as a separator once the chunk size reaches the set limit.
- Chunk size: Set to 200, meaning that once a chunk reaches 200 characters, it will be split.
- Chunk overlap: Set to 20, meaning there will be 20 characters overlapping between chunks.
- Length function: Set to len.

In [None]:
# defining the splitter
text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)

texts = text_splitter.split_text(companypolicies)
print(texts)

print("Number of chunks: ", len(texts))

Number of chunks:  87


In [32]:
# if we would rather the output be a Document Object with metadata we would do it like this
texts = text_splitter.create_documents([companypolicies], metadatas=[{"document":"Company Policies"}]) 
texts[0:4]

[Document(metadata={'document': 'Company Policies'}, page_content='1.\tCode of Conduct\n\nOur Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built'),
 Document(metadata={'document': 'Company Policies'}, page_content='kplace that is built on integrity, respect, and accountability.\nIntegrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whe'),
 Document(metadata={'document': 'Company Policies'}, page_content='ur interactions, whether with colleagues, clients, or the broader community. We respect and protect sensitive information, and we avoid conflicts of interest.\nRespect: We embrace diversity and value e'),
 Document(metadata={'document': 'Company Policies'}, page_content="iversity and value each individual's contributions. Discrimination, harassment, or any form of disrespectful behavi

In [33]:
# lets split on a different character, "\n"
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=300,
    chunk_overlap=20,
    length_function=len,
)

# we can see when you specify the separator the splitter may not be able to keep to the correct chunk size
texts = text_splitter.split_text(companypolicies)
print(texts)

Created a chunk of size 325, which is longer than the specified 300
Created a chunk of size 321, which is longer than the specified 300
Created a chunk of size 694, which is longer than the specified 300
Created a chunk of size 323, which is longer than the specified 300
Created a chunk of size 326, which is longer than the specified 300
Created a chunk of size 421, which is longer than the specified 300




## Recursively Split by Character

RecursiveCharacterTextSplitter class from LangChain  
  
This text splitter is the recommended one for generic text. It is parameterized by a list of characters, and it tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This could solve the problem of splitting by character in the example above.  
  
It processes the large text by attempting to split it by the first character, \n\n. If the first split by \n\n results in chunks that are still too large, it moves to the next character, \n, and attempts to split by it. This process continues through the list of characters until the chunks are less than the specified chunk size.  
  
This method aims to keep all paragraphs (then sentences, then words) together as much as possible, as these are generally the most semantically related pieces of text.  
  
How the text is split: by list of characters.  
How the chunk size is measured: by number of characters.  
 
In the example below we will use the parameters set to:  
- separator: the default separator list, which is ["\n\n", "\n", " ", ""].
- Chunk size: 100.
- Chunk overlap: 20.
- length function: len.

In [34]:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

# this yields a much nicer result
texts = text_splitter.create_documents([companypolicies])
texts

[Document(metadata={}, page_content='1.\tCode of Conduct'),
 Document(metadata={}, page_content='Our Code of Conduct outlines the fundamental principles and ethical standards that guide every'),
 Document(metadata={}, page_content='that guide every member of our organization. We are committed to maintaining a workplace that is'),
 Document(metadata={}, page_content='a workplace that is built on integrity, respect, and accountability.'),
 Document(metadata={}, page_content='Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and'),
 Document(metadata={}, page_content='acting honestly and transparently in all our interactions, whether with colleagues, clients, or the'),
 Document(metadata={}, page_content='clients, or the broader community. We respect and protect sensitive information, and we avoid'),
 Document(metadata={}, page_content='and we avoid conflicts of interest.'),
 Document(metadata={}, page_content="Respect: We embrace diversity and valu

## Code splitter

CodeTextSplitter 
   
We can also split up code from multiple programming languages usng langchain. The class is based on the RecursiveCharacterTextSplitter strategy. Simply import enum Language and specify the language

In [35]:
# lets see all the languages supported
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'kotlin',
 'js',
 'ts',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html',
 'sol',
 'csharp',
 'cobol',
 'c',
 'lua',
 'perl',
 'haskell',
 'elixir',
 'powershell']

In [36]:
# lets see what default separator it uses for python
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

In [37]:
# some sample python code
PYTHON_CODE = """
    def hello_world():
        print("Hello, World!")
    
    # Call the function
    hello_world()
"""

# we use the splitter same as before just calling from_language and specifying the language
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(metadata={}, page_content='def hello_world():'),
 Document(metadata={}, page_content='print("Hello, World!")'),
 Document(metadata={}, page_content='# Call the function\n    hello_world()')]

In [38]:
# same thing but for JS
JS_CODE = """
    function helloWorld() {
      console.log("Hello, World!");
    }
    
    // Call the function
    helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs

[Document(metadata={}, page_content='function helloWorld() {'),
 Document(metadata={}, page_content='console.log("Hello, World!");\n    }'),
 Document(metadata={}, page_content='// Call the function\n    helloWorld();')]

## Markdown Splitter

MarkdownHeaderTextSplitter  

We can split a Markdown file by its headers.

In [39]:
# example of markdown
md = "# Foo\n\n## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n### Boo \n\nHi this is Lance \n\n## Baz\n\nHi this is Molly"

# specify the headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]   

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(md)
# the page_content contains the text under the headings, and the metadata contains the header information corresponding to the page_content.
md_header_splits

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='Hi this is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='Hi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Hi this is Molly')]

In [40]:
# If you want the headers appears in the page_content as well, you can specify strip_headers=False when you call the MarkdownHeaderTextSplitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(md)
md_header_splits

[Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}, page_content='### Boo  \nHi this is Lance'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='## Baz  \nHi this is Molly')]

## Split by HTML

### Split by HTML header

HTMLHeaderTextSplitter  

In [41]:
# sample html
html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

# set the headers to split on
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]


html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
# the context under the headings is extracted and put in the page_content parameter. The metatdata contains the header information.
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Bar main section'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 

### Split by HTML section

HTMLSectionSplitter  

In [42]:
# same parameters and example as above but splitting by section
html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo \n Some intro text about Foo.'),
 Document(metadata={'Header 2': 'Bar main section'}, page_content='Bar main section \n Some intro text about Bar.'),
 Document(metadata={'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 2': 'Baz'}, page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo')]