# Text Splitters

https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter

API

https://api.python.langchain.com/en/latest/text_splitters_api_reference.html

https://api.python.langchain.com/en/stable/text_splitter/langchain.text_splitter.TextSplitter.html#langchain.text_splitter.TextSplitter.split_documents

umenty document

## 1.Recursive Character Text Splitter

https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html

#### Sample PDF
Download it to local file system:  https://constitutioncenter.org/media/files/constitution.pdf

#### Note
The sample code below can be replaced with *pdf_loader.load_and_split(pdf_text_splitter)*

In [1]:
from langchain_community.document_loaders import PyPDFLoader

# Load from local file system or from a URL
# You may also us the PDF web loader
pdf_source = 'C:/temp/us-constitution.pdf'
pdf_loader = PyPDFLoader(pdf_source) 
documents = pdf_loader.load()

# print(documents) #[0].page_content)

print("Document list size : ", len(documents))
print("Metadata : ", documents[0].metadata)


Document list size :  52
Metadata :  {'source': 'C:/temp/us-constitution.pdf', 'page': 0}


* chunk_size – Maximum size of chunks to return.
* chunk_overlap : Overlap in characters between chunks
* length_function – Function that measures the length of given chunks
* keep_separator – Whether to keep the separator in the chunks
* add_start_index – If True, includes chunk’s start index in metadata
* strip_whitespace – If True, strips whitespace from the start and end of every document

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 784
chunk_overlap = 100

# Create an instance of the splitter
pdf_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

In [3]:
chunked_documents = pdf_text_splitter.split_documents(documents)

print("Number of chunks : ", len(chunked_documents))



Number of chunks :  0


In [4]:
print('Chunk length = ', len(chunked_documents[20].page_content))
print(chunked_documents[20].metadata)
print('------------')
print(chunked_documents[20].page_content)
print('------------')
print(chunked_documents[21].page_content)

IndexError: list index out of range

## 2.JSON 

This json splitter traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size. If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size considder following this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.


https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_json_splitter

#### API

https://api.python.langchain.com/en/latest/json/langchain_text_splitters.json.RecursiveJsonSplitter.html#langchain_text_splitters.json.RecursiveJsonSplitter

- max_chunk_size(int)
- min_chunk_size (Optional[int]) –

In [None]:
sample_json = {
  "name": "John",
  "age": 30,
  "city": "New York",
  "pets": [
    {
      "type": "dog",
      "name": "Buddy",
      "age": 5,
      "traits": ["friendly", "energetic"]
    },
    {
      "type": "cat",
      "name": "Whiskers",
      "age": 3,
      "traits": ["independent", "playful"]
    }
  ],
  "work": {
    "company": "XYZ Corp",
    "position": "Software Engineer",
    "years_of_experience": 8,
    "projects": [
      {
        "name": "Project A",
        "status": "completed",
        "team_members": ["Alice", "Bob", "Charlie"]
      },
      {
        "name": "Project B",
        "status": "in_progress",
        "team_members": ["Dave", "Eve"]
      }
    ]
  }
}


In [None]:
from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=50)

json_chunks = splitter.split_json(json_data=sample_json)

In [None]:
print('Number of chunks : ', len(json_chunks))

for chunk in json_chunks:
    print('---------', type(chunk),'----------')
    print(chunk)
    

## 3.HTML Splitter

https://python.langchain.com/docs/modules/data_connection/document_transformers/HTML_header_metadata

#### API
https://api.python.langchain.com/en/latest/html/langchain_text_splitters.html.HTMLHeaderTextSplitter.html#langchain_text_splitters.html.HTMLHeaderTextSplitter

* headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2)].
* 
return_each_element (bool) – Return each element w/ associated headers.

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

## 4.Semantic splitter

EXPERIMENTAL

https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker