### Text Splitting from Documents - RecursiveCharacter Text Splitters

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

->How the text is split: by list of characters.
->How the chunk size is measured: by number of characters.

In [24]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('AutismFinalPPT.pdf')
docs = loader.load()
docs[3].page_content


'IDEA/SOLUTION: Implementation of an interactive\nAI model of same age group to treat autism by\ntracking the progress of the individual.\nUnique software that utilize an AI model that\ncommunicates like a peer of the same age group.\nTailored activities for kids below 8. Teens lead fun\nactivities for those above 8, improving\ncommunication through their interests in singing,\ndancing, and arts, etc.\nComprehensive section offering available solutions\nand tips for parents to support their children\neffectively.\nSpecial Focus on Teenagers where there are no\nsystem to treat themProposed Solution\nTECHNOLOGY STACK'

In [25]:
type(docs[0])

langchain_core.documents.base.Document

In [28]:
### Recursively Splitting Text By Characters
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,chunk_overlap=50)
final_documents=text_splitter.split_documents(docs)
final_documents

[Document(metadata={'source': 'AutismFinalPPT.pdf', 'page': 0}, page_content='Auti Talk'),
 Document(metadata={'source': 'AutismFinalPPT.pdf', 'page': 1}, page_content='IMPAIRATHON - 2024\nName of the Student : ANIL KATROTH\nDepartment and Year : CSE - IV\nCollege Name : RGUKT, BasarMinistry/Organization :  KARPAGAM INNOVATION\nAND INCUBATION COUNCIL'),
 Document(metadata={'source': 'AutismFinalPPT.pdf', 'page': 2}, page_content='Problem  Statement\nPERVASIVE DEVELOPMENT:\nSystem to develop communication for\nautistic children'),
 Document(metadata={'source': 'AutismFinalPPT.pdf', 'page': 3}, page_content='IDEA/SOLUTION: Implementation of an interactive\nAI model of same age group to treat autism by\ntracking the progress of the individual.\nUnique software that utilize an AI model that\ncommunicates like a peer of the same age group.\nTailored activities for kids below 8. Teens lead fun\nactivities for those above 8, improving\ncommunication through their interests in singing,\ndancin

In [31]:
type(docs[0])

langchain_core.documents.base.Document

### How to split by character-Character Text Splitter

This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

->How the text is split: by single character separator.
->How the chunk size is measured: by number of characters.

In [40]:
from langchain_text_splitters import CharacterTextSplitter
speech = ""
with open("speech.txt") as f:
    speech = f.read()

text_splitter = CharacterTextSplitter(separator="\n\n",chunk_size=500,chunk_overlap=50)
text_splitter.split_documents(docs)


[Document(page_content='Rajiv Gandhi University of Knowledge Technologies (RGUKT) Basar is unique university which actively uses Information and Communication Technology (ICT) in teaching. It is perhaps the first of its kind in the country with an educational model that is intensely ICT based. Established by the Government of erstwhile Andhra Pradesh vide a special act of legislation, this campus is loacated at the holy land of Basar (the abode of Gnyana Saraswathi, Goddess of knowledge) in Nirmal District (Telangana State). The campus is set in about 272 acres of salubrious and serene surrounding just a short distance from the banks of river Godavari.'),
 Document(page_content='The primary objective of establishing RGUKT is to provide high quality educational opportunities for the rural youth of the state. The selection process follows approved rules and has very high competition where only the top rural graduates (mostly within the top 5%) get the opportunity to study at RGUKT.'),
 D

This is the Implementation of Text Splitter using .txt file

In [41]:
text_splitter = CharacterTextSplitter(chunk_size=200,chunk_overlap=30)
text = text_splitter.create_documents([speech])
print(text[0])
print(text[1])

Created a chunk of size 636, which is longer than the specified 200
Created a chunk of size 306, which is longer than the specified 200
Created a chunk of size 662, which is longer than the specified 200


page_content='Rajiv Gandhi University of Knowledge Technologies (RGUKT) Basar is unique university which actively uses Information and Communication Technology (ICT) in teaching. It is perhaps the first of its kind in the country with an educational model that is intensely ICT based. Established by the Government of erstwhile Andhra Pradesh vide a special act of legislation, this campus is loacated at the holy land of Basar (the abode of Gnyana Saraswathi, Goddess of knowledge) in Nirmal District (Telangana State). The campus is set in about 272 acres of salubrious and serene surrounding just a short distance from the banks of river Godavari.'
page_content='The primary objective of establishing RGUKT is to provide high quality educational opportunities for the rural youth of the state. The selection process follows approved rules and has very high competition where only the top rural graduates (mostly within the top 5%) get the opportunity to study at RGUKT.'


### How to split by HTML header

HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

In [43]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string="""
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits


[Document(page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Baz'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Some text about Baz'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some concluding text about Foo')]

In [44]:
url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

[Document(page_content="Stanford Encyclopedia of Philosophy  \nMenu  \nBrowse About Support SEP  \nTable of Contents What's New Random Entry Chronological Archives  \nEditorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  \nSupport the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  \nEntry Navigation  \nEntry Contents Bibliography Academic Tools Friends PDF Preview Author and Citation Info Back to Top  \nKurt Gödel"),
 Document(metadata={'Header 1': 'Kurt Gödel'}, page_content='First published Tue Feb 13, 2007; substantive revision Fri Dec 11, 2015  \nKurt Friedrich Gödel (b. 1906, d. 1978) was one of the principal founders of the modern, metamathematical era in mathematical logic. He is widely known for his Incompleteness Theorems, which are among the handful of landmark theorems in twentieth century mathematics, but his work touched every field of mathematical logic, if it was not in most cases their original 

### JSON Splitter

This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.

How the text is split: json value.
How the chunk size is measured: by number of characters.

In [45]:
import json
import requests

json_data=requests.get("https://api.smith.langchain.com/openapi.json").json()
json_data


{'openapi': '3.1.0',
 'info': {'title': 'LangSmith', 'version': '0.1.0'},
 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'],
    'summary': 'Read Tracer Session',
    'description': 'Get a specific session.',
    'operationId': 'read_tracer_session_api_v1_sessions__session_id__get',
    'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}],
    'parameters': [{'name': 'session_id',
      'in': 'path',
      'required': True,
      'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
     {'name': 'include_stats',
      'in': 'query',
      'required': False,
      'schema': {'type': 'boolean',
       'default': False,
       'title': 'Include Stats'}},
     {'name': 'accept',
      'in': 'header',
      'required': False,
      'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
       'title': 'Accept'}}],
    'responses': {'200': {'description': 'Successful Response',
      'content': {'application/json': {'sch