### Data Transformation

This step in RAG (Retrieval Augmented Generation) involves converting huge data into smaller text chunks

In [1]:
# Load data 
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/cricket.pdf")
docs = loader.load()
docs

[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="Cricket: A Gentleman's Game\nCricket is a bat-and-ball game played between two teams of eleven players on a field at the center\nof which is a 22-yard pitch with a wicket at each end. \nThe game has its origins in 16th-century England and has evolved into a global sport, particularly\npopular in countries like India, Australia, England, Pakistan, and South Africa.\nThere are different formats of cricket, including Test matches, One Day Internationals (ODIs), and\nTwenty20 (T20) games. Each format has its own charm, with Test matches known for their strategic\ndepth, ODIs for balanced gameplay, and T20s for fast-paced entertainment.\nKey players in the history of cricket include legends such as Sachin Tendulkar, Don Bradman,\nMuttiah Muralitharan, and Jacques Kall

### Recursive Character Text splitter

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n","\n"," ",""]. This has the effect of trying yo keep all paragraphs (and then sentences and then words) together as long as possible, as these would generically seem to be the strongest semantically related pieces of text. <br>

- Text is split by a list of characters
- Chunk size is measured by number of characters.

In this we generally give two parameters which are : <br>

<b>Chunk_size</b>: Maximum character length of each chunk. <br>
<b>Chink_overlap</b>: Number of characters in previous chunk that will be repeated in the current chunk.

In [2]:
# Split_documents => Reading from list of documents
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=15)
split_documents = text_splitter.split_documents(docs)
split_documents

[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="Cricket: A Gentleman's Game"),
 Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Cricket is a bat-and-ball game played between two teams of eleven players on a field at the center'),
 Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='of which is a 22-yard pitch with a wicket at each end.'),
 Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032

In [3]:
# create_documents => Creating split documents from simple text

import fitz  # PyMuPDF

# Extract text from PDF
cricket_data = ""
with fitz.open("data/cricket.pdf") as pdf:
    for page in pdf:
        cricket_data += page.get_text()

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=15)
split_documents = text_splitter.create_documents([cricket_data])
split_documents  


[Document(metadata={}, page_content="Cricket: A Gentleman's Game"),
 Document(metadata={}, page_content='Cricket is a bat-and-ball game played between two'),
 Document(metadata={}, page_content='between two teams of eleven players on a field at'),
 Document(metadata={}, page_content='on a field at the center'),
 Document(metadata={}, page_content='of which is a 22-yard pitch with a wicket at each'),
 Document(metadata={}, page_content='wicket at each end.'),
 Document(metadata={}, page_content='The game has its origins in 16th-century England'),
 Document(metadata={}, page_content='England and has evolved into a global sport,'),
 Document(metadata={}, page_content='global sport, particularly'),
 Document(metadata={}, page_content='popular in countries like India, Australia,'),
 Document(metadata={}, page_content='Australia, England, Pakistan, and South Africa.'),
 Document(metadata={}, page_content='There are different formats of cricket, including'),
 Document(metadata={}, page_conten

### Character Text Splitter

This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters. <br>
- Text is split by a single character separator.
- Chunk size is measured by number of characters.

In [4]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/cricket.pdf")
docs = loader.load()
docs

[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="Cricket: A Gentleman's Game\nCricket is a bat-and-ball game played between two teams of eleven players on a field at the center\nof which is a 22-yard pitch with a wicket at each end. \nThe game has its origins in 16th-century England and has evolved into a global sport, particularly\npopular in countries like India, Australia, England, Pakistan, and South Africa.\nThere are different formats of cricket, including Test matches, One Day Internationals (ODIs), and\nTwenty20 (T20) games. Each format has its own charm, with Test matches known for their strategic\ndepth, ODIs for balanced gameplay, and T20s for fast-paced entertainment.\nKey players in the history of cricket include legends such as Sachin Tendulkar, Don Bradman,\nMuttiah Muralitharan, and Jacques Kall

In [5]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="\n",chunk_size=50,chunk_overlap=15)
text_splitter.split_documents(docs)

Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 55, which is longer than the specified 50
Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 80, which is longer than the specified 50
Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 96, which is longer than the specified 50
Created a chunk of size 73, which is longer than the specified 50
Created a chunk of size 92, which is longer than the specified 50
Created a chunk of size 99, which is longer than the specified 50
Created a chunk of size 74, which is longer than the specified 50
Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 76, which is longer than the specified 50


[Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="Cricket: A Gentleman's Game"),
 Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Cricket is a bat-and-ball game played between two teams of eleven players on a field at the center'),
 Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032520', 'source': 'data/cricket.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='of which is a 22-yard pitch with a wicket at each end.'),
 Document(metadata={'producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', 'creator': 'PyPDF', 'creationdate': 'D:20250609032

In [6]:
import fitz

cricket_data = ""
with fitz.open("data/cricket.pdf") as file:
    for page in file:
        cricket_data += page.get_text()

text_splitter = CharacterTextSplitter(separator="\n",chunk_size=50,chunk_overlap=15)
text_splitter.create_documents([cricket_data])

Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 55, which is longer than the specified 50
Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 80, which is longer than the specified 50
Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 96, which is longer than the specified 50
Created a chunk of size 73, which is longer than the specified 50
Created a chunk of size 92, which is longer than the specified 50
Created a chunk of size 99, which is longer than the specified 50
Created a chunk of size 74, which is longer than the specified 50
Created a chunk of size 98, which is longer than the specified 50
Created a chunk of size 76, which is longer than the specified 50


[Document(metadata={}, page_content="Cricket: A Gentleman's Game"),
 Document(metadata={}, page_content='Cricket is a bat-and-ball game played between two teams of eleven players on a field at the center'),
 Document(metadata={}, page_content='of which is a 22-yard pitch with a wicket at each end.'),
 Document(metadata={}, page_content='The game has its origins in 16th-century England and has evolved into a global sport, particularly'),
 Document(metadata={}, page_content='popular in countries like India, Australia, England, Pakistan, and South Africa.'),
 Document(metadata={}, page_content='There are different formats of cricket, including Test matches, One Day Internationals (ODIs), and'),
 Document(metadata={}, page_content='Twenty20 (T20) games. Each format has its own charm, with Test matches known for their strategic'),
 Document(metadata={}, page_content='depth, ODIs for balanced gameplay, and T20s for fast-paced entertainment.'),
 Document(metadata={}, page_content='Key players

### HTMLHeaderTextSplitter

It is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with same metadata. The objectives of HTMLHeaderTextSplitter are : <br>

- Keeping related text grouped semantically.
- Preserving context-rich information encoded in document structures. <br>

This can be used with other text splitters as a part of chunking pipeline.

In [7]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = '''
<!DOCTYPE HTML>
<html>
    <body>
        <div>
            <h1>Charan</h1>
            <p>Let me introduce myself</p>
            <h3>I am from IIITDM Jabalpur</h3>
            <p>I have done bachelors in computer science</p>
            <h4>I have graduated in 2024</h4>
        </div>
    </body>
</html>
'''

headers_to_split = [
    ("h1","Header 1"),
    ("h3","Header 3")
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Charan'}, page_content='Charan'),
 Document(metadata={'Header 1': 'Charan'}, page_content='Let me introduce myself'),
 Document(metadata={'Header 1': 'Charan', 'Header 3': 'I am from IIITDM Jabalpur'}, page_content='I am from IIITDM Jabalpur'),
 Document(metadata={'Header 1': 'Charan', 'Header 3': 'I am from IIITDM Jabalpur'}, page_content='I have done bachelors in computer science  \nI have graduated in 2024')]

In [8]:
cricket_url = "https://en.wikipedia.org/wiki/Cricket"

headers_to_split = [
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split)
html_header_splits = html_splitter.split_text_from_url(cricket_url)
html_header_splits

[Document(metadata={}, page_content='Jump to content  \nMain menu  \nMain menu  \nmove to sidebar  \nhide  \nNavigation  \nMain page  \nContents  \nCurrent events  \nRandom article  \nAbout Wikipedia  \nContact us  \nContribute  \nHelp  \nLearn to edit  \nCommunity portal  \nRecent changes  \nUpload file  \nSpecial pages  \nSearch  \nSearch  \nAppearance  \nDonate  \nCreate account  \nLog in  \nPersonal tools  \nDonate  \nCreate account  \nLog in  \nPages for logged out editors  \nlearn more  \nContributions  \nTalk  \nCentralNotice'),
 Document(metadata={'Header 2': 'Contents'}, page_content='Contents'),
 Document(metadata={}, page_content='move to sidebar  \nhide  \n(Top)  \n1  \nHistory  \nToggle History subsection  \n1.1  \nOrigins  \n1.2  \nGrowth of amateur and professional cricket in England  \n1.3  \nEnglish cricket in the 18th and 19th centuries  \n1.4  \nCricket becomes an international sport  \n1.5  \nCricket in the 20th century  \n1.6  \nCricket in the 21st century  \n2  \n