##### Introduction to Data Ingetion

In [19]:
from typing import List,Dict,Any
import os
import pandas as pd


In [20]:
from langchain_core.documents import Document
print("Setup completed")

Setup completed


###### Understanding Document Structure in Langchain

In [21]:
# create a simple document
doc = Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata = {
        "source":"example.txt",
        "page":1,
        "author":"bhopindrasingh parmar",
        "date_created":"2025-08-23",
        "custom_field":"any_value"
    }
)

print("Document Structure:-")
print(f"Content:{doc.page_content}")
print(f"metadata: {doc.metadata}")


Document Structure:-
Content:This is the main text content that will be embedded and searched.
metadata: {'source': 'example.txt', 'page': 1, 'author': 'bhopindrasingh parmar', 'date_created': '2025-08-23', 'custom_field': 'any_value'}


###### Text File(.txt)

In [22]:
import os
os.makedirs("data/text_files",exist_ok=True)

In [23]:
sample_texts = {
    "data/text_files/python_intro.txt":'''
Python is a high-level, interpreted programming language that has become one of the most popular tools in the tech world. Known for its simple syntax and readability, it allows developers to write clean and efficient code without getting lost in overly complex structures. Unlike low-level languages, Python abstracts many details of computer operations, making it beginner-friendly while still being powerful enough for advanced projects. Its vast ecosystem of libraries and frameworks gives it a huge advantage in almost every domain of computing.

Python is used across a wide range of industries and applications. In web development, frameworks like Django and Flask help developers build robust websites and APIs. In data science and analytics, libraries like Pandas, NumPy, and Matplotlib are essential for handling and visualizing data. Python is also dominant in artificial intelligence and machine learning thanks to libraries such as TensorFlow, PyTorch, and Scikit-learn. Beyond that, it’s used in automation, scripting, cybersecurity, DevOps, and even game development. Its versatility is one of the main reasons it continues to dominate the programming landscape.

Python in Machine Learning
One of Python’s strongest suits is its role in machine learning (ML). Machine learning is a subset of artificial intelligence where systems learn patterns from data instead of being explicitly programmed. Python makes this process easier because of its extensive libraries and active community. Developers and researchers use it to build models that can predict outcomes, classify information, and even generate new content.

Types of Machine Learning Techniques
There are several types of machine learning techniques, each serving different purposes:

Supervised Learning – This involves training a model on labeled data, meaning the inputs and correct outputs are known. It’s commonly used for tasks like spam detection, stock price prediction, or medical diagnosis.

Unsupervised Learning – Here, the model works with unlabeled data to discover hidden patterns or groupings. A good example is customer segmentation, where businesses group users based on behavior without prior labels.

Reinforcement Learning – In this technique, an agent learns by interacting with its environment and receiving rewards or penalties. It’s used in robotics, gaming (like AlphaGo), and self-driving cars.

Semi-Supervised Learning – A mix of supervised and unsupervised approaches, where models are trained with a small portion of labeled data and a large portion of unlabeled data. It’s helpful when labeling data is expensive or time-consuming.

Deep Learning – A subset of machine learning that uses neural networks with many layers to handle complex tasks like image recognition, natural language processing, and voice assistants.
''',

"data/text_files/intro_to_rag.txt":'''
Retrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: information retrieval and text generation. Traditional large language models (LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to what they’ve seen during training. This can cause them to produce outdated or incorrect information. RAG fixes this issue by retrieving relevant, up-to-date information from an external knowledge source—like a database, vector store, or even the web—and then using a language model to generate responses based on that retrieved context.

How RAG Works
The RAG process happens in two main steps: retrieval and generation. In the retrieval step, the system takes the user’s query and searches for the most relevant pieces of information, often using a vector database that stores text in numerical embeddings. Then in the generation step, the retrieved documents are passed to a large language model, which uses them as context to produce an accurate, coherent answer. This approach reduces hallucinations, improves factual accuracy, and ensures that the responses are grounded in real data rather than guesses.

Applications of RAG
RAG has a wide range of applications across industries. In customer support, it can provide instant, accurate answers by pulling information from product manuals or FAQs. In healthcare, it can assist doctors by retrieving the latest medical research before generating a summary. In legal and finance, it helps professionals by finding relevant case studies, policies, or market data to create well-informed reports. Essentially, RAG is the backbone of modern AI assistants that need to be both intelligent and reliable.

Why RAG Matters
The real strength of RAG lies in its ability to extend the capabilities of language models without retraining them from scratch. Instead of building an entirely new model every time knowledge changes, developers can simply update the external database. This makes RAG not only powerful but also scalable and cost-efficient. As AI continues to grow, RAG will play a crucial role in building applications that can reason over private, domain-specific, or constantly changing data sources.
'''

}

for file_path,content in sample_texts.items():
    with open(file_path,'w',encoding='utf-8') as f:
        f.write(content)

print("Sample text file has created!")


Sample text file has created!


TextLoader

In [24]:
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

# loading a single text file
loader = TextLoader("data/text_files/python_intro.txt",encoding='utf-8')

documents = loader.load()

print(f"Loaded {len(documents)} document")
print(f"Content preview:{documents[0].page_content[:100]}...")
print(f"metadata: {documents[0].metadata}")


Loaded 1 document
Content preview:
Python is a high-level, interpreted programming language that has become one of the most popular to...
metadata: {'source': 'data/text_files/python_intro.txt'}


##### DirectoryLoader(reading entire dir[multiple files])

In [25]:
from langchain_community.document_loaders import DirectoryLoader

#load all the text files from the directory
dir_loader = DirectoryLoader(
    "data/text_files",
    glob="**/*.txt", ## pattern to match files
    loader_cls=TextLoader, ## loader class to use
    loader_kwargs={'encoding':'utf-8'},
    show_progress=True
)

documents = dir_loader.load()

print(f"Loaded {len(documents)} documents")
for i,doc in enumerate(documents):
    print(f"\nDocument {i+1}")
    print(f"Source:{doc.metadata['source']}")
    print(f"Length: {len(doc.page_content)} characters")

100%|██████████| 2/2 [00:00<00:00, 2871.83it/s]

Loaded 2 documents

Document 1
Source:data/text_files/intro_to_rag.txt
Length: 2210 characters

Document 2
Source:data/text_files/python_intro.txt
Length: 2826 characters





###### Text splitting Strategies

In [26]:
from langchain.text_splitter import(
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter

)

In [27]:
print(documents)

[Document(metadata={'source': 'data/text_files/intro_to_rag.txt'}, page_content='\nRetrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: information retrieval and text generation. Traditional large language models (LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to what they’ve seen during training. This can cause them to produce outdated or incorrect information. RAG fixes this issue by retrieving relevant, up-to-date information from an external knowledge source—like a database, vector store, or even the web—and then using a language model to generate responses based on that retrieved context.\n\nHow RAG Works\nThe RAG process happens in two main steps: retrieval and generation. In the retrieval step, the system takes the user’s query and searches for the most relevant pieces of information, often using a vector database that stores text in numerical embeddings. Then in the generation step, the retrieved documents

###### method:1 character Text Splitter

In [28]:
text = documents[0].page_content
text

'\nRetrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: information retrieval and text generation. Traditional large language models (LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to what they’ve seen during training. This can cause them to produce outdated or incorrect information. RAG fixes this issue by retrieving relevant, up-to-date information from an external knowledge source—like a database, vector store, or even the web—and then using a language model to generate responses based on that retrieved context.\n\nHow RAG Works\nThe RAG process happens in two main steps: retrieval and generation. In the retrieval step, the system takes the user’s query and searches for the most relevant pieces of information, often using a vector database that stores text in numerical embeddings. Then in the generation step, the retrieved documents are passed to a large language model, which uses them as context to produce an 

In [30]:
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size = 200,
    chunk_overlap = 20,
    length_function = len
)

char_chunk=char_splitter.split_text(text)
print(f"Created {len(char_chunk)} chunks")
print(f"First chunk: {char_chunk[0][:100]}...")

Created a chunk of size 590, which is longer than the specified 200
Created a chunk of size 557, which is longer than the specified 200
Created a chunk of size 519, which is longer than the specified 200


Created 7 chunks
First chunk: Retrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: inf...


In [34]:
print(char_chunk[0])
print("-"*30)
print(char_chunk[1])
print("-"*30)
print(char_chunk[2])

Retrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: information retrieval and text generation. Traditional large language models (LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to what they’ve seen during training. This can cause them to produce outdated or incorrect information. RAG fixes this issue by retrieving relevant, up-to-date information from an external knowledge source—like a database, vector store, or even the web—and then using a language model to generate responses based on that retrieved context.
------------------------------
How RAG Works
------------------------------
The RAG process happens in two main steps: retrieval and generation. In the retrieval step, the system takes the user’s query and searches for the most relevant pieces of information, often using a vector database that stores text in numerical embeddings. Then in the generation step, the retrieved documents are passed to a large la

###### Method:2 Recursive Character splitting (Recommended)

In [42]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=[" "],
    chunk_size = 200,
    chunk_overlap = 20,
    length_function = len
)

In [43]:
recursive_chunks=recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
print(f"First chunk: {recursive_chunks[0][:100]}...")

Created 13 chunks
First chunk: Retrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: inf...


In [44]:
print(char_chunk[0])
print("-"*30)
print(char_chunk[1])
print("-"*30)
print(char_chunk[2])

Retrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: information retrieval and text generation. Traditional large language models (LLMs) like GPT rely
------------------------------
like GPT rely solely on their pre-trained knowledge, which is limited to what they’ve seen during training. This can cause them to produce outdated or incorrect information. RAG fixes this issue by
------------------------------
fixes this issue by retrieving relevant, up-to-date information from an external knowledge source—like a database, vector store, or even the web—and then using a language model to generate responses


In [41]:
#Created text without natural break points

sample_text = "Retrieval-Augmented Generation,or RAG, is a framework that combines two powerful AI techniques: information retrieval and text generation.Traditional large language models (LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to what they’ve seen during training. This can cause them to produce outdated or incorrect information. RAG fixes this issue by retrieving relevant, up-to-date information from an external knowledge source—like a database, vector store, or even the web—and then using a language model to generate responses based on that retrieved context."

splitter = RecursiveCharacterTextSplitter(
    separators=[" "],
    chunk_size = 80,
    chunk_overlap = 20,
    length_function = len
)


chunks = splitter.split_text(sample_text)

print(f"\nSample text example - {len(chunks)} chunks:\n")

for i in range(len(chunks) - 1):
    print(f"chunk {i+1}: '{chunks[i]}'")
    print(f"chunk {i+2}: '{chunks[i+1]}'")

    print()


Sample text example - 10 chunks:

chunk 1: 'Retrieval-Augmented Generation,or RAG, is a framework that combines two powerful'
chunk 2: 'two powerful AI techniques: information retrieval and text'

chunk 2: 'two powerful AI techniques: information retrieval and text'
chunk 3: 'retrieval and text generation.Traditional large language models (LLMs) like GPT'

chunk 3: 'retrieval and text generation.Traditional large language models (LLMs) like GPT'
chunk 4: '(LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to'

chunk 4: '(LLMs) like GPT rely solely on their pre-trained knowledge, which is limited to'
chunk 5: 'which is limited to what they’ve seen during training. This can cause them to'

chunk 5: 'which is limited to what they’ve seen during training. This can cause them to'
chunk 6: 'can cause them to produce outdated or incorrect information. RAG fixes this'

chunk 6: 'can cause them to produce outdated or incorrect information. RAG fixes this'
chunk 7: 'RA

##### method:3 Token-based splitting

In [46]:
token_splitter = TokenTextSplitter(
    chunk_size = 50,
    chunk_overlap = 10
)

token_chunk = token_splitter.split_text(text)
print(f"Total {len(token_chunk)} chunks")
print(f"First chunk:{token_chunk[0][:100]}...")

Total 11 chunks
First chunk:
Retrieval-Augmented Generation, or RAG, is a framework that combines two powerful AI techniques: in...
