### Data Ingestion

In [2]:
from langchain_core.documents import Document

In [3]:
doc = Document(
    page_content="this is the main text content I am using to create RAG",
    metadata = {
        "source":"example.txt",
        "pages":1,
        "author":"devam",
        "date_created":"2024-06-10"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'devam', 'date_created': '2024-06-10'}, page_content='this is the main text content I am using to create RAG')

In [5]:
##create a simple txt file
import os 
os.makedirs("../data/text_files", exist_ok=True)

In [6]:
sample_text = {
    "../data/text_files/python_intro.txt":"""Python programming introduction
    
    Python is one of the most popular programming languages in the world today. It’s known for being beginner-friendly, highly readable, and incredibly versatile. Whether you want to build websites, analyze data, or create Artificial Intelligence, Python is often the go-to choice.

Here is a foundational overview to get you started.

1. Why Choose Python?
Python's philosophy focuses on code readability and simplicity. Here's why it stands out:

Easy to Learn: The syntax looks a lot like English. It avoids the complex symbols (like ; and {}) found in languages like C++ or Java.

Interpreted: Python executes code line-by-line, which makes debugging much faster.

Huge Library Support: There are "packages" for almost everything—from scientific computing (NumPy) to web development (Django).

Community: Because it's so popular, you can find solutions to almost any problem online instantly.

2. Core Concepts
To understand Python, you need to be familiar with its basic building blocks:

Variables: Used to store information (e.g., name = "Alice").

Data Types: Python handles numbers (integers/floats), text (strings), and true/false values (booleans).

Indentation: Unlike other languages, Python uses whitespace (indentation) to define blocks of code. If your indentation is wrong, the code won't run!

Functions: Reusable blocks of code that perform a specific task.
    """
}
for filepath, content in sample_text.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)
        
print("Sample text files created.")

Sample text files created.


In [17]:
from langchain_community.document_loaders import TextLoader
loaders = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document=loaders.load()
document

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python programming introduction\n\n    Python is one of the most popular programming languages in the world today. It’s known for being beginner-friendly, highly readable, and incredibly versatile. Whether you want to build websites, analyze data, or create Artificial Intelligence, Python is often the go-to choice.\n\nHere is a foundational overview to get you started.\n\n1. Why Choose Python?\nPython\'s philosophy focuses on code readability and simplicity. Here\'s why it stands out:\n\nEasy to Learn: The syntax looks a lot like English. It avoids the complex symbols (like ; and {}) found in languages like C++ or Java.\n\nInterpreted: Python executes code line-by-line, which makes debugging much faster.\n\nHuge Library Support: There are "packages" for almost everything—from scientific computing (NumPy) to web development (Django).\n\nCommunity: Because it\'s so popular, you can find solutions to almos

In [18]:
## Directory load:
from langchain_community.document_loaders import DirectoryLoader
dir_loader = DirectoryLoader("../data/text_files", glob="**/*.txt", loader_cls=TextLoader, loader_kwargs={"encoding":"utf-8"},show_progress=False)
documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that learn from data to improve their performance over time, rather than being explicitly programmed for every specific task.\n\nIf Python is the engine, Machine Learning is one of the most powerful destinations you can reach with it.\n\n1. How Machine Learning Works\nAt its core, ML is about finding patterns. Instead of writing complex "if-then" rules, you feed a computer a large amount of data, and it builds a Model to make predictions or decisions.\n\nData Collection: Gathering historical information (e.g., past house prices).\n\nTraining: Feeding that data into an algorithm.\n\nThe Model: The "brain" created after training.\n\nPrediction: Giving the model new data (e.g., a house\'s square footage) to get an output (the estimated price).\n\n2. The Three Main Types of ML\nMachine Learning is gener

In [19]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader
dir_loader = DirectoryLoader("../data/pdf",
                             glob="**/*.pdf", 
                             loader_cls=PyMuPDFLoader, 
                             show_progress=False)
documents_pdf = dir_loader.load()
documents_pdf

[Document(metadata={'producer': 'Microsoft® Word 2016', 'creator': 'Microsoft® Word 2016', 'creationdate': '2026-01-08T15:52:59+05:30', 'source': '..\\data\\pdf\\JD_AI Engineer Salhakar.pdf', 'file_path': '..\\data\\pdf\\JD_AI Engineer Salhakar.pdf', 'total_pages': 2, 'format': 'PDF 1.5', 'title': '', 'author': 'Un-named', 'subject': '', 'keywords': '', 'moddate': '2026-01-08T15:52:59+05:30', 'trapped': '', 'modDate': "D:20260108155259+05'30'", 'creationDate': "D:20260108155259+05'30'", 'page': 0}, page_content='JOB DESCRIPTION  \n \nAI Engineer Intern  \n \n \nABOUT THE ROLE  \n  \n \nInternship Type: Hybrid  \nDuration: 4–6 Months  \nStipend: ₹10,000 – ₹20,000 per month (based on skills and prior experience)  \n  \nWe are looking for an AI Engineer Intern to work closely with our core tech team on designing, building, and optimizing \nAI systems for large-scale legal research applications. This role offers hands-on exposure to LLM-powered pipelines, \nRetrieval-Augmented Generation (

In [28]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path="../data/json/linear_ode_auto_000.json",
    jq_schema=".input.equation",
    text_content=True
)

In [29]:
loader1 = loader.load()
loader1

[Document(metadata={'source': 'C:\\Users\\devam\\RAG_pipeline_from_scratch\\data\\json\\linear_ode_auto_000.json', 'seq_num': 1}, page_content='dy/dt + 1.36y = 2.65*exp(-2.1t)')]

In [36]:
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["problem_id"] = record.get("problem_id")
    metadata["category"] = record.get("category")
    return metadata

loader = JSONLoader(
    file_path="../data/json/linear_ode_auto_000.json",
    jq_schema=".", # Load the root object
    # content_key="input", # Use the 'input' section as the main text
    metadata_func=metadata_func,
    text_content=False
)

docs = loader.load()
# print(docs[0].metadata) # Result: {'problem_id': 'linear_ode_auto_000', ...}
docs

[Document(metadata={'source': 'C:\\Users\\devam\\RAG_pipeline_from_scratch\\data\\json\\linear_ode_auto_000.json', 'seq_num': 1, 'problem_id': 'linear_ode_auto_000', 'category': 'linear_ode'}, page_content='{"problem_id": "linear_ode_auto_000", "category": "linear_ode", "input": {"equation": "dy/dt + 1.36y = 2.65*exp(-2.1t)", "initial_conditions": {"y(0)": 1.43}, "domain": "t >= 0"}, "analysis": {"problem_type": "First-order linear ODE", "linearity": "linear", "stiffness": "non-stiff", "symbolic_solution_possible": true}, "symbolic_solution": {"method": "Integrating factor", "solution": "-3.58108108108108*exp(-2.1*t) + 5.01108108108108*exp(-1.36*t)", "verified_with": "SymPy"}, "numerical_solution": {"solver": "RK45", "time_span": [0, 5], "num_points": 200}, "validation": {"strategy": "symbolic_vs_numerical", "status": "PASS", "max_error": 0.0005199923331993261, "tolerance": 0.001}}')]

### Embedding part and VectorStore DB