### Data Ingestion

In [24]:
from langchain_core.documents import Document

In [25]:
doc = Document(
    page_content="this is the main text content I am using to create RAG",
    metadata = {
        "source":"example.txt",
        "pages":1,
        "author":"devam",
        "date_created":"2024-06-10"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'devam', 'date_created': '2024-06-10'}, page_content='this is the main text content I am using to create RAG')

In [26]:
##create a simple txt file
import os 
os.makedirs("../data/text_files", exist_ok=True)

In [27]:
sample_text = {
    "../data/text_files/python_intro.txt":"""Python programming introduction
    
    Python is one of the most popular programming languages in the world today. It’s known for being beginner-friendly, highly readable, and incredibly versatile. Whether you want to build websites, analyze data, or create Artificial Intelligence, Python is often the go-to choice.

Here is a foundational overview to get you started.

1. Why Choose Python?
Python's philosophy focuses on code readability and simplicity. Here's why it stands out:

Easy to Learn: The syntax looks a lot like English. It avoids the complex symbols (like ; and {}) found in languages like C++ or Java.

Interpreted: Python executes code line-by-line, which makes debugging much faster.

Huge Library Support: There are "packages" for almost everything—from scientific computing (NumPy) to web development (Django).

Community: Because it's so popular, you can find solutions to almost any problem online instantly.

2. Core Concepts
To understand Python, you need to be familiar with its basic building blocks:

Variables: Used to store information (e.g., name = "Alice").

Data Types: Python handles numbers (integers/floats), text (strings), and true/false values (booleans).

Indentation: Unlike other languages, Python uses whitespace (indentation) to define blocks of code. If your indentation is wrong, the code won't run!

Functions: Reusable blocks of code that perform a specific task.
    """
}
for filepath, content in sample_text.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)
        
print("Sample text files created.")

Sample text files created.


In [28]:
from langchain_community.document_loaders import TextLoader
loaders = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document=loaders.load()
document

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python programming introduction\n\n    Python is one of the most popular programming languages in the world today. It’s known for being beginner-friendly, highly readable, and incredibly versatile. Whether you want to build websites, analyze data, or create Artificial Intelligence, Python is often the go-to choice.\n\nHere is a foundational overview to get you started.\n\n1. Why Choose Python?\nPython\'s philosophy focuses on code readability and simplicity. Here\'s why it stands out:\n\nEasy to Learn: The syntax looks a lot like English. It avoids the complex symbols (like ; and {}) found in languages like C++ or Java.\n\nInterpreted: Python executes code line-by-line, which makes debugging much faster.\n\nHuge Library Support: There are "packages" for almost everything—from scientific computing (NumPy) to web development (Django).\n\nCommunity: Because it\'s so popular, you can find solutions to almos

In [29]:
## Directory load:
from langchain_community.document_loaders import DirectoryLoader
dir_loader = DirectoryLoader("../data/text_files", glob="**/*.txt", loader_cls=TextLoader, loader_kwargs={"encoding":"utf-8"},show_progress=False)
documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that learn from data to improve their performance over time, rather than being explicitly programmed for every specific task.\n\nIf Python is the engine, Machine Learning is one of the most powerful destinations you can reach with it.\n\n1. How Machine Learning Works\nAt its core, ML is about finding patterns. Instead of writing complex "if-then" rules, you feed a computer a large amount of data, and it builds a Model to make predictions or decisions.\n\nData Collection: Gathering historical information (e.g., past house prices).\n\nTraining: Feeding that data into an algorithm.\n\nThe Model: The "brain" created after training.\n\nPrediction: Giving the model new data (e.g., a house\'s square footage) to get an output (the estimated price).\n\n2. The Three Main Types of ML\nMachine Learning is gener

In [30]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

dir_loader = DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)

documents_pdf = dir_loader.load()
documents_pdf


100%|██████████| 3/3 [06:49<00:00, 136.52s/it]
100%|██████████| 3/3 [00:02<00:00,  1.25it/s]


[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-05-28T00:07:51+00:00', 'author': '', 'keywords': '', 'moddate': '2019-05-28T00:07:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\1810.04805v2.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT, which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representa

In [31]:
import pathlib
from langchain_community.document_loaders import PyMuPDFLoader

path = pathlib.Path("../data/pdf")
for pdf_file in path.glob("**/*.pdf"):
    try:
        print(f"Processing: {pdf_file}")
        loader = PyMuPDFLoader(str(pdf_file))
        loader.load()
    except Exception as e:
        print(f"!!! FAILED to load {pdf_file}: {e}")

Processing: ..\data\pdf\1810.04805v2.pdf
Processing: ..\data\pdf\2010.11929v2.pdf
Processing: ..\data\pdf\attention_is_all_you_need.pdf


In [32]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path="../data/json/linear_ode_auto_000.json",
    jq_schema=".input.equation",
    text_content=True
)

In [33]:
loader1 = loader.load()
loader1

[Document(metadata={'source': 'C:\\Users\\devam\\RAG_pipeline_from_scratch\\data\\json\\linear_ode_auto_000.json', 'seq_num': 1}, page_content='dy/dt + 1.36y = 2.65*exp(-2.1t)')]

In [34]:
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["problem_id"] = record.get("problem_id")
    metadata["category"] = record.get("category")
    return metadata

loader = JSONLoader(
    file_path="../data/json/linear_ode_auto_000.json",
    jq_schema=".", # Load the root object
    # content_key="input", # Use the 'input' section as the main text
    metadata_func=metadata_func,
    text_content=False
)

docs = loader.load()
# print(docs[0].metadata) # Result: {'problem_id': 'linear_ode_auto_000', ...}
docs

[Document(metadata={'source': 'C:\\Users\\devam\\RAG_pipeline_from_scratch\\data\\json\\linear_ode_auto_000.json', 'seq_num': 1, 'problem_id': 'linear_ode_auto_000', 'category': 'linear_ode'}, page_content='{"problem_id": "linear_ode_auto_000", "category": "linear_ode", "input": {"equation": "dy/dt + 1.36y = 2.65*exp(-2.1t)", "initial_conditions": {"y(0)": 1.43}, "domain": "t >= 0"}, "analysis": {"problem_type": "First-order linear ODE", "linearity": "linear", "stiffness": "non-stiff", "symbolic_solution_possible": true}, "symbolic_solution": {"method": "Integrating factor", "solution": "-3.58108108108108*exp(-2.1*t) + 5.01108108108108*exp(-1.36*t)", "verified_with": "SymPy"}, "numerical_solution": {"solver": "RK45", "time_span": [0, 5], "num_points": 200}, "validation": {"strategy": "symbolic_vs_numerical", "status": "PASS", "max_error": 0.0005199923331993261, "tolerance": 0.001}}')]

### Chunking:

In [35]:
# Creating Data Chunks 

from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """
    Split documents into smaller chunks for better RAG performance.
    
    Parameters:
    - chunk_size: Maximum characters per chunk (adjust based on your LLM)
    - chunk_overlap: Characters to overlap between chunks (preserves context)
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, # Each chunk: ~1000 characters
        chunk_overlap=chunk_overlap, # 200 chars overlap for context
        length_function=len, # How to measure length
        separators=["\n\n", "\n", " ", ""] # Split hierarchy
    )
    # Actually split the documents
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show what a chunk looks like
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

chunks = split_documents(documents_pdf)
chunks

Split 53 documents into 223 chunks

Example chunk:
Content: BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout...
Metadata: {'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-05-28T00:07:51+00:00', 'author': '', 'keywords': '', 'moddate': '2019-05-28T00:07:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\1810.04805v2.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}


[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-05-28T00:07:51+00:00', 'author': '', 'keywords': '', 'moddate': '2019-05-28T00:07:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\1810.04805v2.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT, which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representa

### Embedding part and VectorStore DB

In [36]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid # 
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [37]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformers"""
    def __init__(self, model_name:str = "all-MiniLM-L6-v2"):
        """
        Inittalize the embedding manager
        
        Args:
            model_name (str, optional): _description_. Defaults to "all-MiniLM-L6-v2".
        """
        self.model_name = model_name
        self.model = None
        self._load_model()
        
    def _load_model(self):
        """Load the SentenceTransformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise
        
    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts
        Args:
            texts: List of text strings to embed
        returns: 
            numpy array of embeddings with shape (len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Embedding model is not loaded.")
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings
    
    # def get_embedding_dimension(self) -> int:
    #     """Get the embedding dimension of the model
    #     """
    #     if not self.model:
    #         raise ValueError("Embedding model is not loaded.")
    #     return self.model.get_sentence_embedding_dimension()
    
embedding_manager = EmbeddingManager()
embedding_manager
    
    
    
    
    

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x24f058bd7d0>

### VectorStore

In [38]:
class VectorStore:
    """Manages document embeddings in a ChromaDB vector store
    """
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):
        """initialize the vector store

        Args:
            collection_name (str, optional): Name of the ChromaDB collection.
            persist_directory (str, optional): Directory to persist the vector store.
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection =None
        self._initialize_store()
        
    def _initialize_store(self):
        """Initialize ChromaDB client and collection"""
        try:
            # Create persistent ChromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)
            
            #get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description":"PDF document embeddings"}
            )
            print(f"Vector store initialized with collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise
        
    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """Add documents and their embeddings to the vector store

        Args:
            documents (List[Any]): List of document objects with metadata
            embeddings (np.ndarray): Corresponding embeddings for the documents
        """
        if len(documents) != embeddings.shape[0]:
            raise ValueError("Number of documents and embeddings must match.")
        print(f"Adding {len(documents)} documents to the vector store...")
        ids = []
        metadatas = []
        documents_texts = []
        embedding_lists = []
        
        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)
            documents_texts.append(doc.page_content)
            embedding_lists.append(embedding.tolist())
        try:
            self.collection.add(
                ids = ids,
                embeddings = embedding_lists,
                metadatas = metadatas,
                documents= documents_texts
            )
            print(f"Successfully added {len(documents)} documents.")
            print(f"Total documents in collection now: {self.collection.count()}")
            
        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise
        
        
vector_store = VectorStore()
vector_store

Vector store initialized with collection: pdf_documents
Existing documents in collection: 3


<__main__.VectorStore at 0x24f05d56a10>

In [39]:
## Convert text to embeddings
texts = [doc.page_content for doc in chunks]
## Generate embeddings

embeddings = embedding_manager.generate_embeddings(texts)

#store in vector store
vector_store.add_documents(chunks, embeddings)

Generating embeddings for 223 texts...


Batches: 100%|██████████| 7/7 [00:03<00:00,  2.29it/s]


Generated embeddings with shape: (223, 384)
Adding 223 documents to the vector store...
Successfully added 223 documents.
Total documents in collection now: 226
