# Data Ingestion

## üìö Dataset Description

### üìå Dataset Overview

The dataset used in this RAG-based Academic Study Assistant consists of structured Machine Learning course material organized into chapter-wise PDF documents.

The dataset contains the following files:

1. **Chapter 1 ‚Äì Introduction to Machine Learning**
2. **Chapter 2 ‚Äì Simple Linear Regression**
3. **Chapter 3 ‚Äì Logistic Regression**
4. **Chapter 4 ‚Äì Decision Trees**
5. **Chapter 5 ‚Äì Neural Networks**
6. **Chapter 6 ‚Äì Overfitting & Regularization**
7. **Chapter 7 ‚Äì Bias-Variance Tradeoff**
8. **Chapter 8 ‚Äì Gradient Descent**
9. **Chapter 9 ‚Äì Evaluation Metrics**

---

### üìä Dataset Characteristics

- **Format:** PDF  
- **Type:** Academic lecture notes / textbook chapters  
- **Domain:** Machine Learning  
- **Content Includes:**
  - Theoretical explanations
  - Mathematical derivations
  - Regression equations
  - Optimization formulas
  - Algorithm descriptions
  - Conceptual diagrams (if present)
- **Complexity Level:** Undergraduate / Early Postgraduate ML coursework  

---

### üß† Nature of the Content

The dataset includes:

- Mathematical equations (e.g., least squares, gradient descent updates)
- Statistical notations
- Model definitions
- Algorithm steps
- Conceptual explanations
- Performance metrics and evaluation formulas

Some chapters (e.g., Regression, Neural Networks, Gradient Descent) contain heavy mathematical notation and symbolic expressions.

---

### ‚ö† Challenges Identified in the Dataset

1. Mathematical equations may appear in symbolic or multi-line format.
2. PDF formatting can break equations during text extraction.
3. Technical terms and formula-heavy sections require careful chunking.
4. Certain chapters are concept-heavy, while others are equation-heavy.

These challenges are addressed using preprocessing, structured chunking strategies, and equation normalization techniques.

---

### üéØ Purpose of Using This Dataset

This dataset is used to build and evaluate a Retrieval-Augmented Generation (RAG) based academic assistant capable of:

- Answering conceptual ML questions
- Explaining mathematical formulas
- Retrieving relevant theory sections
- Providing context-aware explanations

---

### üìà Why This Dataset Is Suitable for RAG Evaluation

This dataset is ideal for evaluating RAG performance because:

- It contains both conceptual and mathematical content.
- It spans multiple interconnected ML topics.
- It allows evaluation of retrieval performance across:
  - Definition-based questions
  - Derivation-based questions
  - Algorithm explanation queries
  - Formula-based queries

### document datastructure

In [1]:
from langchain_core.documents import Document

In [60]:
doc=Document(
    page_content="this is the main text contain I am Using to create RAG",
    metadata={
        "source": "example",
        "page": 1,
        "author": "Bhushan",
        "date_created": "2025-04-05"
    }
)
doc

Document(metadata={'source': 'example', 'page': 1, 'author': 'Bhushan', 'date_created': '2025-04-05'}, page_content='this is the main text contain I am Using to create RAG')

In [18]:
## create a simple text file
import os
os.makedirs("../data/text_files", exist_ok=True)

In [4]:
sample_text = {
    "../data/text_files/python_intro.txt" : """Python Programming Introduction
     
    Python is a high-level, interpreted programming language known for its simplicity and readability.
    Created by Guido van Rossum and first released in 1991, Pyhton has become one of the most popular 
    programming languages in the world.

    Key Features:
    - Easy to learn and use
    - Extensive standard library
    - Cross-platform campatibility
    - Strong community support

    Python is widely used in web development, data science, artificial intelligence, and automation.""",

    "../data/text_files/machine_learning.txt" : """ Machine Learning Basics
    
    Machine learning is a subset of artificial intelligence that enables syste,s to learn and improve 
    from experience without being explicitly programmed. It focuses on developing computer programs
    that can access data and use it to learn for themselves.

    Types of Machine Learning:
    1. Supervised Learning: Learning with labeled data
    2. Unsupervised Learning: Finding patterns in unlabeled data
    3. Reinforcement Learning: Learning through rewards and penalties

    Applications include image recognition, speech processing, and recommendation systems
    """
}

for filepath,content in sample_text.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("‚úÖ Sample text files created!")

‚úÖ Sample text files created!


### Text loader

In [30]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt",encoding="utf-8")
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\n    Python is a high-level, interpreted programming language known for its simplicity and readability.\n    Created by Guido van Rossum and first released in 1991, Pyhton has become one of the most popular \n    programming languages in the world.\n\n    Key Features:\n    - Easy to learn and use\n    - Extensive standard library\n    - Cross-platform campatibility\n    - Strong community support\n\n    Python is widely used in web development, data science, artificial intelligence, and automation.')]


### Direactory Loader

In [13]:
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls= TextLoader,
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=False 
)

documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content=' Machine Learning Basics\n\n    Machine learning is a subset of artificial intelligence that enables syste,s to learn and improve \n    from experience without being explicitly programmed. It focuses on developing computer programs\n    that can access data and use it to learn for themselves.\n\n    Types of Machine Learning:\n    1. Supervised Learning: Learning with labeled data\n    2. Unsupervised Learning: Finding patterns in unlabeled data\n    3. Reinforcement Learning: Learning through rewards and penalties\n\n    Applications include image recognition, speech processing, and recommendation systems\n    '),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\n    Python is a high-level, interpreted programming language known for its simplicity and readability.\n    Created by Guido van Rossum and first released in 1991, 

In [2]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader, PyMuPDFLoader
dir_loader = DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf",
    loader_cls= PyMuPDFLoader,
    show_progress=False 
)

pdf_documents = dir_loader.load()

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
pdf_documents

[Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '..\\data\\pdf\\Chapter 1 - Intro to ML.pdf', 'file_path': '..\\data\\pdf\\Chapter 1 - Intro to ML.pdf', 'total_pages': 120, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='MACHINE LEARNING  \n[R17A0534] \nLECTURE NOTES \n \nB.TECH IV YEAR ‚Äì I SEM(R17) \n(2020-21) \n \n \n \n \n \n \nDEPARTMENT OF \nCOMPUTER SCIENCE AND ENGINEERING \nMALLA REDDY COLLEGE OF ENGINEERING & \nTECHNOLOGY \n(Autonomous Institution ‚Äì UGC, Govt. of India) \nRecognized under 2(f) and 12 (B) of UGC ACT 1956 \n(Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC ‚Äì ‚ÄòA‚Äô Grade - ISO 9001:2015 Certified) \nMaisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad ‚Äì 500100, Telangana State, India'),
 Document(metadata={'producer': '', 'creator': '', 'creationdate': '', 'source': '

### Equation Preprocessing Function

In [4]:
import re

def preprocess_equations(text: str) -> str:
    """
    Normalize common math symbols and LaTeX expressions
    to improve embedding quality.
    """

    # Greek letters
    replacements = {
        r"\\theta": "theta",
        r"\\alpha": "alpha",
        r"\\beta": "beta",
        r"\\gamma": "gamma",
        r"\\lambda": "lambda",

        r"\^T": " transpose ",
        r"\^2": " squared ",
        r"\^3": " cubed ",

        r"=": " equals ",
        r"\+": " plus ",
        r"-": " minus ",
        r"\*": " times ",
        r"/": " divided by ",
    }

    for pattern, replacement in replacements.items():
        text = re.sub(pattern, replacement, text)

    # Remove excessive whitespace
    text = re.sub(r"\s+", " ", text)

    return text.strip()

In [29]:
pdf_documents = dir_loader.load()

for doc in pdf_documents:
    doc.page_content = preprocess_equations(doc.page_content)

pdf_documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics Machine learning is a subset of artificial intelligence that enables syste,s to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves. Types of Machine Learning: 1. Supervised Learning: Learning with labeled data 2. Unsupervised Learning: Finding patterns in unlabeled data 3. Reinforcement Learning: Learning through rewards and penalties Applications include image recognition, speech processing, and recommendation systems'),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction Python is a high minus level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum and first released in 1991, Pyhton has become one of the most popular programming langua

### Fixed-size chunking

In [30]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
def chunk_documents(documents, chunk_size=500, chunk_overlap=100):
    """
    Split documents into smaller chunks for better RAG retrieval.
    """

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", " ", ""]
    )

    chunked_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunked_docs)} chunks")

    if chunked_docs:
        print(f"\n Example Chunk:")
        print(f"Content: {chunked_docs[0].page_content[:200]}...")
        print(f"Metadata: {chunked_docs[0].metadata}")

    return chunked_docs

In [31]:
chunks = chunk_documents(pdf_documents)
chunks

Split 2 documents into 4 chunks

 Example Chunk:
Content: Machine Learning Basics Machine learning is a subset of artificial intelligence that enables syste,s to learn and improve from experience without being explicitly programmed. It focuses on developing ...
Metadata: {'source': '..\\data\\text_files\\machine_learning.txt'}


[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics Machine learning is a subset of artificial intelligence that enables syste,s to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves. Types of Machine Learning: 1. Supervised Learning: Learning with labeled data 2. Unsupervised Learning: Finding patterns in unlabeled data 3'),
 Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='. Unsupervised Learning: Finding patterns in unlabeled data 3. Reinforcement Learning: Learning through rewards and penalties Applications include image recognition, speech processing, and recommendation systems'),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction Python is a high minus level, interpreted programming language kn

### Sentence Based Chunking

In [47]:
import nltk
from nltk.tokenize import sent_tokenize
from langchain_core.documents import Document

nltk.download('punkt')

def sentence_based_chunking(documents, max_chunk_size=800):
    """
    Chunk documents by preserving sentence boundaries.
    Sentences are grouped until max_chunk_size is reached.
    """
    chunked_docs = []

    for doc in documents:
        sentences = sent_tokenize(doc.page_content)
        current_chunk = ""
        
        for sentence in sentences:
            if len(current_chunk) + len(sentence) <= max_chunk_size:
                current_chunk += " " + sentence
            else:
                chunked_docs.append(
                    Document(
                        page_content=current_chunk.strip(),
                        metadata=doc.metadata
                    )
                )
                current_chunk = sentence
        
        if current_chunk:
            chunked_docs.append(
                Document(
                    page_content=current_chunk.strip(),
                    metadata=doc.metadata
                )
            )

    print(f"Sentence-based chunking created {len(chunked_docs)} chunks")
    return chunked_docs

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SHIVA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [48]:
sentence_chunks = sentence_based_chunking(pdf_documents, max_chunk_size=800)
sentence_chunks

Sentence-based chunking created 2 chunks


[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics Machine learning is a subset of artificial intelligence that enables syste,s to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves. Types of Machine Learning: 1. Supervised Learning: Learning with labeled data 2. Unsupervised Learning: Finding patterns in unlabeled data 3. Reinforcement Learning: Learning through rewards and penalties Applications include image recognition, speech processing, and recommendation systems'),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction Python is a high minus level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum and first released in 1991, Pyhton has become one of the most popular programming langua

### Embedding and vectorStoreDB

In [49]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [50]:
class EmbeddingManager:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise
    
    def generate_embeddings(self, texts: List[str]) -> np.ndarray:

        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Genrated emebeddings with shape: {embeddings.shape}")
        return embeddings
    

embedding_manager = EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2


Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 103/103 [00:00<00:00, 719.94it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x225b24c10d0>

### Vector Store

In [51]:
class VectorStore:
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):

        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()
    
    def _initialize_store(self):
        try:
            #create ChromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description" : "PDF document embeddings for RAG"}
            )
            print(f"Vector store initilized. Collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise
        
    def add_documents(self, documents: List[Any], embeddings:np.array):
        if len(documents) != len(embeddings):
            raise ValueError("Number of docments must match number of embeddings")
        
        print(f"Adding {len(documents)} documents to vector store...")

        #data for chromaDB
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            #unique Id
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document content
            documents_text.append(doc.page_content)

            # Embedding
            embeddings_list.append(embedding.tolist())

        # Add to collection
        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text
            )
            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise

vectorStore_ = VectorStore()
vectorStore_

Vector store initilized. Collection: pdf_documents
Existing documents in collection: 5532


<__main__.VectorStore at 0x225b2455790>

In [37]:
### convert the text to embeddings
texts=[doc.page_content for doc in chunks]

### Generate the embedding
embeddings = embedding_manager.generate_embeddings(texts)

### Store in vector
vectorStore_.add_documents(chunks, embeddings)

Generating embeddings for 4 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 11.74it/s]

Genrated emebeddings with shape: (4, 384)
Adding 4 documents to vector store...
Successfully added 4 documents to vector store
Total documents in collection: 5532





### Retriever Pipeline From VectorStore

In [52]:
class RAGRetriever:
    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:

        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_threshold}")

        #genrate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )

            # Process results
            retrieved_docs = []

            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    similarity_score = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i+1
                        })

                        print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
                    
            else:
                print("No docments found")

            return retrieved_docs
        
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []
                
rag_retriever = RAGRetriever(vectorStore_, embedding_manager)

In [53]:
rag_retriever

<__main__.RAGRetriever at 0x225c1d43350>

In [40]:
rag_retriever.retrieve("What is Machine Learning")

Retrieving documents for query: 'What is Machine Learning'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 93.48it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)
Retrieved 4 documents (after filtering)
Retrieved 5 documents (after filtering)





[{'id': 'doc_3f0a412a_15',
  'content': '1 \n \nUNIT I  \nIntroduction to Machine Learning \n1. Introduction \n \n1.1 What Is Machine Learning?  \nMachine learning is programming computers to optimize a performance criterion using example \ndata or past experience. We have a model defined up to some parameters, and learning is the \nexecution of a computer program to optimize the parameters of the model using the training data or \npast experience. The model may be predictive to make predictions in the future, or descriptive to gain',
  'metadata': {'source': '..\\data\\pdf\\Chapter 1 - Intro to ML.pdf',
   'format': 'PDF 1.4',
   'modDate': '',
   'moddate': '',
   'creationdate': '',
   'creationDate': '',
   'content_length': 487,
   'page': 5,
   'total_pages': 120,
   'title': '',
   'file_path': '..\\data\\pdf\\Chapter 1 - Intro to ML.pdf',
   'trapped': '',
   'doc_index': 15,
   'keywords': '',
   'creator': '',
   'author': '',
   'producer': '',
   'subject': ''},
  'similari

In [41]:
rag_retriever.retrieve("Multi Layer Artificial Neural Networks")

Retrieving documents for query: 'Multi Layer Artificial Neural Networks'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 63.47it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)
Retrieved 4 documents (after filtering)
Retrieved 5 documents (after filtering)





[{'id': 'doc_5806b333_1013',
  'content': 'efforts (Werbos 1974, Parker 1982, and Rumelhart, Hinton and Williams 1986) \nprovides a systematic means for training multi-layer networks, thereby overcoming \nlimitations presented by Minsky. \n1.3.0 Characteristics of ANN \nArtificial neural networks are biologically inspired; that is, they are composed of \nelements that perform in a manner that is analogous to the most elementary functions of the \nbiological neuron. The important characteristics of artificial neural networks are learning',
  'metadata': {'subject': '',
   'creator': 'Microsoft¬Æ Word for Office 365',
   'author': 'Harish  Balaga',
   'content_length': 493,
   'title': 'Introduction to ANN',
   'creationdate': '2019-07-02T13:53:30+05:30',
   'format': 'PDF 1.7',
   'keywords': '',
   'total_pages': 36,
   'page': 2,
   'source': '..\\data\\pdf\\chapter 5 -Neural Networks.pdf',
   'creationDate': "D:20190702135330+05'30'",
   'producer': 'Microsoft¬Æ Word for Office 365',

### RAG pipeline with LLM

In [None]:
from langchain_groq import ChatGroq
import os 
from dotenv import load_dotenv
load_dotenv()

groq_api_key = os.getenv("GROQ_API_KEY")

llm = ChatGroq(groq_api_key=groq_api_key, model_name="llama-3.1-8b-instant", temperature=0.1, max_tokens=1024)

#RAG function

def rag_simple(query, retriever, llm, top_k=3):
    
    results = retriever.retrieve(query, top_k=top_k)
    context = "\n\n".join([doc['content'] for doc in results]) if results else ""

    if not context:
        return "No Relevant context found to answer the question."
    
    ## generate ans
    prompt = f"""
        You are an academic Machine Learning assistant.

        Use only the provided context to answer the question.

        Guidelines:
        - Answer clearly and concisely.
        - Do not add information outside the context.
        - If a mathematical formula appears, rewrite it in clean readable format.
        - Explain technical terms briefly if necessary.

        Context:
        {context}

        Question:
        {query}

        Answer:
        """
    
    response = llm.invoke([prompt.format(context=context, query=query)])

    return response.content

#### Test Simple RAG

In [62]:
answer = rag_simple("What is The simple linear regression model?", rag_retriever, llm)
print(answer)

Retrieving documents for query: 'What is The simple linear regression model?'
Top K: 3, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 82.40it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





The Simple Linear Regression model is a statistical model that shows a linear or sloped straight line relationship between a dependent variable (continuous/real value) and a single independent variable (can be continuous or categorical value).


In [None]:
def rag_advanced(query, retriever, llm, top_k=5, min_score=0.2, return_context=False):
    results = retriever.retrieve(query, top_k=top_k, score_threshold=min_score)
    if not results:
        return {'answer': 'No relevant context found.', 'sources': [], 'confidence': 0.0, 'context': ''}
    
    context = "\n\n".join([doc['content'] for doc in results])
    sources = [{
        'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unkown')),
        'page': doc['metadata'].get('page', 'unkown'),
        'score': doc['similarity_score'],
        'preview': doc['content'][:300]+ '...'
    } for doc in results]

    confidence = max([doc['similarity_score'] for doc in results])

    # Genrate ans

    prompt = f"""
        You are a machine learning academic assistant.

        Use the retrieved context to answer clearly.

        If mathematical equations appear:
        - Rewrite them in clean readable format.
        - Explain each symbol.
        - Do not copy broken PDF formatting.

            Context:
            {context}

        Question:
        {query}

        Answer:
        """    
    response = llm.invoke([prompt.format(context=context, query=query)])

    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }
    if return_context:
        output['context'] = context
    return output


#### Test Advanced RAG

##### 1st Question

In [63]:
result = rag_advanced("What is Machine Learning?", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'What is Machine Learning?'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 19.78it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: **What is Machine Learning?**

Machine learning is a subfield of artificial intelligence (AI) that involves programming computers to optimize a performance criterion using example data or past experience.

**Key Components:**

1. **Model**: A mathematical representation of a system or a relationship between variables.
2. **Parameters**: Adjustable values within the model that need to be optimized.
3. **Training Data**: A set of examples used to train the model and optimize its parameters.
4. **Performance Criterion**: A measure of how well the model performs, such as accuracy or loss.

**Machine Learning Process:**

1. **Model Definition**: Define a model with adjustable parameters.
2. **Training**: Execute a computer program to optimize the parameters of the model using the training data.
3. **Model Evaluation**: Evaluate the performance of the trained model using a performance criterion.

**Types of Machine Learning Models:**

1. **Predictive Models**: Designed to make predic

##### 2nd Question

In [64]:
result = rag_advanced("What is Version Spaces?", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'What is Version Spaces?'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 62.34it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: **Version Spaces: A Concept in Machine Learning**

Version spaces are a fundamental concept in machine learning, particularly in the field of inductive inference. They were introduced by Mitchell in 1982 as a way to represent the set of possible hypotheses that are consistent with a given set of training examples.

**Definition:**

The version space, denoted as V(H, D), is the subset of hypotheses from the hypothesis space H that are consistent with the training examples in D.

**Mathematical Representation:**

Let H be the hypothesis space, D be the set of training examples, and V(H, D) be the version space. Then:

V(H, D) = {h ‚àà H | h(x) = c(x) ‚àÄ x ‚àà D}

**Explanation:**

* h ‚àà H: This means that h is a hypothesis in the hypothesis space H.
* h(x) = c(x): This means that the hypothesis h correctly classifies the example x, where c(x) is the correct classification of x.
* ‚àÄ x ‚àà D: This means that the hypothesis h must correctly classify all examples in the training

##### 3rd Question

In [66]:
result = rag_advanced("Explain Least squares estimation?", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'Explain Least squares estimation?'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 44.49it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: Least Squares Estimation is a popular method of estimation in statistics and machine learning. It is used to estimate the parameters of a linear regression model by minimizing the sum of squares of the residuals.

**Simple Linear Regression Model**

The simple linear regression model is given by:

y_i = Œ≤_0 + Œ≤_1x_i + Œµ_i

where:

* y_i is the dependent variable (response variable)
* x_i is the independent variable (predictor variable)
* Œ≤_0 is the intercept or constant term
* Œ≤_1 is the slope coefficient
* Œµ_i is the error term or residual

**Least Squares Estimation**

The principle of least squares estimates the parameters Œ≤_0 and Œ≤_1 by minimizing the sum of squares of the residuals. The sum of squares of the residuals is given by:

S = ‚àë[y_i - (Œ≤_0 + Œ≤_1x_i)]^2

where the sum is taken over all n observations.

**Minimizing the Sum of Squares**

To minimize the sum of squares, we take the partial derivatives of S with respect to Œ≤_0 and Œ≤_1, and set them equal

##### 4th Question

In [67]:
result = rag_advanced("What is Logistic Regression?", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'What is Logistic Regression?'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 48.45it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: Based on the provided context, Logistic Regression is the case where the response is binomial, with 'n' equal to the number of data-points with the given 'x' (often but not always 1), and 'p' is given by the following equation:

p = 1 / (1 + e^(-z))

Here's a clean and readable format of the equation:

p = 1 / (1 + e^(-z))

Explanation of symbols:

- p: The probability of the response variable
- e: The base of the natural logarithm (approximately 2.718)
- z: The linear predictor, which is a linear combination of the input features 'x' and the model parameters

In Logistic Regression, the relationship between the parameters and the linear predictor is changed using a link function, which is a mathematical function that maps the linear predictor to the probability of the response variable. This is done for computational reasons, as it allows for the use of optimization algorithms to find the model parameters.

In essence, Logistic Regression is a type of regression analysis that 

##### 5th Question

In [68]:
result = rag_advanced("Explain Decision Tree Learning Algorithm", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'Explain Decision Tree Learning Algorithm'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 33.69it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: Decision Tree Learning Algorithm

Decision Tree Learning is a supervised learning algorithm used for both classification and regression tasks. It's a simple yet powerful algorithm that works by recursively partitioning the data into smaller subsets based on the input features.

**Decision Tree Structure**

A decision tree consists of:

1. **Root Node**: The topmost node in the tree, which represents the entire dataset.
2. **Internal Nodes**: These nodes represent a feature or attribute of the data. Each internal node performs a Boolean test on the input feature.
3. **Leaf Nodes**: These nodes represent the predicted class or target value.
4. **Edges**: These are the connections between nodes, labeled with the values of the input feature.

**Decision Tree Learning Algorithm**

The decision tree learning algorithm can be summarized as follows:

1. **Choose a Root Node**: Select a feature or attribute to be the root node of the tree.
2. **Split the Data**: Partition the data into 

##### 6th Question

In [69]:
result = rag_advanced("What is Multi Layer Artificial Neural Networks?", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'What is Multi Layer Artificial Neural Networks?'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 46.97it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: **Multi Layer Artificial Neural Networks (MLANN)**

A Multi Layer Artificial Neural Network (MLANN) is a type of neural network that consists of multiple layers of interconnected nodes or neurons. These layers are:

1. **Input Layer**: The first layer that receives the input data.
2. **Hidden Layers**: One or more layers between the input and output layers that perform complex computations.
3. **Output Layer**: The final layer that produces the output.

**Function of Each Neuron in Hidden Layers**

The neurons in the hidden layers are responsible for:

* **Nonlinear Function Approximation**: The ability to approximate complex relationships between inputs and outputs using nonlinear functions.
* **Learning Generalization**: The ability to learn from a limited set of examples and apply that knowledge to new, unseen data.
* **Nonlinear Classification**: The ability to classify data into different categories using nonlinear boundaries.

**Architecture of a Multi Layer Neural Networ

##### 7th Question

In [70]:
result = rag_advanced("what is L1 regularization (LASSO)", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'what is L1 regularization (LASSO)'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 39.45it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: L1 regularization, also known as LASSO (Least Absolute Shrinkage and Selection Operator), is a type of regularization technique used in linear regression and other machine learning models. It is a method for reducing overfitting by adding a penalty term to the loss function.

The L1 regularization equation is:

ÀÜw = arg min
w (Y ‚àíXw)T(Y ‚àíXw) + Œª‚à•w‚à•1

where:

- ÀÜw is the estimated weight vector
- w is the weight vector
- Y is the target variable
- X is the design matrix
- Œª is the regularization parameter (Œª ‚â• 0)
- ‚à•w‚à•1 is the L1 norm of the weight vector, defined as:

‚à•w‚à•1 = PD
j = 1|wj|

In simpler terms, the L1 norm is the sum of the absolute values of the weights.

The L1 regularization term (Œª‚à•w‚à•1) encourages the model to set some of the weights to zero, effectively selecting the most important features and reducing overfitting.

LASSO has two main implications:

1. **No closed-form solution**: Unlike L2 regularization, LASSO does not have a clos

##### 8th Question

In [71]:
result = rag_advanced("What is Bias-Variance Decomposition of the 0-1 Loss", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'What is Bias-Variance Decomposition of the 0-1 Loss'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 52.24it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: The Bias-Variance Decomposition of the 0-1 Loss, as formulated by Kong & Dietterich (1995), is given by:

**Bias-Variance Decomposition of 0-1 Loss:**

Let's denote the true label as y, the predicted label as \hat{y}, and the probability of the predicted label as p(\hat{y}).

The 0-1 loss function is defined as:

L(y, \hat{y}) = \begin{cases}
0, & \text{if } y = \hat{y} \\
1, & \text{if } y \neq \hat{y}
\end{cases}

The expected 0-1 loss is given by:

E[L(y, \hat{y})] = E[1 - 2y\hat{y}]

Using the law of iterated expectations, we can rewrite this as:

E[L(y, \hat{y})] = E[E[L(y, \hat{y}) | \hat{y}]]

Now, we can expand the inner expectation:

E[L(y, \hat{y}) | \hat{y}] = E[1 - 2y\hat{y} | \hat{y}]

Using the fact that E[y | \hat{y}] = \hat{y}, we get:

E[L(y, \hat{y}) | \hat{y}] = 1 - 2\hat{y}^2

Now, we can take the outer expectation:

E[L(y, \hat{y})] = E[1 - 2\hat{y}^2]

Using the fact that E[\hat{y}^2] = Var(\hat{y}) + (E[\hat{y}])^2, we get:

E[L(y, \hat{y})] = 1 - 2(E[\ha

##### 9th Question

In [72]:
result = rag_advanced("Explain Gradient Descent", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'Explain Gradient Descent'
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 34.28it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: **What is Gradient Descent?**

Gradient Descent is a fundamental optimization algorithm used in machine learning to minimize the loss function of a model. It's a popular technique for finding the optimal parameters of a model by iteratively adjusting them based on the gradient of the loss function.

**How Gradient Descent Works**

The goal of Gradient Descent is to find the minimum value of the loss function, which measures the difference between the model's predictions and the actual values. The algorithm works as follows:

1. **Initialization**: Initialize the model's parameters (weights and biases) randomly.
2. **Forward Pass**: Compute the model's predictions using the current parameters.
3. **Compute Loss**: Calculate the loss function using the predictions and actual values.
4. **Compute Gradient**: Calculate the gradient of the loss function with respect to each parameter.
5. **Update Parameters**: Update each parameter by subtracting a fraction of the gradient, scaled b

##### 10th Question

In [73]:
result = rag_advanced("Explain Evaluation Metrics? ", rag_retriever, llm, top_k=3, min_score=0.1, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'Explain Evaluation Metrics? '
Top K: 3, Score threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 61.49it/s]

Genrated emebeddings with shape: (1, 384)
Retrieved 1 documents (after filtering)
Retrieved 2 documents (after filtering)
Retrieved 3 documents (after filtering)





Answer: Evaluation Metrics for Machine Learning Algorithms

As a data scientist, it's essential to understand various evaluation metrics to assess the performance of machine learning algorithms. These metrics help identify issues caused by an imbalance in the prevalence of categories.

### Confusion Matrix

A confusion matrix is a table used to evaluate the performance of a classification model. It displays the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

|  | Predicted Positive | Predicted Negative |
| --- | --- | --- |
| **Actual Positive** | TP | FN |
| **Actual Negative** | FP | TN |

### Sensitivity (Recall)

Sensitivity, also known as recall, measures the proportion of actual positive instances that are correctly predicted by the model.

**Sensitivity (Recall) = TP / (TP + FN)**

* TP: True Positives (correctly predicted positive instances)
* FN: False Negatives (missed positive instances)

Sensitivity is particularly useful