# Data Scientist Assignment

In this notebook, I will discuss the concept of RAG, defining how the bot understands questions, retrieves information, and how the LLM ultimately uses that information to generate answers.

<img src="../Data//images/RAG Pipeline.png" width="500">


## Phase 1: Preparation

It's essential to have a knowledge of the core components of RAG :

1 - **Chunking documents**  
2 - **Embedding**  
3 - **Ingestion**  
4 - **Vector Search**  
5 - **(Retrieval + Generation)**  
6 - **Memory**

## HOW DO RAGs WORK?

<div style="text-align:center;">
  <img src="../Data/images/RAG Work.png" width="600">
</div>

In the first step, there's an encoder that converts your raw text and documents into mathematical form, so the computer can understand them. So, all the words, sentences, or entire documents that make up your external database are converted into "vectors." All these vectors (in the form of vector embeddings) will now be stored in a vector database. Note that this is a great way of capturing the semantics of different words, their relationship to other words, and what topics these words represent.

### 1- Chunking documents

Breaking large documents into smaller chunks so that AI models can better find, understand, and use the information when answering questions.

In [101]:
from openai import OpenAI
from dotenv import load_dotenv
import os
from langchain_community.document_loaders import PyPDFLoader

load_dotenv(override=True)

True

In [102]:
file_path = "../Data/MlQuestions_Final_Comprehensive.pdf"
loader = PyPDFLoader(file_path=file_path)
pages = loader.load()

In [103]:
len(pages)

8

In [104]:
page_one = pages[0]
print(page_one.page_content)

Part 1: Machine Learning Interview 
Questions & Answers  
Disclaimer 
Machine Learning is an important concept when it comes to Data Science Interviews. 
Prepare for your Machine Learning Interviews with these most asked interview questions. 
Q. 1: Explain Bias-Variance Tradeoff. 
Ans: The bias-variance tradeoff represents the balance between the model's ability to 
generalize across different datasets (bias) and its sensitivity to small fluctuations in the 
training set (variance). A high-bias model is too simple and underfits the data, missing the 
underlying trend. A high-variance model is too complex, overfitting the data and capturing 
noise as if it were a real pattern. The goal is to find a sweet spot that minimizes the total 
error. 
Q. 2: How does Gradient Descent Work? 
Ans: Gradient Descent is an optimization algorithm used to minimize some function by 
iteratively moving in the direction of the steepest descent as defined by the negative of the 
gradient. In machine learnin

In [105]:
page_one.metadata

{'producer': 'Microsoft® Word LTSC',
 'creator': 'Microsoft® Word LTSC',
 'creationdate': '2026-01-22T10:57:39+02:00',
 'author': 'python-docx',
 'moddate': '2026-01-22T10:57:39+02:00',
 'source': '../Data/MlQuestions_Final_Comprehensive.pdf',
 'total_pages': 8,
 'page': 0,
 'page_label': '1'}

### 1. Why do we need Document Splitting or Chunking?

In LangChain, document splitting (or text chunking) is an essential preprocessing step before feeding large documents into language models or vector databases.

LLMs (Large Language Models) like GPT can only handle a limited number of tokens per request.
So, we split long documents into smaller, manageable chunks — allowing:

* Efficient retrieval
* Better embeddings generation
* Faster and more accurate responses during question answering or summarization tasks


In [106]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 1500 
chunk_overlap = 200

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", ". ", " ", ""]
)

In [107]:
chunks = r_splitter.split_documents(pages)

print(f"Total chunks created: {len(chunks)}")
print(f"First chunk preview: {chunks[0].page_content[:200]}")

Total chunks created: 15
First chunk preview: Part 1: Machine Learning Interview 
Questions & Answers  
Disclaimer 
Machine Learning is an important concept when it comes to Data Science Interviews. 
Prepare for your Machine Learning Interviews w


## 2 - Embedding 

### What are Text Embeddings?

Text embeddings are numerical representations of text — words, sentences, or even entire documents — that capture their meaning in a way that computers can understand and compare.

Embeddings are the foundation for many NLP (Natural Language Processing) tasks:

 * Semantic search: Finding documents similar in meaning, not just keyword match.

  * Chatbots / Retrieval-Augmented Generation (RAG): Retrieving relevant context from a database to answer questions.

 * Clustering: Grouping similar texts (reviews or news articles).

 * Recommendation systems: Suggesting similar content.

 * Sentiment or topic analysis.

In [108]:
openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")
cohere_api_key = os.getenv("COHERE_API_KEY")

generation_model_id = os.getenv("GENERATION_MODEL_ID")
embedded_model_id = os.getenv("EMBEDDING_MODEL_ID")
embeddid_model_size = os.getenv("EMBEDDING_MODEL_SIZE")

if openai_api_key:
    print(f"OpenAI key found and start with: {openai_api_key[:5]}")
else:
    print(f"Key not found")

OpenAI key found and start with: sk-pr


In [109]:
from langchain_cohere import CohereEmbeddings

# Initialize Cohere embeddings
embeddings = CohereEmbeddings(
    model=embedded_model_id,
    cohere_api_key=cohere_api_key
)

# Extract text content from Document objects
chunk_texts = [chunk.page_content for chunk in chunks]

# Test the embeddings
print(f"Embedding {len(chunk_texts)} chunks...")
chunk_embeddings = embeddings.embed_documents(chunk_texts)
print(f"Created {len(chunk_embeddings)} embeddings")
print(f"Embedding dimension: {len(chunk_embeddings[0])}")
print(f"First chunk: {chunk_texts[0][:100]}...")
print(f"First embedding (first 5 values): {chunk_embeddings[0][:5]}")

Embedding 15 chunks...
Created 15 embeddings
Embedding dimension: 1024
First chunk: Part 1: Machine Learning Interview 
Questions & Answers  
Disclaimer 
Machine Learning is an importa...
First embedding (first 5 values): [-0.0126571655, -0.009361267, -0.09033203, 0.046203613, -0.049041748]


In [110]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize Hugging Face embeddings
embeddings_model_huggingface = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"  
)

# Extract text content from Document objects
chunk_texts = [chunk.page_content for chunk in chunks]

# Embed your chunks
print(f"Embedding {len(chunk_texts)} chunks...")
chunk_embeddings = embeddings_model_huggingface.embed_documents(chunk_texts)

print(f"Created {len(chunk_embeddings)} embeddings")
print(f"Embedding dimension: {len(chunk_embeddings[0])}")
print(f"First chunk: {chunk_texts[0][:100]}...")
print(f"First embedding (first 5 values): {chunk_embeddings[0][:5]}")

Embedding 15 chunks...
Created 15 embeddings
Embedding dimension: 384
First chunk: Part 1: Machine Learning Interview 
Questions & Answers  
Disclaimer 
Machine Learning is an importa...
First embedding (first 5 values): [-0.05354616791009903, 0.061014801263809204, 0.053887173533439636, 0.030077779665589333, 0.06359510868787766]


## 3 - Ingestion

Store Text Embeddings In Vector Database with LangChain

In [111]:
from pymongo import MongoClient
from langchain_mongodb import MongoDBAtlasVectorSearch

# MongoDB Configuration
mongodb_uri = os.getenv("MONGODB_URI")
db_name = os.getenv("MONGODB_DATABASE")
collection_name = os.getenv("MONGODB_COLLECTION")

print("\nConnecting to MongoDB...")
client = MongoClient(mongodb_uri)
client.admin.command('ping')
database = client[db_name]
collection = database[collection_name]
print(f"✓ Connected to {db_name}.{collection_name}")


Connecting to MongoDB...
✓ Connected to rag_database.document_chunks


In [112]:
collection.delete_many({})
print("Docs after delete:", collection.count_documents({}))

Docs after delete: 0


In [113]:
# Extract text content from Document objects
chunk_texts = [chunk.page_content for chunk in chunks]

print(f"\nIngesting {len(chunk_texts)} chunks into MongoDB...")
vector_store = MongoDBAtlasVectorSearch.from_texts(
    texts=chunk_texts,
    embedding=embeddings,
    collection=collection,
    index_name="vector_index"
)
print("Docs after insert:", collection.count_documents({}))


Ingesting 15 chunks into MongoDB...
Docs after insert: 15


## 4 - Vector Search


Vector search (also called semantic search or similarity search) is the process of finding the most relevant chunks of text from your database based on the meaning of a query, not just keyword matching


In [114]:
# Check current count
current_count = collection.count_documents({})
print(f"Current documents in collection: {current_count}")

Current documents in collection: 15


In [118]:
def get_answer(query, vector_store, k=10):
    results = vector_store.similarity_search(query, k=k)
    
    if not results:
        return "No results found"
    
    # Use the top result
    content = results[0].page_content
    
    # Extract answer between "Ans:" and next "Q."
    ans_start = content.find("Ans:")
    
    if ans_start != -1:
        # Find next question
        next_q = content.find("Q.", ans_start + 4)
        
        if next_q != -1:
            answer = content[ans_start + 4:next_q].strip()
        else:
            answer = content[ans_start + 4:].strip()
        
        return answer
    else:
        return content.strip()

# Test
test_query = "What is Regularization, and what is the difference between L1 (Lasso) and L2 (Ridge)?"
answer = get_answer(test_query, vector_store)

print(f"Query: {test_query}\n")
print("Answer:")
print(answer)

Query: What is Regularization, and what is the difference between L1 (Lasso) and L2 (Ridge)?

Answer:
Both Bagging and Boosting are ensemble techniques to improve model predictions, but 
they work differently. 
 
Comparison Table: 
1. Bagging: The simplest way of combining predictions that belong to the same type. 
Boosting: A way of combining predictions that belong to the different types.


that is a bad answer:

That usually happens for one of these reasons in a RAG pipeline:

1- Retrieval is weak (the vector search returns irrelevant chunks)

2- The LLM is answering without using the retrieved context

3- Your chunks are too big / too small so the meaning is lost

4- You’re not storing metadata (page number, source) so the model can’t ground answers

5- Your prompt is not forcing “answer only from context”


In [124]:
from langchain_core.documents import Document

INDEX_NAME = "vector_index"

def ingest_pdf(pdf_path: str):
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=120
    )

    docs = []
    for page in pages:
        chunks = splitter.split_text(page.page_content)
        for chunk in chunks:
            docs.append(
                Document(
                    page_content=chunk,
                    metadata={
                        "source": pdf_path,
                        "page": page.metadata.get("page", 0)
                    }
                )
            )

    collection.delete_many({})

    vector_store = MongoDBAtlasVectorSearch.from_documents(
        documents=docs,
        embedding=embeddings,
        collection=collection,
        index_name=INDEX_NAME
    )

    return vector_store

In [125]:
def retrieve_docs(vector_store, query: str, k: int = 6):
    docs = vector_store.similarity_search(query, k=k)
    return docs


def format_docs(docs):
    text = ""
    for d in docs:
        page = d.metadata.get("page", "N/A")
        source = d.metadata.get("source", "N/A")
        text += f"[Source: {source} | Page: {page}]\n{d.page_content}\n\n"
    return text.strip()

In [129]:
client_openai = OpenAI(api_key=openai_api_key)

def ask_rag(vector_store, query: str):
    docs = retrieve_docs(vector_store, query, k=6)

    if not docs:
        return "No documents retrieved. Check MongoDB index or embeddings.", []

    context = format_docs(docs)

    prompt = f"""
You are a helpful assistant.
Answer ONLY using the context below.
If the answer is not in the context, say:
"I don't know based on the document."

Context:
{context}

Question: {query}

Answer:
"""

    response = client_openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    answer = response.choices[0].message.content

    sources = []
    for d in docs:
        sources.append(
            f"{d.metadata.get('source')} (page {d.metadata.get('page')})"
        )

    return answer, sources



In [135]:
def ask_rag(vector_store, query: str, k: int = 6):
    docs = vector_store.similarity_search(query, k=k)

    print("Retrieved inside ask_rag:", len(docs))

    if not docs:
        return "No documents retrieved. Check MongoDB index or embeddings.", []

    context = "\n\n".join(
        [
            f"[Source: {d.metadata.get('source')} | Page: {d.metadata.get('page')}]\n{d.page_content}"
            for d in docs
        ]
    )

    prompt = f"""
You are a helpful assistant.
Answer ONLY using the context below.
Write the answer in clear bullet points.
If the answer is not in the context, say:
"I don't know based on the document."

Context:
{context}

Question: {query}
Answer:
"""

    response = client_openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    answer = response.choices[0].message.content.strip()

    sources = sorted(
        set(
            [f"{d.metadata.get('source')} (page {d.metadata.get('page')})" for d in docs]
        ),
        key=lambda x: int(x.split("page ")[-1].replace(")", ""))
    )

    return answer, sources


In [143]:
query = "What are the main AI roles defined by AWS (developers, deployers, end users)?"
answer, sources = ask_rag(vector_store, query)

print("Answer:\n", answer)

print("\nSources:")
for s in sources:
    print("-", s)

Retrieved inside ask_rag: 6
Answer:
 - **AI Developers**: Create and develop AI models or systems, define intended use cases, and assess potential risks.
- **AI Deployers**: Deploy AI systems to end users, assess suitability and performance in their unique operating context.
- **AI End Users**: Provide inputs or receive outputs from an AI system and are encouraged to share feedback for improvements.

Sources:
- ../Data/MlQuestions_Final_Comprehensive.pdf (page 4)
- ../Data/MlQuestions_Final_Comprehensive.pdf (page 5)
