# üîç Domain-Specific Q&A Using RAG (Retrieval-Augmented Generation)
## AI/Machine Learning Domain

**Course:** Natural Language Processing  
**Date:** February 2026  
**Runtime:** Google Colab (Free Tier)

---

## 1. Introduction

### Objective
Build a complete **Retrieval-Augmented Generation (RAG)** pipeline for answering technical questions in the **AI/Machine Learning** domain. The system ingests academic PDFs, creates a searchable vector store, retrieves relevant passages for a given query, and uses a language model to generate accurate, context-grounded answers.

### Why RAG?
Standard language models generate answers from their parametric memory (training data), which can lead to hallucinations or outdated information. RAG addresses this by **retrieving relevant documents first**, then feeding them as context to the generator ‚Äî grounding the answer in actual source material.

### Domain & Knowledge Base
We use **6 PDF documents** covering core AI/ML topics:

| # | Document | Description |
|---|----------|-------------|
| 1 | `attention_is_all_you_need.pdf` | Original Transformer paper (Vaswani et al., 2017) |
| 2 | `cnn_transformers_intro.pdf` | Introduction to CNNs and Transformers |
| 3 | `cs224n_merged_notes.pdf` | Stanford CS224N (NLP) merged lecture notes |
| 4 | `cs224n_transformers_2024.pdf` | CS224N Transformers lecture (2024 edition) |
| 5 | `cs231n_full_notes.pdf` | Stanford CS231N (Computer Vision) full notes |
| 6 | `neural_networks_backprop.pdf` | Neural networks & backpropagation notes |

### Pipeline Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  6 AI/ML    ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ Text Extract ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Chunking  ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Embedding     ‚îÇ
‚îÇ  PDFs       ‚îÇ    ‚îÇ + Cleaning   ‚îÇ    ‚îÇ (500 tok,  ‚îÇ    ‚îÇ (MiniLM-L6-v2) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ 50 overlap)‚îÇ    ‚îÇ  384-dim       ‚îÇ
                                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                                 ‚îÇ
                                                                 ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Generated  ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÇ  FLAN-T5     ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÇ  Prompt    ‚îÇ‚óÄ‚îÄ‚îÄ‚îÄ‚îÇ  FAISS Index   ‚îÇ
‚îÇ  Answer     ‚îÇ    ‚îÇ  -large      ‚îÇ    ‚îÇ  Building  ‚îÇ    ‚îÇ  (Flat L2)     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                            ‚ñ≤
                                            ‚îÇ
                                       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                       ‚îÇ  User      ‚îÇ
                                       ‚îÇ  Query     ‚îÇ
                                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Flow:** PDFs ‚Üí Extract Text ‚Üí Clean ‚Üí Chunk ‚Üí Embed ‚Üí Store in FAISS ‚Üí User asks question ‚Üí Embed query ‚Üí Retrieve top-k chunks ‚Üí Build prompt with context ‚Üí FLAN-T5-large generates answer

## 2. Methodology

### Pipeline Stages

Our RAG system operates in **7 sequential stages**:

1. **PDF Text Extraction** ‚Äî Read raw text from each PDF page using PyPDF2
2. **Text Cleaning** ‚Äî Remove excess whitespace, fix encoding artifacts, normalize formatting
3. **Chunking** ‚Äî Split cleaned text into overlapping chunks (~500 tokens, 50 token overlap) with source metadata
4. **Embedding Generation** ‚Äî Encode each chunk into a 384-dimensional dense vector using `all-MiniLM-L6-v2`
5. **FAISS Indexing** ‚Äî Store all embeddings in a FAISS flat L2 index for exact nearest-neighbor search
6. **Retrieval** ‚Äî Given a user query, embed it and retrieve the top-k most similar chunks from the index
7. **Generation** ‚Äî Feed retrieved chunks as context into a prompt template, then generate an answer using FLAN-T5-large

### Model Choices & Justification

| Component | Model / Tool | Justification |
|-----------|-------------|---------------|
| Embeddings | `all-MiniLM-L6-v2` | Fast, 384-dim, excellent quality-to-size ratio, fits Colab free tier |
| Vector Store | FAISS `IndexFlatL2` | Exact search (no approximation error), dataset is small enough, simple to explain |
| Generator | `google/flan-t5-large` (~780M params) | Instruction-tuned, runs on T4 GPU, good at following prompt templates |
| PDF Parsing | PyPDF2 | Lightweight, no system dependencies, handles standard academic PDFs |

### Evaluation Plan

We compare **RAG answers** vs. **Baseline answers** (no retrieval) on 8‚Äì10 domain-specific questions using:

- **ROUGE-L** ‚Äî Measures longest common subsequence overlap with reference answers (lexical similarity)
- **Cosine Semantic Similarity** ‚Äî Uses MiniLM embeddings to measure meaning-level similarity with reference answers

This dual evaluation captures both surface-level word overlap and deeper semantic alignment.

## 3. Implementation

### 3A. Environment Setup

Mount Google Drive and install all required packages.

In [68]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [69]:
import os

PROJECT_PATH = "/content/drive/MyDrive/rag_project"
DATA_PATH = os.path.join(PROJECT_PATH, "data")

os.makedirs(DATA_PATH, exist_ok=True)

print("Project directory created at:", PROJECT_PATH)

Project directory created at: /content/drive/MyDrive/rag_project


In [70]:
import os

print("Files in data folder:")
print(os.listdir(DATA_PATH))

Files in data folder:
['cs231n_full_notes.pdf', 'cnn_transformers_intro.pdf', 'cs224n_transformers_2024.pdf', 'attention_is_all_you_need.pdf', 'cs224n_merged_notes.pdf', 'neural_networks_backprop.pdf']


In [71]:
!pip install -q PyPDF2 sentence-transformers faiss-cpu transformers rouge-score torch pandas

In [72]:
import torch
import faiss
import numpy as np
import pandas as pd
import re

from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from rouge_score import rouge_scorer
from sklearn.metrics.pairwise import cosine_similarity

In [73]:
documents = []

for file in os.listdir(DATA_PATH):
    if file.endswith(".pdf"):
        reader = PdfReader(os.path.join(DATA_PATH, file))
        for page_num, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                documents.append({
                    "source": file,
                    "page": page_num,
                    "text": text
                })

print("Total pages loaded:", len(documents))

Total pages loaded: 594


In [74]:
def clean_text(text):
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'-\s+', '', text)
    return text.strip()

for doc in documents:
    doc["text"] = clean_text(doc["text"])

print("Sample cleaned text:\n")
print(documents[0]["text"][:500])

Sample cleaned text:

DEEP LEARNING STUDY NOTES All credits go to L. Fei-Fei, A. Karpathy, J.Johnson teachers of the CS231n course. Thank you for this amazing course!! by Albert Pumarola


In [75]:
def chunk_text(text, chunk_size=400, overlap=50):
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks


all_chunks = []
chunk_metadata = []

for doc in documents:
    chunks = chunk_text(doc["text"])
    for chunk in chunks:
        all_chunks.append(chunk)
        chunk_metadata.append({
            "source": doc["source"],
            "page": doc["page"]
        })

print("Total chunks created:", len(all_chunks))

Total chunks created: 674


In [76]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

chunk_embeddings = embedding_model.encode(
    all_chunks,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

print("Embedding matrix shape:", chunk_embeddings.shape)

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Embedding matrix shape: (674, 384)


In [77]:
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(chunk_embeddings)

print("Total vectors in index:", index.ntotal)

Total vectors in index: 674


In [78]:
def retrieve(query, top_k=5):
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, top_k)

    results = []
    for idx in indices[0]:
        results.append({
            "chunk": all_chunks[idx],
            "source": chunk_metadata[idx]
        })

    return results

In [79]:
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large").to(device)

print("Generator loaded on:", device)

Loading weights:   0%|          | 0/558 [00:00<?, ?it/s]



Generator loaded on: cuda


In [80]:
PROMPT_TEMPLATE = """
You are an AI assistant answering questions about AI and Machine Learning.

Use the provided context to answer the question clearly and completely.
Write a short explanatory paragraph (3-5 sentences).
Do not answer with a single word.
Do not copy raw fragments.
Explain the concept in your own words using the context.

Context:
{context}

Question:
{question}

Answer:
"""

In [81]:
def build_rag_prompt(question, retrieved_docs, max_context_tokens=800):
    """
    Build a prompt that fits within FLAN-T5-large's 1024-token input limit.
    Uses actual tokenizer for precise token-level truncation.
    """
    context_parts = []
    used_tokens = 0

    for doc in retrieved_docs:
        chunk_text = doc["chunk"]
        chunk_ids = tokenizer.encode(chunk_text, add_special_tokens=False)

        if used_tokens + len(chunk_ids) <= max_context_tokens:
            context_parts.append(chunk_text)
            used_tokens += len(chunk_ids)
        else:
            remaining = max_context_tokens - used_tokens
            if remaining > 30:
                truncated_text = tokenizer.decode(chunk_ids[:remaining], skip_special_tokens=True)
                context_parts.append(truncated_text)
                used_tokens += remaining
            break

    context = "\n\n".join(context_parts)
    prompt = PROMPT_TEMPLATE.format(context=context, question=question)
    return prompt


GENERATION_CONFIG = {
    "max_new_tokens": 180,
    "min_new_tokens": 40,
    "num_beams": 4,
    "length_penalty": 1.1,
    "no_repeat_ngram_size": 3,
    "early_stopping": True,
}


def generate_rag_answer(question, top_k=5):
    """RAG pipeline: Retrieve ‚Üí Build token-aware prompt ‚Üí Generate."""
    retrieved_docs = retrieve(question, top_k)
    prompt = build_rag_prompt(question, retrieved_docs)

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(device)
    outputs = model.generate(**inputs, **GENERATION_CONFIG)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer, retrieved_docs


# Quick check: verify prompt fits
test_retrieved = retrieve("What is attention?", top_k=5)
test_prompt = build_rag_prompt("What is attention?", test_retrieved)
test_tokens = len(tokenizer.encode(test_prompt))
print(f"‚úì RAG generation function defined")
print(f"  Test prompt: {test_tokens} tokens (limit: 1024)")
print(f"  Fits: {'YES ‚úì' if test_tokens <= 1024 else 'NO ‚úó'}")
print(f"\n--- Prompt preview (first 400 chars) ---")
print(test_prompt[:400])

Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors


‚úì RAG generation function defined
  Test prompt: 546 tokens (limit: 1024)
  Fits: YES ‚úì

--- Prompt preview (first 400 chars) ---

You are an AI assistant answering questions about AI and Machine Learning.

Use the provided context to answer the question clearly and completely.
Write a short explanatory paragraph (3-5 sentences).
Do not answer with a single word.
Do not copy raw fragments.
Explain the concept in your own words using the context.

Context:
Attention is weighted averaging, which lets you do lookups! 11 Attenti


In [82]:
def generate_baseline_answer(question):
    """Baseline: Generate answer WITHOUT retrieval (parametric knowledge only)."""
    prompt = f"Answer the question clearly in 3-5 sentences.\n\nQuestion:\n{question}\n\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(device)
    outputs = model.generate(**inputs, **GENERATION_CONFIG)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer

print("‚úì Baseline generation function defined")

‚úì Baseline generation function defined


In [83]:
demo_questions = [
    "What is the key innovation of the Transformer architecture?",
    "What is backpropagation?"
]

for question in demo_questions:
    print("=" * 70)
    print("QUESTION:", question)
    print("=" * 70)

    baseline_answer = generate_baseline_answer(question)
    rag_answer, sources = generate_rag_answer(question)

    print(f"\nBASELINE ANSWER (no retrieval):\n  {baseline_answer}")
    print(f"\nRAG ANSWER (with retrieval):\n  {rag_answer}")
    print(f"\nSOURCES: {[s['source'] for s in sources[:3]]}")
    print()

QUESTION: What is the key innovation of the Transformer architecture?

BASELINE ANSWER (no retrieval):
  This is the key innovation of the Transformer architecture. Therefore the final question is: what are the key innovations of the transformer architecture?. It's impossible to answer this question because there are multiple possible answers. The final answer may not be the best one for your situation.

RAG ANSWER (with retrieval):
  Parallelizability allows for efficient pretraining, and have made them the de-facto standard. On this popular aggregate benchmark, for example: All top models are Transformer (and pretraining)-based. More results Thursday when we discuss pretraining.

SOURCES: [{'source': 'cs224n_transformers_2024.pdf', 'page': 68}, {'source': 'cs224n_transformers_2024.pdf', 'page': 62}, {'source': 'cs224n_transformers_2024.pdf', 'page': 52}]

QUESTION: What is backpropagation?

BASELINE ANSWER (no retrieval):
  Backpropagation is a technique used by computer programmers 

In [84]:
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

def compute_rouge(pred, ref):
    return scorer.score(ref, pred)['rougeL'].fmeasure

def compute_cosine(pred, ref):
    emb = embedding_model.encode([pred, ref], convert_to_numpy=True)
    return float(cosine_similarity([emb[0]], [emb[1]])[0][0])

In [85]:
test_data = [
    {
        "question": "What is the key innovation of the Transformer architecture?",
        "reference": "The key innovation of the Transformer is the self-attention mechanism, which allows the model to process all positions in a sequence simultaneously rather than sequentially, eliminating the need for recurrence and enabling better parallelization."
    },
    {
        "question": "How does multi-head attention work in Transformers?",
        "reference": "Multi-head attention runs multiple attention functions in parallel, each with different learned linear projections of queries, keys, and values. The outputs from each head are concatenated and linearly projected to produce the final output."
    },
    {
        "question": "What is backpropagation?",
        "reference": "Backpropagation is an algorithm for computing gradients of the loss function with respect to each weight in a neural network by applying the chain rule of calculus, propagating the error signal backward from the output layer through hidden layers."
    },
    {
        "question": "What is the purpose of positional encoding in Transformers?",
        "reference": "Positional encoding injects information about the position of tokens in the sequence since the Transformer has no inherent notion of order. The original paper uses sinusoidal functions of different frequencies added to the input embeddings."
    },
    {
        "question": "What is the vanishing gradient problem?",
        "reference": "The vanishing gradient problem occurs when gradients become extremely small as they propagate backward through many layers of a deep neural network, making it difficult for earlier layers to learn effectively."
    },
    {
        "question": "How do Convolutional Neural Networks process images?",
        "reference": "CNNs process images by applying learnable convolutional filters across the input to detect local features like edges and textures. Through successive convolutional and pooling layers, the network builds hierarchical feature representations."
    },
    {
        "question": "What is the softmax function and where is it used in attention?",
        "reference": "The softmax function converts raw scores into a probability distribution by exponentiating each score and normalizing. In attention mechanisms, softmax is applied to the query-key dot products to produce attention weights that determine how much each value contributes."
    },
    {
        "question": "What is dropout regularization and why is it used?",
        "reference": "Dropout is a regularization technique where randomly selected neurons are temporarily removed during training. This prevents co-adaptation and forces the network to learn more robust features, reducing overfitting."
    },
    {
        "question": "What is the difference between self-attention and cross-attention?",
        "reference": "In self-attention, queries, keys, and values all come from the same sequence. In cross-attention, queries come from one sequence while keys and values come from another, enabling the decoder to attend to the encoder output."
    },
    {
        "question": "How does the encoder-decoder architecture work in sequence-to-sequence models?",
        "reference": "The encoder processes the input sequence and produces hidden representations. The decoder generates output tokens one at a time, using cross-attention to attend to the encoder states. This handles inputs and outputs of different lengths."
    }
]

print(f"‚úì Defined {len(test_data)} test questions with reference answers")
for i, item in enumerate(test_data):
    print(f"  Q{i+1}: {item['question'][:60]}...")

‚úì Defined 10 test questions with reference answers
  Q1: What is the key innovation of the Transformer architecture?...
  Q2: How does multi-head attention work in Transformers?...
  Q3: What is backpropagation?...
  Q4: What is the purpose of positional encoding in Transformers?...
  Q5: What is the vanishing gradient problem?...
  Q6: How do Convolutional Neural Networks process images?...
  Q7: What is the softmax function and where is it used in attenti...
  Q8: What is dropout regularization and why is it used?...
  Q9: What is the difference between self-attention and cross-atte...
  Q10: How does the encoder-decoder architecture work in sequence-t...


In [86]:
results = []

for i, item in enumerate(test_data):
    q = item["question"]
    ref = item["reference"]

    print(f"Processing Q{i+1}/{len(test_data)}: {q[:50]}...")

    rag_ans, sources = generate_rag_answer(q)
    base_ans = generate_baseline_answer(q)

    results.append({
        "question": q,
        "reference": ref,
        "rag_answer": rag_ans,
        "baseline_answer": base_ans,
        "rag_rouge": compute_rouge(rag_ans, ref),
        "baseline_rouge": compute_rouge(base_ans, ref),
        "rag_cosine": compute_cosine(rag_ans, ref),
        "baseline_cosine": compute_cosine(base_ans, ref),
        "top_source": sources[0]["source"] if sources else "N/A"
    })

print(f"\n‚úì All {len(results)} questions processed!")

df = pd.DataFrame(results)
df[["question", "rag_rouge", "baseline_rouge", "rag_cosine", "baseline_cosine"]]

Processing Q1/10: What is the key innovation of the Transformer arch...
Processing Q2/10: How does multi-head attention work in Transformers...
Processing Q3/10: What is backpropagation?...
Processing Q4/10: What is the purpose of positional encoding in Tran...
Processing Q5/10: What is the vanishing gradient problem?...
Processing Q6/10: How do Convolutional Neural Networks process image...
Processing Q7/10: What is the softmax function and where is it used ...
Processing Q8/10: What is dropout regularization and why is it used?...
Processing Q9/10: What is the difference between self-attention and ...
Processing Q10/10: How does the encoder-decoder architecture work in ...

‚úì All 10 questions processed!


Unnamed: 0,question,rag_rouge,baseline_rouge,rag_cosine,baseline_cosine
0,What is the key innovation of the Transformer architecture?,0.114286,0.289157,0.512925,0.543925
1,How does multi-head attention work in Transformers?,0.2,0.205882,0.532234,0.581249
2,What is backpropagation?,0.216216,0.222222,0.846071,0.792735
3,What is the purpose of positional encoding in Transformers?,0.16129,0.229508,0.700931,0.665509
4,What is the vanishing gradient problem?,0.176471,0.2,0.801976,0.81754
5,How do Convolutional Neural Networks process images?,0.205128,0.103896,0.527853,-0.043547
6,What is the softmax function and where is it used in attention?,0.289855,0.222222,0.731861,0.716506
7,What is dropout regularization and why is it used?,0.162791,0.155844,0.794623,0.702842
8,What is the difference between self-attention and cross-attention?,0.195652,0.162162,0.554893,0.310571
9,How does the encoder-decoder architecture work in sequence-to-sequence models?,0.307692,0.233333,0.636457,0.634363


In [87]:
def show_retrieval_details(question, top_k=5):
    retrieved = retrieve(question, top_k)

    print("QUESTION:", question)
    print("\nTop Retrieved Chunks:\n")

    for i, doc in enumerate(retrieved):
        print(f"--- Rank {i+1} ---")
        print("Source:", doc["source"])
        print("Preview:", doc["chunk"][:300], "...\n")

In [88]:
show_retrieval_details("What is backpropagation?")

QUESTION: What is backpropagation?

Top Retrieved Chunks:

--- Rank 1 ---
Source: {'source': 'neural_networks_backprop.pdf', 'page': 4}
Preview: Fei-Fei Li, Ehsan AdeliLecture 4 April 11, 2024Administrative: Discussion SectionDiscussion section tomorrow (led by Lucas Leanza):Backpropagation 5 ...

--- Rank 2 ---
Source: {'source': 'neural_networks_backprop.pdf', 'page': 58}
Preview: Fei-Fei Li, Ehsan AdeliLecture 4 April 11, 202460 Backpropagation: a simple example ...

--- Rank 3 ---
Source: {'source': 'cs231n_full_notes.pdf', 'page': 32}
Preview: Chapter 7 Backpropagation Introduction Motivation In this section we will develop expertise with an intuitive understanding of backpropagation, which is a way of computing gradients of expressions through recursive application ofchain rule . Understanding of this process and its subtleties is critic ...

--- Rank 4 ---
Source: {'source': 'neural_networks_backprop.pdf', 'page': 0}
Preview: Fei-Fei Li, Ehsan AdeliLecture 4 April 11, 20241Lectu

In [89]:
# Build results_df from already-computed results (no need to re-generate)
results_df = pd.DataFrame([{
    "Question": r["question"][:60] + "...",
    "RAG ROUGE-L": round(r["rag_rouge"], 4),
    "Baseline ROUGE-L": round(r["baseline_rouge"], 4),
    "RAG Cosine": round(r["rag_cosine"], 4),
    "Baseline Cosine": round(r["baseline_cosine"], 4),
    "ROUGE Œî": round(r["rag_rouge"] - r["baseline_rouge"], 4),
    "Cosine Œî": round(r["rag_cosine"] - r["baseline_cosine"], 4),
} for r in results])

# Print answer comparison for each question
print("=" * 80)
print("         RAG vs BASELINE ‚Äî ANSWER COMPARISON")
print("=" * 80)
for i, r in enumerate(results):
    print(f"\nQ{i+1}: {r['question']}")
    print(f"  Baseline: {r['baseline_answer']}")
    print(f"  RAG:      {r['rag_answer']}")
    print(f"  Source:   {r['top_source']}")
    print(f"  ROUGE-L:  baseline={r['baseline_rouge']:.4f}  RAG={r['rag_rouge']:.4f}")
    print(f"  Cosine:   baseline={r['baseline_cosine']:.4f}  RAG={r['rag_cosine']:.4f}")

         RAG vs BASELINE ‚Äî ANSWER COMPARISON

Q1: What is the key innovation of the Transformer architecture?
  Baseline: This is the key innovation of the Transformer architecture. Therefore the final question is: what are the key innovations of the transformer architecture?. It's impossible to answer this question because there are multiple possible answers. The final answer may not be the best one for your situation.
  RAG:      Parallelizability allows for efficient pretraining, and have made them the de-facto standard. On this popular aggregate benchmark, for example: All top models are Transformer (and pretraining)-based. More results Thursday when we discuss pretraining.
  Source:   {'source': 'cs224n_transformers_2024.pdf', 'page': 68}
  ROUGE-L:  baseline=0.2892  RAG=0.1143
  Cosine:   baseline=0.5439  RAG=0.5129

Q2: How does multi-head attention work in Transformers?
  Baseline: Multi-Head Attenuation works in Transformers by focusing on one head at a time. Therefore the f

In [90]:
results_df

Unnamed: 0,Question,RAG ROUGE-L,Baseline ROUGE-L,RAG Cosine,Baseline Cosine,ROUGE Œî,Cosine Œî
0,What is the key innovation of the Transformer architecture?...,0.1143,0.2892,0.5129,0.5439,-0.1749,-0.031
1,How does multi-head attention work in Transformers?...,0.2,0.2059,0.5322,0.5812,-0.0059,-0.049
2,What is backpropagation?...,0.2162,0.2222,0.8461,0.7927,-0.006,0.0533
3,What is the purpose of positional encoding in Transformers?...,0.1613,0.2295,0.7009,0.6655,-0.0682,0.0354
4,What is the vanishing gradient problem?...,0.1765,0.2,0.802,0.8175,-0.0235,-0.0156
5,How do Convolutional Neural Networks process images?...,0.2051,0.1039,0.5279,-0.0435,0.1012,0.5714
6,What is the softmax function and where is it used in attenti...,0.2899,0.2222,0.7319,0.7165,0.0676,0.0154
7,What is dropout regularization and why is it used?...,0.1628,0.1558,0.7946,0.7028,0.0069,0.0918
8,What is the difference between self-attention and cross-atte...,0.1957,0.1622,0.5549,0.3106,0.0335,0.2443
9,How does the encoder-decoder architecture work in sequence-t...,0.3077,0.2333,0.6365,0.6344,0.0744,0.0021


In [91]:
print("=" * 60)
print("           EVALUATION SUMMARY")
print("=" * 60)

avg_rag_rouge = results_df["RAG ROUGE-L"].mean()
avg_base_rouge = results_df["Baseline ROUGE-L"].mean()
avg_rag_cos = results_df["RAG Cosine"].mean()
avg_base_cos = results_df["Baseline Cosine"].mean()

print(f"\n  ROUGE-L    ‚Üí  Baseline: {avg_base_rouge:.4f}  |  RAG: {avg_rag_rouge:.4f}  |  Œî: {avg_rag_rouge - avg_base_rouge:+.4f}")
print(f"  Cosine Sim ‚Üí  Baseline: {avg_base_cos:.4f}  |  RAG: {avg_rag_cos:.4f}  |  Œî: {avg_rag_cos - avg_base_cos:+.4f}")

rouge_wins = sum(1 for r in results if r["rag_rouge"] > r["baseline_rouge"])
cosine_wins = sum(1 for r in results if r["rag_cosine"] > r["baseline_cosine"])
print(f"\n  RAG wins on ROUGE-L:    {rouge_wins}/{len(results)} questions")
print(f"  RAG wins on Cosine Sim: {cosine_wins}/{len(results)} questions")

           EVALUATION SUMMARY

  ROUGE-L    ‚Üí  Baseline: 0.2024  |  RAG: 0.2030  |  Œî: +0.0005
  Cosine Sim ‚Üí  Baseline: 0.5722  |  RAG: 0.6640  |  Œî: +0.0918

  RAG wins on ROUGE-L:    5/10 questions
  RAG wins on Cosine Sim: 7/10 questions


In [92]:
def retrieval_accuracy(test_data, top_k=5):
    hits = 0
    total = 0

    for item in test_data:
        q = item["question"]
        expected_keywords = item["reference"].split()[:3]  # simple keyword proxy

        retrieved = retrieve(q, top_k)
        combined_text = " ".join([doc["chunk"] for doc in retrieved])

        if any(word.lower() in combined_text.lower() for word in expected_keywords):
            hits += 1

        total += 1

    return hits / total

In [93]:
print("Approximate Retrieval Accuracy:", retrieval_accuracy(test_data))

Approximate Retrieval Accuracy: 1.0
