## What is Retrieval-Augmented Generation (RAG)?

RAG is a technique that enhances Large Language Models (LLMs) by combining:
1. **Retrieval** – Search a knowledge base for the most relevant documents.
2. **Generation** – Feed those retrieved documents into an LLM to produce a context-aware answer.

### Why RAG?
- LLMs have a fixed knowledge cutoff (they "forget" recent or external info).
- Retrieval allows dynamic access to updated or domain-specific knowledge.
- It reduces hallucinations by grounding responses in real data.

**Example:**
- Question: *"What is LoRA fine-tuning?"*
- RAG retrieves your Day 5 summary about LoRA → LLM generates a concise explanation.


### How Does LoRA Complement RAG?

While RAG (Retrieval-Augmented Generation) focuses on **grounding a language model's answers using external knowledge**, LoRA (Low-Rank Adaptation) enhances **how the model expresses that knowledge**.

### Why Combine Them?
- **RAG's Role**: Fetches and injects relevant, up-to-date information from a database or vector store.
- **LoRA's Role**: Fine-tunes only a small subset of the model's parameters to adapt its tone, style, or domain-specific understanding.

### Benefits of RAG + LoRA
1. **Domain-Specific RAG Pipelines**: Fine-tune the LLM via LoRA to speak in your company's language while RAG fetches fresh policies, research, or FAQs.
2. **Lightweight & Efficient**: LoRA fine-tuning affects only ~1–2% of the parameters, making it cost-effective for frequently updated domains.
3. **Reduced Hallucinations**: RAG ensures the facts are correct, and LoRA ensures they are presented in a tailored, meaningful way.

**Example:**
- Build a RAG-based assistant for healthcare.
- Use LoRA fine-tuning to teach the LLM medical terminology and patient-friendly explanation styles.
- Result: Answers are both *accurate (via RAG)* and *professionally aligned (via LoRA)*.


## Setting Up the Retriever

A retriever is responsible for fetching the **top-k most relevant documents** from a knowledge base, given a user query.

In our case:
- Knowledge base = Wikipedia summaries on AI/ML/Data Science
- Representation = Sentence embeddings (vector form of summaries).
- Similarity metric = Cosine similarity.

We'll recreate the knowledge base:
1. Define a list of topics related to AI, ML, and Data Science.
2. Fetch their summaries using Python's `wikipedia` library.
3. Store the results as a JSON file inside an `artifacts/` folder.
4. Generate sentence embeddings for each summary using a Transformer model.

This knowledge base will serve as the retrieval source for our RAG pipeline.

Workflow:
1. Convert user query into an embedding (same model used for corpus).
2. Compute cosine similarity between query embedding and all stored embeddings.
3. Select top-k most similar documents to feed into the generator.


In [None]:
!pip install wikipedia
topics = [
    'Artificial Intelligence',
    'Machine Learning',
    'Deep Learning',
    'Neural Networks',
    'Generative AI',
    'Computer Vision',
    'Large Language Model',
    'Retrieval-augmented generation',
    'Object Detection',
    'Face Recognition',
    'Natural Language Processing',
    'Image Processing',
    'Data Science',
    'Data Mining',
    'Big Data',
    'Data Analytics',
    'Predictive Analytics',
    'Statistical Modeling',
    'Data Visualization',
    'Exploratory Data Analysis',
    'Data Cleaning',
    'ETL (Extract Transform Load)',
    'Business Intelligence',
    'Data Warehousing',
    'Feature Engineering',
    'Time Series Analysis',
    'Reinforcement Learning',
    'Anomaly Detection',
    'Data Governance',
    'Data Ethics',
    'Cloud Computing for Data Science'
]

In [None]:
import wikipedia
import json
import os

# Create artifacts folder if not exists
os.makedirs("artifacts", exist_ok=True)

successful_topics = []  # track which topics succeeded
corpus = []             # your summaries

for topic in topics:
    try:
        search_results = wikipedia.search(topic)
        if not search_results:
            print(f"No results for: {topic}")
            continue
        page = wikipedia.page(search_results[0])
        corpus.append({
            "requested_topic": topic,
            "fetched_title": page.title,
            "summary": page.summary
        })
        successful_topics.append(page.title)  # store actual page title
        print(f"Fetched: {page.title}")
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Skipped {topic} due to disambiguation: {e}")
    except wikipedia.exceptions.PageError:
        print(f"Page not found for: {topic}")

print(f"\nTotal successful topics: {len(successful_topics)}")

# Save corpus
with open("artifacts/wikipedia_corpus_rag.json", "w") as f:
    json.dump(corpus, f, indent=2)

print(f"\nTotal valid articles fetched: {len(corpus)}")

## Generate Embeddings for Knowledge Base

To enable fast and meaningful retrieval in RAG, we convert each document's summary
into a dense numerical representation (embedding). These embeddings capture
semantic meaning, allowing similar concepts to be close in vector space.

- We use a pre-trained sentence embedding model from Hugging Face
  (e.g., `sentence-transformers/all-MiniLM-L6-v2`).
- Each summary is transformed into a 384-dimensional vector.
- These vectors will be stored alongside their corresponding topic titles.


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json

# Load the previously saved corpus
with open("artifacts/wikipedia_corpus_rag.json", "r") as f:
    corpus = json.load(f)

# Extract summaries
texts = [item["summary"] if isinstance(item, dict) else item for item in corpus]

# Load a sentence embedding model (lightweight, free-tier friendly)
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate embeddings
embeddings = embedder.encode(texts, convert_to_numpy=True, show_progress_bar=True)

# Save embeddings
np.save("artifacts/wikipedia_embeddings_rag.npy", embeddings)

print(f"Generated embeddings for {len(texts)} documents.")

## Visualize Embeddings using PCA

To confirm the quality of our generated embeddings, we reduce their dimensionality from 384 → 2
using **Principal Component Analysis (PCA)**. This allows us to:
- Visually inspect how different topics are distributed in semantic space.
- Check for any obvious clustering or overlap.

Interpretation:
- Each point represents one Wikipedia topic summary.
- Closer points → more semantically similar content.
- Farther points → less related content.


In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
import json

# Load corpus and embeddings
with open("artifacts/wikipedia_corpus_rag.json", "r") as f:
    corpus = json.load(f)

embeddings = np.load("artifacts/wikipedia_embeddings_rag.npy")

# Extract topic names (handles dict or simple list)
topics = [item["title"] if isinstance(item, dict) and "title" in item else
          item.get("topic", f"Topic {idx}") if isinstance(item, dict)
          else f"Topic {idx}" for idx, item in enumerate(corpus)]

# Reduce dimensions using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c='blue', alpha=0.6)

# Annotate each point
for i, topic in enumerate(topics):
    plt.annotate(topic[:20] + "...", (embeddings_2d[i, 0] + 0.02, embeddings_2d[i, 1]))

plt.title("Wikipedia RAG Corpus – PCA Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()

## Build a Retriever Function

The retriever is the backbone of a RAG system. Its job:
- Take a **query** as input.
- Convert it into an embedding (using the same model as the corpus).
- Compute **cosine similarity** between the query embedding and all document embeddings.
- Return the **top-k most relevant documents**.

Why cosine similarity?
- Measures how close two vectors are in terms of direction (not magnitude).
- Works well for high-dimensional embeddings like those produced by MiniLM/BERT.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import json
from sentence_transformers import SentenceTransformer

# Load corpus and embeddings
with open("artifacts/wikipedia_corpus_rag.json", "r") as f:
    corpus = json.load(f)

embeddings = np.load("artifacts/wikipedia_embeddings_rag.npy")
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Retrieve function
def retrieve_top_k(query, k=5):
    query_embedding = embedder.encode([query], convert_to_numpy=True)
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_k_idx = np.argsort(similarities)[::-1][:k]
    results = []
    for idx in top_k_idx:
        topic_name = corpus[idx]["title"] if isinstance(corpus[idx], dict) and "title" in corpus[idx] else f"Topic {idx}"
        summary = corpus[idx]["summary"] if isinstance(corpus[idx], dict) and "summary" in corpus[idx] else corpus[idx]
        results.append({
            "topic": topic_name,
            "similarity": float(similarities[idx]),
            "summary": summary
        })
    return results

# Test it
query = "How do neural networks learn?"
top_results = retrieve_top_k(query, k=5)
for r in top_results:
    print(f"Topic: {r['topic']}\nScore: {r['similarity']:.4f}\nSummary: {r['summary'][:150]}...\n")

Suppose we have:

Corpus (3 documents):
1. "Cats are small animals that like to sleep."
2. "Dogs are loyal pets and like to play."
3. "Neural networks are used in artificial intelligence."

Query: "Tell me about pets."

1. Encode Corpus (already done once and saved) - Each document is turned into an embedding vector.
2. Encode Query - The query "Tell me about pets." → [0.33, 0.87, 0.15, ...]
3. Calculate Cosine Similarity - Compare the query vector with each document vector:
  - Cosine(Query, Doc1) = 0.62
  - Cosine(Query, Doc2) = 0.91  ← highest similarity
  - Cosine(Query, Doc3) = 0.10
4. Sort and Pick Top-k:
  - Sort scores in descending order: [0.91, 0.62, 0.10]
  - Take top 2 indices → [Doc2, Doc1]
5. Return Results - The function returns something like:
```
[
    {"topic": "Dogs are loyal pets...", "similarity": 0.91},
    {"topic": "Cats are small animals...", "similarity": 0.62}
]
```
Key Points
- query_embedding = embedder.encode([query]) → converts the search query into a vector.
- cosine_similarity(query_embedding, embeddings) → finds how close each document is to the query.
- np.argsort(similarities)[::-1][:k] → sorts in descending order and takes top-k.
- Returns both the score and the matching document summary.



## Integrate Retriever with a Language Model (RAG Flow)

- **Goal:** Combine semantic search (retriever) with a language model (generator).
- **Retriever:** Finds top-k most relevant documents.
- **Generator (LLM):** Uses the retrieved context to craft a response.
- **Benefit:** The model does not need to "memorize" all facts. It retrieves them dynamically.

We will:
1. Accept a query from the user.
2. Retrieve top 2–3 summaries using cosine similarity.
3. Concatenate these summaries into a context string.
4. Pass this context + query to a text generation pipeline.

In [None]:
from transformers import pipeline

# Load a lightweight language model (can use flan-t5-base or similar)
rag_generator = pipeline("text2text-generation", model="google/flan-t5-base", device=-1)

def rag_answer(query, top_k=3, max_context_tokens=400):
    # Step 1: Retrieve top-k relevant summaries
    similarities = cosine_similarity(embedder.encode([query]), embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    retrieved_docs = [texts[i] for i in top_indices]

    # Step 2: Build context (truncate if too long)
    context = "\n".join(retrieved_docs)
    context_tokens = context.split()
    if len(context_tokens) > max_context_tokens:
        context = " ".join(context_tokens[:max_context_tokens])

    # Step 3: Combine query with context
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    # Step 4: Generate answer
    result = rag_generator(prompt, max_new_tokens=150, do_sample=True, temperature=0.7, top_p=0.9)
    return result[0]["generated_text"], [texts[i] for i in top_indices]


# Example
query = "What is the role of feature engineering in data science?"
print(rag_answer(query))

We’re hitting the token-limit warning because the prompt (context + question) is still being fed to the model with >512 tokens. Word-based truncation isn’t enough; we need token-aware truncation using the tokenizer, and we should bypass the pipeline so we control inputs precisely.

Here’s a drop-in fix that:

Builds context within a token budget,

Tokenizes with truncation to 512,

Calls the model directly (no pipeline),

Eliminates the warning.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1) Load generator model + tokenizer
tok = AutoTokenizer.from_pretrained("google/flan-t5-base")
gen = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

# Helper: build context within a token budget (encoder side)
def build_context_token_budget(retrieved_docs, query, budget=480):
    # keep some room for instructions + question (we’ll enforce with tokenizer)
    parts = []
    used = len(tok.encode(f"Question: {query}\n", add_special_tokens=False))
    for doc in retrieved_docs:
        chunk = "\n---\n" + doc
        chunk_tokens = tok.encode(chunk, add_special_tokens=False)
        if used + len(chunk_tokens) > budget:
            remaining = budget - used
            if remaining <= 0:
                break
            # truncate this chunk to fit the budget
            chunk_tokens = chunk_tokens[:remaining]
            chunk = tok.decode(chunk_tokens, skip_special_tokens=True)
            parts.append(chunk)
            break
        parts.append(chunk)
        used += len(chunk_tokens)
    return "".join(parts)

# 2) RAG answer with token-aware truncation
def rag_answer(query, top_k=3, token_budget=480, max_new_tokens=128, temperature=0.7, top_p=0.9):
    # retrieve top-k by cosine similarity
    sims = cosine_similarity(embedder.encode([query], convert_to_numpy=True), embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:top_k]
    retrieved_docs = [texts[i] for i in top_idx]

    # build context within a safe token budget
    context = build_context_token_budget(retrieved_docs, query, budget=token_budget)

    # final prompt
    prompt = (
        "Use the context to answer the question concisely. "
        "If the context is insufficient, say you don't know.\n"
        f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    )

    # tokenize with hard truncation to model’s max input length (512 for Flan-T5)
    inputs = tok(prompt, return_tensors="pt", truncation=True, max_length=512)

    # generate
    outputs = gen.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p
    )
    answer = tok.decode(outputs[0], skip_special_tokens=True)

    retrieved_info = [
        {"rank": r+1, "index": int(i), "similarity": float(sims[i])}
        for r, i in enumerate(top_idx)
    ]
    return answer, retrieved_info

# Example test
q = "Explain the importance of feature engineering in data science."
ans, used = rag_answer(q, top_k=3)
print("Answer:\n", ans)
print("\nRetrieved (rank, idx, sim):\n", used)


## Evaluate the RAG System's Performance

Now that the RAG pipeline is working with token-safe truncation, we will:
1. Test multiple **queries** to check the relevance of retrieved documents.
2. Inspect the **generated answers** for correctness and grounding.
3. View the **retrieved documents (top-k)** for each query.

This helps us:
- Understand whether truncation affects the retrieved context.
- Verify if the model is **using retrieved knowledge** effectively.
- Identify cases where RAG may fail or retrieve irrelevant content.

In [None]:
# Define a set of evaluation queries
eval_queries = [
    "What is reinforcement learning?",
    "How does data visualization help in analysis?",
    "Explain feature engineering in machine learning.",
    "What is a large language model?",
    "Describe data cleansing."
]

# Evaluate each query
for query in eval_queries:
    answer, retrieved = rag_answer(query, top_k=3)
    print(f"\n--- Query: {query} ---")
    print(f"Answer: {answer}")
    print("Retrieved Documents (Top 3):")
    for r in retrieved:
        print(f"  Rank {r['rank']}: Index {r['index']}, Similarity {r['similarity']:.4f}")

## Visualizing Retrieved Document Similarities

To better understand how the RAG pipeline scores the documents during retrieval:
- We will plot a **heatmap** of cosine similarity scores.
- Each row will represent a query.
- Each column will represent a retrieved document (Top-k).
- Darker cells = higher similarity (more relevant context).

This allows us to visually inspect:
- Which documents the model considers most relevant for each query.
- Whether there is consistent retrieval behavior across queries.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Define the same evaluation queries
eval_queries = [
    "What is reinforcement learning?",
    "How does data visualization help in analysis?",
    "Explain feature engineering in machine learning.",
    "What is a large language model?",
    "Describe data cleansing."
]

# Collect similarity scores
heatmap_data = []

for query in eval_queries:
    _, retrieved = rag_answer(query, top_k=5)
    scores = [r["similarity"] for r in retrieved]
    heatmap_data.append(scores)

# Convert to numpy array
heatmap_array = np.array(heatmap_data)

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(
    heatmap_array,
    annot=True,
    fmt=".2f",
    cmap="YlGnBu",
    xticklabels=[f"Doc {i+1}" for i in range(heatmap_array.shape[1])],
    yticklabels=[f"Q{i+1}" for i in range(heatmap_array.shape[0])],
)
plt.title("RAG Retrieved Document Similarity Scores")
plt.xlabel("Retrieved Documents")
plt.ylabel("Queries")
plt.show()

## SUMMARY
- Built a **RAG pipeline** to answer questions using a Wikipedia-based corpus on AI/ML/Data Science.
- Steps covered:
  - Created a **Wikipedia knowledge base** with topics and summaries.
  - Converted the summaries into **embeddings** using a Transformer model.
  - Implemented **cosine similarity-based retrieval** to fetch top-k relevant documents.
  - Integrated retrieval results into a **Flan-T5 text generation pipeline**.
  - Added **visualizations** (heatmap & PCA plots) to understand document similarity.
- Key concepts:
  - **Embeddings** → Dense vector representations of text.
  - **Cosine Similarity** → Measures closeness between embeddings.
  - **Top-k Retrieval** → Selects the most relevant documents for context.
  - **RAG Workflow** → Retrieve → Combine Context → Generate Answer.
- Tech stack:
  - `transformers` (Hugging Face) for model & tokenizer
  - `sentence-transformers` for embeddings
  - `torch` for similarity & tensor operations
  - `matplotlib` & `seaborn` for visualization
