# Problem 2 - rag_evaluation_prime_ministers

Introduction

This notebook implements Question 2(b) of the assignment, which evaluates a Retrieval-Augmented Generation (RAG) system built on top of a custom Wikipedia index of all Prime Ministers of India. The goal is to measure how effectively a language model can answer factual political questions when supported by relevant retrieved context.

We begin by rebuilding an index of Wikipedia articles corresponding to each Prime Minister. These articles are cleaned, chunked (1024 token windows, 20 token overlap), and embedded using the BAAI/bge-small-en-v1.5 encoder. The chunks are stored in a VectorStoreIndex from LlamaIndex.

The LLM used for generation is llama-3.1-8b-instant, served via the Groq API for fast inference.
The dataset used for evaluation is QA.xlsx, which contains factual question–answer pairs about Indian political history.

For each question, we retrieve the top-K most similar Wikipedia chunks, generate an answer with the LLM based on the retrieved context, and compare the output to the ground-truth answer using Jaccard similarity. Finally, we report the mean Jaccard score for topK = 1 and topK = 2, along with observations on retrieval effectiveness and model behavior.

---



In [1]:

!pip install -q groq llama-index llama-index-llms-groq llama-index-embeddings-huggingface sentence-transformers beautifulsoup4 requests tqdm pandas


In [None]:
import os
from llama_index.core import Settings
from llama_index.llms.groq import Groq
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Set your Groq API key
os.environ["GROQ_API_KEY"] = "-redacted-"

# Making sure OpenAI key is NOT used at all (ensure code does not give)
os.environ.pop("OPENAI_API_KEY", None)

# HuggingFace embedding model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

# Groq LLM
Settings.llm = Groq(
    model="llama-3.1-8b-instant",
    temperature=0,
)

# Chunking
Settings.chunk_size = 1024
Settings.chunk_overlap = 20

print("Settings configured: Groq llama3-8b-8192 + BGE embeddings.")


2025-11-22 10:28:15.485272: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Settings configured: Groq llama3-8b-8192 + BGE embeddings.


In [3]:
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import time

PM_NAMES = [
    "Jawaharlal_Nehru",
    "Lal_Bahadur_Shastri",
    "Gulzarilal_Nanda",
    "Indira_Gandhi",
    "Morarji_Desai",
    "Charan_Singh",
    "Rajiv_Gandhi",
    "V._P._Singh",
    "Chandra_Shekhar",
    "P._V._Narasimha_Rao",
    "Atal_Bihari_Vajpayee",
    "H._D._Deve_Gowda",
    "Inder_Kumar_Gujral",
    "Manmohan_Singh",
    "Narendra_Modi",
]

WIKI_BASE = "https://en.wikipedia.org/wiki/"
OUT_DIR = Path("wikipedia_pm_pages")
OUT_DIR.mkdir(exist_ok=True)

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

def download_and_extract_wiki(name):
    url = WIKI_BASE + name
    print(f"Downloading {url}")
    r = requests.get(url, headers=HEADERS, timeout=30)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")

    # Strip non-content elements
    for el in soup(["script", "style", "aside", "footer", "nav", "sup", "table", "style"]):
        el.decompose()

    content = soup.find("div", {"class": "mw-parser-output"})
    if content is None:
        return soup.get_text(separator="\n")

    paras = []
    for p in content.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5']):
        txt = p.get_text(strip=True)
        if txt:
            paras.append(txt)
    return "\n\n".join(paras)

downloaded = {}

for name in PM_NAMES:
    try:
        text = download_and_extract_wiki(name)
        fn = OUT_DIR / f"{name}.txt"
        fn.write_text(text, encoding="utf-8")
        downloaded[name] = str(fn)
        time.sleep(1)
    except Exception as e:
        print(f"Failed to download {name}: {e}")

print("Downloaded pages:", len(downloaded))
print("Keys:", list(downloaded.keys()))


Downloading https://en.wikipedia.org/wiki/Jawaharlal_Nehru
Downloading https://en.wikipedia.org/wiki/Lal_Bahadur_Shastri
Downloading https://en.wikipedia.org/wiki/Gulzarilal_Nanda
Downloading https://en.wikipedia.org/wiki/Indira_Gandhi
Downloading https://en.wikipedia.org/wiki/Morarji_Desai
Downloading https://en.wikipedia.org/wiki/Charan_Singh
Downloading https://en.wikipedia.org/wiki/Rajiv_Gandhi
Downloading https://en.wikipedia.org/wiki/V._P._Singh
Downloading https://en.wikipedia.org/wiki/Chandra_Shekhar
Downloading https://en.wikipedia.org/wiki/P._V._Narasimha_Rao
Downloading https://en.wikipedia.org/wiki/Atal_Bihari_Vajpayee
Downloading https://en.wikipedia.org/wiki/H._D._Deve_Gowda
Downloading https://en.wikipedia.org/wiki/Inder_Kumar_Gujral
Downloading https://en.wikipedia.org/wiki/Manmohan_Singh
Downloading https://en.wikipedia.org/wiki/Narendra_Modi
Downloaded pages: 15
Keys: ['Jawaharlal_Nehru', 'Lal_Bahadur_Shastri', 'Gulzarilal_Nanda', 'Indira_Gandhi', 'Morarji_Desai', 'Ch

In [4]:
from pathlib import Path
from llama_index.core import Document, VectorStoreIndex

docs = []
for name, path in downloaded.items():
    text = Path(path).read_text(encoding="utf-8")

    #   API
    docs.append(Document(text=text, metadata={"source": name}))

print("Total documents:", len(docs))
if docs:
    print("Example metadata of first doc:", docs[0].metadata)


Total documents: 15
Example metadata of first doc: {'source': 'Jawaharlal_Nehru'}


In [5]:
index = VectorStoreIndex.from_documents(docs)
print(" Index built.")


 Index built.


In [7]:
import pandas as pd

QA_PATH = "QA.xlsx"

qa_df = pd.read_excel(QA_PATH)
print("Columns in QA.xlsx:", qa_df.columns.tolist())
qa_df.head()


Columns in QA.xlsx: ['Question', 'Answer']


Unnamed: 0,Question,Answer
0,Who was the first Prime Minister of independen...,Jawaharlal Nehru
1,"Which Prime Minister gave the famous slogan ""J...",Lal Bahadur Shastri
2,Who was the first and only woman to serve as t...,Indira Gandhi
3,Who was the first Prime Minister of India not ...,Morarji Desai
4,Which Prime Minister of India never faced the ...,Charan Singh


In [8]:
QUESTION_COL = "Question"
ANSWER_COL   = "Answer"

qa_df = qa_df[[QUESTION_COL, ANSWER_COL]].dropna().reset_index(drop=True)
print("Total questions:", len(qa_df))
qa_df.head()


Total questions: 42


Unnamed: 0,Question,Answer
0,Who was the first Prime Minister of independen...,Jawaharlal Nehru
1,"Which Prime Minister gave the famous slogan ""J...",Lal Bahadur Shastri
2,Who was the first and only woman to serve as t...,Indira Gandhi
3,Who was the first Prime Minister of India not ...,Morarji Desai
4,Which Prime Minister of India never faced the ...,Charan Singh


In [9]:
import string

def normalize_text(text: str):
    """Lowercase, strip punctuation, split into tokens."""
    if not isinstance(text, str):
        text = str(text)
    text = text.lower()
    text = text.translate(str.maketrans(
        string.punctuation,
        " " * len(string.punctuation)
    ))
    return text.split()

def jaccard_similarity(pred: str, gold: str) -> float:
    """
    Jaccard = |words(pred) ∩ words(gold)| / |words(pred) ∪ words(gold)|
    """
    pred_tokens = set(normalize_text(pred))
    gold_tokens = set(normalize_text(gold))

    if len(pred_tokens) == 0 and len(gold_tokens) == 0:
        return 1.0
    if len(pred_tokens) == 0 or len(gold_tokens) == 0:
        return 0.0

    inter = pred_tokens & gold_tokens
    union = pred_tokens | gold_tokens
    return len(inter) / len(union)


In [10]:
from tqdm import tqdm

def evaluate_topk(top_k: int):
    """
    For each QA pair:
      - retrieve top_k most relevant PM pages
      - generate answer with RAG
      - compute Jaccard(pred, gold)
      - record which PM pages were retrieved (metadata['source'])
    """
    query_engine = index.as_query_engine(similarity_top_k=top_k)
    retriever    = index.as_retriever(similarity_top_k=top_k)

    results = []

    for _, row in tqdm(qa_df.iterrows(), total=len(qa_df)):
        question    = row[QUESTION_COL]
        gold_answer = row[ANSWER_COL]

        # Retrieval
        retrieved_nodes = retriever.retrieve(question)
        retrieved_sources = []
        for node in retrieved_nodes:
            # NodeWithScore vs Node handling
            meta = getattr(node, "metadata", None)
            if meta is None and hasattr(node, "node"):
                meta = getattr(node.node, "metadata", {})
            if meta is None:
                meta = {}
            src = meta.get("source", "UNKNOWN_SOURCE")
            retrieved_sources.append(src)

        # Generation (RAG answer)
        response = query_engine.query(question)
        pred_answer = str(response)

        # Jaccard similarity
        jac = jaccard_similarity(pred_answer, gold_answer)

        results.append({
            "top_k": top_k,
            "question": question,
            "gold_answer": gold_answer,
            "pred_answer": pred_answer,
            "jaccard": jac,
            "retrieved_sources": retrieved_sources,
        })

    mean_jac = sum(r["jaccard"] for r in results) / len(results)
    return results, mean_jac


In [11]:
results_k1, mean_j_k1 = evaluate_topk(1)
results_k2, mean_j_k2 = evaluate_topk(2)

print(f"Mean Jaccard (topK=1): {mean_j_k1:.4f}")
print(f"Mean Jaccard (topK=2): {mean_j_k2:.4f}")


100%|██████████| 42/42 [05:48<00:00,  8.30s/it]
100%|██████████| 42/42 [12:11<00:00, 17.42s/it]

Mean Jaccard (topK=1): 0.2047
Mean Jaccard (topK=2): 0.2119



