# CS 5542 — Lab 3: Multimodal RAG Systems & Retrieval Evaluation  
**Text + Images/PDFs (runs offline by default; optional LLM API hook)**

This notebook is a **student-ready, simplified, and fully runnable** lab workflow for **multimodal retrieval-augmented generation (RAG)**:
- ingest **PDF text** + **image captions/filenames**
- retrieve evidence with a lightweight baseline (TF‑IDF)
- build a **context block** for answering
- evaluate retrieval quality (Precision@5, Recall@10)
- run an **ablation study** (REQUIRED)

> ✅ **Important:** The code is optimized for **clarity + reproducibility for students** (minimal dependencies, no keys required).  
> It is not the “fastest possible” or “best-performing” RAG system — but it is a correct baseline that you can extend.

---

## Student Tasks (what you must do)
1. **Ingest** PDFs + images from `project_data_mm/` (or use the provided sample package).  
2. Implement / experiment with **chunking strategies** (page-based vs fixed-size).  
3. Compare retrieval methods (at least):  
   - **Sparse** (TF‑IDF / BM25-style)  
   - **Dense** (optional: embeddings)  
   - **Hybrid** (score fusion with `alpha`)  
   - **Hybrid + rerank** (optional: reranker / LLM rerank)  
4. Build a **multimodal context** that includes **evidence items** (text + images).  
5. Produce the required **results table**:

`Query × Method × Precision@5 × Recall@10 × Faithfulness`

---

## Expected Outputs (what graders look for)
- Printed ingestion counts (how many PDF pages/chunks, how many images)
- A retrieval demo showing **top‑k evidence** for a query
- Evaluation metrics per method (P@5, R@10)
- An ablation section with a small comparison table + short explanation


## Key Parameters You Can Tune (and what they do)

These parameters control retrieval + context building. **Students should change them and report what happens.**

- **`TOP_K_TEXT`**: how many text chunks to consider as candidates.  
  - Larger → more recall, but more noise (lower precision).
- **`TOP_K_IMAGES`**: how many image items to consider as candidates.  
  - Larger → more multimodal evidence, but can add irrelevant images.
- **`TOP_K_EVIDENCE`**: how many total evidence items (text+image) go into the final context.  
  - Larger → longer context; may dilute answer quality.
- **`ALPHA`** *(0 → 1)*: **fusion weight** when mixing text vs image evidence.  
  - `ALPHA = 1.0` → text dominates  
  - `ALPHA = 0.0` → images dominate  
  - typical starting point: `0.5`
- **`CHUNK_SIZE`** (fixed-size chunking): characters per chunk (baseline).  
  - Smaller → more granular retrieval (often higher precision)  
  - Larger → fewer chunks (often higher recall but less specific)
- **`CHUNK_OVERLAP`**: overlap between chunks to avoid cutting important info.  
  - Too high → redundant chunks; too low → missing context boundaries

### What to try (recommended student experiments)
- Keep everything fixed, vary **`ALPHA`**: 0.2, 0.5, 0.8  
- Vary **`TOP_K_TEXT`**: 2, 5, 10  
- Compare **page-based** vs **fixed-size** chunking (required ablation)


## 0) Student Info (Fill in)
- Name: Ben Blake
- UMKC ID: 14387365
- Course/Section: CS5542-0001


## 1) Setup (student-friendly baseline)

This lab starter is designed to be **easy to run** and **easy to modify**:
- **PyMuPDF (`fitz`)** for PDF text extraction
- **scikit-learn** for TF‑IDF retrieval (strong sparse baseline)
- **Pillow** for basic image IO
- Optional: connect an **LLM API** for answer generation (not required to run retrieval + eval)

### Student guideline
- First make sure **retrieval + metrics** run end-to-end.
- Then iterate: chunking → retrieval method → fusion (`ALPHA`) → rerank → faithfulness.

> If you have API keys (e.g., Gemini / OpenAI / etc.), you can plug them into the optional LLM hook later —  
> but your retrieval evaluation should work **without** any external keys.


In [1]:
# Imports
import os, re, glob, json, math
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple, Optional

import numpy as np
import pandas as pd

!pip install PyMuPDF sentence-transformers faiss-cpu rank-bm25
import fitz  # PyMuPDF
from PIL import Image, ImageDraw, ImageFont

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
from rank_bm25 import BM25Okapi

Collecting PyMuPDF
  Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25, PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.26.7 faiss-cpu-1.13.2 rank-bm25-0.2.2


### Cell Description: Imports & Setup

- **What this cell does:** Imports necessary libraries for file handling (os, glob), PDF processing (PyMuPDF), and vector operations (numpy, sklearn, faiss).
- **Why it matters:** Sets up the environment with tools for OCR-free PDF text extraction and dense/sparse retrieval.
- **Key assumptions:** Dependencies like `pymupdf` and `sentence-transformers` are installed in the environment.

In [2]:
# =========================
# Lab Configuration (EDIT ME)
# =========================
# Students: try changing these and observe how retrieval metrics change.

DATA_DIR = "project_data_mm"   # folder containing pdfs/ and images/
PDF_DIR  = os.path.join(DATA_DIR, "pdfs")
IMG_DIR  = os.path.join(DATA_DIR, "images")

# Retrieval knobs
TOP_K_TEXT     = 5    # candidate text chunks
TOP_K_IMAGES   = 3    # candidate images (based on captions/filenames)
TOP_K_EVIDENCE = 8    # final evidence items used in the context

# Fusion knob (text vs images)
ALPHA = 0.5  # 0.0 = images dominate, 1.0 = text dominates

# Chunking knobs (for fixed-size chunking ablation)
CHUNK_SIZE    = 900   # characters per chunk
CHUNK_OVERLAP = 150   # overlap characters

# Reproducibility
RANDOM_SEED = 0


### Cell Description: Configuration

- **What this cell does:** Defines global hyperparameters for the RAG pipeline, including k-retrieval counts (`TOP_K`) and fusion weights (`ALPHA`).
- **Why it matters:** These parameters directly control the trade-off between precision (low k) and recall (high k), and the balance between text and image evidence.
- **Tradeoffs:** Higher `TOP_K` increases recall but risks polluting the context with irrelevant noise.

## 2) Data folder
Expected structure:
```
project_data_mm/
  doc1.pdf
  doc2.pdf
  figures/
    img1.png
    ... (>=5)
```

If the folder is missing, we will generate **sample PDFs and images** automatically so you can run and verify the pipeline end-to-end.


## 3) Define your 3 queries + rubrics
**Guideline:** write queries that can be answered using your PDFs/images.

Rubric format below is **simple and runnable**:
- `must_have_keywords`: words/phrases that should appear in relevant evidence
- `optional_keywords`: nice-to-have

Later, retrieval metrics will treat an evidence chunk as relevant if it contains at least one `must_have_keywords` item.


In [3]:
QUERIES = [
    {
        "id": "Q1",
        "question": "Based on the risk matrix shown in the figures and the accompanying text, which combination of likelihood and impact corresponds to the highest risk level?",
        "rubric": {
            "must_have_keywords": ["likelihood", "impact", "high risk"],
            "optional_keywords": ["risk matrix", "heat map", "severity", "probability"]
        }
    },
    {
        "id": "Q2",
        "question": "Using both the Zero Trust architecture diagram and the document text, what core principle is emphasized for access decisions?",
        "rubric": {
            "must_have_keywords": ["zero trust", "verify", "never trust"],
            "optional_keywords": ["least privilege", "continuous authentication", "identity", "access control"]
        }
    },
    {
        "id": "Q3",
        "question": "What specific encryption algorithm (for example, AES-256 or RSA-2048) is mandated by the organization’s policy?",
        "rubric": {
            "must_have_keywords": ["AES-256", "RSA-2048", "encryption algorithm"],
            "optional_keywords": ["policy", "standard", "cryptographic"]
        }
    }
]

### Cell Description: Evaluation Dataset

- **What this cell does:** Defines the test set of queries with a ground-truth rubric (keywords) for evaluation.
- **Why it matters:** Essential for objective evaluation of the retrieval system using Precision and Recall metrics.
- **Key assumptions:** The queries cover both text and image modalities present in the dataset.

## 4) Ingestion
We extract:
- **PDF per-page text** as `TextChunk`
- **Image metadata** as `ImageItem` (caption = filename without extension)

> This is intentionally lightweight so it runs without downloading large embedding models.


**Cell Description:**
This cell handles the ingestion of PDF documents and images. We implement two chunking strategies: page-based (natural boundaries) and fixed-size (consistent length). This is crucial for RAG as the chunk size determines the context window usage and semantic completeness.

In [4]:
@dataclass
class TextChunk:
    chunk_id: str
    doc_id: str
    page_num: int
    text: str

@dataclass
class ImageItem:
    item_id: str
    path: str
    caption: str

def clean_text(s: str) -> str:
    s = s or ""
    s = re.sub(r"\s+", " ", s).strip()
    return s

def extract_pdf_pages(pdf_path: str) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    out: List[TextChunk] = []
    for i in range(len(doc)):
        page = doc.load_page(i)
        text = clean_text(page.get_text("text"))
        if text:
            out.append(TextChunk(chunk_id=f"{doc_id}::p{i+1}", doc_id=doc_id, page_num=i+1, text=text))
    return out

def chunk_text_fixed(text: str, chunk_size: int, overlap: int, doc_id: str, page_num: int) -> List[TextChunk]:
    chunks = []
    if not text: return chunks
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk_text = text[start:end]
        chunks.append(TextChunk(chunk_id=f"{doc_id}::p{page_num}::c{start}", doc_id=doc_id, page_num=page_num, text=chunk_text))
        start += (chunk_size - overlap)
        if start >= len(text): break
    return chunks

def extract_pdf_fixed(pdf_path: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    out: List[TextChunk] = []
    for i in range(len(doc)):
        page = doc.load_page(i)
        text = clean_text(page.get_text("text"))
        if text:
            out.extend(chunk_text_fixed(text, chunk_size, overlap, doc_id, i+1))
    return out

def load_images(fig_dir: str) -> List[ImageItem]:
    items: List[ImageItem] = []
    for p in sorted(glob.glob(os.path.join(fig_dir, "*.*"))):
        base = os.path.basename(p)
        caption = os.path.splitext(base)[0].replace("_", " ")
        items.append(ImageItem(item_id=base, path=p, caption=caption))
    return items

pdfs = sorted(glob.glob(os.path.join(PDF_DIR, "*.pdf")))
page_chunks = []
for p in pdfs: page_chunks.extend(extract_pdf_pages(p))
fixed_chunks = []
for p in pdfs: fixed_chunks.extend(extract_pdf_fixed(p))
image_items = load_images(IMG_DIR)

print("Total page chunks:", len(page_chunks))
print("Total fixed chunks:", len(fixed_chunks))
print("Total images:", len(image_items))


Total page chunks: 221
Total fixed chunks: 859
Total images: 10


### Cell Description: Ingestion & Chunking

- **What this cell does:** Extracts text from PDFs and captions from images. Implements two chunking strategies: page-based (natural boundaries) and fixed-size (consistent length).
- **Why it matters:** Standardizes raw documents into a format suitable for indexing. The choice of chunking strategy impacts context window usage and semantic coherence.
- **Tradeoffs:** Page-based chunking preserves document structure but varies in size; fixed-size ensures consistency but may split sentences.

## 5) Retrieval (TF‑IDF)
We build two TF‑IDF indexes:
- One over **PDF text chunks**
- One over **image captions**

Retrieval returns the top‑k results with similarity scores.


**Cell Description:**
We build three types of indices: Sparse (TF-IDF, BM25) and Dense (SentenceTransformers). This allows us to compare keyword-based retrieval against semantic retrieval. We also index image captions to enable multimodal retrieval.

In [5]:
# 1. TF-IDF Setup
def build_tfidf_index(texts: List[str]):
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(texts)
    X = normalize(X)
    return vec, X

# 2. BM25 Setup
def build_bm25_index(texts: List[str]):
    tokenized_corpus = [doc.split(" ") for doc in texts]
    return BM25Okapi(tokenized_corpus)

# 3. Dense Setup
model_st = SentenceTransformer('all-MiniLM-L6-v2')
def build_dense_index(texts: List[str]):
    embeddings = model_st.encode(texts, convert_to_numpy=True)
    normalize(embeddings, copy=False)
    d = embeddings.shape[1]
    index = faiss.IndexFlatIP(d)
    index.add(embeddings)
    return index

print("Building indexes (this may take a moment)...")

# PAGE indexes
page_texts = [c.text for c in page_chunks]
tfidf_vec_page, tfidf_X_page = build_tfidf_index(page_texts)
bm25_page = build_bm25_index(page_texts)
dense_index_page = build_dense_index(page_texts)

# FIXED indexes
fixed_texts = [c.text for c in fixed_chunks]
tfidf_vec_fixed, tfidf_X_fixed = build_tfidf_index(fixed_texts)
bm25_fixed = build_bm25_index(fixed_texts)
dense_index_fixed = build_dense_index(fixed_texts)

# IMAGE indexes
img_texts = [it.caption for it in image_items]
tfidf_vec_img, tfidf_X_img = build_tfidf_index(img_texts)
dense_index_img = build_dense_index(img_texts)

print("✅ Indexes built.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Building indexes (this may take a moment)...
✅ Indexes built.


### Cell Description: Indexing (Sparse + Dense)

- **What this cell does:** Builds Sparse (TF-IDF, BM25) and Dense (SentenceTransformers) indices for both text chunks and image captions.
- **Why it matters:** Enables comparison between keyword-based retrieval (good for exact terms) and semantic retrieval (good for concepts). Multimodal indexing allows retrieving images via text queries.
- **Key assumptions:** `all-MiniLM-L6-v2` provides sufficient semantic understanding for this domain.

## 6) Build evidence context
We assemble a compact context string + list of image paths.

**Guidelines for good context:**
- Keep snippets short (100–300 chars)
- Always include chunk IDs so you can cite evidence
- Attach images that are likely relevant


In [6]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_tfidf(query: str, vec, X, top_k=5):
    q = vec.transform([query])
    q = normalize(q)
    scores = (X @ q.T).toarray().ravel()
    idx = np.argsort(-scores)[:top_k]
    return [(int(i), float(scores[i])) for i in idx]

def retrieve_bm25(query: str, bm25_obj, top_k=5):
    tokenized_query = query.split(" ")
    scores = bm25_obj.get_scores(tokenized_query)
    idx = np.argsort(-scores)[:top_k]
    return [(int(i), float(scores[i])) for i in idx]

def retrieve_dense(query: str, index, top_k=5):
    q_emb = model_st.encode([query], convert_to_numpy=True)
    normalize(q_emb, copy=False)
    scores, indices = index.search(q_emb, top_k)
    return [(int(indices[0][i]), float(scores[0][i])) for i in range(top_k) if indices[0][i] != -1]

def build_context(
    question: str,
    method: str = "sparse",
    chunking: str = "page",
    top_k_text: int = TOP_K_TEXT,
    top_k_images: int = TOP_K_IMAGES,
    top_k_evidence: int = TOP_K_EVIDENCE,
    alpha: float = ALPHA,
) -> Dict[str, Any]:
    # Select Corpus
    if chunking == "page":
        chunks = page_chunks
        tfidf_vec, tfidf_X = tfidf_vec_page, tfidf_X_page
        bm25_obj = bm25_page
        dense_idx = dense_index_page
    else:
        chunks = fixed_chunks
        tfidf_vec, tfidf_X = tfidf_vec_fixed, tfidf_X_fixed
        bm25_obj = bm25_fixed
        dense_idx = dense_index_fixed

    # 1. Text Retrieval
    text_hits = []
    if method == "sparse":
        text_hits = retrieve_tfidf(question, tfidf_vec, tfidf_X, top_k=top_k_text)
    elif method == "bm25":
        text_hits = retrieve_bm25(question, bm25_obj, top_k=top_k_text)
    elif method == "dense":
        text_hits = retrieve_dense(question, dense_idx, top_k=top_k_text)
    elif "hybrid" in method:
        h1 = retrieve_tfidf(question, tfidf_vec, tfidf_X, top_k=top_k_text * 2)
        h2 = retrieve_dense(question, dense_idx, top_k=top_k_text * 2)
        # Simple rank fusion or score fusion (using dict)
        combined = {}
        for i, s in h1: combined[i] = combined.get(i, 0) + 0.3 * s
        for i, s in h2: combined[i] = combined.get(i, 0) + 0.7 * s
        text_hits = sorted(combined.items(), key=lambda x: x[1], reverse=True)[:top_k_text]

    # Prepare candidates for fusion/reranking
    candidates = []
    for idx, s in text_hits:
        candidates.append({
            "modality": "text",
            "id": chunks[idx].chunk_id,
            "score": float(s),
            "text": chunks[idx].text,
            "path": None
        })

    # 2. Image Retrieval
    img_hits = retrieve_tfidf(question, tfidf_vec_img, tfidf_X_img, top_k=top_k_images)
    for idx, s in img_hits:
        candidates.append({
            "modality": "image",
            "id": image_items[idx].item_id,
            "score": float(s),
            "text": image_items[idx].caption,
            "path": image_items[idx].path
        })

    # 3. Rerank (if requested)
    if "rerank" in method:
        pairs = [[question, c["text"]] for c in candidates]
        scores = cross_encoder.predict(pairs)
        for i, s in enumerate(scores): candidates[i]["score"] = float(s)
        candidates.sort(key=lambda x: x["score"], reverse=True)
    else:
        # Basic normalize & sort
        candidates.sort(key=lambda x: x["score"], reverse=True)

    # 4. Final Selection
    final_evidence = candidates[:top_k_evidence]
    ctx_lines = []
    image_paths = []
    for ev in final_evidence:
        if ev["modality"] == "text":
            snippet = (ev["text"] or "")[:260].replace("\n", " ")
            ctx_lines.append(f"[TEXT | {ev['id']} | score={ev['score']:.3f}] {snippet}")
        else:
            ctx_lines.append(f"[IMAGE | {ev['id']} | score={ev['score']:.3f}] caption={ev['text']}")
            image_paths.append(ev["path"])

    return {
        "question": question,
        "context": "\n".join(ctx_lines),
        "image_paths": image_paths,
        "evidence": final_evidence,
        "method": method
    }

# Demo
ctx_demo = build_context(QUERIES[0]["question"], method="hybrid_rerank")
print(ctx_demo["context"])

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

[IMAGE | img1.png | score=-7.435] caption=img1
[IMAGE | img10.png | score=-7.684] caption=img10
[IMAGE | img2.png | score=-7.726] caption=img2
[TEXT | doc3.pdf::p29 | score=-7.900] NIST CSWP 29 The NIST Cybersecurity Framework (CSF) 2.0 February 26, 2024 24 Appendix B. CSF Tiers Table 2 contains a notional illustration of the CSF Tiers discussed in Sec. 3. The Tiers characterize the rigor of an organization’s cybersecurity risk governanc
[TEXT | doc3.pdf::p17 | score=-8.217] NIST CSWP 29 The NIST Cybersecurity Framework (CSF) 2.0 February 26, 2024 12 organizations by their nature may monitor risk at the enterprise level, while larger companies may maintain separate risk management efforts integrated into the ERM. Organizations can
[TEXT | doc3.pdf::p16 | score=-8.248] NIST CSWP 29 The NIST Cybersecurity Framework (CSF) 2.0 February 26, 2024 11 Preparing to create and use Organizational Profiles involves gathering information about organizational priorities, resources, and risk directio

### Cell Description: Retrieval Logic (Context Building)

- **What this cell does:** The core retrieval logic that fetches candidates, fuses scores (Hybrid), optionally reranks, and constructs the prompt context.
- **Why it matters:** This is the 'Brain' of the RAG system. It determines what evidence is presented to the LLM. Hybrid fusion allows leveraging strengths of both sparse and dense retrievers.
- **Key assumptions:** Normalization of scores (0-1) is sufficient for fair fusion between different retrieval methods.

## 7) “Generator” (simple, offline)
To keep this notebook runnable anywhere, we implement a **lightweight extractive generator**:
- It returns the top evidence lines
- In your real submission, you can replace this with an LLM call (HF local model or an API)

**Key rule:** the answer must stay consistent with evidence.


In [7]:
def simple_extractive_answer(question: str, context: str) -> str:
    lines = context.splitlines()
    if not lines:
        return "I don't know (no evidence retrieved)."
    # Return top 2 evidence lines as a "grounded" answer
    return (
        f"Question: {question}\n\n"
        "Grounded answer (extractive):\n"
        + "\n".join(lines[:2])
    )

# --- OPTIONAL: Gemini API Hook (Uncomment to use) ---
# import google.generativeai as genai
# GOOGLE_API_KEY = "YOUR_KEY_HERE"
# genai.configure(api_key=GOOGLE_API_KEY)
# model = genai.GenerativeModel('gemini-pro')
#
# def generate_gemini(question, context):
#     prompt = f"Based on the following evidence, answer the question. Cite sources like [doc1] or [img1].\nEvidence:\n{context}\nQuestion: {question}"
#     return model.generate_content(prompt).text

def run_query(qobj, top_k_text=TOP_K_TEXT, top_k_images=TOP_K_IMAGES, top_k_evidence=TOP_K_EVIDENCE, alpha=ALPHA) -> Dict[str, Any]:
    question = qobj["question"]
    # Use the advanced build_context if available, else fallback
    try:
        ctx = build_context(question, method="hybrid_rerank", chunking="fixed")
    except NameError:
        # Fallback to simple if advanced not defined yet (during partial runs)
        ctx = build_context(question, top_k_text=top_k_text, top_k_images=top_k_images, top_k_evidence=top_k_evidence, alpha=alpha)

    # Use Gemini if available, else simple
    # answer = generate_gemini(question, ctx["context"])
    answer = simple_extractive_answer(question, ctx["context"])

    return {
        "id": qobj["id"],
        "question": question,
        "answer": answer,
        "context": ctx["context"],
        "image_paths": ctx["image_paths"],
    }

results = [run_query(q) for q in QUERIES]
for r in results:
    print("\n" + "="*80)
    print(r["id"], r["question"])
    print(r["answer"][:500])
    print("Images:", [os.path.basename(p) for p in r["image_paths"]])


Q1 Based on the risk matrix shown in the figures and the accompanying text, which combination of likelihood and impact corresponds to the highest risk level?
Question: Based on the risk matrix shown in the figures and the accompanying text, which combination of likelihood and impact corresponds to the highest risk level?

Grounded answer (extractive):
[TEXT | doc1.pdf::p11::c4500 | score=-6.947]  for which a health care facility or business associate, as applicable, determines there is a low probability of compromise in accordance with HIPAA’s 4-factor risk assessment (see “Analysis of Risk of Harm” section below for a complete listing of these facto
Images: ['img1.png', 'img10.png', 'img2.png']

Q2 Using both the Zero Trust architecture diagram and the document text, what core principle is emphasized for access decisions?
Question: Using both the Zero Trust architecture diagram and the document text, what core principle is emphasized for access decisions?

Grounded answer (extractive

### Cell Description: Generator

- **What this cell does:** Generates an answer based on the retrieved context. Currently implements a simple extractive baseline, with hooks for an LLM.
- **Why it matters:** Demonstrates the End-to-End RAG flow. In a production system, this would be replaced by a generative model to synthesize the answer.
- **Tradeoffs:** The extractive baseline cannot synthesize new information, only repeat evidence.

## 8) Retrieval Evaluation (Precision@k / Recall@k)
We treat a text chunk as **relevant** for a query if it contains at least one `must_have_keywords` term.



In [8]:
def is_relevant_text(chunk_text: str, rubric: Dict[str, Any]) -> bool:
    text = chunk_text.lower()
    must = [k.lower() for k in rubric.get("must_have_keywords", [])]
    return any(k in text for k in must)

def precision_at_k(relevances: List[bool], k: int) -> float:
    k = min(k, len(relevances))
    if k == 0:
        return 0.0
    return sum(relevances[:k]) / k

def recall_at_k(relevances: List[bool], k: int, total_relevant: int) -> float:
    k = min(k, len(relevances))
    if total_relevant == 0:
        return 0.0
    return sum(relevances[:k]) / total_relevant

def eval_retrieval_for_query(qobj, top_k=10) -> Dict[str, Any]:
    question = qobj["question"]
    rubric = qobj["rubric"]

    # Corrected: Using retrieve_tfidf and the globally defined page TF-IDF vectors
    hits = retrieve_tfidf(question, tfidf_vec_page, tfidf_X_page, top_k=top_k)
    rels = []
    for i, score in hits:
        rels.append(is_relevant_text(page_chunks[i].text, rubric))

    # Estimate total relevant in the corpus (for recall)
    total_rel = sum(is_relevant_text(ch.text, rubric) for ch in page_chunks)

    return {
        "id": qobj["id"],
        "P@5": precision_at_k(rels, 5),
        "R@10": recall_at_k(rels, 10, total_rel),
        "total_relevant_chunks": total_rel,
    }

eval_rows = [eval_retrieval_for_query(q) for q in QUERIES]
df_eval = pd.DataFrame(eval_rows)
df_eval

Unnamed: 0,id,P@5,R@10,total_relevant_chunks
0,Q1,0.4,0.119048,42
1,Q2,0.6,0.2,25
2,Q3,0.2,1.0,1


### Cell Description: Retrieval Metrics

- **What this cell does:** Calculates retrieval metrics (Precision@5, Recall@10) based on keyword matching against the defined rubric.
- **Why it matters:** Quantifies the performance of the retrieval system, allowing for data-driven improvements.
- **Key assumptions:** Keyword matching is a proxy for true relevance (which would require human annotation).

## 9) Ablation Study (REQUIRED)

You must compare **at least**:
- **Chunking A (page-based)** vs **Chunking B (fixed-size)**  
- **Sparse** vs **Dense** vs **Hybrid** vs **Hybrid + Rerank** *(dense/rerank can be optional extensions — but include at least sparse + one fusion variant)*  
- **Text-only RAG** vs **Multimodal RAG** (your context must include evidence items)

**Deliverable:** include a final results table in your README:

`Query × Method × Precision@5 × Recall@10 × Faithfulness`

### Quick ablation ideas
- Vary `TOP_K_TEXT`: 2, 5, 10  
- Vary `ALPHA`: 0.2, 0.5, 0.8  
- Compare page-chunking vs fixed-size (`CHUNK_SIZE` / `CHUNK_OVERLAP`)  


In [9]:
def calculate_faithfulness(answer: str, rubric: Dict[str, Any]) -> float:
    """Calculates a raw faithfulness score based on keyword overlap."""
    if not answer:
        return 0.0

    answer_lower = answer.lower()
    must_have = [k.lower() for k in rubric.get("must_have_keywords", [])]
    if not must_have:
        return 1.0 # No requirements

    matches = sum(1 for k in must_have if k in answer_lower)
    return matches / len(must_have)

def run_ablation():
    results = []
    methods = ["sparse", "bm25", "dense", "hybrid", "hybrid_rerank"]
    strategies = ["page", "fixed"]

    for q in QUERIES:
        for strat in strategies:
            for meth in methods:
                # Retrieve (Ensure enough evidence is fetched for P@10)
                ctx = build_context(q["question"], method=meth, chunking=strat, top_k_evidence=10)

                # Generate Answer (Extractive)
                answer = simple_extractive_answer(q["question"], ctx["context"])

                # Evaluate Retrieval (P@5 and R@10)
                rubric = q["rubric"]
                must_have = [k.lower() for k in rubric.get("must_have_keywords", [])]
                retrieved_rels = []
                found_keywords = set()

                for i, ev in enumerate(ctx["evidence"]):
                    if ev["modality"] == "text":
                        text_lower = ev["text"].lower()
                        is_rel = any(k in text_lower for k in must_have)
                        retrieved_rels.append(is_rel)
                        if i < 10:
                             found_keywords.update(k for k in must_have if k in text_lower)
                    else:
                        retrieved_rels.append(False)

                p5 = sum(retrieved_rels[:5]) / 5 if retrieved_rels else 0
                r10 = len(found_keywords) / len(must_have) if must_have else 0

                # Evaluate Faithfulness
                faith_score = calculate_faithfulness(answer, rubric)

                results.append({
                    "Query": q["id"],
                    "Method": meth,
                    "Chunking": strat,
                    "P@5": p5,
                    "R@10": r10,
                    "Faithfulness": faith_score
                })
    return pd.DataFrame(results)

df_res = run_ablation()
print(df_res)

   Query         Method Chunking  P@5      R@10  Faithfulness
0     Q1         sparse     page  0.4  0.666667      0.666667
1     Q1           bm25     page  1.0  0.666667      0.666667
2     Q1          dense     page  0.4  0.333333      0.666667
3     Q1         hybrid     page  0.2  0.333333      0.666667
4     Q1  hybrid_rerank     page  0.2  0.333333      0.666667
5     Q1         sparse    fixed  0.4  0.666667      0.666667
6     Q1           bm25    fixed  0.8  0.666667      0.666667
7     Q1          dense    fixed  0.2  0.333333      0.666667
8     Q1         hybrid    fixed  0.2  0.333333      0.666667
9     Q1  hybrid_rerank    fixed  0.2  0.333333      0.666667
10    Q2         sparse     page  0.6  0.333333      0.333333
11    Q2           bm25     page  0.6  0.333333      0.333333
12    Q2          dense     page  0.2  0.333333      0.333333
13    Q2         hybrid     page  0.2  0.333333      0.333333
14    Q2  hybrid_rerank     page  0.2  0.333333      0.333333
15    Q2

### Cell Description: Ablation Study

- **What this cell does:** Systematically compares different configurations (Chunking strategies × Retrieval methods) to identify the optimal setup.
- **Why it matters:** Provides scientific rigor to system design choices, proving which components contribute to performance.
- **Tradeoffs:** Only evaluates a subset of possible hyperparameters due to compute/time constraints.

## 10) What to submit
1) Your updated dataset (or keep your own)
2) This notebook (with your answers + screenshots/outputs)
3) A short write‑up: retrieval metrics + faithfulness discussion + ablation

**Tip:** If you switch to an LLM, keep the same `build_context()` so the evidence is always visible.
