# CS 5542 — Lab 3: Multimodal RAG Systems & Retrieval Evaluation  
**Text + Images/PDFs (runs offline by default; optional LLM API hook)**

This notebook is a **student-ready, simplified, and fully runnable** lab workflow for **multimodal retrieval-augmented generation (RAG)**:
- ingest **PDF text** + **image captions/filenames**
- retrieve evidence with a lightweight baseline (TF‑IDF)
- build a **context block** for answering
- evaluate retrieval quality (Precision@5, Recall@10)
- run an **ablation study** (REQUIRED)

> ✅ **Important:** The code is optimized for **clarity + reproducibility for students** (minimal dependencies, no keys required).  
> It is not the “fastest possible” or “best-performing” RAG system — but it is a correct baseline that you can extend.

---

## Student Tasks (what you must do)
1. **Ingest** PDFs + images from `project_data_mm/` (or use the provided sample package).  
2. Implement / experiment with **chunking strategies** (page-based vs fixed-size).  
3. Compare retrieval methods (at least):  
   - **Sparse** (TF‑IDF / BM25-style)  
   - **Dense** (optional: embeddings)  
   - **Hybrid** (score fusion with `alpha`)  
   - **Hybrid + rerank** (optional: reranker / LLM rerank)  
4. Build a **multimodal context** that includes **evidence items** (text + images).  
5. Produce the required **results table**:

`Query × Method × Precision@5 × Recall@10 × Faithfulness`

---

## Expected Outputs (what graders look for)
- Printed ingestion counts (how many PDF pages/chunks, how many images)
- A retrieval demo showing **top‑k evidence** for a query
- Evaluation metrics per method (P@5, R@10)
- An ablation section with a small comparison table + short explanation


## Key Parameters You Can Tune (and what they do)

These parameters control retrieval + context building. **Students should change them and report what happens.**

- **`TOP_K_TEXT`**: how many text chunks to consider as candidates.  
  - Larger → more recall, but more noise (lower precision).
- **`TOP_K_IMAGES`**: how many image items to consider as candidates.  
  - Larger → more multimodal evidence, but can add irrelevant images.
- **`TOP_K_EVIDENCE`**: how many total evidence items (text+image) go into the final context.  
  - Larger → longer context; may dilute answer quality.
- **`ALPHA`** *(0 → 1)*: **fusion weight** when mixing text vs image evidence.  
  - `ALPHA = 1.0` → text dominates  
  - `ALPHA = 0.0` → images dominate  
  - typical starting point: `0.5`
- **`CHUNK_SIZE`** (fixed-size chunking): characters per chunk (baseline).  
  - Smaller → more granular retrieval (often higher precision)  
  - Larger → fewer chunks (often higher recall but less specific)
- **`CHUNK_OVERLAP`**: overlap between chunks to avoid cutting important info.  
  - Too high → redundant chunks; too low → missing context boundaries

### What to try (recommended student experiments)
- Keep everything fixed, vary **`ALPHA`**: 0.2, 0.5, 0.8  
- Vary **`TOP_K_TEXT`**: 2, 5, 10  
- Compare **page-based** vs **fixed-size** chunking (required ablation)


## 0) Student Info (Fill in)
- Name: Ben Blake
- UMKC ID: 14387365
- Course/Section: CS5542-0001


## 1) Setup (student-friendly baseline)

This lab starter is designed to be **easy to run** and **easy to modify**:
- **PyMuPDF (`fitz`)** for PDF text extraction
- **scikit-learn** for TF‑IDF retrieval (strong sparse baseline)
- **Pillow** for basic image IO
- Optional: connect an **LLM API** for answer generation (not required to run retrieval + eval)

### Student guideline
- First make sure **retrieval + metrics** run end-to-end.
- Then iterate: chunking → retrieval method → fusion (`ALPHA`) → rerank → faithfulness.

> If you have API keys (e.g., Gemini / OpenAI / etc.), you can plug them into the optional LLM hook later —  
> but your retrieval evaluation should work **without** any external keys.


In [1]:
# Imports
import os, re, glob, json, math
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple, Optional

import numpy as np
import pandas as pd

!pip install PyMuPDF
import fitz  # PyMuPDF
from PIL import Image, ImageDraw, ImageFont

# Added for sample PDF generation
!pip install reportlab

# --- OCR Dependencies (Required for Full Credit) ---
# Install system packages for Tesseract (works on Colab/Linux)
!apt-get update && apt-get install -y tesseract-ocr
!pip install pytesseract
import pytesseract

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

# Added for Dense Retrieval & Reranking
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch


Collecting PyMuPDF
  Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.7
Collecting reportlab
  Downloading reportlab-4.4.9-py3-none-any.whl.metadata (1.7 kB)
Downloading reportlab-4.4.9-py3-none-any.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: reportlab
Successfully installed reportlab-4.4.9
Get:1 https://cli.github.com/packages stable InRelease [3,917 B]
Get:2 https://cli.github.com/packages stable/main amd64 Packages [356 B]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jamm

### Cell Description
**What:** Imports essential libraries for PDF processing (fitz/PyMuPDF), image handling (PIL), sparse retrieval (sklearn), and dense retrieval (sentence-transformers).
**Why:** These tools provide the foundation for ingesting multimodal data, vectorizing text/images, and calculating similarity scores.
**Assumptions/Tradeoffs:** We use `all-MiniLM-L6-v2` for dense retrieval because it is lightweight and CPU-friendly, but it has lower semantic capacity than larger embedding models.


In [2]:
# =========================
# Lab Configuration (EDIT ME)
# =========================
# Students: try changing these and observe how retrieval metrics change.

DATA_DIR = "project_data_mm"   # folder containing pdfs/ and images/
PDF_DIR  = os.path.join(DATA_DIR, "pdfs")
IMG_DIR  = os.path.join(DATA_DIR, "images")

# Retrieval knobs
TOP_K_TEXT     = 5    # candidate text chunks
TOP_K_IMAGES   = 3    # candidate images (based on captions/filenames)
TOP_K_EVIDENCE = 8    # final evidence items used in the context

# Fusion knob (text vs images)
ALPHA = 0.5  # 0.0 = images dominate, 1.0 = text dominates

# Chunking knobs (for fixed-size chunking ablation)
CHUNK_SIZE    = 900   # characters per chunk
CHUNK_OVERLAP = 150   # overlap characters

# Reproducibility
RANDOM_SEED = 0


### Cell Description
**What:** Defines global hyperparameters like `TOP_K` (retrieval depth), `ALPHA` (fusion weight), and chunking configuration.
**Why:** These parameters control the trade-off between precision/recall and the balance between text and image evidence in the final context.
**Assumptions/Tradeoffs:** Static parameters apply to all queries equally. In a production system, these might be dynamic (e.g., adaptive retrieval depth).


## 2) Data folder
Expected structure:
```
project_data_mm/
  doc1.pdf
  doc2.pdf
  figures/
    img1.png
    ... (>=5)
```

If the folder is missing, we will generate **sample PDFs and images** automatically so you can run and verify the pipeline end-to-end.


In [3]:
# Data paths
DATA_DIR = "project_data_mm"
FIG_DIR = os.path.join(DATA_DIR, "figures")
os.makedirs(FIG_DIR, exist_ok=True)

def _write_sample_pdf(pdf_path: str, title: str, paragraphs: List[str]) -> None:
    """Create a simple multi-page PDF with ReportLab."""
    from reportlab.lib.pagesizes import letter
    from reportlab.pdfgen import canvas

    c = canvas.Canvas(pdf_path, pagesize=letter)
    width, height = letter
    y = height - 72

    c.setFont("Helvetica-Bold", 16)
    c.drawString(72, y, title)
    y -= 36
    c.setFont("Helvetica", 11)

    for p in paragraphs:
        # naive line wrapping
        words = p.split()
        line = ""
        for w in words:
            if len(line) + len(w) + 1 > 95:
                c.drawString(72, y, line)
                y -= 14
                line = w
                if y < 72:
                    c.showPage()
                    y = height - 72
                    c.setFont("Helvetica", 11)
            else:
                line = (line + " " + w).strip()
        if line:
            c.drawString(72, y, line)
            y -= 18

        if y < 72:
            c.showPage()
            y = height - 72
            c.setFont("Helvetica", 11)

    c.save()

def _write_sample_image(img_path: str, label: str, size=(900, 550)) -> None:
    """Create a simple image with a big label."""
    img = Image.new("RGB", size, (245, 245, 245))
    d = ImageDraw.Draw(img)
    try:
        font = ImageFont.truetype("DejaVuSans.ttf", 48)
    except Exception:
        font = ImageFont.load_default()
    d.rectangle([30, 30, size[0]-30, size[1]-30], outline=(30, 30, 30), width=6)
    d.text((60, 200), label, fill=(20, 20, 20), font=font)
    img.save(img_path)

def ensure_sample_dataset(min_pdfs=5, min_imgs=5) -> None:
    """Create a small dataset if user doesn't have one yet."""
    pdfs = sorted(glob.glob(os.path.join(DATA_DIR, "*.pdf")))
    imgs = sorted(glob.glob(os.path.join(FIG_DIR, "*.*")))

    if len(pdfs) >= min_pdfs and len(imgs) >= min_imgs:
        print("✅ Found existing dataset:", len(pdfs), "PDFs and", len(imgs), "images.")
        return

    print("⚠️ Dataset incomplete. Creating sample dataset...")

    # Relevant Docs
    pdf1 = os.path.join(DATA_DIR, "sample_doc_rag_basics.pdf")
    pdf2 = os.path.join(DATA_DIR, "sample_doc_multimodal_eval.pdf")

    p1 = [
        "Retrieval-Augmented Generation (RAG) combines a retriever and a generator. The retriever fetches evidence chunks from documents.",
        "A common baseline is TF-IDF retrieval. Another baseline is BM25, which uses term frequency and inverse document frequency.",
        "Good RAG answers should be grounded in the retrieved evidence and should not hallucinate facts that are not supported.",
    ]
    p2 = [
        "Multimodal RAG includes both text (PDF pages) and images (figures). A simple approach is to attach relevant figures as evidence.",
        "Evaluation can include retrieval metrics such as Precision@k and Recall@k, plus qualitative checks for faithfulness.",
        "Ablation studies vary the chunking strategy, retriever type, or the number of retrieved items.",
    ]

    _write_sample_pdf(pdf1, "Sample Doc 1: RAG Basics", p1)
    _write_sample_pdf(pdf2, "Sample Doc 2: Multimodal RAG + Evaluation", p2)

    # Irrelevant / Distractor Docs (to make metrics realistic)
    distractors = [
        ("sample_doc_cooking.pdf", ["To bake a cake, preheat oven to 350F. Mix flour, sugar, and eggs.", "Frosting can be made with butter and powdered sugar."]),
        ("sample_doc_sports.pdf", ["The soccer match ended in a draw. The goalkeeper made three saves.", "Tennis scoring is 15, 30, 40, Deuce, Advantage."]),
        ("sample_doc_history.pdf", ["The Roman Empire fell in 476 AD. Julius Caesar was a famous leader.", "The Industrial Revolution changed manufacturing processes forever."]),
    ]

    for name, content in distractors:
        _write_sample_pdf(os.path.join(DATA_DIR, name), f"Distractor: {name}", content)

    # Images
    labels = [
        "figure_rag_pipeline",
        "figure_tfidf_retrieval",
        "figure_bm25_baseline",
        "figure_precision_recall",
        "figure_ablation_study",
        "figure_cooking_cake", # Distractor
        "figure_soccer_field", # Distractor
    ]
    for lab in labels:
        _write_sample_image(os.path.join(FIG_DIR, f"{lab}.png"), lab)

    print("✅ Sample dataset created.")

ensure_sample_dataset()

pdfs = sorted(glob.glob(os.path.join(DATA_DIR, "*.pdf")))
imgs = sorted(glob.glob(os.path.join(FIG_DIR, "*.*")))

print("PDFs:", len(pdfs), pdfs)
print("Images:", len(imgs), imgs)


⚠️ Dataset incomplete. Creating sample dataset...
✅ Sample dataset created.
PDFs: 5 ['project_data_mm/sample_doc_cooking.pdf', 'project_data_mm/sample_doc_history.pdf', 'project_data_mm/sample_doc_multimodal_eval.pdf', 'project_data_mm/sample_doc_rag_basics.pdf', 'project_data_mm/sample_doc_sports.pdf']
Images: 7 ['project_data_mm/figures/figure_ablation_study.png', 'project_data_mm/figures/figure_bm25_baseline.png', 'project_data_mm/figures/figure_cooking_cake.png', 'project_data_mm/figures/figure_precision_recall.png', 'project_data_mm/figures/figure_rag_pipeline.png', 'project_data_mm/figures/figure_soccer_field.png', 'project_data_mm/figures/figure_tfidf_retrieval.png']


### Cell Description
**What:** Checks if the dataset exists; if not, generates synthetic PDFs and images using ReportLab and PIL.
**Why:** Ensures the notebook is fully reproducible and runnable out-of-the-box without external file dependencies.
**Assumptions/Tradeoffs:** Synthetic data is clean and simple. Real-world PDFs often have complex layouts, noise, and OCR errors that this generator does not simulate.


## 3) Define your 3 queries + rubrics
**Guideline:** write queries that can be answered using your PDFs/images.

Rubric format below is **simple and runnable**:
- `must_have_keywords`: words/phrases that should appear in relevant evidence
- `optional_keywords`: nice-to-have

Later, retrieval metrics will treat an evidence chunk as relevant if it contains at least one `must_have_keywords` item.


In [4]:
QUERIES = [
    {
        "id": "Q1",
        "question": "What is Retrieval-Augmented Generation (RAG) and why is evidence grounding important?",
        "rubric": {
            "must_have_keywords": ["retrieval-augmented generation", "evidence", "grounded", "hallucinate", "retriever"],
            "optional_keywords": ["chunks", "generator", "context"]
        }
    },
    {
        "id": "Q2",
        "question": "Name two retrieval baselines and briefly describe them.",
        "rubric": {
            "must_have_keywords": ["tf-idf", "bm25"],
            "optional_keywords": ["term frequency", "inverse document frequency"]
        }
    },
    {
        "id": "Q3",
        "question": "How would you evaluate a multimodal RAG system? Mention at least one retrieval metric.",
        "rubric": {
            "must_have_keywords": ["precision", "recall", "evaluation"],
            "optional_keywords": ["ablation", "faithfulness", "multimodal"]
        }
    },
]


## 4) Ingestion
We extract:
- **PDF per-page text** as `TextChunk`
- **Image metadata** as `ImageItem` (caption = filename without extension)

> This is intentionally lightweight so it runs without downloading large embedding models.


In [5]:
@dataclass
class TextChunk:
    chunk_id: str
    doc_id: str
    page_num: int
    text: str

@dataclass
class ImageItem:
    item_id: str
    path: str
    caption: str  # simple text to make image retrieval runnable

def clean_text(s: str) -> str:
    s = s or ""
    s = re.sub(r"\s+", " ", s).strip()
    return s

def extract_pdf_pages(pdf_path: str) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    out: List[TextChunk] = []
    for i in range(len(doc)):
        page = doc.load_page(i)
        text = clean_text(page.get_text("text"))

        # --- Implemented: OCR Support ---
        # If text is empty (scanned PDF), use Tesseract OCR
        if not text:
            try:
                # Render page to image
                pix = page.get_pixmap()
                img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                # Run OCR
                text = pytesseract.image_to_string(img)
                text = clean_text(text)
                print(f"  [OCR] Extracted text from {doc_id} page {i+1} (length={len(text)})")
            except Exception as e:
                print(f"  [OCR] Failed for {doc_id} page {i+1}: {e}")

        if text:
            out.append(TextChunk(
                chunk_id=f"{doc_id}::p{i+1}",
                doc_id=doc_id,
                page_num=i+1,
                text=text
            ))
    return out

def load_images(fig_dir: str) -> List[ImageItem]:
    items: List[ImageItem] = []
    for p in sorted(glob.glob(os.path.join(fig_dir, "*.*"))):
        base = os.path.basename(p)
        caption = os.path.splitext(base)[0].replace("_", " ")
        items.append(ImageItem(item_id=base, path=p, caption=caption))
    return items

# Run ingestion
page_chunks: List[TextChunk] = []
for p in pdfs:
    page_chunks.extend(extract_pdf_pages(p))

image_items = load_images(FIG_DIR)

print("Total text chunks:", len(page_chunks))
print("Total images:", len(image_items))
if page_chunks:
    print("Sample text chunk:", page_chunks[0].chunk_id, page_chunks[0].text[:180])
if image_items:
    print("Sample image item:", image_items[0])


Total text chunks: 5
Total images: 7
Sample text chunk: sample_doc_cooking.pdf::p1 Distractor: sample_doc_cooking.pdf To bake a cake, preheat oven to 350F. Mix flour, sugar, and eggs. Frosting can be made with butter and powdered sugar.
Sample image item: ImageItem(item_id='figure_ablation_study.png', path='project_data_mm/figures/figure_ablation_study.png', caption='figure ablation study')


### Cell Description
**What:** Ingests documents by extracting text page-by-page and loading image metadata (using filenames as captions).
**Why:** Transforms raw unstructured files into structured `TextChunk` and `ImageItem` objects that can be indexed.
**Assumptions/Tradeoffs:** Page-based chunking is simple but may split semantic contexts across pages. Using filenames as captions assumes filenames are descriptive, which is often not true in the wild.


In [6]:
# --- Fixed-Size Chunking (Ablation Option) ---
def chunk_text_fixed(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    tokens = text.split()
    chunks = []
    start = 0
    text_len = len(text)
    while start < text_len:
        end = min(start + chunk_size, text_len)
        chunks.append(text[start:end])
        start += (chunk_size - overlap)
    return chunks

def extract_pdf_chunks_fixed(pdf_path: str, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text("text") + "\n"
    clean = clean_text(full_text)
    raw = chunk_text_fixed(clean, chunk_size, overlap)
    return [
        TextChunk(chunk_id=f"{doc_id}::c{i}", doc_id=doc_id, page_num=-1, text=t)
        for i, t in enumerate(raw)
    ]


### Cell Description
**What:** Implements a sliding window chunking strategy (Fixed-Size) as an alternative to page-based chunking.
**Why:** Allows for more granular retrieval, ensuring that specific facts can be retrieved without pulling in large amounts of irrelevant text.
**Assumptions/Tradeoffs:** Fixed boundaries (e.g., 900 chars) might cut sentences or tables in half, potentially losing context compared to paragraph-aware chunking.


## 5) Retrieval (TF‑IDF)
We build two TF‑IDF indexes:
- One over **PDF text chunks**
- One over **image captions**

Retrieval returns the top‑k results with similarity scores.


In [7]:
def build_tfidf_index_text(chunks: List[TextChunk]):
    corpus = [c.text for c in chunks]
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(corpus)
    X = normalize(X)
    return vec, X

def build_tfidf_index_images(items: List[ImageItem]):
    corpus = [it.caption for it in items]
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(corpus)
    X = normalize(X)
    return vec, X

text_vec, text_X = build_tfidf_index_text(page_chunks)
img_vec, img_X = build_tfidf_index_images(image_items)

def tfidf_retrieve(query: str, vec: TfidfVectorizer, X, top_k: int = 5):
    q = vec.transform([query])
    q = normalize(q)
    scores = (X @ q.T).toarray().ravel()
    idx = np.argsort(-scores)[:top_k]
    return [(int(i), float(scores[i])) for i in idx]

print("✅ Indexes built.")


✅ Indexes built.


### Cell Description
**What:** Builds TF-IDF (Term Frequency-Inverse Document Frequency) indexes for text chunks and image captions.
**Why:** Provides a strong "Sparse" retrieval baseline that is highly effective at finding exact keyword matches.
**Assumptions/Tradeoffs:** TF-IDF ignores semantic meaning and synonyms (e.g., "car" vs "automobile"). It relies entirely on exact lexical overlap.


In [8]:
# --- Dense Retrieval & Rerank Setup ---
dense_model = None
reranker = None

try:
    print("⏳ Loading embedding model...")
    dense_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("✅ Dense model loaded.")

    # Optional: Load Reranker (CrossEncoder)
    # Using a small one for speed/memory
    print("⏳ Loading reranker...")
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    print("✅ Reranker loaded.")
except Exception as e:
    print(f"⚠️ Models skipped: {e}")

text_embs = None
img_embs = None

def build_dense_indexes():
    global text_embs, img_embs
    if dense_model is None: return
    print("Building dense indexes...")
    # Encode text
    text_embs = dense_model.encode([c.text for c in page_chunks], convert_to_tensor=True)
    # Encode images (captions)
    img_embs = dense_model.encode([i.caption for i in image_items], convert_to_tensor=True)
    print("✅ Dense indexes built.")

# Build immediately if model exists
if dense_model:
    build_dense_indexes()

def dense_retrieve(query: str, embs, top_k=5):
    if dense_model is None or embs is None: return []
    q_emb = dense_model.encode(query, convert_to_tensor=True)
    scores = util.cos_sim(q_emb, embs)[0]
    top = torch.topk(scores, k=min(top_k, len(scores)))
    return [(int(i), float(s)) for s, i in zip(top.values, top.indices)]


⏳ Loading embedding model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Dense model loaded.
⏳ Loading reranker...


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

✅ Reranker loaded.
Building dense indexes...
✅ Dense indexes built.


### Cell Description
**What:** Initializes the Dense Retrieval model (`SentenceTransformer`) and Reranker (`CrossEncoder`) and encodes the corpus.
**Why:** Enables semantic search that can match concepts even without exact keywords, and reranking to refine the top results.
**Assumptions/Tradeoffs:** We check for model availability to ensure offline safety. Running embeddings locally requires more RAM/CPU than TF-IDF.


## 6) Build evidence context
We assemble a compact context string + list of image paths.

**Guidelines for good context:**
- Keep snippets short (100–300 chars)
- Always include chunk IDs so you can cite evidence
- Attach images that are likely relevant


In [9]:
def _normalize_scores(hits):
    """Normalize scores to 0..1 range for fusion."""
    if not hits: return []
    scores = [s for _, s in hits]
    min_s, max_s = min(scores), max(scores)
    if math.isclose(max_s, min_s):
        return [(i, 1.0) for i, _ in hits]
    return [(i, (s - min_s) / (max_s - min_s)) for i, s in hits]

def simple_extractive_answer(question: str, context: str) -> str:
    """
    A rule-based baseline answer generator.
    """
    if not context:
        return "I don't know (no context)."

    # Extract text content from the formatted context string
    lines = context.split('\n')
    extracted_text = []

    for line in lines:
        # Look for [TEXT | ... ] lines
        if "[TEXT |" in line:
            # The actual text starts after the closing bracket ]
            parts = line.split("] ", 1)
            if len(parts) > 1:
                extracted_text.append(parts[1].strip())

    if extracted_text:
        # Return the top 3 lines joined
        return " ".join(extracted_text[:3])

    return "See context above."

def build_context(
    question: str,
    method: str = "sparse",
    top_k_text: int = TOP_K_TEXT,
    top_k_images: int = TOP_K_IMAGES,
    top_k_evidence: int = TOP_K_EVIDENCE,
    alpha: float = ALPHA,
) -> Dict[str, Any]:

    # 1. Retrieve candidates
    text_hits = []
    img_hits = []

    if method == "sparse":
        text_hits = tfidf_retrieve(question, text_vec, text_X, top_k=top_k_text)
        img_hits = tfidf_retrieve(question, img_vec, img_X, top_k=top_k_images)

    elif method == "dense":
        text_hits = dense_retrieve(question, text_embs, top_k=top_k_text)
        img_hits = dense_retrieve(question, img_embs, top_k=top_k_images)

    elif method == "hybrid" or method == "hybrid_rerank":
        # Weighted Fusion
        t_sparse = dict(tfidf_retrieve(question, text_vec, text_X, top_k=top_k_text*2))
        t_dense = dict(dense_retrieve(question, text_embs, top_k=top_k_text*2))

        all_t_ids = set(t_sparse.keys()) | set(t_dense.keys())
        merged_text = []
        for i in all_t_ids:
            s_sp = t_sparse.get(i, 0.0)
            s_dn = t_dense.get(i, 0.0)
            merged_text.append((i, s_sp + s_dn))
        text_hits = sorted(merged_text, key=lambda x: x[1], reverse=True)[:top_k_text]

        i_sparse = dict(tfidf_retrieve(question, img_vec, img_X, top_k=top_k_images*2))
        i_dense = dict(dense_retrieve(question, img_embs, top_k=top_k_images*2))
        all_i_ids = set(i_sparse.keys()) | set(i_dense.keys())
        merged_img = []
        for i in all_i_ids:
            s_sp = i_sparse.get(i, 0.0)
            s_dn = i_dense.get(i, 0.0)
            merged_img.append((i, s_sp + s_dn))
        img_hits = sorted(merged_img, key=lambda x: x[1], reverse=True)[:top_k_images]

    # 2. Normalize and Fuse
    text_norm = _normalize_scores(text_hits)
    img_norm  = _normalize_scores(img_hits)

    fused = []
    for idx, s in text_norm:
        ch = page_chunks[idx]
        fused.append({
            "modality": "text",
            "id": ch.chunk_id,
            "fused_score": float(alpha * s),
            "text": ch.text,
            "path": None,
            "raw_score": s
        })
    for idx, s in img_norm:
        it = image_items[idx]
        fused.append({
            "modality": "image",
            "id": it.item_id,
            "fused_score": float((1.0 - alpha) * s),
            "text": it.caption,
            "path": it.path,
            "raw_score": s
        })

    fused = sorted(fused, key=lambda d: d["fused_score"], reverse=True)

    # 3. Rerank (Optional)
    if method == "hybrid_rerank" and reranker:
        candidates = fused[:top_k_evidence*2]
        pairs = [(question, c["text"]) for c in candidates]
        if pairs:
            scores = reranker.predict(pairs)
            for i, score in enumerate(scores):
                candidates[i]["fused_score"] = float(score)
            fused = sorted(candidates, key=lambda d: d["fused_score"], reverse=True)

    fused = fused[:top_k_evidence]

    # Context String
    ctx_lines = []
    paths = []
    for ev in fused:
        if ev["modality"] == "text":
            snippet = (ev["text"] or "")[:260].replace("\n", " ")
            ctx_lines.append(f"[TEXT | {ev['id']} | score={ev['fused_score']:.3f}] {snippet}")
        else:
            ctx_lines.append(f"[IMAGE | {ev['id']} | score={ev['fused_score']:.3f}] caption={ev['text']}")
            paths.append(ev["path"])

    return {
        "question": question,
        "context": "\n".join(ctx_lines),
        "image_paths": paths,
        "text_hits": text_hits,
        "img_hits": img_hits
    }


### Cell Description
**What:** The core retrieval logic: retrieves candidates via Sparse/Dense methods, fuses scores using `ALPHA`, reranks (optional), and formats the context string.
**Why:** Selects the most relevant multimodal evidence to feed into the generation step.
**Assumptions/Tradeoffs:** Linear score fusion assumes normalized sparse and dense scores are directly comparable, which isn't always robust. The context window is limited by `TOP_K_EVIDENCE`.


## 7) “Generator” (simple, offline)
To keep this notebook runnable anywhere, we implement a **lightweight extractive generator**:
- It returns the top evidence lines
- In your real submission, you can replace this with an LLM call (HF local model or an API)

**Key rule:** the answer must stay consistent with evidence.


In [13]:
def run_query(qobj, method="sparse", top_k_text=TOP_K_TEXT, alpha=ALPHA):
    ctx = build_context(qobj["question"], method=method, top_k_text=top_k_text, alpha=alpha)
    return {
        "id": qobj["id"],
        "question": qobj["question"],
        "answer": simple_extractive_answer(qobj["question"], ctx["context"]),
        "context": ctx["context"],
        "image_paths": ctx["image_paths"]
    }

# Demo: Print full output for Screenshot 1 (Evidence) and Screenshot 2 (Answer)
result = run_query(QUERIES[0])
print(f"QUESTION: {result['question']}")
print("-" * 40)
print("RETRIEVED CONTEXT:")
print(result['context'])
print("-" * 40)
print("GROUNDED ANSWER:")
print(result['answer'])
print("=" * 40)


QUESTION: What is Retrieval-Augmented Generation (RAG) and why is evidence grounding important?
----------------------------------------
RETRIEVED CONTEXT:
[TEXT | sample_doc_rag_basics.pdf::p1 | score=0.500] Sample Doc 1: RAG Basics Retrieval-Augmented Generation (RAG) combines a retriever and a generator. The retriever fetches evidence chunks from documents. A common baseline is TF-IDF retrieval. Another baseline is BM25, which uses term frequency and inverse doc
[IMAGE | figure_tfidf_retrieval.png | score=0.500] caption=figure tfidf retrieval
[IMAGE | figure_rag_pipeline.png | score=0.500] caption=figure rag pipeline
[TEXT | sample_doc_multimodal_eval.pdf::p1 | score=0.205] Sample Doc 2: Multimodal RAG + Evaluation Multimodal RAG includes both text (PDF pages) and images (figures). A simple approach is to attach relevant figures as evidence. Evaluation can include retrieval metrics such as Precision@k and Recall@k, plus qualitati
[TEXT | sample_doc_history.pdf::p1 | score=0.000] Dis

### Cell Description
**What:** A lightweight "Generator" that extracts the top text lines from the context as the answer.
**Why:** Simulates the generation phase to complete the RAG pipeline without requiring an external LLM API key.
**Assumptions/Tradeoffs:** This is extractive, not generative. It cannot synthesize new reasoning or fluent sentences like a GPT model would.


## 8) Retrieval Evaluation (Precision@k / Recall@k)
We treat a text chunk as **relevant** for a query if it contains at least one `must_have_keywords` term.



In [11]:
def is_relevant_text(chunk_text: str, rubric: Dict[str, Any]) -> bool:
    text = chunk_text.lower()
    must = [k.lower() for k in rubric.get("must_have_keywords", [])]
    return any(k in text for k in must)

def precision_at_k(relevances: List[bool], k: int) -> float:
    k = min(k, len(relevances))
    if k == 0: return 0.0
    return sum(relevances[:k]) / k

def recall_at_k(relevances: List[bool], k: int, total_relevant: int) -> float:
    k = min(k, len(relevances))
    if total_relevant == 0: return 0.0
    return sum(relevances[:k]) / total_relevant

def faithfulness_proxy(answer: str, context: str) -> float:
    # Simple proxy: overlap of keywords between answer and context
    ans_words = set(answer.lower().split())
    ctx_words = set(context.lower().split())
    if not ans_words: return 0.0
    overlap = len(ans_words.intersection(ctx_words))
    return overlap / len(ans_words)

def eval_retrieval_for_query(qobj, method="sparse", top_k=10) -> Dict[str, Any]:
    question = qobj["question"]
    rubric = qobj["rubric"]

    # 1. Run Retrieval via build_context (to reuse logic)
    # We set top_k_evidence to top_k to get enough candidates
    res = build_context(question, method=method, top_k_evidence=top_k)

    # 2. Check Relevance of the Text chunks in the context
    ctx_lines = res["context"].split('\n')
    rels = []
    # Parse the context to find text chunks
    for line in ctx_lines:
        if "[TEXT" in line:
            # Extract text content after the metadata block
            content = line.split("] ")[-1]
            rels.append(is_relevant_text(content, rubric))
        elif "[IMAGE" in line:
            pass # Skip images for text-based P/R

    # Estimate total relevant in corpus (for recall)
    # Note: This is a simplified recall estimate based on the entire corpus
    total_rel = sum(is_relevant_text(ch.text, rubric) for ch in page_chunks)

    # Generate answer for faithfulness
    ans = simple_extractive_answer(question, res["context"])
    faith = faithfulness_proxy(ans, res["context"])

    return {
        "id": qobj["id"],
        "method": method,
        "P@5": precision_at_k(rels, 5),
        "R@10": recall_at_k(rels, 10, total_rel),
        "Faithfulness": faith,
        "total_relevant_chunks": total_rel,
    }

# Demo
print(eval_retrieval_for_query(QUERIES[0], method="sparse"))


{'id': 'Q1', 'method': 'sparse', 'P@5': 0.4, 'R@10': 1.0, 'Faithfulness': 1.0, 'total_relevant_chunks': 2}


### Cell Description
**What:** Evaluates the retrieval system using Precision@5, Recall@10, and a Faithfulness proxy (keyword overlap).
**Why:** Provides quantitative metrics to objectively measure the quality of the retrieval strategies.
**Assumptions/Tradeoffs:** Keyword-based relevance is a rough proxy for true semantic relevance. "Faithfulness" here checks word overlap, whereas true faithfulness requires logical entailment checks.


## 9) Ablation Study (REQUIRED)

You must compare **at least**:
- **Chunking A (page-based)** vs **Chunking B (fixed-size)**  
- **Sparse** vs **Dense** vs **Hybrid** vs **Hybrid + Rerank** *(dense/rerank can be optional extensions — but include at least sparse + one fusion variant)*  
- **Text-only RAG** vs **Multimodal RAG** (your context must include evidence items)

**Deliverable:** include a final results table in your README:

`Query × Method × Precision@5 × Recall@10 × Faithfulness`

### Quick ablation ideas
- Vary `TOP_K_TEXT`: 2, 5, 10  
- Vary `ALPHA`: 0.2, 0.5, 0.8  
- Compare page-chunking vs fixed-size (`CHUNK_SIZE` / `CHUNK_OVERLAP`)  


In [12]:
def ablation_study():
    rows = []

    # 1. Retrieval Methods Comparison (Multimodal, Page Chunking)
    # -----------------------------------------------------------
    methods = ["sparse"]
    if dense_model:
        methods.append("dense")
        methods.append("hybrid")
        if 'reranker' in globals() and reranker:
            methods.append("hybrid_rerank")

    print(f"Running Method Ablation: {methods}")
    for q in QUERIES:
        for m in methods:
            row = eval_retrieval_for_query(q, method=m)
            row["experiment"] = "Method"
            rows.append(row)

    # 2. Chunking Strategy Comparison (Sparse, Fixed-Size vs Page-Based)
    # ------------------------------------------------------------------
    # Note: eval_retrieval_for_query uses global 'page_chunks' by default.
    # To test fixed chunking, we need to temporarily swap the global index.
    print("Running Chunking Ablation (Page vs Fixed)...")

    # Generate fixed chunks
    fixed_chunks = []
    for p in pdfs:
        fixed_chunks.extend(extract_pdf_chunks_fixed(p))

    # Temporarily build index for fixed chunks
    global page_chunks, text_vec, text_X # Access globals to swap
    original_chunks = page_chunks
    original_vec = text_vec
    original_X = text_X

    # Swap to fixed
    page_chunks = fixed_chunks
    text_vec, text_X = build_tfidf_index_text(page_chunks)

    for q in QUERIES:
        # Run Sparse on Fixed Chunks
        row = eval_retrieval_for_query(q, method="sparse")
        row["method"] = "sparse_fixed_chunk"
        row["experiment"] = "Chunking"
        rows.append(row)

    # Restore Page-Based
    page_chunks = original_chunks
    text_vec = original_vec
    text_X = original_X
    print("Restored page-based index.")

    # 3. Text-Only vs Multimodal (Hybrid)
    # -----------------------------------
    # To simulate text-only, we set alpha=1.0 (Text Dominates) or ignore images
    print("Running Modality Ablation (Text-Only vs Multimodal)...")

    # Define a helper that forces text-only context
    def eval_text_only(qobj, top_k=10):
        # Force alpha=1.0 implies text scores are used 100%, image scores 0%
        # But images might still appear if text score is low.
        # Better to force top_k_images=0 in build_context.
        # However, build_context signature is fixed.
        # We will assume alpha=1.0 is sufficient proxy for "Text Focused"
        # OR better: manually call build_context with alpha=1.0

        # Actually, let's use the existing alpha param in build_context
        # We need to modify eval_retrieval_for_query to accept alpha, or call build_context directly here.
        question = qobj["question"]
        rubric = qobj["rubric"]

        # Call build_context with alpha=1.0 (Text Only)
        res = build_context(question, method="hybrid", alpha=1.0, top_k_evidence=top_k)

        # Eval logic (copy-paste from eval_retrieval_for_query)
        ctx_lines = res["context"].split('\n')
        rels = []
        for line in ctx_lines:
            if "[TEXT" in line:
                content = line.split("] ")[-1]
                rels.append(is_relevant_text(content, rubric))

        total_rel = sum(is_relevant_text(ch.text, rubric) for ch in page_chunks)
        ans = simple_extractive_answer(question, res["context"])
        faith = faithfulness_proxy(ans, res["context"])

        return {
            "id": qobj["id"],
            "method": "text_only_hybrid",
            "P@5": precision_at_k(rels, 5),
            "R@10": recall_at_k(rels, 10, total_rel),
            "Faithfulness": faith,
            "total_relevant_chunks": total_rel,
            "experiment": "Modality"
        }

    for q in QUERIES:
        rows.append(eval_text_only(q))

    return pd.DataFrame(rows)

df_ablation = ablation_study()
df_ablation


Running Method Ablation: ['sparse', 'dense', 'hybrid', 'hybrid_rerank']
Running Chunking Ablation (Page vs Fixed)...
Restored page-based index.
Running Modality Ablation (Text-Only vs Multimodal)...


Unnamed: 0,id,method,P@5,R@10,Faithfulness,total_relevant_chunks,experiment
0,Q1,sparse,0.4,1.0,1.0,2,Method
1,Q1,dense,0.4,1.0,1.0,2,Method
2,Q1,hybrid,0.4,1.0,1.0,2,Method
3,Q1,hybrid_rerank,0.4,1.0,1.0,2,Method
4,Q2,sparse,0.2,1.0,1.0,1,Method
5,Q2,dense,0.2,1.0,1.0,1,Method
6,Q2,hybrid,0.2,1.0,1.0,1,Method
7,Q2,hybrid_rerank,0.2,1.0,1.0,1,Method
8,Q3,sparse,0.2,1.0,1.0,1,Method
9,Q3,dense,0.2,1.0,1.0,1,Method


### Cell Description
**What:** Runs an ablation study comparing Sparse, Dense, Hybrid, and Hybrid+Rerank methods across all queries.
**Why:** Identifies which components contribute most to performance and helps diagnose system weaknesses.
**Assumptions/Tradeoffs:** We test a limited set of configurations. A full study would also vary chunk sizes and fusion weights (`ALPHA`) systematically.


## 10) What to submit
1) Your updated dataset (or keep your own)
2) This notebook (with your answers + screenshots/outputs)
3) A short write‑up: retrieval metrics + faithfulness discussion + ablation

**Tip:** If you switch to an LLM, keep the same `build_context()` so the evidence is always visible.
