# CS 5588 — Week 3 Hands-On  
## Building a Multimodal RAG Product Prototype (PDF + Images)

**Goal (today):** Build a *working product prototype* that answers user questions from real documents (PDFs + images) with **evidence citations**.

**What you’ll leave with:**
- A project-ready multimodal RAG pipeline (ingestion → indexing → retrieval → grounded answer)
- A short **Product Brief** inside the notebook (persona, problem, value, success metrics)
- A small **demo loop** you can show to stakeholders (prompt → answer + citations)

> This hands-on is application-first: prioritize a realistic use case and a clean demo.


## 0) Product Brief (Fill in — REQUIRED for Week 3)
- **Team / Name:**  
- **Project name (working title):**  

### 0.1 Target user persona
- Who will use this? (role, context, pain point)

### 0.2 Problem statement (1–2 sentences)
- What decision/task does your product support?

### 0.3 Value proposition (1 sentence)
- What improves (speed, accuracy, trust, cost, risk)?

### 0.4 Success metrics (pick 2–3)
- e.g., time-to-answer, citation coverage, % “not enough evidence” when missing, user satisfaction (1–5), precision@5


## 1) Setup (Colab)
Run installs, then imports.


In [None]:
# === Setup & Imports (Colab-friendly) ===
import os, re, glob, json, math
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple, Optional

import numpy as np
import pandas as pd

# ---- Core deps ----
# PyMuPDF for PDF text extraction
!pip -q install pymupdf pillow pandas numpy scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

import fitz  # PyMuPDF
from PIL import Image

# ---- OCR deps ----
!pip -q install pytesseract
!sudo apt-get -qq update
!sudo apt-get -qq install -y tesseract-ocr

import pytesseract

# ---- Retrieval deps ----
!pip -q install faiss-cpu rank-bm25
import faiss
from rank_bm25 import BM25Okapi

# ---- Dense + rerank (optional) ----
# Some environments may have version conflicts. We try to install, but fall back gracefully if needed.
USE_ST = True
USE_RERANK = True

try:
    from sentence_transformers import SentenceTransformer, CrossEncoder
except Exception as e:
    USE_ST = False
    USE_RERANK = False
    print("⚠️ sentence-transformers not available in this runtime. Falling back to TF-IDF for 'dense' retrieval.")
    print("   Error:", e)

# Optional captioning (bonus)
USE_CAPTIONING = False
try:
    from transformers import pipeline
    USE_CAPTIONING = True
except Exception:
    USE_CAPTIONING = False


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


### 1.1 System dependencies (Colab/Linux)
If OCR fails, run this cell.


In [None]:
# (Handled in Setup & Imports above)
print('System dependencies installed in Section 1.')

System dependencies installed in Section 1.


### 1.2 Imports


> **Note:** Dependencies are installed and imported above. If you restart the runtime, re-run Sections 1–2.

## 2) Choose a project dataset (realistic, stakeholder-facing)
Create this structure (you can start small today):

```
project_data_mm/
  docs/
    doc1.pdf
    doc2.pdf
  figures/
    fig1.png
    fig2.jpg
  notes.txt (optional)
```

**Recommended today:** 2 PDFs + 3–5 images that matter to your use case.


In [None]:
import os, glob, zipfile, shutil

DATA_DIR = "project_data_mm"
DOC_DIR = os.path.join(DATA_DIR, "docs")
FIG_DIR = os.path.join(DATA_DIR, "figures")
TMP_DIR = os.path.join(DATA_DIR, "_tmp_unzip")

os.makedirs(DOC_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)
os.makedirs(TMP_DIR, exist_ok=True)

# ✅ Colab uploads are here
zip_files = sorted(glob.glob("/content/*.zip"))
print("Using ZIPs:", zip_files)

IMG_EXTS = {".png", ".jpg", ".jpeg", ".webp", ".bmp", ".tif", ".tiff"}
PDF_EXTS = {".pdf"}

# unzip
for z in zip_files:
    with zipfile.ZipFile(z, "r") as zip_ref:
        zip_ref.extractall(TMP_DIR)

# move into docs/figures
moved_pdf = moved_img = 0
for root, _, files in os.walk(TMP_DIR):
    for fn in files:
        src = os.path.join(root, fn)
        ext = os.path.splitext(fn.lower())[1]

        if ext in PDF_EXTS:
            shutil.move(src, os.path.join(DOC_DIR, fn))
            moved_pdf += 1
        elif ext in IMG_EXTS:
            shutil.move(src, os.path.join(FIG_DIR, fn))
            moved_img += 1

print(f"Moved PDFs: {moved_pdf}, Moved images: {moved_img}")

# verify
pdfs = sorted(glob.glob(os.path.join(DOC_DIR, "*.pdf")))
imgs = sorted(glob.glob(os.path.join(FIG_DIR, "*.*")))
print("PDFs:", len(pdfs), pdfs[:5])
print("Images:", len(imgs), imgs[:5])


Using ZIPs: ['/content/pictrues.zip', '/content/withour 2000.zip']
Moved PDFs: 14, Moved images: 10
PDFs: 7 ['project_data_mm/docs/2025ccfsrumkcpolice.pdf', 'project_data_mm/docs/2026-spring-shuttle-schedule.pdf', 'project_data_mm/docs/Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf', 'project_data_mm/docs/umkc-health-sciences-campus-map.pdf', 'project_data_mm/docs/umkc-student-handbook.pdf']
Images: 5 ['project_data_mm/figures/1.png', 'project_data_mm/figures/2.png', 'project_data_mm/figures/3.png', 'project_data_mm/figures/4.png', 'project_data_mm/figures/5.png']


## 3) Define 3 stakeholder questions (application-oriented)
- **Q1/Q2:** require both text + figure/table evidence  
- **Q3:** ambiguous/missing evidence → system should say **Not enough evidence in the retrieved context.**

Also add:
- Must-cite evidence (page or figure)
- Success criteria (what a good answer must include)


In [None]:
QUERIES = [
    {
        "id": "Q1",
        "question": (
            "Which shuttle bus station is closest to Plaster Hall "
            "according to the campus map and shuttle route information?"
        ),
        "must_cite": ["[map]", "[route]"],
        "success_criteria": [
            "Mentions a specific shuttle stop",
            "Uses campus map evidence",
            "Cites shuttle route or map figure"
        ],
        "keywords": ["Plaster Hall", "shuttle", "map", "route"]
    },
    {
        "id": "Q2",
        "question": (
            "Which parking lot is closest to Haag Hall, and how much does "
            "a student pay for a semester parking permit for that lot?"
        ),
        "must_cite": ["[parking map]", "[parking table]"],
        "success_criteria": [
            "Identifies the nearest parking lot",
            "States a numeric semester permit price",
            "Uses both text and table evidence"
        ],
        "keywords": ["Haag Hall", "parking", "permit", "price"]
    },
    {
        "id": "Q3",
        "question": (
            "Will UMKC’s student enrollment decrease in 2026 compared to previous years?"
        ),
        "must_cite": [],
        "success_criteria": [
            "Not enough evidence in the retrieved context."
        ],
        "keywords": ["enrollment", "2026", "trend"]
    },
]


## 4) Ingest PDFs (per-page text)


In [None]:
from dataclasses import dataclass
from typing import List
import os, re
import fitz  # PyMuPDF

@dataclass
class TextChunk:
    chunk_id: str
    doc_id: str
    page_num: int
    text: str
    source: str  # file path for citations

def extract_pdf_pages(pdf_path: str) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    out: List[TextChunk] = []

    try:
        with fitz.open(pdf_path) as doc:
            for i in range(len(doc)):
                page = doc.load_page(i)
                text = page.get_text("text") or ""
                text = re.sub(r"\s+", " ", text).strip()

                if text:
                    out.append(
                        TextChunk(
                            chunk_id=f"{doc_id}::p{i+1}",
                            doc_id=doc_id,
                            page_num=i+1,
                            text=text,
                            source=pdf_path
                        )
                    )
    except Exception as e:
        print(f"[WARN] Failed to read {pdf_path}: {e}")

    return out

page_chunks = []
for p in pdfs:
    page_chunks.extend(extract_pdf_pages(p))

print("Total PDF page chunks:", len(page_chunks))
if page_chunks:
    print("Sample:", page_chunks[0].chunk_id, page_chunks[0].text[:250])


Total PDF page chunks: 256
Sample: 2025ccfsrumkcpolice.pdf::p1 1 UNIVERSITY OF MISSOURI – KANSAS CITY JEANNE CLERY ACT REPORT Annual Campus Security and Fire Safety Report for 2024 Reported September 2025 In accordance with the Jeanne Clery Disclosure of Campus Security Policy and Campus Crime Statistics Act of 


## 5) Ingest images (OCR first, optional captioning)


In [None]:
@dataclass
class EvidenceItem:
    evid_id: str
    source: str
    image_path: str
    ocr_text: str
    caption_text: str
    evidence_text: str

def run_ocr(image_path: str) -> str:
    img = Image.open(image_path).convert("RGB")
    text = pytesseract.image_to_string(img)
    return re.sub(r"\s+", " ", text).strip()

evidence_items = []
for ip in imgs:
    base = os.path.basename(ip)
    evid_id = os.path.splitext(base)[0]
    ocr = run_ocr(ip)
    evidence_items.append(EvidenceItem(evid_id, base, ip, ocr, "", ocr))

print("Evidence items:", len(evidence_items))
if evidence_items:
    print("Sample OCR:", evidence_items[0].source, evidence_items[0].ocr_text[:200])


Evidence items: 5
Sample OCR: 1.png Undergraduate Research and Creative Scholarship This year we celebrate 50 years of being at the forefront of Undergraduate Research and Creative Scholarship, which offers undergraduate students opport


### 5.1 Optional captioning (bonus)


In [None]:
### 5.1 Optional captioning (bonus)

USE_CAPTIONING = False  # set True to enable

if USE_CAPTIONING:
    from transformers import pipeline
    from PIL import Image
    import re

    # Use GPU if available (Colab)
    captioner = pipeline(
        "image-to-text",
        model="Salesforce/blip-image-captioning-base",
        device=0
    )

    for ei in evidence_items:
        try:
            img = Image.open(ei.image_path).convert("RGB")

            cap = captioner(
                img,
                max_new_tokens=40
            )[0]["generated_text"]

            cap = re.sub(r"\s+", " ", cap).strip()

            # Store caption separately
            ei.caption_text = cap

            # Merge OCR + caption safely
            if ei.ocr_text:
                ei.evidence_text = (ei.ocr_text + "\n" + cap).strip()
            else:
                ei.evidence_text = cap

        except Exception as e:
            print(f"[WARN] Captioning failed for {ei.image_path}: {e}")

    print("Captioning complete.")

else:
    print("Captioning skipped.")


Captioning skipped.


## 6) Chunking (page-based vs fixed-size)


In [None]:
@dataclass
class SubChunk:
    chunk_id: str
    doc_id: str
    page_num: int
    text: str

def fixed_size_chunk(text: str, words_per_chunk: int = 250, overlap: int = 40) -> List[str]:
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(len(words), start + words_per_chunk)
        chunks.append(" ".join(words[start:end]))
        if end == len(words):
            break
        start = max(0, end - overlap)
    return chunks

sub_chunks = []
for pc in page_chunks:
    for j, t in enumerate(fixed_size_chunk(pc.text, 250, 40)):
        sub_chunks.append(SubChunk(f"{pc.doc_id}::p{pc.page_num}::c{j+1}", pc.doc_id, pc.page_num, t))

print("Page chunks:", len(page_chunks))
print("Fixed-size chunks:", len(sub_chunks))


Page chunks: 256
Fixed-size chunks: 464


## 7) Indexing & retrieval (dense + sparse + rerank)


In [None]:
def tokenize(text: str) -> List[str]:
    return [t.lower() for t in re.findall(r"[a-zA-Z0-9]+", text)]

# --- Embeddings (dense retrieval) ---
# If SentenceTransformers is available, we use it. Otherwise, we fall back to TF-IDF vectors.
if USE_ST:
    embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    def embed_texts(texts: List[str], batch_size: int = 32) -> np.ndarray:
        return embedder.encode(
            texts, batch_size=batch_size, show_progress_bar=True,
            convert_to_numpy=True, normalize_embeddings=True
        )
else:
    # TF-IDF fallback (acts as a "dense-ish" baseline)
    tfidf_vec = TfidfVectorizer(max_features=50000, ngram_range=(1,2))
    _tfidf_fitted = False

    def embed_texts(texts: List[str], batch_size: int = 32) -> np.ndarray:
        global _tfidf_fitted
        X = tfidf_vec.fit_transform(texts) if not _tfidf_fitted else tfidf_vec.transform(texts)
        _tfidf_fitted = True
        X = normalize(X)
        return X.toarray().astype(np.float32)

def build_faiss_ip(vectors: np.ndarray):
    dim = vectors.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(vectors.astype(np.float32))
    return index

TEXT_CORPUS_A = page_chunks
TEXT_CORPUS_B = sub_chunks

texts_A = [c.text for c in TEXT_CORPUS_A]
vecs_A = embed_texts(texts_A) if texts_A else np.zeros((0,384), dtype=np.float32)
faiss_A = build_faiss_ip(vecs_A) if len(texts_A)>0 else None
bm25_A = BM25Okapi([tokenize(t) for t in texts_A]) if len(texts_A)>0 else None

texts_B = [c.text for c in TEXT_CORPUS_B]
vecs_B = embed_texts(texts_B) if texts_B else np.zeros((0,384), dtype=np.float32)
faiss_B = build_faiss_ip(vecs_B) if len(texts_B)>0 else None
bm25_B = BM25Okapi([tokenize(t) for t in texts_B]) if len(texts_B)>0 else None

evid_texts = [e.evidence_text for e in evidence_items]
evid_vecs = embed_texts(evid_texts) if evid_texts else np.zeros((0,384), dtype=np.float32)
faiss_E = build_faiss_ip(evid_vecs) if len(evid_texts)>0 else None

print("Indexes ready.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/8 [00:00<?, ?it/s]

Batches:   0%|          | 0/15 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Indexes ready.


In [None]:
def dense_search(query: str, index, corpus, top_k: int = 5):
    if index is None or len(corpus)==0:
        return []
    qv = embed_texts([query])
    scores, idxs = index.search(qv.astype(np.float32), top_k)
    out = []
    for s, i in zip(scores[0], idxs[0]):
        if int(i) >= 0:
            out.append((float(s), corpus[int(i)]))
    return out

def sparse_search(query: str, bm25, corpus, top_k: int = 5):
    if bm25 is None or len(corpus)==0:
        return []
    scores = bm25.get_scores(tokenize(query))
    top = np.argsort(scores)[::-1][:top_k]
    return [(float(scores[i]), corpus[int(i)]) for i in top]

def hybrid_fuse(dense_res, sparse_res, alpha: float = 0.5, top_k: int = 5):
    def k(item): return getattr(item, "chunk_id", getattr(item, "evid_id", str(item)))
    dense_rank = {k(it): r for r, (_, it) in enumerate(dense_res, start=1)}
    sparse_rank = {k(it): r for r, (_, it) in enumerate(sparse_res, start=1)}
    keys = set(dense_rank) | set(sparse_rank)
    fused = []
    for key in keys:
        dr = dense_rank.get(key, len(dense_res)+1)
        sr = sparse_rank.get(key, len(sparse_res)+1)
        score = alpha*(1.0/dr) + (1-alpha)*(1.0/sr)
        obj = next((it for _, it in dense_res if k(it)==key), None) or next((it for _, it in sparse_res if k(it)==key), None)
        fused.append((score, obj))
    fused.sort(key=lambda x: x[0], reverse=True)
    return fused[:top_k]

# --- Reranker (optional) ---
reranker = None
if USE_ST and USE_RERANK:
    try:
        reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    except Exception as e:
        reranker = None
        USE_RERANK = False
        print("⚠️ Reranker unavailable, continuing without reranking. Error:", e)


def rerank(query: str, items, get_text, top_k=5):
    if reranker is None:
        return list(items)[:top_k]

    if not items:
        return []
    scores = reranker.predict([(query, get_text(it)) for it in items])
    ranked = sorted(zip(scores, items), key=lambda x: x[0], reverse=True)
    return [it for _, it in ranked[:top_k]]

def retrieve_text(query: str, chunking: str = "page", method: str = "hybrid", top_k: int = 5, alpha: float = 0.5, use_rerank: bool = True):
    if chunking == "page":
        corpus, index, bm25 = TEXT_CORPUS_A, faiss_A, bm25_A
    else:
        corpus, index, bm25 = TEXT_CORPUS_B, faiss_B, bm25_B

    if method == "dense":
        res = dense_search(query, index, corpus, top_k=max(10, top_k))
        items = [it for _, it in res]
    elif method == "sparse":
        res = sparse_search(query, bm25, corpus, top_k=max(10, top_k))
        items = [it for _, it in res]
    else:
        d = dense_search(query, index, corpus, top_k=max(10, top_k))
        s = sparse_search(query, bm25, corpus, top_k=max(10, top_k))
        res = hybrid_fuse(d, s, alpha=alpha, top_k=max(10, top_k))
        items = [it for _, it in res]

    if use_rerank:
        return rerank(query, items, lambda it: it.text, top_k=top_k)
    return items[:top_k]

def retrieve_evidence(query: str, top_k: int = 3):
    res = dense_search(query, faiss_E, evidence_items, top_k=top_k)
    return [it for _, it in res]


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## 8) Evidence pack + citations (product output)


In [None]:
def cite_text(it): return f"[{it.doc_id} p{it.page_num}]"
def cite_fig(ei): return f"[{os.path.splitext(ei.source)[0]}]"

def build_evidence_pack(question: str, chunking="page", method="hybrid", top_k_text=4, top_k_fig=2):
    txt = retrieve_text(question, chunking=chunking, method=method, top_k=top_k_text, use_rerank=True)
    figs = retrieve_evidence(question, top_k=top_k_fig)
    pack = []
    for it in txt:
        pack.append({"type":"text", "cite": cite_text(it), "content": it.text[:800]})
    for ei in figs:
        pack.append({"type":"figure", "cite": cite_fig(ei), "content": (ei.evidence_text or "")[:800], "path": ei.image_path})
    return pack

ep = build_evidence_pack(QUERIES[0]["question"])
for e in ep:
    print(e["cite"], e["type"], e["content"][:120])


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[umkc-volker-campus-map.pdf p1] text Brush Creek Brush Creek Kauffman Legacy Lake 14 5 24 32S 32E A44 1 37 96M A24 32W 59 4 37 37 11 38 46 40 4 38 43 16 96 3
[umkc-volker-campus-map.pdf p2] text UMKC Volker campus buildings and addresses ADMINISTRATIVE CENTER 5115 Oak St. AFRICAN AMERICAN HISTORY AND CULTURE HOUSE
[umkc-health-sciences-campus-map.pdf p2] text UMKC Health Sciences campus buildings and addresses CENTER FOR BEHAVIORAL MEDICINE 1000 E. 24th St. CHILDREN’S MERCY HOS
[umkc-health-sciences-campus-map.pdf p1] text 68M Reserved Dental Patient Parking 67 28 Faculty/Staff - Levels 2-3 Students - Levels 4-7 28W Reserved Dental Patient P
[4] figure Top Rankings Among the best in the Midwest The Princeton Review describes UMKC as "academically outstanding and well wor
[3] figure Professional Career Escalators Through Professional Career Escalators, students explore careers in four growing industri


## 9) Grounded response (LLM/VLM) — connect Gemini/HF if available


In [None]:
from typing import Optional
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"  # fast

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=2048,
    do_sample=False,
    temperature=0.0,
    pad_token_id=tokenizer.pad_token_id
)

def generate_answer(prompt: str, image_paths: Optional[list] = None) -> str:
    out = generator(prompt, max_new_tokens=160)
    text = out[0]["generated_text"]
    return text.replace(prompt, "").strip()


`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Passing `generation_config` together with generation-related arguments=({'max_length', 'pad_token_id', 'do_sample', 'temperature'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


## 10) Demo loop (stakeholder-facing)


In [None]:
import re

def rag_prompt(question: str, evidence_pack: list) -> str:
    evidence_lines = []
    for e in (evidence_pack or []):
        cite = e.get("cite", "[unknown]")
        content = re.sub(r"\s+", " ", str(e.get("content",""))).strip()
        if content:
            evidence_lines.append(f"{cite} {content}")

    evidence_block = "\n\n".join(evidence_lines)

    return f"""You are a grounded assistant. Use ONLY the evidence below.
Every key claim must cite evidence like [doc p#] or [fig1].
If the evidence is insufficient, respond exactly:
Not enough evidence in the retrieved context.

Evidence:
{evidence_block}

Question:
{question}

Answer (with citations):
"""


In [None]:
def demo_one(question: str, chunking="page", method="hybrid"):
    ep = build_evidence_pack(question, chunking=chunking, method=method) or []
    prompt = rag_prompt(question, ep)

    image_paths = []
    for e in ep:
        if e.get("type") == "figure" and e.get("path"):
            image_paths.append(e["path"])

    ans = generate_answer(prompt, image_paths=image_paths)
    return ep, ans
for q in QUERIES:
    ep, ans = demo_one(q["question"], chunking="page", method="hybrid")

    print("\n=== ", q["id"], " ===")
    print("Q:", q["question"])
    print("Top evidence citations:", [e.get("cite") for e in ep])
    print("Answer:", (ans or "")[:500])


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Passing `generation_config` together with generation-related arguments=({'max_new_tokens'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



===  Q1  ===
Q: Which shuttle bus station is closest to Plaster Hall according to the campus map and shuttle route information?
Top evidence citations: ['[umkc-volker-campus-map.pdf p1]', '[umkc-volker-campus-map.pdf p2]', '[umkc-health-sciences-campus-map.pdf p2]', '[umkc-health-sciences-campus-map.pdf p1]', '[4]', '[3]']
Answer: According to the provided campus maps and shuttle route information, the closest shuttle bus station to Plaster Hall is the one located at 5115 Oak St., which serves the Administrative Center building. This location is approximately 1 mile away from Plaster Hall based on the distance between these two points on the campus map. [umkc-volker-campus-map.pdf p2]

The shuttle route tracking page also mentions that the UMKC streetcar stop is just steps away from UMKC's Volker Campus, located at 51 Oak


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



===  Q2  ===
Q: Which parking lot is closest to Haag Hall, and how much does a student pay for a semester parking permit for that lot?
Top evidence citations: ['[Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p2]', '[Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p3]', '[Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p1]', '[Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p4]', '[5]', '[1]']
Answer: The closest parking lot to Haag Hall is Area 32S, located at 52nd and Rockhill Road. A single-semester all-day accessible permit for this lot costs $135. For multi-semester permits, it varies depending on the duration; for example, a Fall & Spring permit costs $270, while a Fall, Spring, & Summer permit costs $338. These prices are valid until May 31st, 2026. Notably, these parking options are subject to change due to the transition to virtu

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



===  Q3  ===
Q: Will UMKC’s student enrollment decrease in 2026 compared to previous years?
Top evidence citations: ['[umkc-student-handbook.pdf p46]', '[umkc-student-handbook.pdf p56]', '[Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p3]', '[umkc-student-handbook.pdf p13]', '[5]', '[4]']
Answer: Not enough evidence in the retrieved context. 

The provided information does not contain any data on UMKC's projected enrollment numbers for 2026 or any historical trends regarding student enrollment changes. To answer this question accurately, we would need specific enrollment figures or projections for 2026 from UMKC's official sources or recent reports. [umkc-student-handbook.pdf p13] mentions that students who believe the conduct of others has crossed the narrow line separating their First 


In [None]:
import pandas as pd
eval_df = pd.DataFrame([
    {"QID":"Q1","Precision@5":"(fill)","Recall@10":"(fill)","Faithfulness":"Yes/Partial/No"},
    {"QID":"Q2","Precision@5":"(fill)","Recall@10":"(fill)","Faithfulness":"Yes/Partial/No"},
    {"QID":"Q3","Precision@5":None,"Recall@10":None,"Faithfulness":"Yes (Refusal)"}
])
eval_df
rows = []
for mode in ["dense","sparse","hybrid"]:
    for q in QUERIES:
        ep = build_evidence_pack(q["question"], chunking="page", method=mode) or []
        rows.append({"Retrieval":mode,"QID":q["id"],"Top-3":[e.get("cite") for e in ep[:3]]})
pd.DataFrame(rows)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Retrieval,QID,Top-3
0,dense,Q1,"[[umkc-health-sciences-campus-map.pdf p2], [um..."
1,dense,Q2,[[Student Permits - Parking Options - Parking ...
2,dense,Q3,"[[umkc-student-handbook.pdf p56], [umkc-studen..."
3,sparse,Q1,"[[umkc-volker-campus-map.pdf p1], [umkc-volker..."
4,sparse,Q2,[[Student Permits - Parking Options - Parking ...
5,sparse,Q3,[[Student Permits - Parking Options - Parking ...
6,hybrid,Q1,"[[umkc-volker-campus-map.pdf p1], [umkc-volker..."
7,hybrid,Q2,[[Student Permits - Parking Options - Parking ...
8,hybrid,Q3,"[[umkc-student-handbook.pdf p46], [umkc-studen..."


In [None]:
# =========================
# Citation-only view (per question)
# =========================

def show_citations_for_queries(chunking="page", method="hybrid", top_k=10):
    for q in QUERIES:
        ep = build_evidence_pack(q["question"], chunking=chunking, method=method) or []

        print("\n" + "="*70)
        print(f'{q["id"]}: {q["question"]}')
        print("- Required must-cite:", q.get("must_cite", []))

        if not ep:
            print("No evidence retrieved.")
            continue

        # Split by type for multimodal visibility
        figs = [e for e in ep if e.get("type") == "figure"]
        docs = [e for e in ep if e.get("type") != "figure"]

        print(f"\nTop evidence (up to {top_k}):")
        for i, e in enumerate(ep[:top_k], 1):
            cite = e.get("cite", "[unknown]")
            etype = e.get("type", "doc")
            path = e.get("path", "")
            preview = (e.get("content", "") or "")[:120].replace("\n", " ")
            print(f"{i:02d}. {cite} | type={etype} | path={path} | preview={preview}")

        # Quick summary line (helps screenshots)
        print("\nSummary:")
        print("Doc citations:", [e.get("cite") for e in docs[:top_k]])
        print("Figure citations:", [e.get("cite") for e in figs[:top_k]])

show_citations_for_queries(chunking="page", method="hybrid", top_k=10)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Q1: Which shuttle bus station is closest to Plaster Hall according to the campus map and shuttle route information?
- Required must-cite: ['[map]', '[route]']

Top evidence (up to 10):
01. [umkc-volker-campus-map.pdf p1] | type=text | path= | preview=Brush Creek Brush Creek Kauffman Legacy Lake 14 5 24 32S 32E A44 1 37 96M A24 32W 59 4 37 37 11 38 46 40 4 38 43 16 96 3
02. [umkc-volker-campus-map.pdf p2] | type=text | path= | preview=UMKC Volker campus buildings and addresses ADMINISTRATIVE CENTER 5115 Oak St. AFRICAN AMERICAN HISTORY AND CULTURE HOUSE
03. [umkc-health-sciences-campus-map.pdf p2] | type=text | path= | preview=UMKC Health Sciences campus buildings and addresses CENTER FOR BEHAVIORAL MEDICINE 1000 E. 24th St. CHILDREN’S MERCY HOS
04. [umkc-health-sciences-campus-map.pdf p1] | type=text | path= | preview=68M Reserved Dental Patient Parking 67 28 Faculty/Staff - Levels 2-3 Students - Levels 4-7 28W Reserved Dental Patient P
05. [4] | type=figure | path=project_data_mm/fig

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Q2: Which parking lot is closest to Haag Hall, and how much does a student pay for a semester parking permit for that lot?
- Required must-cite: ['[parking map]', '[parking table]']

Top evidence (up to 10):
01. [Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p2] | type=text | path= | preview=Single-semester all-day accessible permit (24 hours per day): $135 for Fall, valid through January 19th, 2026 Multi-seme
02. [Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p3] | type=text | path= | preview=Permit sales for the Hospital Hill Apartments parking garage are limited, and one permit will be sold for every space. N
03. [Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p1] | type=text | path= | preview=UNIVERSITY OF MISSOURI-KANSAS CITY Parking Home / Parking Options / Student Permits Section Navigation STUDENT PERMITS S
04. [Student Permits - Parking Options - Parking _ Unive

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Q3: Will UMKC’s student enrollment decrease in 2026 compared to previous years?
- Required must-cite: []

Top evidence (up to 10):
01. [umkc-student-handbook.pdf p46] | type=text | path= | preview=Return to Table of Contents P a g e 46 The UMKC O365 email account is the used for University business and all official 
02. [umkc-student-handbook.pdf p56] | type=text | path= | preview=Return to Table of Contents P a g e 56 Connect with the UMKC Student HelpLine. Student Union Built by students, for stud
03. [Student Permits - Parking Options - Parking _ University of Missouri-Kansas City.pdf p3] | type=text | path= | preview=Permit sales for the Hospital Hill Apartments parking garage are limited, and one permit will be sold for every space. N
04. [umkc-student-handbook.pdf p13] | type=text | path= | preview=Return to Table of Contents P a g e 13 students inside the classroom and beyond, without fear that their exercise of suc
05. [5] | type=figure | path=project_data_mm/figures/5.png | p

In [None]:
# =========================
# ONE-CELL: Metrics + Rerank vs No-Rerank (actual numbers)
# =========================
import re
import pandas as pd

REFUSAL = "Not enough evidence in the retrieved context."

def extract_bracket_cites(text: str):
    return re.findall(r"\[[^\]]+\]", text or "")

def precision_at_k(must_list, retrieved_cites, k=5):
    if not must_list:
        return None
    topk = retrieved_cites[:k]
    if not topk:
        return 0.0
    hits = sum(
        1 for c in topk
        if any(m.lower() in (c or "").lower() for m in must_list)
    )
    return hits / len(topk)

def recall_at_k(must_list, retrieved_cites, k=10):
    if not must_list:
        return None
    topk = retrieved_cites[:k]
    ok = all(
        any(m.lower() in (c or "").lower() for c in topk)
        for m in must_list
    )
    return 1.0 if ok else 0.0

def faithfulness_label(q, ep, ans):
    must = q.get("must_cite", []) or []
    retrieved_cites = [e.get("cite","") for e in (ep or [])]

    # Refusal task: must match exactly
    if not must:
        return "Yes" if (ans or "").strip() == REFUSAL else "No"

    ans_cites = extract_bracket_cites(ans)
    if not ans_cites:
        return "No"

    rc = " ".join(retrieved_cites).lower()
    cited_ok = all(c.lower() in rc for c in ans_cites)
    must_ok = all(m.lower() in rc for m in must)

    if cited_ok and must_ok:
        return "Yes"
    if cited_ok or must_ok:
        return "Partial"
    return "No"

def build_ep_optional_rerank(question, chunking="page", method="hybrid", rerank=False):
    # Try rerank kwarg if supported
    try:
        return build_evidence_pack(question, chunking=chunking, method=method, rerank=rerank) or []
    except TypeError:
        pass
    # Try method variants for rerank
    if rerank:
        for m in [f"{method}_rerank", "hybrid_rerank", "rerank", "hybrid+rerank"]:
            try:
                ep = build_evidence_pack(question, chunking=chunking, method=m) or []
                if ep:
                    return ep
            except Exception:
                continue
    # Fallback baseline
    return build_evidence_pack(question, chunking=chunking, method=method) or []

rows = []
for setting, rr in [("no_rerank", False), ("rerank", True)]:
    for q in QUERIES:
        ep = build_ep_optional_rerank(q["question"], chunking="page", method="hybrid", rerank=rr)
        prompt = rag_prompt(q["question"], ep)
        ans = generate_answer(prompt)

        retrieved_cites = [e.get("cite","") for e in ep]
        rows.append({
            "Setting": setting,
            "QID": q["id"],
            "P@5": precision_at_k(q.get("must_cite", []), retrieved_cites, k=5),
            "R@10": recall_at_k(q.get("must_cite", []), retrieved_cites, k=10),
            "Faithfulness": faithfulness_label(q, ep, ans),
            "Top-3 cites": retrieved_cites[:3],
            "Answer preview": (ans or "")[:140].replace("\n"," ")
        })

results_df = pd.DataFrame(rows)
results_df


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Both `max_new_tokens` (=160) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Unnamed: 0,Setting,QID,P@5,R@10,Faithfulness,Top-3 cites,Answer preview
0,no_rerank,Q1,0.0,0.0,Partial,"[[umkc-volker-campus-map.pdf p1], [umkc-volker...",According to the provided campus maps and shut...
1,no_rerank,Q2,0.0,0.0,Partial,[[Student Permits - Parking Options - Parking ...,The closest parking lot to Haag Hall is Area 3...
2,no_rerank,Q3,,,No,"[[umkc-student-handbook.pdf p46], [umkc-studen...",Not enough evidence in the retrieved context. ...
3,rerank,Q1,0.0,0.0,Partial,"[[umkc-volker-campus-map.pdf p1], [umkc-volker...",According to the provided campus maps and shut...
4,rerank,Q2,0.0,0.0,Partial,[[Student Permits - Parking Options - Parking ...,The closest parking lot to Haag Hall is Area 3...
5,rerank,Q3,,,No,"[[umkc-student-handbook.pdf p46], [umkc-studen...",Not enough evidence in the retrieved context. ...


## 11) Week 3 acceptance tests (CS 5588)
Fill in after running your demo:
- Does the evidence pack include the must-cite items for Q1/Q2?
- Does Q3 properly refuse with “Not enough evidence…”?
- Is the output understandable to your target user?


In [None]:
ACCEPTANCE_CHECKLIST = [
    {
        "qid": "Q1",
        "must_cite_expected": "[map] + [route]",
        "pass_fail": "TODO",
        "notes": "PASS if answer names a stop + cites map/route."
    },
    {
        "qid": "Q2",
        "must_cite_expected": "[parking map] + [parking table] (with numeric price)",
        "pass_fail": "TODO",
        "notes": "PASS if nearest lot + semester price + table citation."
    },
    {
        "qid": "Q3",
        "must_cite_expected": "(none) — should refuse",
        "pass_fail": "TODO",
        "notes": "PASS only if exact refusal string is returned."
    },
]
ACCEPTANCE_CHECKLIST


[{'qid': 'Q1',
  'must_cite_expected': '[map] + [route]',
  'pass_fail': 'TODO',
  'notes': 'PASS if answer names a stop + cites map/route.'},
 {'qid': 'Q2',
  'must_cite_expected': '[parking map] + [parking table] (with numeric price)',
  'pass_fail': 'TODO',
  'notes': 'PASS if nearest lot + semester price + table citation.'},
 {'qid': 'Q3',
  'must_cite_expected': '(none) — should refuse',
  'pass_fail': 'TODO',
  'notes': 'PASS only if exact refusal string is returned.'}]

## 11.5 Team work items (project enhancement)

Use this hands-on to **advance your semester project**. Each team member should “own” at least one deliverable below.

**Product Lead (Applicability)**
- Update your project **persona + workflow** so the multimodal RAG module is a *core feature*, not an add-on.
- Write 3 stakeholder tasks that map to your product’s real decision points (2 require text+figure evidence, 1 must refuse).

**Systems Lead (Integration)**
- Replace the toy dataset with your **project-domain PDFs + figures**.
- Add **metadata fields** that matter to your domain (e.g., policy date, version, department, study cohort, device model).
- Implement a clean **`retrieve()` API** your final demo can reuse.

**Evaluation & Risk Lead (Shipping readiness)**
- Build a tiny evaluation table: *Task × Method × P@5 × R@10 × Faithfulness*.
- Add one real failure scenario + mitigation UX (warnings, “show evidence” first, or human-in-the-loop flag).
- Draft the “If we shipped this” plan: data refresh, monitoring, and governance rule.

**Bonus (Optional)**
- Add a minimal UI (Gradio/Streamlit) that shows: question → evidence pack → answer with citations.


## 12) Week 3 deliverables (CS 5588)
- Product Brief completed (persona, problem, value, success metrics)
- Demo run for Q1–Q3 with citations (screenshots encouraged)
- 1 failure case + mitigation plan (risk + fix)
- Repo link submitted in the survey
