# CS 5542 ‚Äî Lab 4 Notebook (Team Project)
## RAG Application Integration, Deployment, and Monitoring (Deadline: Feb. 12, 2026)

**Purpose:** This notebook is a **project-aligned template** for Lab 4. Your team should reuse your Lab-3 multimodal RAG pipeline and integrate it into a **deployable application** with **automatic logging** and **failure analysis**.

### Submission policy
- **Survey:** submitted **individually**
- **Deliverables (GitHub repo / notebook / report / deployment link):** submitted **as a team**

### Team-size requirement
- **1‚Äì2 students:** Base requirements + **1 extension**
- **3‚Äì4 students:** Base requirements + **2‚Äì3 extensions**

---

## What you will build (at minimum)
1. A **Streamlit app** that accepts a question and returns:
   - an **answer**
   - **retrieved evidence** with citations
   - **metrics panel** (latency, P@5, R@10 if applicable)
2. An **automatic logger** that appends to: `logs/query_metrics.csv`
3. A **mini gold set** of **5 project queries** (Q1‚ÄìQ5) for evaluation
4. **Two failure cases** with root cause + proposed fix

> **Important:** Lab 4 focuses on **application integration and deployment**, not on redesigning retrieval. Prefer reusing your Lab-3 modules.

---

## Recommended repository structure (for your team repo)
```
/app/              # Streamlit UI (required)
/rag/              # Retrieval + indexing modules (reuse from Lab 3)
/logs/             # query_metrics.csv (auto-created)
/data/             # your project-aligned PDFs/images (do NOT commit large/private data)
/api/              # optional FastAPI backend (extension)
/notebooks/        # this notebook
requirements.txt
README.md
```

---

## Contents of this notebook
1. Setup & environment checks  
2. Project dataset wiring (connect your Lab-3 ingestion)  
3. Mini gold set (Q1‚ÄìQ5)  
4. Retrieval + answer function (reuse your Lab-3 pipeline)  
5. Evaluation + logging (required)  
6. Streamlit app skeleton (required)  
7. Optional extension: FastAPI backend  
8. Deployment checklist + failure analysis template


In [11]:
from google.colab import files
uploaded = files.upload()


Saving data.zip to data (1).zip


In [16]:
import shutil, os, zipfile
from pathlib import Path

zip_path = "data.zip"

# Remove old extracted folder if it exists (prevents stale files)
if Path("data").exists():
    shutil.rmtree("data")

# Extract fresh
with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall(".")

print("‚úÖ Extracted fresh")
print("Docs now:", sorted(os.listdir("data/docs")) if Path("data/docs").exists() else "NO data/docs folder")
print("Images now:", sorted(os.listdir("data/images")) if Path("data/images").exists() else "NO data/images folder")


‚úÖ Extracted fresh
Docs now: ['doc1.pdf', 'doc2.pdf', 'doc3.pdf', 'doc4.pdf', 'doc5.pdf']
Images now: ['cyber_kill_chain.png', 'impact_likelihood_matrix.png', 'network_system_security.png', 'nist_framework.png', 'risk_management.png', 'risk_management_process.png', 'security_audit_process.png', 'soc2_requirements.png', 'website_security_audit.png', 'zero_trust.png']


In [19]:
from pathlib import Path

docs_dir = Path("data/docs")
docs_dir.mkdir(parents=True, exist_ok=True)

# Only create numeric demo if explicitly needed
create_demo_numeric = False   # üî¥ Change to True only if required by lab rubric

if create_demo_numeric:
    numeric_path = docs_dir / "numeric_demo.txt"
    numeric_path.write_text(
        "Fusion Hyperparameters (Table 1)\n"
        "alpha = 0.50\n"
        "top_k = 5\n"
        "missing_evidence_score_threshold = 0.05\n"
        "latency_alert_ms = 2000\n",
        encoding="utf-8"
    )
    print("‚úÖ Created numeric demo file")
else:
    print("‚ÑπÔ∏è Skipping numeric demo file creation (not in dataset)")


‚ÑπÔ∏è Skipping numeric demo file creation (not in dataset)


In [21]:
# Sanity check: ensure PDF docs are loaded
import os, glob

doc_files = glob.glob('./data/docs/*.pdf')
print("Found PDF docs:", len(doc_files))

assert len(doc_files) > 0, "No PDF docs found. Ensure the ZIP was extracted correctly."

# Preview first PDF filename (we can't directly print PDF text yet)
print("First document:", os.path.basename(doc_files[0]))


Found PDF docs: 5
First document: doc1.pdf


In [36]:
!pip -q install PyPDF2


In [35]:
!pip -q install pycryptodome

In [37]:
!pip -q install pycryptodome cryptography


In [38]:
import Crypto
print("Crypto OK:", Crypto.__version__)


Crypto OK: 3.23.0


In [39]:
import glob, os, sys, subprocess

# Ensure dependencies
def pip_install(pkg):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

try:
    from PyPDF2 import PdfReader
except ModuleNotFoundError:
    pip_install("PyPDF2")
    from PyPDF2 import PdfReader

# Needed for AES-encrypted PDFs
try:
    import Crypto  # noqa: F401
except ModuleNotFoundError:
    pip_install("pycryptodome")

DOC_DIR = "./data/docs"
doc_files = sorted(glob.glob(os.path.join(DOC_DIR, "*.pdf")))

if not doc_files:
    raise RuntimeError("No .pdf documents found in ./data/docs. Ensure the ZIP was extracted.")

documents = []
skipped = []

for p in doc_files:
    try:
        reader = PdfReader(p)

        # If encrypted, try empty password
        if getattr(reader, "is_encrypted", False):
            try:
                reader.decrypt("")  # common case: encrypted but no password
            except Exception as e:
                skipped.append((os.path.basename(p), f"Encrypted (decrypt failed): {e}"))
                continue

        text_parts = []
        for page in reader.pages:
            extracted = page.extract_text()
            if extracted:
                text_parts.append(extracted)

        txt = "\n".join(text_parts).strip()
        if not txt:
            skipped.append((os.path.basename(p), "No extractable text"))
            continue

        documents.append({"doc_id": os.path.basename(p), "source": p, "text": txt})

    except Exception as e:
        skipped.append((os.path.basename(p), str(e)))

print("‚úÖ Loaded documents:", len(documents))
print("‚ö†Ô∏è Skipped:", len(skipped))
for name, reason in skipped[:10]:
    print(" -", name, "->", reason)

if documents:
    print("\nExample doc_id:", documents[0]["doc_id"])
    print(documents[0]["text"][:300])


‚úÖ Loaded documents: 4
‚ö†Ô∏è Skipped: 1
 - doc5.pdf -> PyCryptodome is required for AES algorithm

Example doc_id: doc1.pdf
State Data Breach Notification Laws
Prepared by Foley‚Äôs Cybersecurity & Data Privacy Team
FOR INFORMATIONAL PURPOSES ONLY 
4824-6127-3219.41 1 ‚ñ†Exceptions based on compliance with other laws, such as 
the Health Insurance Portability and Accountability Act 
(HIPAA) or Gramm-Leach-Bliley Act (GLBA).



In [40]:
import sys, site

print("Python:", sys.version)
print("Executable:", sys.executable)
print("Site-packages:", site.getsitepackages())

try:
    import Crypto
    from Crypto.Cipher import AES
    print("‚úÖ pycryptodome works. Crypto version:", Crypto.__version__)
except Exception as e:
    print("‚ùå AES crypto NOT available:", repr(e))


Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Executable: /usr/bin/python3
Site-packages: ['/usr/local/lib/python3.12/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.12/dist-packages']
‚úÖ pycryptodome works. Crypto version: 3.23.0


In [41]:
!python -m pip install -U pip
!python -m pip install -U pycryptodome


Collecting pip
  Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-26.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.8/1.8 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-26.0.1


In [42]:
!python -m pip install -q pikepdf

import os
import pikepdf

src = "./data/docs/doc5.pdf"
dst = "./data/docs/doc5_decrypted.pdf"

# Try opening with empty password (common case)
with pikepdf.open(src, password="") as pdf:
    pdf.save(dst)

print("‚úÖ Decrypted copy saved to:", dst)
print("Now you should load doc5_decrypted.pdf instead of doc5.pdf")


‚úÖ Decrypted copy saved to: ./data/docs/doc5_decrypted.pdf
Now you should load doc5_decrypted.pdf instead of doc5.pdf


In [43]:
import glob, os
from PyPDF2 import PdfReader

DOC_DIR = "./data/docs"
doc_files = sorted(glob.glob(os.path.join(DOC_DIR, "*.pdf")))

# If decrypted version exists, prefer it over the original
if os.path.exists(os.path.join(DOC_DIR, "doc5_decrypted.pdf")):
    doc_files = [p for p in doc_files if not p.endswith("doc5.pdf")]

documents = []
skipped = []

for p in doc_files:
    try:
        reader = PdfReader(p)
        text_parts = []
        for page in reader.pages:
            t = page.extract_text()
            if t:
                text_parts.append(t)
        txt = "\n".join(text_parts).strip()
        if not txt:
            skipped.append((os.path.basename(p), "No extractable text"))
            continue
        documents.append({"doc_id": os.path.basename(p), "source": p, "text": txt})
    except Exception as e:
        skipped.append((os.path.basename(p), str(e)))

print("‚úÖ Loaded documents:", len(documents))
print("‚ö†Ô∏è Skipped:", skipped)


‚úÖ Loaded documents: 5
‚ö†Ô∏è Skipped: []


In [29]:
# Load demo images and create lightweight text surrogates (captions) for multimodal retrieval
import glob, os

IMG_DIR = './data/images'
img_files = sorted(glob.glob(os.path.join(IMG_DIR, '*.*')))
img_files = [p for p in img_files if p.lower().endswith(('.png','.jpg','.jpeg','.webp'))]

# Minimal captions so images participate in retrieval without requiring a vision encoder
IMAGE_CAPTIONS = {
    'rag_pipeline.png': 'RAG pipeline diagram: ingest, chunk, index, retrieve top-k evidence, build context, generate grounded answer, log metrics for monitoring.',
    'retrieval_modes.png': 'Retrieval modes diagram: BM25 keyword, vector semantic, hybrid fusion, multi-hop hop-1 to hop-2 refinement.',
}

images = []
for p in img_files:
    fid = os.path.basename(p)
    cap = IMAGE_CAPTIONS.get(fid, fid.replace('_',' ').replace('.png','').replace('.jpg',''))
    images.append({'img_id': fid, 'source': p, 'text': cap})

print('‚úÖ Loaded images:', len(images))
if images:
    print('Example image:', images[0]['img_id'])
    print('Caption:', images[0]['text'])

# Unified evidence store used by retrieval (text + images)
items = []
for d in documents:
    items.append({
        'evidence_id': d.get('doc_id') or os.path.basename(d.get('source','')),
        'modality': 'text',
        'source': d.get('source'),
        'text': d.get('text','')
    })
for im in images:
    items.append({
        'evidence_id': f"img::{im['img_id']}",
        'modality': 'image',
        'source': im.get('source'),
        'text': im.get('text','')
    })

assert len(items) > 0, 'Evidence store is empty.'
print('‚úÖ Unified evidence items:', len(items), '(text:', len(documents), ', images:', len(images), ')')


‚úÖ Loaded images: 10
Example image: cyber_kill_chain.png
Caption: cyber kill chain
‚úÖ Unified evidence items: 14 (text: 4 , images: 10 )


# 1) Setup & environment checks

This notebook includes **safe defaults** and **lightweight code examples**.  
Replace the placeholder pieces with your Lab-3 implementation (PDF parsing, OCR, multimodal evidence, hybrid retrieval, reranking).

### Install dependencies (edit as needed)
- Core: `streamlit`, `pandas`, `numpy`, `requests`
- Optional: `fastapi`, `uvicorn` (if you do the FastAPI extension)
- Retrieval examples: `scikit-learn` (TF-IDF baseline), optionally `sentence-transformers` (dense embeddings)

> In your team repo, always keep a clean `requirements.txt` for reproducibility.


In [44]:
# If running in Colab or fresh environment, uncomment installs:
# !pip -q install streamlit pandas numpy requests scikit-learn
# # Optional (FastAPI extension):
# !pip -q install fastapi uvicorn pydantic
# # Optional (dense retrieval):
# !pip -q install sentence-transformers

import os, json, time
from pathlib import Path
import pandas as pd
import numpy as np

print("Python OK. Working directory:", os.getcwd())


Python OK. Working directory: /content


# 2) Project paths + configuration

Set your project data paths and key parameters here.

- Do **not** hardcode secrets (API keys) in notebooks or repos.
- If you use a hosted LLM, read from environment variables locally.

**Tip:** Keep these settings mirrored in `rag/config.py` so your Streamlit app uses the same config.


In [45]:
from dataclasses import dataclass

@dataclass
class Lab4Config:
    project_name: str = "YOUR_PROJECT_NAME"
    data_dir: str = "./data"        # where your PDFs/images live locally
    logs_dir: str = "./logs"
    log_file: str = "./logs/query_metrics.csv"
    top_k_default: int = 10
    eval_p_at: int = 5
    eval_r_at: int = 10

cfg = Lab4Config()
Path(cfg.logs_dir).mkdir(parents=True, exist_ok=True)
print(cfg)


Lab4Config(project_name='YOUR_PROJECT_NAME', data_dir='./data', logs_dir='./logs', log_file='./logs/query_metrics.csv', top_k_default=10, eval_p_at=5, eval_r_at=10)


# 3) Dataset wiring (project-aligned)

For Lab 4, your **data, application UI, and models** must be aligned to your team project.

## Required (project-aligned)
- 2‚Äì6 PDFs
- 5‚Äì15 images/figures/tables (if your project is multimodal)

## In Lab 3 you likely had:
- PDF text extraction (PyMuPDF)
- OCR / captions for figures or scanned pages
- Chunking + indexing (dense/sparse/hybrid)
- Reranking (optional)
- Grounded answer generation with citations

### What to do here
1. Point this notebook to your dataset folder.
2. Load *already-prepared* chunks/evidence from Lab 3 (recommended), OR
3. Call your Lab-3 ingestion function to rebuild the index.

Below is a **minimal example** that loads plain text files as ‚Äúdocuments‚Äù so the notebook is runnable even without PDFs.
Replace it with your Lab-3 ingestion code.


In [47]:
# Minimal runnable loader for YOUR dataset (PDFs in ./data/docs)
from pathlib import Path
import sys, subprocess

docs_dir = Path("data") / "docs"
docs_dir.mkdir(parents=True, exist_ok=True)

# --- install deps if missing ---
def pip_install(pkg: str):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

try:
    from PyPDF2 import PdfReader
except ModuleNotFoundError:
    pip_install("PyPDF2")
    from PyPDF2 import PdfReader

# Optional: try to decrypt AES PDFs more reliably
# (If it still doesn't work, we'll skip encrypted PDFs)
try:
    import Crypto  # noqa: F401
except ModuleNotFoundError:
    try:
        pip_install("pycryptodome")
    except Exception:
        pass  # okay if install fails; we'll still load non-encrypted PDFs

def load_pdf_docs(docs_path: Path):
    items = []
    skipped = []

    for p in sorted(docs_path.glob("*.pdf")):
        try:
            reader = PdfReader(str(p))

            # Try empty password if encrypted (common case)
            if getattr(reader, "is_encrypted", False):
                try:
                    reader.decrypt("")
                except Exception as e:
                    skipped.append((p.name, f"Encrypted (decrypt failed): {e}"))
                    continue

            text_parts = []
            for page in reader.pages:
                t = page.extract_text()
                if t:
                    text_parts.append(t)

            text = "\n".join(text_parts).strip()
            if not text:
                skipped.append((p.name, "No extractable text"))
                continue

            items.append({
                "doc_id": p.stem,         # doc1 (no .pdf)
                "source": str(p),
                "text": text
            })

        except Exception as e:
            skipped.append((p.name, str(e)))

    return items, skipped

documents, skipped = load_pdf_docs(docs_dir)

print("‚úÖ Loaded docs:", len(documents))
print("‚ö†Ô∏è Skipped:", len(skipped))
for name, reason in skipped[:10]:
    print(" -", name, "->", reason)

# show keys + one sample like the template does
if documents:
    print("\nKeys:", documents[0].keys())
    print("Example doc_id:", documents[0]["doc_id"])
    print(documents[0]["text"][:300])
else:
    raise RuntimeError("No PDFs were successfully loaded from ./data/docs")


‚úÖ Loaded docs: 5
‚ö†Ô∏è Skipped: 1
 - doc5.pdf -> PyCryptodome is required for AES algorithm

Keys: dict_keys(['doc_id', 'source', 'text'])
Example doc_id: doc1
State Data Breach Notification Laws
Prepared by Foley‚Äôs Cybersecurity & Data Privacy Team
FOR INFORMATIONAL PURPOSES ONLY 
4824-6127-3219.41 1 ‚ñ†Exceptions based on compliance with other laws, such as 
the Health Insurance Portability and Accountability Act 
(HIPAA) or Gramm-Leach-Bliley Act (GLBA).



In [48]:
!python -m pip -q install pikepdf
import pikepdf
from pathlib import Path

src = Path("data/docs/doc5.pdf")
dst = Path("data/docs/doc5_decrypted.pdf")

with pikepdf.open(src, password="") as pdf:
    pdf.save(dst)

print("‚úÖ Saved:", dst)


‚úÖ Saved: data/docs/doc5_decrypted.pdf


# **DATA INGESTION & CHUNKING**

In [49]:

# Imports
import os, re, glob, json, math
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple, Optional

import numpy as np
import pandas as pd

!pip install PyMuPDF
import fitz  # PyMuPDF
from PIL import Image, ImageDraw, ImageFont

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

Collecting PyMuPDF
  Downloading pymupdf-1.27.1-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.27.1-cp310-abi3-manylinux_2_28_x86_64.whl (24.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.9/24.9 MB[0m [31m37.9 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.27.1


In [50]:
DATA_DIR = "data"
DOC_DIR = os.path.join(DATA_DIR, "docs")
FIG_DIR = os.path.join(DATA_DIR, "images")
os.makedirs(DOC_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)

pdfs = sorted(glob.glob(os.path.join(DOC_DIR, "*.pdf")))
imgs = sorted(glob.glob(os.path.join(FIG_DIR, "*.*")))

print("PDFs:", len(pdfs), pdfs)
print("Images:", len(imgs), imgs)


PDFs: 6 ['data/docs/doc1.pdf', 'data/docs/doc2.pdf', 'data/docs/doc3.pdf', 'data/docs/doc4.pdf', 'data/docs/doc5.pdf', 'data/docs/doc5_decrypted.pdf']
Images: 10 ['data/images/cyber_kill_chain.png', 'data/images/impact_likelihood_matrix.png', 'data/images/network_system_security.png', 'data/images/nist_framework.png', 'data/images/risk_management.png', 'data/images/risk_management_process.png', 'data/images/security_audit_process.png', 'data/images/soc2_requirements.png', 'data/images/website_security_audit.png', 'data/images/zero_trust.png']


In [51]:
!sudo apt-get install -y tesseract-ocr
!pip install -q pytesseract

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.


In [52]:
# Track B (Recommended)
import pytesseract
from PIL import Image

@dataclass
class TextChunk:
    chunk_id: str
    doc_id: str
    page_num: int
    text: str

@dataclass
class ImageItem:
    item_id: str
    path: str
    caption: str  # simple text to make image retrieval runnable

def clean_text(s: str) -> str:
    s = s or ""
    s = re.sub(r"\s+", " ", s).strip()
    return s

def extract_pdf_pages(pdf_path: str) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    out: List[TextChunk] = []
    for i in range(len(doc)):
        page = doc.load_page(i)
        text = clean_text(page.get_text("text"))
        if text:
            out.append(TextChunk(
                chunk_id=f"{doc_id}::p{i+1}",
                doc_id=doc_id,
                page_num=i+1,
                text=text
            ))
    return out

def load_images_track_b(fig_dir: str) -> List[ImageItem]:
    items: List[ImageItem] = []
    print(f"Scanning images in {fig_dir} with OCR...")

    for p in sorted(glob.glob(os.path.join(fig_dir, "*.*"))):
        base = os.path.basename(p)

        # 1. Generate Caption (Filename based)
        simple_caption = os.path.splitext(base)[0].replace("_", " ")

        # 2. Run OCR (Tesseract) to get text inside the image
        try:
            image = Image.open(p)
            ocr_text = pytesseract.image_to_string(image).strip()
            # Clean up OCR noise (optional)
            ocr_text = re.sub(r"\s+", " ", ocr_text)
        except Exception as e:
            print(f"OCR Failed for {base}: {e}")
            ocr_text = ""

        # 3. Combine for Evidence (Track B Requirement)
        # evidence_text = Caption + OCR
        final_text = f"Caption: {simple_caption}. Content: {ocr_text}"

        items.append(ImageItem(item_id=base, path=p, caption=final_text))

    return items

# Run ingestion
page_chunks: List[TextChunk] = []
for p in pdfs:
    page_chunks.extend(extract_pdf_pages(p))

image_items = load_images_track_b(FIG_DIR)

print("Total text chunks:", len(page_chunks))
print("Total images:", len(image_items))
print("Sample text chunk:", page_chunks[0].chunk_id, page_chunks[0].text[:180])
print("Sample image item:", image_items[0])

# --- Deliverable Output ---

print("\n=== Deliverable: Extracted PDF Chunk ===")
if page_chunks:
    chunk = page_chunks[0]
    print(f"Chunk ID:   {chunk.chunk_id}")
    print(f"Source Doc: {chunk.doc_id}")
    print(f"Page Num:   {chunk.page_num}")
    print(f"Text Content (First 300 chars):\n{chunk.text[:300]}...")
else:
    print("‚ùå No PDF chunks found.")

print("\n" + "="*60)

print("\n=== Deliverable: Extracted Image Evidence ===")
if image_items:
    item = image_items[0]
    print(f"Image ID: {item.item_id}")
    print(f"Path:     {item.path}")
    print("-" * 20)
    print(f"Full Evidence Text (Caption + OCR):\n{item.caption}")
    # Note: item.caption now holds "Caption: [filename]. Content: [OCR Text]"
else:
    print("‚ùå No images found.")


Scanning images in data/images with OCR...
OCR Failed for security_audit_process.png: Unsupported image format/type
Total text chunks: 265
Total images: 10
Sample text chunk: doc1.pdf::p1 State Data Breach Notification Laws Prepared by Foley‚Äôs Cybersecurity & Data Privacy Team
Sample image item: ImageItem(item_id='cyber_kill_chain.png', path='data/images/cyber_kill_chain.png', caption='Caption: cyber kill chain. Content: Harvesting email addresses, conference information, etc. Coupling exploit with backdoor into deliverable payload Delivering weaponized bundle to the victim via email, web, USB, etc. Exploiting a vulnerability to execute code on victim‚Äôs system Installing malware on the asset Command channel for remote manipulation of victim With ‚ÄòHands on Keyboard‚Äô access, intruders accomplish their original goals')

=== Deliverable: Extracted PDF Chunk ===
Chunk ID:   doc1.pdf::p1
Source Doc: doc1.pdf
Page Num:   1
Text Content (First 300 chars):
State Data Breach Notification

# 4) Mini Gold Set (Q1‚ÄìQ5) ‚Äî Required

Create **5 project-relevant queries** and define a simple evidence rubric.

- **Q1‚ÄìQ3:** typical project queries (answerable using evidence)
- **Q4:** multimodal evidence query (table/figure heavy, OCR/captions should help)
- **Q5:** missing-evidence or ambiguous query (must trigger safe behavior)

For each query, define:
- `gold_evidence_ids`: list of evidence identifiers that are relevant (doc_id/page/fig id)
- `answer_criteria`: 1‚Äì2 bullets
- `citation_format`: how you will cite (e.g., `[Doc1 p3]`, `[fig2]`)

This enables **consistent evaluation** and makes logging meaningful.


In [55]:
from pathlib import Path

print("PDF docs:", [p.name for p in Path("data/docs").glob("*.pdf")])
print("Images:", [p.name for p in Path("data/images").glob("*.png")])


PDF docs: ['doc5_decrypted.pdf', 'doc1.pdf', 'doc2.pdf', 'doc4.pdf', 'doc5.pdf', 'doc3.pdf']
Images: ['cyber_kill_chain.png', 'nist_framework.png', 'risk_management.png', 'impact_likelihood_matrix.png', 'zero_trust.png', 'network_system_security.png', 'soc2_requirements.png', 'website_security_audit.png', 'risk_management_process.png', 'security_audit_process.png']


In [57]:
import pandas as pd

MISSING_EVIDENCE_PHRASE = "Not enough evidence in the retrieved context."

mini_gold = [
    {
        "query_id": "Q1",
        "question": "According to doc1.pdf, what are State Data Breach Notification Laws and what compliance-based exception(s) are mentioned?",
        "gold_evidence_ids": ["doc1.pdf"],
        "answer_criteria": [
            "Defines/describes state data breach notification laws",
            "Mentions an exception based on compliance with other laws (e.g., HIPAA/GLBA) if stated",
            "Includes a citation"
        ],
        "citation_format": "[doc_id]",
        "query_type": "answerable"
    },
    {
        "query_id": "Q2",
        "question": "From the cyber_kill_chain.png diagram, list the main stages shown in the Cyber Kill Chain in order.",
        "gold_evidence_ids": ["cyber_kill_chain.png"],
        "answer_criteria": [
            "Lists the stages shown in the diagram",
            "Keeps the correct order",
            "Includes a citation"
        ],
        "citation_format": "[evidence_id]",
        "query_type": "answerable"
    },
    {
        "query_id": "Q3",
        "question": "Based on soc2_requirements.png, what are the key SOC 2 Trust Service Criteria categories shown?",
        "gold_evidence_ids": ["soc2_requirements.png"],
        "answer_criteria": [
            "Names the categories visible in the image",
            "Does not invent categories not shown",
            "Includes a citation"
        ],
        "citation_format": "[evidence_id]",
        "query_type": "answerable"
    },
    {
        "query_id": "Q4",
        "question": "Using impact_likelihood_matrix.png, which area represents the highest risk and why?",
        "gold_evidence_ids": ["impact_likelihood_matrix.png"],
        "answer_criteria": [
            "Identifies highest risk as the region where both impact and likelihood are highest (or equivalent label)",
            "Explains using the matrix meaning (risk increases with likelihood and impact)",
            "Includes a citation"
        ],
        "citation_format": "[evidence_id]",
        "query_type": "multimodal"
    },
    {
        "query_id": "Q5",
        "question": "Who won the FIFA World Cup in 2050?",
        "gold_evidence_ids": [],  # ‚úÖ keep as list (type-stable)
        "answer_criteria": [
            f'Returns exactly the missing-evidence phrase: "{MISSING_EVIDENCE_PHRASE}"',
            "Does not claim a winner",
            "Does not cite any evidence"
        ],
        "citation_format": "",
        "query_type": "missing_evidence",
        "expected_safe_answer": MISSING_EVIDENCE_PHRASE
    },
]

pd.DataFrame(mini_gold)[["query_id", "question", "gold_evidence_ids", "query_type"]]


Unnamed: 0,query_id,question,gold_evidence_ids,query_type
0,Q1,"According to doc1.pdf, what are State Data Bre...",[doc1.pdf],answerable
1,Q2,"From the cyber_kill_chain.png diagram, list th...",[cyber_kill_chain.png],answerable
2,Q3,"Based on soc2_requirements.png, what are the k...",[soc2_requirements.png],answerable
3,Q4,"Using impact_likelihood_matrix.png, which area...",[impact_likelihood_matrix.png],multimodal
4,Q5,Who won the FIFA World Cup in 2050?,[],missing_evidence


In [58]:
# Task: Mini gold set (evidence IDs) for evaluation
# Evidence IDs refer to files under ./data/docs or ./data/images
# Image evidence uses prefix img::

import pandas as pd

mini_gold = [
    {
        'query_id': 'Q1',
        'question': 'According to doc1.pdf, what are State Data Breach Notification Laws and what compliance exception is mentioned?',
        'gold_evidence_ids': ['doc1.pdf']
    },
    {
        'query_id': 'Q2',
        'question': 'What are the main stages shown in the Cyber Kill Chain diagram?',
        'gold_evidence_ids': ['img::cyber_kill_chain.png']
    },
    {
        'query_id': 'Q3',
        'question': 'What are the SOC 2 Trust Service Criteria categories shown in the SOC 2 diagram?',
        'gold_evidence_ids': ['img::soc2_requirements.png']
    },
    {
        'query_id': 'Q4',
        'question': 'From the impact-likelihood risk matrix, which area represents the highest risk?',
        'gold_evidence_ids': ['img::impact_likelihood_matrix.png']
    },
    {
        'query_id': 'Q5',
        'question': 'Who won the FIFA World Cup in 2050?',
        'gold_evidence_ids': []   # Missing-evidence case (must trigger safe behavior)
    },
    {
        'query_id': 'Q6',
        'question': 'What are the five core functions shown in the NIST Cybersecurity Framework diagram?',
        'gold_evidence_ids': ['img::nist_framework.png']
    },
]

pd.DataFrame(mini_gold)[['query_id','question','gold_evidence_ids']]



Unnamed: 0,query_id,question,gold_evidence_ids
0,Q1,"According to doc1.pdf, what are State Data Bre...",[doc1.pdf]
1,Q2,What are the main stages shown in the Cyber Ki...,[img::cyber_kill_chain.png]
2,Q3,What are the SOC 2 Trust Service Criteria cate...,[img::soc2_requirements.png]
3,Q4,"From the impact-likelihood risk matrix, which ...",[img::impact_likelihood_matrix.png]
4,Q5,Who won the FIFA World Cup in 2050?,[]
5,Q6,What are the five core functions shown in the ...,[img::nist_framework.png]


# 5) Retrieval + Answer Function (Reuse Lab 3)

Below is a **baseline TF‚ÄëIDF retriever** so this notebook is runnable.
Replace with your Lab-3 retrieval stack:
- dense (SentenceTransformers + FAISS/Chroma)
- sparse (BM25)
- hybrid fusion
- optional reranking

### Required output contract (recommended)
Your retrieval function should return a list of evidence items:
- `chunk_id` or `doc_id`
- `source`
- `score`
- `citation_tag` (e.g., `[Doc1 p3]`, `[fig2]`)
- `text` (the evidence text shown to users)

Your answer function must enforce:
- **Citations for claims**
- If missing evidence: **return exactly**  
  `Not enough evidence in the retrieved context.`


In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Build a simple TF-IDF index over documents (demo baseline)
corpus = [d["text"] for d in documents]
doc_ids = [d["doc_id"] for d in documents]
sources = [d["source"] for d in documents]

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)

def retrieve_tfidf(question: str, top_k: int = 5):
    q = vectorizer.transform([question])
    sims = cosine_similarity(q, X).ravel()
    idxs = np.argsort(-sims)[:top_k]
    evidence = []
    for rank, i in enumerate(idxs):
        evidence.append({
            "chunk_id": doc_ids[i],
            "source": sources[i],
            "score": float(sims[i]),
            "citation_tag": f"[{doc_ids[i]}]",
            "text": corpus[i][:800]  # truncate for UI
        })
    return evidence

MISSING_EVIDENCE_MSG = "Not enough evidence in the retrieved context."

def generate_answer_stub(question: str, evidence: list):
    """Replace with your LLM/VLM generation.
    For this template we produce a simple grounded response.
    """
    if not evidence or max(e.get("score", 0.0) for e in evidence) < 0.05:
        return MISSING_EVIDENCE_MSG

    # Minimal grounded "answer" example: summarize top evidence
    top = evidence[0]
    answer = (
        f"Based on the retrieved evidence {top['citation_tag']}, "
        f"the system should ground its response in retrieved context and cite sources. "
        f"If evidence is missing, it must respond with: '{MISSING_EVIDENCE_MSG}'. "
        f"{top['citation_tag']}"
    )
    return answer

# Quick test
test_q = mini_gold[0]["question"]
ev = retrieve_tfidf(test_q, top_k=3)
print("Top evidence:", ev[0]["chunk_id"], ev[0]["score"])
print("Answer:", generate_answer_stub(test_q, ev))


Top evidence: doc1 0.24315996181631225
Answer: Based on the retrieved evidence [doc1], the system should ground its response in retrieved context and cite sources. If evidence is missing, it must respond with: 'Not enough evidence in the retrieved context.'. [doc1]


# **Fixed-size Chunking Strategy (Sliding Window Chunk)**

In [60]:
# Chunking knobs (for fixed-size chunking ablation)
CHUNK_SIZE    = 900   # characters per chunk
CHUNK_OVERLAP = 150   # overlap characters

# Reproducibility
RANDOM_SEED = 0

In [62]:
def extract_fixed_size_chunks(pdf_path: str, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += clean_text(page.get_text("text")) + " "

    # Sliding window slicing
    chunks = []
    for i in range(0, len(full_text), chunk_size - overlap):
        window = full_text[i : i + chunk_size]
        if len(window) > 50: # Filter tiny chunks
            chunks.append(TextChunk(
                chunk_id=f"{doc_id}::span{i}-{i+len(window)}",
                doc_id=doc_id,
                page_num=0, # Logical chunk, not page bound
                text=window
            ))
    return chunks

# **Retrieval (TF‚ÄëIDF)**
We build two TF‚ÄëIDF indexes:

- One over **PDF text chunks**
- One over **image captions**

Retrieval returns the top‚Äëk results with similarity scores.

In [65]:
def build_tfidf_index_text(chunks: List[TextChunk]):
    corpus = [c.text for c in chunks]
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(corpus)
    X = normalize(X)
    return vec, X

def build_tfidf_index_images(items: List[ImageItem]):
    corpus = [it.caption for it in items]
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(corpus)
    X = normalize(X)
    return vec, X

text_vec, text_X = build_tfidf_index_text(page_chunks)
img_vec, img_X = build_tfidf_index_images(image_items)

def tfidf_retrieve(query: str, vec: TfidfVectorizer, X, top_k: int = 5):
    q = vec.transform([query])
    q = normalize(q)
    scores = (X @ q.T).toarray().ravel()
    idx = np.argsort(-scores)[:top_k]
    return [(int(i), float(scores[i])) for i in idx]

print("‚úÖ Indexes built.")

# Inspect built indexes by listing first 5 as a sample
print(f"--- Text Index ({len(page_chunks)} items) ---")
for i, chunk in enumerate(page_chunks[:5]):  # Print first 5 as a sample
    # Assuming 'chunk' has a 'source_doc' or similar attribute, otherwise just print text
    preview = chunk.text[:50].replace("\n", " ") + "..."
    print(f"ID {i}: {preview}")

print(f"\n--- Image Index ({len(image_items)} items) ---")
for i, item in enumerate(image_items[:5]):
    print(f"ID {i}: {item.caption} (File: {item.item_id})")

‚úÖ Indexes built.
--- Text Index (265 items) ---
ID 0: State Data Breach Notification Laws Prepared by Fo...
ID 1: FOR INFORMATIONAL PURPOSES ONLY 4824-6127-3219.41 ...
ID 2: FOR INFORMATIONAL PURPOSES ONLY 4824-6127-3219.41 ...
ID 3: 3 Return to Map FOR INFORMATIONAL PURPOSES ONLY 48...
ID 4: 4 Return to Map FOR INFORMATIONAL PURPOSES ONLY 48...

--- Image Index (10 items) ---
ID 0: Caption: cyber kill chain. Content: Harvesting email addresses, conference information, etc. Coupling exploit with backdoor into deliverable payload Delivering weaponized bundle to the victim via email, web, USB, etc. Exploiting a vulnerability to execute code on victim‚Äôs system Installing malware on the asset Command channel for remote manipulation of victim With ‚ÄòHands on Keyboard‚Äô access, intruders accomplish their original goals (File: cyber_kill_chain.png)
ID 1: Caption: impact likelihood matrix. Content: Catastrophic Significant Ss Moderate D ler a + low Negligable Catastrophic | Stop 1 2 3 4 

# **Build Dense Retrieval and Figure Index**

In [67]:
!pip install -q sentence-transformers
!pip install -q sentence-transformers faiss-cpu

In [68]:

# Retrieval knobs
TOP_K_TEXT     = 5    # candidate text chunks
TOP_K_IMAGES   = 3    # candidate images (based on captions/filenames)
TOP_K_EVIDENCE = 8    # final evidence items used in the context

In [69]:
import faiss
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
corpus_text = [c.text for c in page_chunks]
# Remove convert_to_tensor=True so we get a NumPy array for FAISS
corpus_embeddings = model.encode(corpus_text)

# Build FAISS Index
d = corpus_embeddings.shape[1]  # Dimension of embeddings (e.g., 384)
index_dense = faiss.IndexFlatL2(d) # L2 distance (Euclidean)
index_dense.add(corpus_embeddings)

print(f"‚úÖ Dense Index built with {index_dense.ntotal} vectors.")

# Embed the captions from your image_items list
corpus_caption = [item.caption for item in image_items]
caption_embeddings = model.encode(corpus_caption, convert_to_tensor=False)

# Build FAISS Index for Image Captions
d_cap = caption_embeddings.shape[1] # Dimension = 384
index_captions = faiss.IndexFlatL2(d_cap)
index_captions.add(caption_embeddings)

print(f"‚úÖ Approach 1 (Captions): Indexed {index_captions.ntotal} images via text.")

def dense_retrieve(query, top_k=TOP_K_TEXT):
    # Encode query to numpy. Wrap in list [query] to ensure (1, d) shape.
    query_emb = model.encode([query])

    # Search FAISS
    distances, indices = index_dense.search(query_emb, top_k)

    # Return indices
    return [(int(idx), float(dist)) for idx, dist in zip(indices[0], distances[0])]

def retrieve_images_by_caption(query: str, top_k=TOP_K_IMAGES):
    # Embed query using the SAME text model
    q_emb = model.encode([query])
    distances, indices = index_captions.search(q_emb, top_k)

    # Return matched ImageItems
    results = []
    for idx, dist in zip(indices[0], distances[0]):
        if idx < 0: continue # FAISS returns -1 if not found
        results.append((image_items[idx], float(dist)))
    return results

# Validation by checking vocabulary size
print(f"Text Dictionary Size: {len(text_vec.vocabulary_)}")
print(f"Image Dictionary Size: {len(img_vec.vocabulary_)}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Dense Index built with 265 vectors.
‚úÖ Approach 1 (Captions): Indexed 10 images via text.
Text Dictionary Size: 4906
Image Dictionary Size: 201


# **Build evidence context**
We assemble a compact context string + list of image paths.

**Guidelines for good context:**

- Keep snippets short (100‚Äì300 chars)
- Always include chunk IDs so you can cite evidence
- Attach images that are likely relevant

In [71]:
# --- Fusion / retrieval hyperparameters ---
TOP_K_TEXT = 5
TOP_K_IMAGES = 3
TOP_K_EVIDENCE = 5
ALPHA = 0.6   # 0.5 = balanced, >0.5 favors text, <0.5 favors images


In [72]:
def _normalize_scores(pairs):
    """Min-max normalize a list of (idx, score) to [0,1].
    If all scores equal, returns 1.0 for each item (so ordering stays stable).
    """
    if not pairs:
        return []
    scores = [s for _, s in pairs]
    lo, hi = min(scores), max(scores)
    if abs(hi - lo) < 1e-12:
        return [(i, 1.0) for i, _ in pairs]
    return [(i, (s - lo) / (hi - lo)) for i, s in pairs]


def build_context(
    question: str,
    top_k_text: int = TOP_K_TEXT,
    top_k_images: int = TOP_K_IMAGES,
    top_k_evidence: int = TOP_K_EVIDENCE,
    alpha: float = ALPHA,
) -> Dict[str, Any]:
    """Build a multimodal context block for the question.

    Students:
    - `top_k_text` / `top_k_images` control *candidate retrieval* per modality.
    - `top_k_evidence` controls the *final context size*.
    - `alpha` controls fusion: higher = prefer text evidence, lower = prefer images.

    This function returns:
    - `context`: a text block with the selected evidence (what you pass to an LLM)
    - `image_paths`: paths of images selected as evidence
    - `evidence`: structured evidence list (recommended for your report)
    """
    # 1) Retrieve candidates from each modality
    text_hits = tfidf_retrieve(question, text_vec, text_X, top_k=top_k_text)   # [(idx, score), ...]
    img_hits  = tfidf_retrieve(question, img_vec,  img_X,  top_k=top_k_images)

    # 2) Normalize scores per modality and fuse with ALPHA
    text_norm = _normalize_scores(text_hits)
    img_norm  = _normalize_scores(img_hits)

    fused = []
    for idx, s in text_norm:
        ch = page_chunks[idx]
        fused.append({
            "modality": "text",
            "id": ch.chunk_id,
            "raw_score": float(dict(text_hits).get(idx, 0.0)),
            "fused_score": float(alpha * s),
            "text": ch.text,
            "path": None,
        })

    for idx, s in img_norm:
        it = image_items[idx]
        fused.append({
            "modality": "image",
            "id": it.item_id,
            "raw_score": float(dict(img_hits).get(idx, 0.0)),
            "fused_score": float((1.0 - alpha) * s),
            "text": it.caption,     # we retrieve on caption/filename text
            "path": it.path,
        })

    # 3) Pick top fused evidence
    fused = sorted(fused, key=lambda d: d["fused_score"], reverse=True)[:top_k_evidence]

    # 4) Build the context string (what you feed into a generator/LLM)
    ctx_lines = []
    image_paths = []
    for ev in fused:
        if ev["modality"] == "text":
            snippet = (ev["text"] or "")[:260].replace("\n", " ")
            ctx_lines.append(f"[TEXT | {ev['id']} | fused={ev['fused_score']:.3f}] {snippet}")
        else:
            ctx_lines.append(f"[IMAGE | {ev['id']} | fused={ev['fused_score']:.3f}] caption={ev['text']}")
            image_paths.append(ev["path"])

    return {
        "question": question,
        "context": "\n".join(ctx_lines),
        "image_paths": image_paths,
        "text_hits": text_hits,
        "img_hits": img_hits,
        "evidence": fused,
        "alpha": alpha,
        "top_k_text": top_k_text,
        "top_k_images": top_k_images,
        "top_k_evidence": top_k_evidence,
    }


# --- Demo: what retrieval returns for one query ---
ctx_demo = build_context(mini_gold[0]["question"])
print(ctx_demo["context"])
print("Images:", ctx_demo["image_paths"])
print("Fusion alpha:", ctx_demo["alpha"])

[TEXT | doc1.pdf::p1 | fused=0.600] State Data Breach Notification Laws Prepared by Foley‚Äôs Cybersecurity & Data Privacy Team
[IMAGE | zero_trust.png | fused=0.400] caption=Caption: zero trust. Content: Identity Network Endpoints Infrastructure Data Applications
[TEXT | doc1.pdf::p35 | fused=0.072] 34 Return to Map FOR INFORMATIONAL PURPOSES ONLY 4824-6127-3219.41 State of Residence Maine Statute 10 Me. Rev. Stat. ¬ß 1346 et seq. Definition of ‚ÄúPersonal Information‚Äù (A) An individual‚Äôs first name or initial and last name in combination with any one or mor
[TEXT | doc1.pdf::p2 | fused=0.052] FOR INFORMATIONAL PURPOSES ONLY 4824-6127-3219.41 1 ‚ñ†Exceptions based on compliance with other laws, such as the Health Insurance Portability and Accountability Act (HIPAA) or Gramm-Leach-Bliley Act (GLBA). ‚ñ†Exceptions regarding good faith acquisition of perso
[TEXT | doc1.pdf::p27 | fused=0.030] 26 Return to Map FOR INFORMATIONAL PURPOSES ONLY 4824-6127-3219.41 State of Residence Illino

# **Reranking**

In [73]:
from sentence_transformers import CrossEncoder

# Load a standard reranking model (trained on MS MARCO)
# This model outputs a score (higher is better, usually unbounded but often -10 to 10)
print("Loading Reranker...")
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("‚úÖ Reranker loaded.")

Loading Reranker...


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

‚úÖ Reranker loaded.


In [74]:
def normalize_scores(hits):
    """Normalizes a list of (idx, score) to 0..1 range."""
    if not hits: return []
    scores = [s for _, s in hits]
    min_s, max_s = min(scores), max(scores)
    if max_s == min_s: return [(i, 1.0) for i, _ in hits]
    return [(i, (s - min_s) / (max_s - min_s)) for i, s in hits]

def get_retrieval_results(query: str, method: str, top_k: int = 5):
    """
    Retrieves candidate chunks based on the specified method.
    Returns a list of (chunk_index, score).
    """
    # 1. SPARSE ONLY
    if method == "Sparse Only":
        return tfidf_retrieve(query, text_vec, text_X, top_k=top_k)

    # 2. DENSE ONLY
    if method == "Dense Only":
        # Assumes dense_retrieve exists from previous step
        return dense_retrieve(query, top_k=top_k)

    # 3. HYBRID (Sparse + Dense)
    if method == "Hybrid" or method == "Hybrid + Rerank" or method == "Multimodal":
        # Retrieve more candidates (e.g., top_k * 2) from both to ensure overlap
        sparse_hits = tfidf_retrieve(query, text_vec, text_X, top_k=top_k*2)
        dense_hits = dense_retrieve(query, top_k=top_k*2)

        # Create a dict to fuse scores: {idx: fused_score}
        fusion_map = {}

        # Normalize and weigh (Alpha=0.5 usually works well for Hybrid)
        for idx, score in normalize_scores(sparse_hits):
            fusion_map[idx] = fusion_map.get(idx, 0) + (0.5 * score)

        for idx, score in normalize_scores(dense_hits):
            fusion_map[idx] = fusion_map.get(idx, 0) + (0.5 * score)

        # Sort by fused score
        hybrid_results = sorted(fusion_map.items(), key=lambda x: x[1], reverse=True)

        # If just Hybrid, return top_k
        if method == "Hybrid":
            return hybrid_results[:top_k]

        # 4. RERANKING (Re-score the hybrid candidates)
        # We take the top 20 hybrid candidates and rerank them
        candidates = hybrid_results[:20]

        # Prepare pairs for CrossEncoder: [[query, doc_text], ...]
        pairs = []
        for idx, _ in candidates:
            pairs.append([query, page_chunks[idx].text])

        # Predict scores
        rerank_scores = reranker.predict(pairs)

        # Attach new scores to indices
        reranked_results = []
        for i, (idx, _) in enumerate(candidates):
            reranked_results.append((idx, float(rerank_scores[i])))

        # Sort by new reranker score
        final_ranked = sorted(reranked_results, key=lambda x: x[1], reverse=True)

        return final_ranked[:top_k]

    return []

# **‚ÄúGenerator‚Äù (simple, offline)**
To keep this notebook runnable anywhere, we implement **a lightweight extractive generator**:

- It returns the top evidence lines
- In addition , we implement LLM Call with HF local model
- LLM call with API

In [75]:
# Fusion knob (text vs images)
ALPHA = 0.5  # 0.0 = images dominate, 1.0 = text dominates


In [76]:
#Method 1: Lightweight extractive generator

def simple_extractive_answer(question: str, context: str) -> str:
    lines = context.splitlines()
    if not lines:
        return "I don't know (no evidence retrieved)."
    # Return top 2 evidence lines as a "grounded" answer
    return (
        f"Question: {question}\n\n"
        "Grounded answer (extractive):\n"
        + "\n".join(lines[:2])
    )

def run_query(qobj, top_k_text=TOP_K_TEXT, top_k_images=TOP_K_IMAGES, top_k_evidence=TOP_K_EVIDENCE, alpha=ALPHA) -> Dict[str, Any]:
    question = qobj["question"]
    ctx = build_context(question, top_k_text=top_k_text, top_k_images=top_k_images, top_k_evidence=top_k_evidence, alpha=alpha)
    answer = simple_extractive_answer(question, ctx["context"])
    return {
        "id": qobj["query_id"], # Fixed: changed from "id" to "query_id"
        "question": question,
        "answer": answer,
        "context": ctx["context"],
        "image_paths": ctx["image_paths"],
        "text_hits": ctx["text_hits"],
        "img_hits": ctx["img_hits"],
    }

results = [run_query(q) for q in mini_gold]
for r in results:
    print("\n" + "="*80)
    print(r["id"], r["question"])
    print(r["answer"][:500])
    print("Images:", [os.path.basename(p) for p in r["image_paths"]])


Q1 According to doc1.pdf, what are State Data Breach Notification Laws and what compliance exception is mentioned?
Question: According to doc1.pdf, what are State Data Breach Notification Laws and what compliance exception is mentioned?

Grounded answer (extractive):
[TEXT | doc1.pdf::p1 | fused=0.500] State Data Breach Notification Laws Prepared by Foley‚Äôs Cybersecurity & Data Privacy Team
[IMAGE | zero_trust.png | fused=0.500] caption=Caption: zero trust. Content: Identity Network Endpoints Infrastructure Data Applications
Images: ['zero_trust.png']

Q2 What are the main stages shown in the Cyber Kill Chain diagram?
Question: What are the main stages shown in the Cyber Kill Chain diagram?

Grounded answer (extractive):
[TEXT | doc4.pdf::p42 | fused=0.500] 42 will require changes to thresholds in risk acceptance, transference, and new mechanisms for mitigation. Supply Chain Security (New in v2) The security of a system is only as good as the supply chain that it relies on. Defendin

# **Generator using LLM (API Call) with model gemini-2.5-flash**

In [77]:
# Method 2: LLM extractive generator (API Call)

import google.generativeai as genai
import os
from google.colab import userdata

# --- SETUP LLM ---
# Set up secret key on the left side bar
try:
    api_key = userdata.get('GEMINI_API_KEY')
except Exception:
    api_key = "PASTE_YOUR_KEY_HERE"

os.environ["GEMINI_API_KEY"] = api_key
genai.configure(api_key=api_key)

def generate_llm_answer(question: str, context: str) -> str:
    """Generates an answer using an LLM (Gemini) based on the provided context."""

    # 1. Check for empty context
    if not context or not context.strip():
        return "Not enough evidence in the retrieved context."

    # 2. Define the model
    # Using gemini-2.5-flash as it is widely available and free-tier friendly
    model = genai.GenerativeModel('gemini-2.5-flash')

    # 3. Construct the prompt
    prompt = f"""
    You are a helpful assistant for a Multimodal RAG system.
    Use the following retrieved context (text chunks and image descriptions) to answer the user's question.

    RULES:
    1. Answer ONLY using the provided context. If the answer is not in the context, say "Not enough evidence in the retrieved context."
    2. Cite your sources! When you use information, append the source ID like [TEXT | doc1.pdf::p1] or [IMAGE | figure1.png].
    3. Be concise and direct.

    CONTEXT:
    {context}

    QUESTION:
    {question}

    ANSWER:
    """

    # 4. Call the API
    try:
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        return f"LLM Generation Error: {str(e)} (Check your API Key)"

# --- UPDATED RUN_QUERY ---
def run_query(qobj, top_k_text=TOP_K_TEXT, top_k_images=TOP_K_IMAGES, top_k_evidence=TOP_K_EVIDENCE, alpha=ALPHA) -> Dict[str, Any]:
    question = qobj["question"]

    # 1. Retrieve and Build Context
    ctx = build_context(question, top_k_text=top_k_text, top_k_images=top_k_images, top_k_evidence=top_k_evidence, alpha=alpha)

    # 2. Generate Answer with LLM (Replaces simple_extractive_answer)
    answer = generate_llm_answer(question, ctx["context"])

    return {
        "id": qobj["query_id"],
        "question": question,
        "answer": answer,
        "context": ctx["context"],
        "image_paths": ctx["image_paths"],
        "text_hits": ctx["text_hits"],
        "img_hits": ctx["img_hits"],
    }

# --- EXECUTION ---
results = [run_query(q) for q in mini_gold]

for r in results:
    print("\n" + "="*80)
    print(f"[{r['id']}] Question: {r['question']}")
    print("-" * 80)
    print(f"LLM Answer:\n{r['answer']}")
    print("-" * 80)
    print("Context Images:", [os.path.basename(p) for p in r["image_paths"]])


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)



[Q1] Question: According to doc1.pdf, what are State Data Breach Notification Laws and what compliance exception is mentioned?
--------------------------------------------------------------------------------
LLM Answer:
LLM Generation Error: 400 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint: API key not valid. Please pass a valid API key. (Check your API Key)
--------------------------------------------------------------------------------
Context Images: ['zero_trust.png']

[Q2] Question: What are the main stages shown in the Cyber Kill Chain diagram?
--------------------------------------------------------------------------------
LLM Answer:
LLM Generation Error: 400 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint: API key not valid. Please pass a valid API key. (Check your API Key)
------------------------------------------

# **Generator using HuggingFace LLM (local) with flan-t5-large**

In [78]:
! pip install -q transformers accelerate bitsandbytes

In [79]:
# Method 3: HuggingFace Local
import torch
from transformers import pipeline

# Load the local model (for extractive RAG)
print("Loading local model...")
llm_pipeline = pipeline(
    "text-generation",
    # model="google/flan-t5-large",
    model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
print("‚úÖ Model loaded.")


Loading local model...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

‚úÖ Model loaded.


In [80]:
def llm_extractive_answer(question: str, context: str) -> str:
    """
    Replaces simple_extractive_answer with a local LLM generation.
    """
    if not context or not context.strip():
        return "I don't know (no evidence retrieved)."

    tokenizer = llm_pipeline.tokenizer # Access tokenizer from the pipeline
    max_context_tokens = 1500 # Safe limit (2048 total - 400 new - ~148 buffer)

    # Tokenize the context
    tokenized_context = tokenizer(context, truncation=False, return_tensors="pt")["input_ids"]

    # If context is too long, slice it
    if tokenized_context.shape[1] > max_context_tokens:
        # Keep the first 1500 tokens
        tokenized_context = tokenized_context[:, :max_context_tokens]
        # Decode back to string
        context = tokenizer.decode(tokenized_context[0], skip_special_tokens=True)

    # Prompt engineering
    # Note: For TinyLlama, a simple format works, but we add "Answer:" to trigger the generation.
    prompt = (
        f"Use the Context below to answer the Question. "
        f"If the answer is not in the Context, say 'Not enough evidence in the retrieved context.'.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
        f"\n\nAnswer:"
    )

    # Generate
    # FIXED: Increased max_new_tokens to 400 (prevents cut-offs)
    # FIXED: Set do_sample=True (prevents the "1.1.1.1" repetition loop)
    output = llm_pipeline(
        prompt,
        max_new_tokens=400,
        do_sample=True,
        temperature=0.7,
        return_full_text=False
    )
    generated_text = output[0]['generated_text'].strip()

    return (
        f"Question: {question}\n\n"
        f"LLM Answer:\n{generated_text}"
    )

def run_query(qobj, top_k_text=TOP_K_TEXT, top_k_images=TOP_K_IMAGES, top_k_evidence=TOP_K_EVIDENCE, alpha=ALPHA) -> Dict[str, Any]:
    question = qobj["question"]

    # 1. Build Context (Uses your existing function)
    ctx = build_context(question, top_k_text=top_k_text, top_k_images=top_k_images, top_k_evidence=top_k_evidence, alpha=alpha)

    # 2. Generate Answer
    answer = llm_extractive_answer(question, ctx["context"])

    # 3. Return exact same structure as your original code
    return {
        "id": qobj["query_id"],
        "question": question,
        "answer": answer,
        "context": ctx["context"],
        "image_paths": ctx["image_paths"],
        "text_hits": ctx["text_hits"], # Preserved
        "img_hits": ctx["img_hits"],   # Preserved
    }

# --- EXECUTION ---
print("Running local LLM queries...")
results = [run_query(q) for q in mini_gold]

for r in results:
    print("\n" + "="*80)
    print(r["id"], r["question"])
    print(r["answer"])
    print("Images:", [os.path.basename(p) for p in r["image_paths"]])

Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'temperature', 'do_sample'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=400) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Running local LLM queries...


Both `max_new_tokens` (=400) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=400) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=400) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=400) and `max_length`(=2048) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generati


Q1 According to doc1.pdf, what are State Data Breach Notification Laws and what compliance exception is mentioned?
Question: According to doc1.pdf, what are State Data Breach Notification Laws and what compliance exception is mentioned?

LLM Answer:
State Data Breach Notification Laws and compliance exception are mentioned in doc1.pdf as State of Residence and Illinois Statute 815 Ill. Comp. Stat. 530/5 et seq., respectively.
Images: ['zero_trust.png']

Q2 What are the main stages shown in the Cyber Kill Chain diagram?
Question: What are the main stages shown in the Cyber Kill Chain diagram?

LLM Answer:
The main stages shown in the Cyber Kill Chain diagram are Harvesting email addresses, Coupling exploit with backdoor into deliverable payload, Installing malware on the asset, Command channel for remote manipulation of victim, and Executing code on the victim's system.
Images: ['cyber_kill_chain.png']

Q3 What are the SOC 2 Trust Service Criteria categories shown in the SOC 2 diagram?

## **Retrieval Evaluation (Precision@k / Recall@k)**

We treat a text chunk as **relevant** for a query if it contains at least one `must_have_keywords` term.


In [82]:
from typing import Dict, Any, List, Set
import pandas as pd

def _normalize_evidence_id(eid: str) -> str:
    """Normalize evidence IDs so comparisons work.
    Examples:
      'img::nist_framework.png' -> 'nist_framework.png'
      'doc1.pdf::p3' -> 'doc1.pdf'
      'doc1.pdf' -> 'doc1.pdf'
    """
    if not eid:
        return ""
    eid = eid.strip()
    if eid.startswith("img::"):
        eid = eid.split("img::", 1)[1]
    if "::" in eid:
        eid = eid.split("::", 1)[0]
    return eid

def _gold_set(qobj: Dict[str, Any]) -> Set[str]:
    return set(_normalize_evidence_id(x) for x in qobj.get("gold_evidence_ids", []) if x)

def _retrieved_set_from_hits(hits: List[Dict[str, Any]]) -> List[str]:
    """Given fused evidence dicts from build_context()['evidence'], produce normalized IDs."""
    out = []
    for ev in hits:
        if ev.get("modality") == "text":
            out.append(_normalize_evidence_id(ev.get("id", "")))     # e.g., doc1.pdf::p3 -> doc1.pdf
        else:
            out.append(_normalize_evidence_id(ev.get("id", "")))     # e.g., nist_framework.png
    return out

def precision_at_k(relevances: List[bool], k: int) -> float:
    k = min(k, len(relevances))
    if k == 0:
        return 0.0
    return sum(relevances[:k]) / k

def recall_at_k(relevances: List[bool], k: int, total_relevant: int) -> float:
    k = min(k, len(relevances))
    if total_relevant == 0:
        return 0.0
    return sum(relevances[:k]) / total_relevant

def eval_retrieval_for_query(qobj: Dict[str, Any], top_k_evidence: int = 10) -> Dict[str, Any]:
    gold = _gold_set(qobj)
    question = qobj["question"]

    # Q5-like missing-evidence case: gold is empty -> recall undefined, but we can report 0
    # Build multimodal context and use fused evidence list
    ctx = build_context(question, top_k_evidence=top_k_evidence)
    retrieved_ids = _retrieved_set_from_hits(ctx["evidence"])  # ordered list (ranked)

    rels = [rid in gold for rid in retrieved_ids]
    total_rel = len(gold)

    return {
        "query_id": qobj.get("query_id", qobj.get("id", "")),
        "P@5": precision_at_k(rels, 5),
        "R@10": recall_at_k(rels, 10, total_rel),
        "gold_count": total_rel,
        "retrieved_top": retrieved_ids[:10],
    }

eval_rows = [eval_retrieval_for_query(q) for q in mini_gold]
df_eval = pd.DataFrame(eval_rows)
df_eval


Unnamed: 0,query_id,P@5,R@10,gold_count,retrieved_top
0,Q1,0.8,5.0,1,"[doc1.pdf, zero_trust.png, doc1.pdf, doc1.pdf,..."
1,Q2,0.2,1.0,1,"[doc4.pdf, doc4.pdf, cyber_kill_chain.png, doc..."
2,Q3,0.2,1.0,1,"[doc5_decrypted.pdf, doc5.pdf, soc2_requiremen..."
3,Q4,0.2,1.0,1,"[doc3.pdf, risk_management_process.png, doc3.p..."
4,Q5,0.0,0.0,0,"[doc3.pdf, cyber_kill_chain.png, impact_likeli..."
5,Q6,0.2,1.0,1,"[doc3.pdf, doc3.pdf, nist_framework.png, doc3...."


In [84]:
import pandas as pd

METHODS = ["Sparse Only", "Dense Only", "Hybrid", "Hybrid + Rerank", "Multimodal"]
eval_results = []

def normalize_evidence_id(eid: str) -> str:
    """Make evidence IDs comparable between gold and retrieved."""
    if not eid:
        return ""
    eid = eid.strip()
    if eid.startswith("img::"):
        eid = eid.split("img::", 1)[1]
    # If chunk ids look like "doc1.pdf::p3", normalize to "doc1.pdf"
    if "::" in eid:
        eid = eid.split("::", 1)[0]
    return eid

def gold_set_for_query(qobj):
    return set(normalize_evidence_id(x) for x in qobj.get("gold_evidence_ids", []) if x)

def precision_at_k_bool(rels, k):
    k = min(k, len(rels))
    if k == 0:
        return 0.0
    return sum(rels[:k]) / k

def recall_at_k_bool(rels, k, total_rel):
    k = min(k, len(rels))
    if total_rel == 0:
        return 0.0
    return sum(rels[:k]) / total_rel

print("Running evaluation across all methods...")

for qobj in mini_gold:
    qid = qobj["query_id"]
    question = qobj["question"]
    gold = gold_set_for_query(qobj)
    total_relevant = len(gold)   # ground-truth count for this query

    for method in METHODS:
        retrieved_ids = []

        if method == "Multimodal":
            # Text part: use your best text method as candidates
            text_hits = get_retrieval_results(question, "Hybrid + Rerank", top_k=10)
            for idx, _ in text_hits:
                # normalize chunk_id -> doc filename
                retrieved_ids.append(normalize_evidence_id(page_chunks[idx].chunk_id))

            # Image part
            img_hits = tfidf_retrieve(question, img_vec, img_X, top_k=5)
            for idx, _ in img_hits:
                # image id should be the filename
                retrieved_ids.append(normalize_evidence_id(image_items[idx].item_id))

        else:
            hits = get_retrieval_results(question, method, top_k=10)
            for idx, _ in hits:
                retrieved_ids.append(normalize_evidence_id(page_chunks[idx].chunk_id))

        # Convert to relevance booleans against gold evidence ids (ranked list)
        rels = [(rid in gold) for rid in retrieved_ids]

        p5 = precision_at_k_bool(rels, 5)
        r10 = recall_at_k_bool(rels, 10, total_relevant)

        eval_results.append({
            "Query": qid,
            "Method": method,
            "Precision@5": round(p5, 2),
            "Recall@10": round(r10, 2),
            "Gold_Evidence_Count": total_relevant,
            "Gold_Evidence": sorted(list(gold)),
            "Retrieved_Top": retrieved_ids[:10],
        })

df_results = pd.DataFrame(eval_results)

print("\n=== Final Deliverable Table (Query x Method x Metrics) ===")
display(df_results[["Query","Method","Precision@5","Recall@10","Gold_Evidence_Count"]])

print("\n=== Comparison View (Precision@5) ===")
display(df_results.pivot(index="Query", columns="Method", values="Precision@5"))

print("\n=== Comparison View (Recall@10) ===")
display(df_results.pivot(index="Query", columns="Method", values="Recall@10"))


Running evaluation across all methods...

=== Final Deliverable Table (Query x Method x Metrics) ===


Unnamed: 0,Query,Method,Precision@5,Recall@10,Gold_Evidence_Count
0,Q1,Sparse Only,1.0,9.0,1
1,Q1,Dense Only,1.0,10.0,1
2,Q1,Hybrid,1.0,10.0,1
3,Q1,Hybrid + Rerank,1.0,10.0,1
4,Q1,Multimodal,1.0,10.0,1
5,Q2,Sparse Only,0.0,0.0,1
6,Q2,Dense Only,0.0,0.0,1
7,Q2,Hybrid,0.0,0.0,1
8,Q2,Hybrid + Rerank,0.0,0.0,1
9,Q2,Multimodal,0.0,0.0,1



=== Comparison View (Precision@5) ===


Method,Dense Only,Hybrid,Hybrid + Rerank,Multimodal,Sparse Only
Query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Q1,1.0,1.0,1.0,1.0,1.0
Q2,0.0,0.0,0.0,0.0,0.0
Q3,0.0,0.0,0.0,0.0,0.0
Q4,0.0,0.0,0.0,0.0,0.0
Q5,0.0,0.0,0.0,0.0,0.0
Q6,0.0,0.0,0.0,0.0,0.0



=== Comparison View (Recall@10) ===


Method,Dense Only,Hybrid,Hybrid + Rerank,Multimodal,Sparse Only
Query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Q1,10.0,10.0,10.0,10.0,9.0
Q2,0.0,0.0,0.0,0.0,0.0
Q3,0.0,0.0,0.0,0.0,0.0
Q4,0.0,0.0,0.0,0.0,0.0
Q5,0.0,0.0,0.0,0.0,0.0
Q6,0.0,0.0,0.0,0.0,0.0


# 6) Evaluation + Logging (Required)

Every query must append to: `logs/query_metrics.csv`

Required columns (minimum):
- timestamp
- query_id
- retrieval_mode
- top_k
- latency_ms
- Precision@5
- Recall@10
- evidence_ids_returned
- faithfulness_pass
- missing_evidence_behavior

> If your gold set is incomplete (common for Q4/Q5), compute P/R only for labeled queries and still log latency/evidence IDs.

## How we define metrics (simple)
- `Precision@K`: (# retrieved evidence IDs in gold) / K
- `Recall@K`: (# retrieved evidence IDs in gold) / (size of gold set)

**Faithfulness (Yes/No):**
- Yes if the answer **only** uses retrieved evidence and includes citations.
- For this template, we implement a simple heuristic. Replace with your rubric/judge if desired.


In [85]:
import os
import csv
from datetime import datetime, timezone



def _canon_evidence_id(x: str) -> str:
    x = str(x).strip()
    # keep img:: prefix intact
    if x.startswith('img::'):
        return x
    # normalize file ids: allow with/without extension
    if x.endswith('.txt'):
        return x[:-4]
    return x

def _normalize_retrieved_ids(retrieved):
    """Normalize retrieved outputs into a list of evidence IDs.
    Returns canonical IDs (doc_id without .txt, or img::filename).

    Supports: list[dict], list[(idx,score)], list[str].
    """
    if retrieved is None:
        return []
    if len(retrieved) == 0:
        return []
    # list[str]
    if isinstance(retrieved[0], str):
        return [_canon_evidence_id(r) for r in retrieved]
    # list[dict]
    if isinstance(retrieved[0], dict):
        out=[]
        for r in retrieved:
            if 'evidence_id' in r and r['evidence_id']:
                out.append(_canon_evidence_id(r['evidence_id']))
            elif 'doc_id' in r and r['doc_id']:
                out.append(_canon_evidence_id(r['doc_id']))
            elif 'source' in r and r['source']:
                out.append(_canon_evidence_id(os.path.basename(str(r['source']))))
        return out
    # list[(idx, score)]
    if isinstance(retrieved[0], (tuple, list)) and len(retrieved[0]) >= 1:
        out=[]
        for item in retrieved:
            idx = int(item[0])
            if 'items' in globals() and 0 <= idx < len(items):
                out.append(_canon_evidence_id(items[idx].get('evidence_id')))
            elif 'documents' in globals() and 0 <= idx < len(documents):
                out.append(_canon_evidence_id(documents[idx].get('doc_id') or os.path.basename(documents[idx].get('source',''))))
        return out
    return []

def _normalize_gold_ids(gold_ids):
    if not gold_ids or gold_ids == ['N/A']:
        return None
    return [_canon_evidence_id(g) for g in gold_ids]

def precision_at_k(retrieved, gold_ids, k):
    gold = _normalize_gold_ids(gold_ids)
    if gold is None:
        return None
    retrieved_ids = _normalize_retrieved_ids(retrieved)[:k]
    if k == 0:
        return None
    return len(set(retrieved_ids) & set(gold)) / float(k)

def recall_at_k(retrieved, gold_ids, k):
    gold = _normalize_gold_ids(gold_ids)
    if gold is None:
        return None
    retrieved_ids = _normalize_retrieved_ids(retrieved)[:k]
    denom = float(len(set(gold)))
    return (len(set(retrieved_ids) & set(gold)) / denom) if denom > 0 else None



def faithfulness_heuristic(answer: str, evidence: list):
    # Simple heuristic: answer includes at least one citation tag from evidence OR is missing-evidence msg
    if answer.strip() == MISSING_EVIDENCE_MSG:
        return True
    tags = [e["citation_tag"] for e in evidence[:5]]
    return any(tag in answer for tag in tags)

def missing_evidence_behavior(answer: str, evidence: list):
    # Pass if either: evidence present and answer not missing-evidence; or evidence absent and answer is missing-evidence msg
    has_ev = bool(evidence) and max(e.get("score", 0.0) for e in evidence) >= 0.05
    if not has_ev:
        return "Pass" if answer.strip() == MISSING_EVIDENCE_MSG else "Fail"
    else:
        return "Pass" if answer.strip() != MISSING_EVIDENCE_MSG else "Fail"

def ensure_logfile(path: str, header: list):
    p = Path(path)
    p.parent.mkdir(parents=True, exist_ok=True)
    if not p.exists():
        with open(p, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(header)

LOG_HEADER = [
    "timestamp", "query_id", "retrieval_mode", "top_k", "latency_ms",
    "Precision@5", "Recall@10",
    "evidence_ids_returned", "gold_evidence_ids",
    "faithfulness_pass", "missing_evidence_behavior"
]
ensure_logfile(cfg.log_file, LOG_HEADER)

def run_query_and_log(query_item, retrieval_mode = 'hybrid', top_k=10):
    question = query_item["question"]
    gold_ids = query_item.get("gold_evidence_ids", [])

    t0 = time.time()
    evidence = retrieve_tfidf(question, top_k=top_k)  # replace with your pipeline + modes
    answer = generate_answer_stub(question, evidence) # replace with LLM/VLM
    latency_ms = (time.time() - t0) * 1000.0

    retrieved_ids = [e["chunk_id"] for e in evidence]
    p5 = precision_at_k(retrieved_ids, gold_ids, cfg.eval_p_at) if gold_ids else np.nan
    r10 = recall_at_k(retrieved_ids, gold_ids, cfg.eval_r_at) if gold_ids else np.nan

    faithful = faithfulness_heuristic(answer, evidence)
    meb = missing_evidence_behavior(answer, evidence)

    row = [
        datetime.now(timezone.utc).isoformat(),
        query_item["query_id"],
        retrieval_mode,
        top_k,
        round(latency_ms, 2),
        p5,
        r10,
        json.dumps(retrieved_ids),
        json.dumps(gold_ids),
        "Yes" if faithful else "No",
        meb
    ]
    with open(cfg.log_file, "a", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(row)

    return {"answer": answer, "evidence": evidence, "p5": p5, "r10": r10, "latency_ms": latency_ms, "faithful": faithful, "meb": meb}

# Run all five queries once (demo)
results = []
for qi in mini_gold:
    results.append(run_query_and_log(qi, retrieval_mode = 'hybrid', top_k=cfg.top_k_default))

pd.read_csv(cfg.log_file).tail(8)


Unnamed: 0,timestamp,query_id,retrieval_mode,top_k,latency_ms,Precision@5,Recall@10,evidence_ids_returned,gold_evidence_ids,faithfulness_pass,missing_evidence_behavior
10,2026-02-12T00:28:10.492067+00:00,Q5,hybrid,10,1.35,,,"[""doc4"", ""doc3"", ""doc2"", ""doc1"", ""doc5_decrypt...",[],Yes,Pass
11,2026-02-12T00:28:10.493481+00:00,Q6,hybrid,10,1.27,0.0,0.0,"[""doc3"", ""doc2"", ""doc4"", ""doc5_decrypted"", ""do...","[""img::nist_framework.png""]",Yes,Pass
12,2026-02-12T00:57:59.163049+00:00,Q1,hybrid,10,5.53,0.0,0.0,"[""doc1"", ""doc2"", ""doc4"", ""doc5_decrypted"", ""do...","[""doc1.pdf""]",Yes,Pass
13,2026-02-12T00:57:59.167601+00:00,Q2,hybrid,10,3.45,0.0,0.0,"[""doc2"", ""doc4"", ""doc3"", ""doc1"", ""doc5_decrypt...","[""img::cyber_kill_chain.png""]",Yes,Pass
14,2026-02-12T00:57:59.171494+00:00,Q3,hybrid,10,3.6,0.0,0.0,"[""doc5_decrypted"", ""doc4"", ""doc3"", ""doc2"", ""do...","[""img::soc2_requirements.png""]",Yes,Pass
15,2026-02-12T00:57:59.174511+00:00,Q4,hybrid,10,2.75,0.0,0.0,"[""doc3"", ""doc4"", ""doc5_decrypted"", ""doc1"", ""do...","[""img::impact_likelihood_matrix.png""]",Yes,Pass
16,2026-02-12T00:57:59.178038+00:00,Q5,hybrid,10,3.28,,,"[""doc4"", ""doc3"", ""doc2"", ""doc1"", ""doc5_decrypt...",[],Yes,Pass
17,2026-02-12T00:57:59.181087+00:00,Q6,hybrid,10,2.74,0.0,0.0,"[""doc3"", ""doc2"", ""doc4"", ""doc5_decrypted"", ""do...","[""img::nist_framework.png""]",Yes,Pass


In [86]:
# Task: Run retrieval + answer generation for all mini-gold queries
# This cell is self-contained: if retrieval/indexing cells were skipped, it will bootstrap a TF-IDF retriever.
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Build a local evidence list if not already present
if 'items' in globals():
    _evidence = items
elif 'documents' in globals():
    _evidence = []
    for d in documents:
        _evidence.append({
            'evidence_id': d.get('doc_id') or os.path.basename(d.get('source','')),
            'modality': 'text',
            'source': d.get('source'),
            'text': d.get('text','')
        })
else:
    raise NameError('Neither items nor documents are defined. Run the ZIP extraction + document loading cells first.')

assert len(_evidence) > 0, 'Evidence store is empty.'

# Canonicalize evidence ids for consistent evaluation
def _canon_evidence_id(x: str) -> str:
    x = str(x).strip()
    if x.startswith('img::'):
        return x
    return x[:-4] if x.endswith('.txt') else x

# Bootstrap TF-IDF retriever if no retriever exists
if 'retrieve_hybrid' not in globals() and 'retrieve_tfidf' not in globals() and 'retrieve' not in globals():
    _texts = [it.get('text','') for it in _evidence]
    _tfidf = TfidfVectorizer(stop_words=None, token_pattern=r'(?u)\b\w+\b')
    _tfidf_mat = _tfidf.fit_transform(_texts)

    def retrieve_tfidf(query, top_k=10):
        qv = _tfidf.transform([query])
        sims = cosine_similarity(qv, _tfidf_mat).ravel()
        idx = np.argsort(sims)[::-1][:top_k]
        return [(int(i), float(sims[i])) for i in idx]

# Define retrieve() wrapper if missing
if 'retrieve' not in globals():
    def retrieve(question, retrieval_mode='hybrid', top_k=10, alpha=0.6):
        # Prefer hybrid if available; otherwise TF-IDF
        if retrieval_mode == 'hybrid' and 'retrieve_hybrid' in globals():
            hits = retrieve_hybrid(question, top_k=top_k, alpha=alpha)
            return hits, {'mode':'hybrid'}
        if 'retrieve_tfidf' in globals():
            hits = retrieve_tfidf(question, top_k=top_k)
            return hits, {'mode':'tfidf'}
        raise NameError('No retriever available. Execute the retrieval/indexing section.')

# Ensure build_context exists
if 'build_context' not in globals():
    def build_context(hit_ids, max_chars=1400):
        parts=[]
        for i in hit_ids:
            parts.append(f"[{_evidence[i].get('evidence_id')}] {_evidence[i].get('text','')}")
        ctx='\n'.join(parts)
        return ctx[:max_chars]

# Ensure extractive_answer exists
if 'extractive_answer' not in globals():
    import re
    def extractive_answer(query, context):
        q=set(re.findall(r'[A-Za-z]+', query.lower()))
        sents=re.split(r'(?<=[.!?])\s+', (context or '').strip())
        scored=[]
        for s in sents:
            w=set(re.findall(r'[A-Za-z]+', s.lower()))
            scored.append((len(q & w), s.strip()))
        scored.sort(key=lambda x:x[0], reverse=True)
        best=[s for sc,s in scored[:3] if sc>0]
        return ' '.join(best) if best else 'Not enough information in the context.'

rows=[]
for ex in mini_gold:
    qid = ex.get('query_id')
    question = ex.get('question')
    gold = ex.get('gold_evidence_ids')

    if 'run_query_and_log' in globals():
        # Call run_query_and_log with the full query item dictionary 'ex'
        out = run_query_and_log(ex, retrieval_mode='hybrid', top_k=10)
        answer = out.get('answer')
        # The 'evidence' key from run_query_and_log output contains a list of dicts with 'chunk_id'
        evidence = [e['chunk_id'] for e in out.get('evidence', [])]
    else:
        hits, debug = retrieve(question, retrieval_mode='hybrid', top_k=10)
        hit_ids = [int(i) for i,_ in hits]
        context = build_context(hit_ids[:10])
        answer = extractive_answer(question, context)
        evidence = [_canon_evidence_id(_evidence[i].get('evidence_id')) for i in hit_ids[:10]]

    rows.append({
        'query_id': qid,
        'question': question,
        'answer': answer,
        'evidence_ids_returned(top10)': evidence,
        'gold_evidence_ids': gold,
    })

df_answers = pd.DataFrame(rows)
df_answers


Unnamed: 0,query_id,question,answer,evidence_ids_returned(top10),gold_evidence_ids
0,Q1,"According to doc1.pdf, what are State Data Bre...","Based on the retrieved evidence [doc1], the sy...","[doc1, doc2, doc4, doc5_decrypted, doc3]",[doc1.pdf]
1,Q2,What are the main stages shown in the Cyber Ki...,"Based on the retrieved evidence [doc2], the sy...","[doc2, doc4, doc3, doc1, doc5_decrypted]",[img::cyber_kill_chain.png]
2,Q3,What are the SOC 2 Trust Service Criteria cate...,Based on the retrieved evidence [doc5_decrypte...,"[doc5_decrypted, doc4, doc3, doc2, doc1]",[img::soc2_requirements.png]
3,Q4,"From the impact-likelihood risk matrix, which ...","Based on the retrieved evidence [doc3], the sy...","[doc3, doc4, doc5_decrypted, doc1, doc2]",[img::impact_likelihood_matrix.png]
4,Q5,Who won the FIFA World Cup in 2050?,Not enough evidence in the retrieved context.,"[doc4, doc3, doc2, doc1, doc5_decrypted]",[]
5,Q6,What are the five core functions shown in the ...,"Based on the retrieved evidence [doc3], the sy...","[doc3, doc2, doc4, doc5_decrypted, doc1]",[img::nist_framework.png]


# 7) Streamlit App Skeleton (Required)

You will create a Streamlit app file in your repo, e.g.:

- `app/main.py`

This notebook can generate a starter `app/main.py` for your team.

### Required UI components
- Query input box
- Retrieval controls (mode, top_k, multimodal toggle if applicable)
- Answer panel
- Evidence panel (with citations)
- Metrics panel (latency, P@5, R@10 if available)
- Logging happens automatically on each query

> This skeleton calls functions in your Python modules. Prefer moving retrieval logic into `/rag/` and importing it.


In [2]:
from pathlib import Path


In [3]:
# Generate a starter Streamlit app file (edit paths as needed).
# In your repo: create /app/main.py and move shared logic into /rag/

streamlit_code = r'''
import json, time
from pathlib import Path
import streamlit as st
import pandas as pd

# --- Import your team pipeline here ---
# from rag.pipeline import retrieve, generate_answer, run_query_and_log

MISSING_EVIDENCE_MSG = "Not enough evidence in the retrieved context."

st.set_page_config(page_title="CS5542 Lab 4 ‚Äî Project RAG App", layout="wide")
st.title("CS 5542 Lab 4 ‚Äî Project RAG Application")
st.caption("Project-aligned Streamlit UI + automatic logging + failure monitoring")

# Sidebar controls
st.sidebar.header("Retrieval Settings")
retrieval_mode = st.sidebar.selectbox("retrieval_mode", ["tfidf", "dense", "sparse", "hybrid", "hybrid_rerank"])
top_k = st.sidebar.slider("top_k", min_value=1, max_value=30, value=10, step=1)
use_multimodal = st.sidebar.checkbox("use_multimodal", value=True)

st.sidebar.header("Logging")
log_path = st.sidebar.text_input("log file", value="logs/query_metrics.csv")

# --- Mini gold set (replace with your team's Q1‚ÄìQ5) ---
# Tip: keep the same structure as in your Lab 4 notebook so IDs match logs.
MINI_GOLD = {
    "Q1": {"question": "Replace with your project Q1", "gold_evidence_ids": []},
    "Q2": {"question": "Replace with your project Q2", "gold_evidence_ids": []},
    "Q3": {"question": "Replace with your project Q3", "gold_evidence_ids": []},
    "Q4": {"question": "Replace with your project Q4 (multimodal/table/figure)", "gold_evidence_ids": []},
    "Q5": {"question": "Replace with your project Q5 (missing-evidence case)", "gold_evidence_ids": []},
}

st.sidebar.header("Evaluation")
query_id = st.sidebar.selectbox("query_id (for logging)", list(MINI_GOLD.keys()))
use_gold_question = st.sidebar.checkbox("Use the gold-set question text", value=True)

# Main query
default_q = MINI_GOLD[query_id]["question"] if use_gold_question else ""
question = st.text_area("Enter your question", value=default_q, height=120)
run_btn = st.button("Run Query")

colA, colB = st.columns([2, 1])

def ensure_logfile(path: str):
    p = Path(path)
    p.parent.mkdir(parents=True, exist_ok=True)
    if not p.exists():
        df = pd.DataFrame(columns=[
            "timestamp","query_id","retrieval_mode","top_k","latency_ms",
            "Precision@5","Recall@10","evidence_ids_returned","gold_evidence_ids",
            "faithfulness_pass","missing_evidence_behavior"
        ])
        df.to_csv(p, index=False)

def precision_at_k(retrieved_ids, gold_ids, k=5):
    if not gold_ids:
        return None
    topk = retrieved_ids[:k]
    hits = sum(1 for x in topk if x in set(gold_ids))
    return hits / k

def recall_at_k(retrieved_ids, gold_ids, k=10):
    if not gold_ids:
        return None
    topk = retrieved_ids[:k]
    hits = sum(1 for x in topk if x in set(gold_ids))
    return hits / max(1, len(gold_ids))

# ---- Placeholder demo logic (replace with imports from your /rag module) ----
def retrieve_demo(q: str, top_k: int):
    return [{"chunk_id":"demo_doc","citation_tag":"[demo_doc]","score":0.9,"source":"data/docs/demo_doc.txt","text":"demo evidence..."}]

def answer_demo(q: str, evidence: list):
    if not evidence:
        return MISSING_EVIDENCE_MSG
    return f"Grounded answer using {evidence[0]['citation_tag']} {evidence[0]['citation_tag']}"

def log_row(path: str, row: dict):
    ensure_logfile(path)
    df = pd.read_csv(path)
    df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)
    df.to_csv(path, index=False)
# --------------------------------------------------------------------------

if run_btn and question.strip():
    t0 = time.time()
    evidence = retrieve_demo(question, top_k=top_k)
    answer = answer_demo(question, evidence)
    latency_ms = round((time.time() - t0)*1000, 2)

    retrieved_ids = [e["chunk_id"] for e in evidence]
    gold_ids = MINI_GOLD[query_id].get("gold_evidence_ids", [])

    p5 = precision_at_k(retrieved_ids, gold_ids, k=5)
    r10 = recall_at_k(retrieved_ids, gold_ids, k=10)

    with colA:
        st.subheader("Answer")
        st.write(answer)

        st.subheader("Evidence (Top-K)")
        st.json(evidence)

    with colB:
        st.subheader("Metrics")
        st.write({"latency_ms": latency_ms, "Precision@5": p5, "Recall@10": r10})

    # Log the query using the selected Q1‚ÄìQ5 ID (not ad-hoc)
    row = {
        "timestamp": pd.Timestamp.utcnow().isoformat(),
        "query_id": query_id,
        "retrieval_mode": retrieval_mode,
        "top_k": top_k,
        "latency_ms": latency_ms,
        "Precision@5": p5,
        "Recall@10": r10,
        "evidence_ids_returned": json.dumps(retrieved_ids),
        "gold_evidence_ids": json.dumps(gold_ids),
        "faithfulness_pass": "Yes" if answer != MISSING_EVIDENCE_MSG else "Yes",
        "missing_evidence_behavior": "Pass"  # update with your rule if needed
    }
    log_row(log_path, row)
    st.success(f"Logged {query_id} to CSV.")
'''
app_dir = Path("app")
app_dir.mkdir(parents=True, exist_ok=True)
(app_dir / "main.py").write_text(streamlit_code, encoding="utf-8")
print("Wrote starter Streamlit app to:", app_dir / "main.py")


Wrote starter Streamlit app to: app/main.py


# 8) Optional Extension ‚Äî FastAPI Backend (Recommended for larger teams)

If your team selects the **FastAPI extension**, create:
- `api/server.py` with `POST /query`
- Streamlit UI calls the API using `requests.post(...)`

This separation mirrors real production systems:
UI (Streamlit) ‚Üí API (FastAPI) ‚Üí Retrieval + LLM services

Below is a minimal FastAPI starter you can generate.


In [4]:
fastapi_code = r'''
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Dict, Any

app = FastAPI(title="CS5542 Lab 4 RAG Backend")

MISSING_EVIDENCE_MSG = "Not enough evidence in the retrieved context."

class QueryIn(BaseModel):
    question: str
    top_k: int = 10
    retrieval_mode: str = "hybrid"
    use_multimodal: bool = True

@app.post("/query")
def query(q: QueryIn) -> Dict[str, Any]:
    # TODO: import your real pipeline:
    # evidence = retrieve(q.question, top_k=q.top_k, mode=q.retrieval_mode, use_multimodal=q.use_multimodal)
    # answer = generate_answer(q.question, evidence)
    evidence = [{"chunk_id":"demo_doc","citation_tag":"[demo_doc]","score":0.9,"source":"data/docs/demo_doc.txt","text":"demo evidence..."}]
    answer = f"Grounded answer using {evidence[0]['citation_tag']} {evidence[0]['citation_tag']}"
    return {
        "answer": answer,
        "evidence": evidence,
        "metrics": {"top_k": q.top_k, "retrieval_mode": q.retrieval_mode},
        "failure_flag": False
    }
'''
api_dir = Path("api")
api_dir.mkdir(parents=True, exist_ok=True)
(api_dir / "server.py").write_text(fastapi_code, encoding="utf-8")
print("Wrote starter FastAPI server to:", api_dir / "server.py")

print("\nRun locally (terminal):")
print("  uvicorn api.server:app --reload --port 8000")


Wrote starter FastAPI server to: api/server.py

Run locally (terminal):
  uvicorn api.server:app --reload --port 8000


In [5]:
# Temporary Build

import os
from pathlib import Path

# We embed the pipeline logic (loading, indexing, retrieval) directly into the server file
# so it runs independently of the notebook kernel.
fastapi_code = r'''
import os
import glob
import re
import numpy as np
from typing import List, Dict, Any, Tuple
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

app = FastAPI(title="CS5542 Lab 4 RAG Backend")

# --- 1. Global State & Data Loading ---
# We store the index and evidence globally so they load once on startup
global_state = {
    "vectorizer": None,
    "tfidf_matrix": None,
    "evidence": []
}

def load_data_and_index():
    """Loads text files from ./data/docs and builds a TF-IDF index."""
    print("Loading data from ./data/docs...")
    docs_dir = "./data/docs"

    # Simple loader matching your notebook's logic
    items = []
    if os.path.exists(docs_dir):
        files = glob.glob(os.path.join(docs_dir, "*.txt")) + glob.glob(os.path.join(docs_dir, "*.pdf"))
        for p in files:
            # For simplicity in this demo server, we read text files directly.
            # If using PDFs, you'd include PyMuPDF logic here or assume pre-converted .txts exist.
            try:
                # Fallback: try reading as text (works for the .txt demo files)
                with open(p, "r", encoding="utf-8", errors="ignore") as f:
                    text = f.read()

                items.append({
                    "evidence_id": os.path.basename(p),
                    "source": p,
                    "text": text,
                    "citation_tag": f"[{os.path.basename(p)}]"
                })
            except Exception as e:
                print(f"Skipping file {p}: {e}")

    if not items:
        print("WARNING: No documents found in ./data/docs. Server will have empty index.")
        return

    # Build TF-IDF Index
    texts = [it["text"] for it in items]
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(texts)

    global_state["evidence"] = items
    global_state["vectorizer"] = vectorizer
    global_state["tfidf_matrix"] = tfidf_matrix
    print(f"Server ready: Indexed {len(items)} documents.")

# Load on startup
@app.on_event("startup")
def startup_event():
    load_data_and_index()

# --- 2. Your Pipeline Functions (Ported from Notebook) ---

def retrieve_tfidf(question: str, top_k: int = 5) -> List[Tuple[int, float]]:
    vec = global_state["vectorizer"]
    mat = global_state["tfidf_matrix"]

    if vec is None or mat is None:
        return []

    q_vec = vec.transform([question])
    sims = cosine_similarity(q_vec, mat).ravel()
    # Get top_k indices
    idxs = np.argsort(-sims)[:top_k]
    return [(int(i), float(sims[i])) for i in idxs]

def build_context(hit_ids: List[int], max_chars=2000) -> str:
    parts = []
    current_len = 0
    for i in hit_ids:
        item = global_state["evidence"][i]
        text = item["text"]
        tag = item["citation_tag"]
        # Simple truncation for context window
        entry = f"{tag} {text}"
        if current_len + len(entry) > max_chars:
            break
        parts.append(entry)
        current_len += len(entry)
    return "\n\n".join(parts)

def extractive_answer(query: str, context: str) -> str:
    """Simple heuristic answer generator from your notebook."""
    if not context.strip():
        return "Not enough evidence in the retrieved context."

    # Heuristic: Find sentences with overlapping words
    q_words = set(re.findall(r'[A-Za-z]+', query.lower()))
    # Split by simple punctuation
    sents = re.split(r'(?<=[.!?])\s+', context)

    scored = []
    for s in sents:
        w = set(re.findall(r'[A-Za-z]+', s.lower()))
        score = len(q_words & w)
        if score > 0:
            scored.append((score, s.strip()))

    scored.sort(key=lambda x: x[0], reverse=True)

    # Return top 3 sentences or fallback
    best = [s for sc, s in scored[:3]]
    if best:
        return " ".join(best)
    return "Not enough evidence in the retrieved context."

# --- 3. API Endpoint ---

class QueryIn(BaseModel):
    question: str
    top_k: int = 5
    retrieval_mode: str = "hybrid" # We use tfidf fallback in this server

@app.post("/query")
def query(q: QueryIn) -> Dict[str, Any]:
    # 1. Retrieve
    # (Currently forcing TF-IDF pipeline for the server demo)
    hits = retrieve_tfidf(q.question, top_k=q.top_k)

    # 2. Format Evidence
    evidence_list = []
    hit_indices = []
    for idx, score in hits:
        item = global_state["evidence"][idx]
        evidence_list.append({
            "chunk_id": item["evidence_id"],
            "citation_tag": item["citation_tag"],
            "score": score,
            "source": item["source"],
            "text": item["text"][:500] + "..." # Truncate for API response payload
        })
        hit_indices.append(idx)

    # 3. Generate Answer
    context = build_context(hit_indices)
    answer = extractive_answer(q.question, context)

    return {
        "answer": answer,
        "evidence": evidence_list,
        "metrics": {
            "top_k": q.top_k,
            "retrieval_mode": "tfidf_server_baseline"
        },
        "failure_flag": answer == "Not enough evidence in the retrieved context."
    }

if __name__ == "__main__":
    import uvicorn
    # Allow running directly via python api/server.py
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

api_dir = Path("api")
api_dir.mkdir(parents=True, exist_ok=True)
(api_dir / "server.py").write_text(fastapi_code, encoding="utf-8")
print("‚úÖ Wrote self-contained FastAPI server to:", api_dir / "server.py")
print("\nRun locally (terminal):")
print("  uvicorn api.server:app --reload --port 8000")

‚úÖ Wrote self-contained FastAPI server to: api/server.py

Run locally (terminal):
  uvicorn api.server:app --reload --port 8000


# 9) Deployment checklist (Required)

Choose **one** deployment route and publish the public link in your README:

- HuggingFace Spaces (Streamlit)
- Streamlit Cloud (GitHub-connected)
- Render / Railway (GitHub-connected)

## README must include
1. Public deployment link  
2. How to run locally:
   - `pip install -r requirements.txt`
   - `streamlit run app/main.py`
3. A screenshot of:
   - the UI
   - evidence panel
   - metrics panel
4. Results snapshot:
   - **5 queries √ó 2 retrieval modes**
5. Failure analysis:
   - 2 failure cases, root cause, proposed fix

---

# 10) Failure analysis template (Required)

Document:
1. **Retrieval failure** (wrong evidence or missed gold evidence)  
2. **Grounding / missing-evidence failure** (safe behavior or citation enforcement)

For each:
- What happened?
- Why did it happen (root cause)?
- What change will you implement next?

You can paste your analysis into your README under **Lab 4 Results**.


# 11) Team checklist (quick)

Before submission, verify:

- [ ] Dataset, UI, and models are **project-aligned**
- [ ] Streamlit app runs locally and shows: answer + evidence + metrics
- [ ] `logs/query_metrics.csv` is auto-created and appended per query
- [ ] Mini gold set Q1‚ÄìQ5 exists and P@5/R@10 computed when possible
- [ ] Deployed link is public and listed in README
- [ ] Two failure cases documented with fixes
- [ ] `requirements.txt` and run instructions are correct
- [ ] Individual survey submitted by each teammate

---

If you want to go beyond: add an evaluation dashboard, reranking integration, or FastAPI separation (extensions).


In [35]:
# Verification: retrieval should return non-empty results for a project-relevant query

test_q = "State Data Breach Notification Laws HIPAA GLBA"

try:
    if 'retrieve_tfidf' in globals():
        hits = retrieve_tfidf(test_q, top_k=5)
    elif 'retrieve' in globals():
        hits = retrieve(test_q, top_k=5)
    else:
        hits = []

    if hits is None:
        hits = []

    n = len(hits) if hasattr(hits, '__len__') else 0
    print('Project retrieval hits:', n)
    assert n > 0, 'Retrieval returned empty results. Check corpus + indexing.'

    # Show top hit preview so you know it matches your dataset
    i0, s0 = hits[0]
    print("Top hit:", getattr(page_chunks[i0], "chunk_id", f"chunk_{i0}"), "score=", s0)
    print("Preview:", (page_chunks[i0].text or "")[:200].replace("\n", " "))

except Exception as e:
    print('‚ö†Ô∏è Retrieval verification could not run.')
    print('Reason:', type(e).__name__, str(e)[:180])


Project retrieval hits: 0
‚ö†Ô∏è Retrieval verification could not run.
Reason: AssertionError Retrieval returned empty results. Check corpus + indexing.



## GitHub Deployment Example

### Step 1 ‚Äî Push to GitHub
```bash
git init
git add .
git commit -m "Lab4 deployment"
git branch -M main
git remote add origin https://github.com/<username>/<repo>.git
git push -u origin main
```

### Step 2 ‚Äî Deploy using Streamlit Cloud
1. Visit https://share.streamlit.io
2. Click **New App**
3. Select your GitHub repository
4. Branch: `main`
5. App path: `app/main.py`
6. Click **Deploy**

### Step 3 ‚Äî Add deployment link
Include the deployed application URL in your README.md file.
