# CS 5542 — Lab 2: Advanced RAG Systems Engineering (Revised Notebook)
**Chunking → Hybrid Search → Re-ranking → Grounded QA → Evaluation**

**Submission:** Survey  
**Submission Date:** January 29 (Thursday), at the end of class  

## New Requirement (Important)
For **full credit**, you must add **your own explanations** for key steps:

- After each **IMPORTANT** code cell, write a short **Cell Description** (2–5 sentences) in a Markdown cell:
  - What the cell does
  - Why the step matters in a RAG system
  - Any assumptions/choices you made (e.g., chunk size, α, embedding model)

> Tip: Treat your descriptions like “mini system documentation.” This is how engineers communicate system design.


## Project Dataset Guide (Required for Full Credit)

To earn **full credit (2% individual)** you must run this lab on **your own project-aligned dataset**, not only the benchmark.

### Minimum project dataset requirements
- **3–20 documents** (start small; you can scale later)
- Prefer **plain text** documents (`.txt`) for Lab 2
- Total size: **at least ~3–10 pages** of content across all files

### Recommended dataset types (choose one)
- Course / technical docs (manuals, API docs, tutorials)
- Research papers (your topic area) converted to text
- Policies / guidelines / compliance docs
- Meeting notes / project reports
- Domain corpus (healthcare, cybersecurity, business, etc.)

### Folder structure (required)
Create a folder named `project_data/` and put files inside:
- `project_data/doc1.txt`
- `project_data/doc2.txt`
- ...

> If you have PDFs, convert them to text first (instructions below).


In [27]:
# ✅ IMPORTANT: Create a project_data folder and add your files
import os, glob

PROJECT_FOLDER = "project_data"
os.makedirs(PROJECT_FOLDER, exist_ok=True)

print("✅ Folder ready:", PROJECT_FOLDER)
print("Put 3–20 .txt files into ./project_data/")
print("Currently found:", len(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt"))), "txt files")


✅ Folder ready: project_data
Put 3–20 .txt files into ./project_data/
Currently found: 7 txt files


### If you are using Google Colab (Upload files)

**Option A — Upload manually**
1. Click the **Files** icon (left sidebar)
2. Click **Upload**
3. Upload your `.txt` files
4. Move them into `project_data/` (or upload directly into that folder)

### Project Queries + Mini Rubric (Required)

You must define **3 project queries**:
- Q1, Q2: normal (typical user questions)
- Q3: ambiguous / tricky (edge case)

Also define a **mini rubric** for each query:
- What counts as “relevant evidence”? (keywords, entities, definitions, constraints)
- What would a correct answer look like? (1–2 bullet points)

This rubric makes your evaluation meaningful (Precision@K / Recall@K).


In [28]:
# ✅ REQUIRED: Define your project queries and mini rubric
project_queries = {
    "Q1": {
        "query": "Who is credited with discovering coffee in the Ethiopian origin legend?",
        "rubric_relevant_evidence": [
            "Must mention 'Kaldi' the goat herder",
            "Must mention Ethiopia or goats/berries",
            "Should mention the abbot/monastery",
        ],
        "rubric_correct_answer": [
            "Kaldi, an Ethiopian goat herder, noticed his goats became energetic after eating berries.",
            "He reported it to an abbot who made a drink from the berries.",
        ],
    },
    "Q2": {
        "query": "How did coffee cultivation spread from France to the Americas?",
        "rubric_relevant_evidence": [
            "Must mention Gabriel de Clieu",
            "Must mention Martinique",
            "Should mention the seedling from King Louis XIV/Paris",
        ],
        "rubric_correct_answer": [
            "Gabriel de Clieu brought a seedling from Paris to Martinique in 1723.",
            "This single plant is the ancestor of most coffee trees in the Caribbean and Americas.",
        ],
    },
    "Q3_ambiguous": {
        "query": "Was coffee ever illegal or banned in Europe?",
        "rubric_relevant_evidence": [
            "Mention King Charles II of England (1675)",
            "Mention the 'bitter invention of Satan' / Pope Clement VIII (though he approved it)",
            "Distinguish between 'attempted bans' and permanent illegality",
        ],
        "rubric_correct_answer": [
            "There were attempts to ban it (e.g., King Charles II in 1675), but they were short-lived due to public outcry.",
            "Some clergy initially opposed it, but the Pope eventually blessed it.",
        ],
    },
}

project_queries

{'Q1': {'query': 'Who is credited with discovering coffee in the Ethiopian origin legend?',
  'rubric_relevant_evidence': ["Must mention 'Kaldi' the goat herder",
   'Must mention Ethiopia or goats/berries',
   'Should mention the abbot/monastery'],
  'rubric_correct_answer': ['Kaldi, an Ethiopian goat herder, noticed his goats became energetic after eating berries.',
   'He reported it to an abbot who made a drink from the berries.']},
 'Q2': {'query': 'How did coffee cultivation spread from France to the Americas?',
  'rubric_relevant_evidence': ['Must mention Gabriel de Clieu',
   'Must mention Martinique',
   'Should mention the seedling from King Louis XIV/Paris'],
  'rubric_correct_answer': ['Gabriel de Clieu brought a seedling from Paris to Martinique in 1723.',
   'This single plant is the ancestor of most coffee trees in the Caribbean and Americas.']},
 'Q3_ambiguous': {'query': 'Was coffee ever illegal or banned in Europe?',
  'rubric_relevant_evidence': ['Mention King Charle

I have chosen a dataset related to The History of Coffee because it represents a historical domain with clear narrative facts and some myths. The files include accounts of its origins in Ethiopia, its spread to Europe and the coffee house culture, and its eventual cultivation in the Americas. The queries were designed to test specific aspects: Q1 tests basic fact retrieval (Kaldi), Q2 tests tracking a specific timeline/journey (De Clieu), and Q3 tests ambiguous queries regarding legal status/bans which were often temporary or attempted.


## 0) One-Click Setup + Import Check  ✅ **IMPORTANT: Add Cell Description after running**

In [29]:
# CS 5542 Lab 2 — One-Click Dependency Install
# If your imports fail after installing, restart the runtime/kernel and rerun this cell.

!pip install -q sentence-transformers faiss-cpu chromadb datasets transformers scikit-learn rank-bm25

import os, glob, re
import numpy as np
import pandas as pd
from typing import List, Tuple, Dict, Set

from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi

from sentence_transformers import SentenceTransformer
import faiss

from transformers import pipeline

print("✅ Setup complete. If you see dependency warnings, ignore unless imports fail.")




This cell installs the necessary Python libraries for the RAG system, including `sentence-transformers` for embeddings, `faiss-cpu` for vector storage/search, and `rank-bm25` for keyword retrieval. It also imports required modules like `pandas` and `numpy`. Restarting the kernel is sometimes necessary because installing new packages in a running environment might not be recognized until the runtime is re-initialized.


## 1) Load Data (Benchmark + Project Data)  ✅ **IMPORTANT: Add Cell Description after running**

In [30]:
# Benchmark Loader (classroom-safe fallback; avoids script-based datasets)
def load_benchmark(n: int = 120) -> List[str]:
    # 1) Try a script-free SciFact source
    try:
        print("Trying allenai/scifact...")
        ds = load_dataset("allenai/scifact", split=f"train[:{n}]")
        sample = ds[0]
        if "claim" in sample:
            return [x["claim"] for x in ds]
        if "text" in sample:
            return [x["text"] for x in ds]
        raise RuntimeError("Unknown SciFact schema.")
    except Exception as e:
        print("⚠️ allenai/scifact failed:", str(e))

    # 2) Try multi_news
    try:
        print("Trying multi_news...")
        ds = load_dataset("multi_news", split=f"train[:{n}]")
        return [x["document"] for x in ds]
    except Exception as e:
        print("⚠️ multi_news failed:", str(e))

    # 3) Fallback: ag_news (very stable)
    print("Using ag_news fallback...")
    ds = load_dataset("ag_news", split=f"train[:{n}]")
    return [x["text"] for x in ds]

# Load benchmark docs
benchmark_docs = load_benchmark(n=120)
print(f"Loaded benchmark docs: {len(benchmark_docs)}")

# Load project-aligned docs from ./project_data/*.txt
PROJECT_FOLDER = "project_data"
project_files = sorted(glob.glob(os.path.join(PROJECT_FOLDER, "*.txt")))
project_docs = []
for fp in project_files:
    with open(fp, "r", encoding="utf-8", errors="ignore") as f:
        project_docs.append(f.read())

print(f"Loaded project docs: {len(project_docs)}")
if len(project_docs) == 0:
    print("⚠️ Add 3–20 .txt files under ./project_data/ to earn full credit.")


Trying allenai/scifact...
⚠️ allenai/scifact failed: Dataset scripts are no longer supported, but found scifact.py
Trying multi_news...
⚠️ multi_news failed: Dataset scripts are no longer supported, but found multi_news.py
Using ag_news fallback...
Loaded benchmark docs: 120
Loaded project docs: 7


This cell loads two datasets: a benchmark dataset (SciFact, MultiNews, or AG News) to ensure baseline functionality, and the custom project-aligned dataset from the `project_data` folder. Project-aligned data is required to demonstrate the RAG system's ability to handle specific, real-world domain knowledge rather than just generic benchmarks.


## 2) Chunking (Fixed vs Semantic)  ✅ **IMPORTANT: Add Cell Description after running**

In [31]:
# --- Chunking functions ---
def fixed_chunks(text: str, size: int = 1200, overlap: int = 200) -> List[str]:
    """Character-based fixed window chunking (fast and reliable in class)."""
    text = text.strip()
    if not text:
        return []
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        c = text[i:i+size].strip()
        if len(c) > 50:
            chunks.append(c)
    return chunks

def semantic_chunks(text: str) -> List[str]:
    """Paragraph-based semantic chunking; merges short segments to keep context."""
    paras = [p.strip() for p in re.split(r"\n\s*\n+", text) if p.strip()]
    merged, buf = [], ""
    for p in paras:
        if len(buf) < 400:
            buf = (buf + "\n\n" + p).strip()
        else:
            merged.append(buf); buf = p
    if buf:
        merged.append(buf)
    return [m for m in merged if len(m) > 80]

def build_corpus(docs: List[str], mode: str) -> List[str]:
    all_chunks = []
    for d in docs:
        if mode == "fixed":
            all_chunks.extend(fixed_chunks(d))
        elif mode == "semantic":
            all_chunks.extend(semantic_chunks(d))
        else:
            raise ValueError("mode must be 'fixed' or 'semantic'")
    return all_chunks

# Build both corpora and choose one to use in retrieval
all_docs = benchmark_docs + project_docs
fixed_corpus = build_corpus(all_docs, mode="fixed")
semantic_corpus = build_corpus(all_docs, mode="semantic")

print("Fixed corpus chunks:", len(fixed_corpus))
print("Semantic corpus chunks:", len(semantic_corpus))

# Choose the corpus for the lab (recommend semantic for better context)
CORPUS = semantic_corpus
print("✅ Using CORPUS =", "semantic" if CORPUS is semantic_corpus else "fixed")


Fixed corpus chunks: 134
Semantic corpus chunks: 138
✅ Using CORPUS = semantic


This cell defines and applies two chunking strategies: 'fixed' (splitting text by character count with overlap) and 'semantic' (splitting by paragraphs/content boundaries). Semantic chunking is generally preferred for RAG because it preserves the meaning of complete thoughts, whereas fixed chunking might split a sentence in half, confusing the embedding model. We select `semantic_corpus` for the retrieval steps.


## 3) Build Retrieval Indexes (Keyword + Vector)  ✅ **IMPORTANT: Add Cell Description after running**

In [32]:
# --- Keyword Retrieval (TF-IDF + BM25) ---
def tokenize(s: str) -> List[str]:
    return re.findall(r"[A-Za-z0-9]+", s.lower())

tfidf = TfidfVectorizer(stop_words="english", max_features=50000)
tfidf_matrix = tfidf.fit_transform(CORPUS)

def keyword_tfidf(query: str, k: int = 10) -> List[Tuple[int, float]]:
    q_vec = tfidf.transform([query])
    scores = (tfidf_matrix @ q_vec.T).toarray().squeeze()
    top = np.argsort(scores)[-k:][::-1]
    return [(int(i), float(scores[i])) for i in top]

bm25 = BM25Okapi([tokenize(x) for x in CORPUS])

def keyword_bm25(query: str, k: int = 10) -> List[Tuple[int, float]]:
    scores = bm25.get_scores(tokenize(query))
    top = np.argsort(scores)[-k:][::-1]
    return [(int(i), float(scores[i])) for i in top]

# --- Vector Retrieval (SentenceTransformer + FAISS) ---
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(embed_model_name)

embeddings = embedder.encode(CORPUS, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
dim = embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(dim)  # cosine via normalized vectors + inner product
faiss_index.add(embeddings)

def vector_search(query: str, k: int = 10) -> List[Tuple[int, float]]:
    q = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    scores, idx = faiss_index.search(q, k)
    return [(int(i), float(s)) for i, s in zip(idx[0], scores[0])]

print("✅ Retrieval engines ready: TF-IDF, BM25, Vector(FAISS)")


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

✅ Retrieval engines ready: TF-IDF, BM25, Vector(FAISS)


This cell builds two distinct retrieval indices: a Sparse Keyword index (using TF-IDF or BM25) and a Dense Vector index (using FAISS with SentenceTransformer embeddings). Keyword retrieval is best for exact matches and specific terminology, while Vector retrieval excels at capturing semantic meaning and context, even if exact keywords are missing.


## 4) Hybrid Retrieval (α-Weighted Fusion)  ✅ **IMPORTANT: Add Cell Description after running**

In [33]:
def normalize_scores(pairs: List[Tuple[int, float]]) -> Dict[int, float]:
    if not pairs:
        return {}
    vals = np.array([s for _, s in pairs], dtype=float)
    vmin, vmax = vals.min(), vals.max()
    if vmax - vmin < 1e-9:
        return {i: 1.0 for i, _ in pairs}
    return {i: (s - vmin) / (vmax - vmin) for i, s in pairs}

def hybrid_search(query: str, k_keyword: int = 10, k_vector: int = 10, alpha: float = 0.5,
                  top_k: int = 10, keyword_mode: str = "bm25") -> List[Tuple[int, float]]:
    kw = keyword_bm25(query, k=k_keyword) if keyword_mode == "bm25" else keyword_tfidf(query, k=k_keyword)
    vec = vector_search(query, k=k_vector)

    kw_n = normalize_scores(kw)
    vec_n = normalize_scores(vec)

    all_ids = set(kw_n) | set(vec_n)
    combined = []
    for i in all_ids:
        score = alpha * kw_n.get(i, 0.0) + (1 - alpha) * vec_n.get(i, 0.0)
        combined.append((i, float(score)))

    combined.sort(key=lambda x: x[1], reverse=True)
    return combined[:top_k]

print("✅ Hybrid retrieval ready. You'll sweep alpha ∈ {0.2, 0.5, 0.8}.")


✅ Hybrid retrieval ready. You'll sweep alpha ∈ {0.2, 0.5, 0.8}.


This cell implements hybrid retrieval by combining scores from both the Keyword (BM25) and Vector (FAISS) retrievers using a weighted sum. The parameter `alpha` (α) controls the balance: an α close to 1.0 favors keyword matches, while an α close to 0.0 favors semantic vector matches. Normalization is crucial because BM25 and Cosine similarity scores are on different scales.


## 5) Re-ranking (Cross-Encoder if available)  ✅ **IMPORTANT: Add Cell Description after running**

In [34]:
USE_CROSS_ENCODER = True
reranker = None

if USE_CROSS_ENCODER:
    try:
        from sentence_transformers import CrossEncoder
        reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        print("✅ Cross-encoder reranker loaded.")
    except Exception as e:
        print("⚠️ Cross-encoder not available. Falling back to no reranking.")
        print("Error:", e)
        reranker = None

def rerank(query: str, candidates: List[Tuple[int, float]], top_k: int = 5) -> List[Tuple[int, float]]:
    ids = [i for i, _ in candidates]
    if reranker is None:
        return candidates[:top_k]
    pairs = [(query, CORPUS[i]) for i in ids]
    scores = reranker.predict(pairs)
    scored = list(zip(ids, scores))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [(int(i), float(s)) for i, s in scored[:top_k]]

print("✅ Reranking function ready.")


✅ Cross-encoder reranker loaded.
✅ Reranking function ready.


This cell uses a Cross-Encoder model to re-score the top candidates retrieved by the hybrid search. Unlike the bi-encoder used for initial retrieval (which compares pre-computed vectors), the cross-encoder processes the query and document pair together, providing a much more accurate relevance score. This improves Precision@K but is computationally expensive, so it is only applied to the top few results.


## 6) Run Your 3 Project Queries + Generate Answers  ✅ **IMPORTANT: Add Cell Description after running**

In [35]:
# Generator (small + class-friendly)
gen = pipeline("text2text-generation", model="google/flan-t5-base")

def prompt_only_answer(query: str, max_new_tokens: int = 200) -> str:
    return gen(query, max_new_tokens=max_new_tokens)[0]["generated_text"]

def rag_answer(query: str, chunk_ids: List[int], max_new_tokens: int = 220) -> str:
    evidence = "\n\n".join([f"[Chunk {j+1}] {CORPUS[i]}" for j, i in enumerate(chunk_ids)])
    prompt = f"""Answer the question using ONLY the evidence below.

Evidence:
{evidence}

Question:
{query}

Rules:
- If evidence is insufficient, say: Not enough evidence.
- Cite evidence with [Chunk 1], [Chunk 2], etc.
"""
    return gen(prompt, max_new_tokens=max_new_tokens)[0]["generated_text"]

def show_top(pairs: List[Tuple[int, float]], title: str, k: int = 5):
    print(f"\n=== {title} (Top {k}) ===")
    for r, (i, s) in enumerate(pairs[:k], 1):
        snip = CORPUS[i].replace("\n", " ")
        snip = snip[:220] + ("..." if len(snip) > 220 else "")
        print(f"{r:>2}. id={i:<6} score={s:>8.4f} | {snip}")

# ✅ REQUIRED: Replace with your project queries
queries = [
    "Q1: " + project_queries["Q1"]["query"],
    "Q2: " + project_queries["Q2"]["query"],
    "Q3 (ambiguous): " + project_queries["Q3_ambiguous"]["query"],
]

alphas = [0.2, 0.5, 0.8]
results_summary = []

for q in queries:
    print("\n" + "="*90)
    print(q)

    kw = keyword_bm25(q, k=10)
    vec = vector_search(q, k=10)
    show_top(kw, "BM25 Keyword")
    show_top(vec, "Vector (FAISS)")

    hybrids = []
    for a in alphas:
        hyb = hybrid_search(q, alpha=a, top_k=10, keyword_mode="bm25")
        hybrids.append((a, hyb))
        show_top(hyb, f"Hybrid (alpha={a})")

    best_a, _ = max(hybrids, key=lambda t: np.mean([s for _, s in t[1]]) if t[1] else -1)
    print(f"\nSelected hybrid alpha={best_a}")

    candidate_pool = hybrid_search(q, alpha=best_a, top_k=20, keyword_mode="bm25")
    reranked = rerank(q, candidate_pool, top_k=5)
    show_top(reranked, "Re-ranked")

    top3_ids = [i for i, _ in reranked[:3]]
    print("\nTop-3 evidence chunk IDs:", top3_ids)

    po = prompt_only_answer(q)
    ra = rag_answer(q, top3_ids)

    print("\n--- Prompt-only answer ---\n", po)
    print("\n--- RAG-grounded answer ---\n", ra)

    results_summary.append({
        "query": q,
        "best_alpha": best_a,
        "top3_chunk_ids": top3_ids,
        "prompt_only": po,
        "rag": ra,
    })

results_summary[:1]


Device set to use cpu



Q1: Who is credited with discovering coffee in the Ethiopian origin legend?

=== BM25 Keyword (Top 5) ===
 1. id=131    score= 16.8778 | The Legend of Kaldi The most popular origin story of coffee traces back to a goat herder named Kaldi, who lived in the Ethiopian plateau around the 9th century (though written accounts appear much later). Legend says Kal...
 2. id=128    score= 10.7598 | The Waves of Coffee The history of modern coffee consumption is often described in "waves."  First Wave (1800s - 1960s) The First Wave was characterized by the mass production and consumption of coffee. Brands like Folge...
 3. id=121    score=  9.8922 | Once planted, the seedling not only thrived, but it's credited with the spread of over 18 million coffee trees on the island of Martinique in the next 50 years. Even more incredible is that this seedling was the parent o...
 4. id=134    score=  8.7120 | Coffee in the Ottoman Empire By the 16th century, coffee had become an integral part of Ottoman c

Token indices sequence length is longer than the specified maximum sequence length for this model (520 > 512). Running this sequence through the model will result in indexing errors



--- Prompt-only answer ---
 banned in Europe

--- RAG-grounded answer ---
 Cite evidence with [Chunk 1], [Chunk 2], etc.


[{'query': 'Q1: Who is credited with discovering coffee in the Ethiopian origin legend?',
  'best_alpha': 0.2,
  'top3_chunk_ids': [131, 132, 121],
  'prompt_only': 'adolf hitler',
  'rag': 'Cite evidence with [Chunk 1], [Chunk 2], etc.'}]

This cell runs the full RAG pipeline on the defined project queries. It retrieves candidates using BM25 and Vector search, combines them with different alpha values, re-ranks the top results, and then generates an answer using the FLAN-T5 model. We compare the 'Prompt-only' answer (LLM's internal knowledge) with the 'RAG-grounded' answer (using retrieved chunks) to verify if the system effectively grounds the response in the provided evidence.


## 7) Metrics (Precision@5 / Recall@10) + Manual Relevance Labels  ✅ **IMPORTANT: Add Cell Description after running**

In [36]:
def precision_at_k(retrieved: List[int], relevant: Set[int], k: int = 5) -> float:
    top = retrieved[:k]
    if not top:
        return 0.0
    return sum(1 for i in top if i in relevant) / len(top)

def recall_at_k(retrieved: List[int], relevant: Set[int], k: int = 10) -> float:
    if not relevant:
        return 0.0
    return len(set(retrieved[:k]) & relevant) / len(relevant)

# Helper to find chunk IDs containing specific phrases (Auto-labeling)
def find_ids_by_phrase(phrases: List[str]) -> Set[int]:
    ids = set()
    for i, chunk in enumerate(CORPUS):
        # Check if ANY of the phrases match
        for p in phrases:
            if p.lower() in chunk.lower():
                ids.add(i)
    return ids

# ✅ REQUIRED: Label a small set of relevant chunk IDs for each query
relevance_labels = {
    queries[0]: find_ids_by_phrase(["Kaldi", "goat", "Ethiopia"]),         # Q1: Broader search for origin story
    queries[1]: find_ids_by_phrase(["Gabriel de Clieu", "Martinique", "seedling"]), # Q2: Broader search for Americas
    queries[2]: find_ids_by_phrase(["King Charles", "ban", "illegal", "Pope"]),     # Q3: Broader search for bans
}

# Print them to verify
for q, ids in relevance_labels.items():
    print(f"Query: {q[:50]}... | Relevant Chunk IDs: {ids}")


Query: Q1: Who is credited with discovering coffee in the... | Relevant Chunk IDs: {132, 123, 131}
Query: Q2: How did coffee cultivation spread from France ... | Relevant Chunk IDs: {120, 121}
Query: Q3 (ambiguous): Was coffee ever illegal or banned ... | Relevant Chunk IDs: {0, 64, 65, 135, 40, 9, 137, 44, 125, 23, 56, 25, 93, 62, 127}


### ✍️ Cell Description (Student)
Precision@K measures how many of the top-K retrieved items are relevant, while Recall@K measures how many of the total relevant items were retrieved. I defined relevance by identifying specific key phrases (e.g., "Kaldi", "goat", "Gabriel de Clieu", "seedling", "ban") that definitively answer the queries. I broadened the search terms to capture multiple relevant chunks per query, providing a more robust ground truth for evaluation.

In [37]:
def evaluate_query(q: str, relevant: Set[int], alpha: float):
    kw_ids = [i for i, _ in keyword_bm25(q, k=10)]
    vec_ids = [i for i, _ in vector_search(q, k=10)]
    hyb_ids = [i for i, _ in hybrid_search(q, alpha=alpha, top_k=10, keyword_mode="bm25")]
    return {
        "P@5_keyword": precision_at_k(kw_ids, relevant, k=5),
        "R@10_keyword": recall_at_k(kw_ids, relevant, k=10),
        "P@5_vector": precision_at_k(vec_ids, relevant, k=5),
        "R@10_vector": recall_at_k(vec_ids, relevant, k=10),
        "P@5_hybrid": precision_at_k(hyb_ids, relevant, k=5),
        "R@10_hybrid": recall_at_k(hyb_ids, relevant, k=10),
    }

metrics_rows = []
for row in results_summary:
    q = row["query"]
    alpha = row["best_alpha"]
    rel = relevance_labels.get(q, set())
    m = evaluate_query(q, rel, alpha)
    m.update({"query": q, "alpha_used": alpha, "num_relevant_labeled": len(rel)})
    metrics_rows.append(m)

metrics_df = pd.DataFrame(metrics_rows)
metrics_df


Unnamed: 0,P@5_keyword,R@10_keyword,P@5_vector,R@10_vector,P@5_hybrid,R@10_hybrid,query,alpha_used,num_relevant_labeled
0,0.2,1.0,0.4,1.0,0.2,1.0,Q1: Who is credited with discovering coffee in...,0.2,3
1,0.4,1.0,0.4,1.0,0.4,1.0,Q2: How did coffee cultivation spread from Fra...,0.8,2
2,0.6,0.2,0.8,0.266667,0.6,0.266667,Q3 (ambiguous): Was coffee ever illegal or ban...,0.8,15


## 8) README Checklist (Deliverables)

Create a section titled **Lab 2 — Advanced RAG Results** in your repo README and include:
- Results table (Query × Method × Precision@5 / Recall@10)
- Screenshots: chunking comparison, reranking before/after, prompt-only vs RAG answers
- Reflection (3–5 sentences): one failure case, which layer failed, one concrete fix

### Required Reflection Labels
- Chunking failure
- Retrieval failure
- Re-ranking failure
- Generation failure


## 9) Final Requirement Reminder (2% Individual)
To earn full credit, you must demonstrate:
- **Project-aligned data** (your domain corpus)
- **Three domain queries** (including one ambiguous case)
- **One system customization** (chunking choice, α policy, model choice, etc.)
- **One real failure case + fix**
