# ROBUST04 Ranking Pipeline

**Architecture (6-Way Hybrid + Neural + LLM Cascade):**
- **Run 1:** BM25(orig) + Dense(orig) + **SPLADE(orig)** + BM25(Q2D) + Dense(Q2D) + **SPLADE(Q2D)** → RRF
- **Run 2:** Run 1 candidates → Bi-Encoder filter → CE+MonoT5 → MaxP
- **Run 3:** Weighted RRF(Run1, Run2, LLM)

**Key Optimizations:**
- **SPLADE retrieval** (+24% recall on hard queries like "unexplained highway accidents")
- **Bi-encoder pre-filtering** (3x faster neural reranking)
- Query2Doc expansions (1990s vocabulary)
- Dynamic few-shot selection for LLM

**Models:** SPLADE++, BGE-large (bi-encoder), BGE-reranker-v2-m3 + MiniLM-L12, MonoT5-large, gpt-4o-mini + gpt-5

## Setup

In [None]:
# Install dependencies (restart runtime after if you see Pillow errors)
!apt-get update -qq && apt-get install -qq openjdk-21-jdk-headless > /dev/null 2>&1
%pip uninstall -q -y gradio gradio-client 2>/dev/null
%pip install -q pyserini faiss-cpu torch transformers sentence-transformers \
    pytrec_eval langchain-text-splitters tqdm accelerate openai 2>/dev/null
print("✓ Dependencies installed")

In [None]:
# Check GPU
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

In [None]:
# Clone repository
%cd /content
!rm -rf text-retrieval-and-search-engines
!git clone https://github.com/er1009/text-retrieval-and-search-engines.git
%cd text-retrieval-and-search-engines/final-project
print("\n✓ Repository cloned")

In [None]:
# Set OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = "sk-YOUR-API-KEY-HERE"

In [None]:
# Mount Google Drive and setup dense index path
from google.colab import drive
drive.mount("/content/drive")
INDEX_PATH = "/content/drive/MyDrive/robust04_dense_index"

In [None]:
# Pre-download BM25 index (models downloaded on-demand during pipeline)
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index("robust04")
print(f"✓ BM25 index cached: {searcher.num_docs:,} documents")
searcher.close()

In [None]:
# Build or load dense index (~45 min first time, instant after)
# BGE-small: best quality/speed tradeoff (512 tokens = 1500 chars)
!python -m src.dense_index \
    --index-path "/content/drive/MyDrive/robust04_dense_index" \
    --embedding-model "BAAI/bge-small-en-v1.5" \
    --chunk-size 1500 \
    --chunk-overlap 200 \
    --batch-size 1024

### Optional: Build SPLADE Index (~1.5-2 hours on A100)

SPLADE (Sparse Lexical and Expansion) learns term importance and query expansion.
This can **improve recall by 3-5%** over BM25 by finding documents that match
semantically but not lexically (e.g., "unexplained accidents" → "mysterious crash").

**Only run this if you have compute budget and want to try SPLADE retrieval.**

In [None]:
# Build SPLADE index (~1.5-2 hours on A100, ~45 min on H100)
# Uses learned sparse retrieval for better term expansion
# Skip this cell if you don't have compute budget

SPLADE_INDEX_PATH = "/content/drive/MyDrive/robust04_splade_index"

!python -m src.splade_index \
    --index-path "{SPLADE_INDEX_PATH}" \
    --model-name "naver/splade-cocondenser-ensembledistil" \
    --chunk-size 1500 \
    --chunk-overlap 200 \
    --batch-size 64

In [None]:
# Test SPLADE index (quick verification)
from src.splade_index import SpladeIndex

splade = SpladeIndex("/content/drive/MyDrive/robust04_splade_index")
splade.load()

# Test search
results = splade.search("unexplained highway accidents", top_k=10)
print("Top 10 results for 'unexplained highway accidents':")
for docid, score in list(results.items())[:10]:
    print(f"  {docid}: {score:.4f}")

In [None]:
# Compare retrieval methods: BM25 vs Dense vs SPLADE (optional analysis)
from pyserini.search.lucene import LuceneSearcher
from src.dense_index import DensePassageIndex
from src.splade_index import SpladeIndex
import pytrec_eval

# Load qrels
qrels = {}
with open("Files-20260103/qrels_50_Queries") as f:
    for line in f:
        parts = line.strip().split()
        qid, docid, rel = parts[0], parts[2], int(parts[3])
        if qid not in qrels: qrels[qid] = {}
        qrels[qid][docid] = rel

# Test on one hard query
test_query = "unexplained highway accidents"
test_qid = "315"

# BM25
bm25 = LuceneSearcher.from_prebuilt_index("robust04")
bm25_results = {h.docid: h.score for h in bm25.search(test_query, k=1000)}

# Dense
dense = DensePassageIndex("/content/drive/MyDrive/robust04_dense_index")
dense.load()
dense_results = dense.search(test_query, top_k=1000)

# SPLADE
splade = SpladeIndex("/content/drive/MyDrive/robust04_splade_index")
splade.load()
splade_results = splade.search(test_query, top_k=1000)

# Evaluate
evaluator = pytrec_eval.RelevanceEvaluator({test_qid: qrels[test_qid]}, {"recall_1000", "map"})

print(f"Query {test_qid}: '{test_query}'")
print(f"  Relevant docs: {sum(1 for r in qrels[test_qid].values() if r > 0)}")
print()
for name, results in [("BM25", bm25_results), ("Dense", dense_results), ("SPLADE", splade_results)]:
    e = evaluator.evaluate({test_qid: results})
    print(f"  {name:8s}: R@1000={e[test_qid]['recall_1000']:.3f}  MAP={e[test_qid]['map']:.4f}")

---
## 1. Sanity Check (5 queries)

In [None]:
# Sanity check with SPLADE (~3-5 min with bi-encoder)
# Now uses 6-way hybrid: BM25(orig) + Dense(orig) + SPLADE(orig) + BM25(Q2D) + Dense(Q2D) + SPLADE(Q2D)
!python -m src.main train \
    --output-dir results_sanity \
    --dense-index-path "/content/drive/MyDrive/robust04_dense_index" \
    --splade-index-path "/content/drive/MyDrive/robust04_splade_index" \
    --limit-queries 5 \
    --retrieval-k 2000 \
    --rerank-depth 1000 \
    --chunk-size 512 \
    --chunk-overlap 64 \
    --use-bi-encoder \
    --bi-encoder-model "BAAI/bge-large-en-v1.5" \
    --bi-encoder-ratio 2.0 \
    --ce-model "BAAI/bge-reranker-v2-m3,cross-encoder/ms-marco-MiniLM-L-12-v2" \
    --monot5-model "castorini/monot5-large-msmarco" \
    --ce-batch-size 256 \
    --monot5-batch-size 64 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60 \
    --rrf-weight-run1 1.0 \
    --rrf-weight-run2 1.0 \
    --rrf-weight-llm 1.0 \
    --llm-model "gpt-4o-mini" \
    --llm-top-k 50 \
    --llm-window-size 10 \
    --llm-step-size 5 \
    --llm-max-passage-length 1500 \
    --llm-concurrency 10 \
    --llm-strong-model "gpt-5" \
    --llm-strong-top-k 20 \
    --llm-dynamic-few-shot

In [None]:
# Per-query analysis (optional - helps debug)
import pytrec_eval

qrels = {}
with open("Files-20260103/qrels_50_Queries") as f:
    for line in f:
        parts = line.strip().split()
        qid, docid, rel = parts[0], parts[2], int(parts[3])
        if qid not in qrels: qrels[qid] = {}
        qrels[qid][docid] = rel

queries = {}
with open("Files-20260103/queriesROBUST.txt") as f:
    for line in f:
        if line.strip():
            parts = line.strip().split('\t')
            if len(parts) >= 2: queries[parts[0]] = parts[1]

def load_run(path):
    run = {}
    with open(path) as f:
        for line in f:
            parts = line.strip().split()
            qid, docid, score = parts[0], parts[2], float(parts[4])
            if qid not in run: run[qid] = {}
            run[qid][docid] = score
    return run

r1 = load_run("results_sanity/run_1.res")
r2 = load_run("results_sanity/run_2.res")
r3 = load_run("results_sanity/run_3.res")

evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'map', 'recall_1000'})
e1, e2, e3 = evaluator.evaluate(r1), evaluator.evaluate(r2), evaluator.evaluate(r3)

print("=" * 85)
print(f"{'QID':<6} {'Query':<35} {'Run1':>10} {'Run2':>10} {'Run3':>10} {'R@1000':>8}")
print("-" * 85)
for qid in sorted(e1.keys(), key=lambda x: int(x)):
    q = queries.get(qid, "N/A")[:32]
    print(f"{qid:<6} {q:<35} {e1[qid]['map']:>10.4f} {e2[qid]['map']:>10.4f} {e3[qid]['map']:>10.4f} {e1[qid]['recall_1000']:>8.3f}")

---
## 1b. Ablation Tests (Optional)

Test whether Q2D and LLM actually help:

In [None]:
# ABLATION 1: No Q2D (original queries only)
!python -m src.main train \
    --output-dir results_no_q2d \
    --dense-index-path "/content/drive/MyDrive/robust04_dense_index" \
    --limit-queries 5 \
    --disable-q2d \
    --retrieval-k 2000 \
    --rerank-depth 1000 \
    --chunk-size 512 \
    --chunk-overlap 64 \
    --use-bi-encoder \
    --bi-encoder-model "BAAI/bge-large-en-v1.5" \
    --bi-encoder-ratio 2.0 \
    --ce-model "BAAI/bge-reranker-v2-m3,cross-encoder/ms-marco-MiniLM-L-12-v2" \
    --monot5-model "castorini/monot5-large-msmarco" \
    --ce-batch-size 256 \
    --monot5-batch-size 64 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60 \
    --rrf-weight-run1 1.0 \
    --rrf-weight-run2 1.0 \
    --rrf-weight-llm 1.0 \
    --llm-model "gpt-4o-mini" \
    --llm-top-k 50 \
    --llm-window-size 10 \
    --llm-step-size 5 \
    --llm-max-passage-length 1500 \
    --llm-concurrency 10 \
    --llm-strong-model "gpt-5" \
    --llm-strong-top-k 20 \
    --llm-dynamic-few-shot

In [None]:
# ABLATION 2: No LLM (neural only)
!python -m src.main train \
    --output-dir results_no_llm \
    --dense-index-path "/content/drive/MyDrive/robust04_dense_index" \
    --limit-queries 5 \
    --disable-llm \
    --retrieval-k 2000 \
    --rerank-depth 1000 \
    --chunk-size 512 \
    --chunk-overlap 64 \
    --use-bi-encoder \
    --bi-encoder-model "BAAI/bge-large-en-v1.5" \
    --bi-encoder-ratio 2.0 \
    --ce-model "BAAI/bge-reranker-v2-m3,cross-encoder/ms-marco-MiniLM-L-12-v2" \
    --monot5-model "castorini/monot5-large-msmarco" \
    --ce-batch-size 256 \
    --monot5-batch-size 64 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60 \
    --rrf-weight-run1 1.0 \
    --rrf-weight-run2 1.0 \
    --rrf-weight-llm 1.0 \
    --llm-model "gpt-4o-mini" \
    --llm-top-k 50 \
    --llm-window-size 10 \
    --llm-step-size 5 \
    --llm-max-passage-length 1500 \
    --llm-concurrency 10 \
    --llm-strong-model "gpt-5" \
    --llm-strong-top-k 20

In [None]:
# ABLATION 3: No Q2D + No LLM (baseline neural)
!python -m src.main train \
    --output-dir results_baseline \
    --dense-index-path "/content/drive/MyDrive/robust04_dense_index" \
    --limit-queries 5 \
    --disable-q2d \
    --disable-llm \
    --retrieval-k 2000 \
    --rerank-depth 1000 \
    --chunk-size 512 \
    --chunk-overlap 64 \
    --use-bi-encoder \
    --bi-encoder-model "BAAI/bge-large-en-v1.5" \
    --bi-encoder-ratio 2.0 \
    --ce-model "BAAI/bge-reranker-v2-m3,cross-encoder/ms-marco-MiniLM-L-12-v2" \
    --monot5-model "castorini/monot5-large-msmarco" \
    --ce-batch-size 256 \
    --monot5-batch-size 64 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60 \
    --rrf-weight-run1 1.0 \
    --rrf-weight-run2 1.0 \
    --rrf-weight-llm 1.0 \
    --llm-model "gpt-4o-mini" \
    --llm-top-k 50 \
    --llm-window-size 10 \
    --llm-step-size 5 \
    --llm-max-passage-length 1500 \
    --llm-concurrency 10 \
    --llm-strong-model "gpt-5" \
    --llm-strong-top-k 20

In [None]:
# Compare ablation results
import json
from pathlib import Path

configs = [
    ("Full (Q2D + LLM)", "results_sanity"),
    ("No Q2D", "results_no_q2d"),
    ("No LLM", "results_no_llm"),
    ("Baseline (no Q2D, no LLM)", "results_baseline"),
]

print("=" * 70)
print("ABLATION COMPARISON")
print("=" * 70)
print(f"{'Config':<30} {'Run1 MAP':>12} {'Run2 MAP':>12} {'Run3 MAP':>12}")
print("-" * 70)

for name, path in configs:
    metrics_file = Path(path) / "metrics.json"
    if metrics_file.exists():
        with open(metrics_file) as f:
            m = json.load(f)
        print(f"{name:<30} {m['run_1']['map']:>12.4f} {m['run_2']['map']:>12.4f} {m['run_3']['map']:>12.4f}")
    else:
        print(f"{name:<30} {'(not run)':>12} {'':>12} {'':>12}")

---
## 2. Training (50 queries)

In [None]:
# Full training with SPLADE (~20 min with bi-encoder, ~$1.00 LLM cost)
# 6-way hybrid retrieval: BM25 + Dense + SPLADE (orig + Q2D each)
!python -m src.main train \
    --output-dir results \
    --dense-index-path "/content/drive/MyDrive/robust04_dense_index" \
    --splade-index-path "/content/drive/MyDrive/robust04_splade_index" \
    --retrieval-k 2000 \
    --rerank-depth 1000 \
    --chunk-size 512 \
    --chunk-overlap 64 \
    --use-bi-encoder \
    --bi-encoder-model "BAAI/bge-large-en-v1.5" \
    --bi-encoder-ratio 2.0 \
    --ce-model "BAAI/bge-reranker-v2-m3,cross-encoder/ms-marco-MiniLM-L-12-v2" \
    --monot5-model "castorini/monot5-large-msmarco" \
    --ce-batch-size 256 \
    --monot5-batch-size 64 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60 \
    --rrf-weight-run1 1.0 \
    --rrf-weight-run2 1.0 \
    --rrf-weight-llm 1.0 \
    --llm-model "gpt-4o-mini" \
    --llm-top-k 50 \
    --llm-window-size 10 \
    --llm-step-size 5 \
    --llm-max-passage-length 1500 \
    --llm-concurrency 10 \
    --llm-strong-model "gpt-5" \
    --llm-strong-top-k 20 \
    --llm-dynamic-few-shot

In [None]:
# View training metrics
import json
with open("results/metrics.json") as f:
    metrics = json.load(f)
for run, m in metrics.items():
    print(f"{run}: MAP={m['map']:.4f}  NDCG@10={m['ndcg_10']:.4f}  P@10={m['p_10']:.4f}")

---
## 3. Test Submission (199 queries)

In [None]:
# Test submission with SPLADE (~50-70 min with bi-encoder, ~$4.00 LLM cost)
# 6-way hybrid retrieval: BM25 + Dense + SPLADE (orig + Q2D each)
!python -m src.main test \
    --output-dir Final_Project_Part_A \
    --dense-index-path "/content/drive/MyDrive/robust04_dense_index" \
    --splade-index-path "/content/drive/MyDrive/robust04_splade_index" \
    --retrieval-k 2000 \
    --rerank-depth 1000 \
    --chunk-size 512 \
    --chunk-overlap 64 \
    --use-bi-encoder \
    --bi-encoder-model "BAAI/bge-large-en-v1.5" \
    --bi-encoder-ratio 2.0 \
    --ce-model "BAAI/bge-reranker-v2-m3,cross-encoder/ms-marco-MiniLM-L-12-v2" \
    --monot5-model "castorini/monot5-large-msmarco" \
    --ce-batch-size 256 \
    --monot5-batch-size 64 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60 \
    --rrf-weight-run1 1.0 \
    --rrf-weight-run2 1.0 \
    --rrf-weight-llm 1.0 \
    --llm-model "gpt-4o-mini" \
    --llm-top-k 50 \
    --llm-window-size 10 \
    --llm-step-size 5 \
    --llm-max-passage-length 1500 \
    --llm-concurrency 10 \
    --llm-strong-model "gpt-5" \
    --llm-strong-top-k 20 \
    --llm-dynamic-few-shot

In [None]:
# Create submission zip
!cd Final_Project_Part_A && zip -q ../Final_Project_Part_A.zip run_1.res run_2.res run_3.res
!ls -la Final_Project_Part_A.zip
print("\n✓ Download Final_Project_Part_A.zip")