# ROBUST04 Ranking Competition - THREE-STAGE NEURAL Pipeline

This notebook runs the complete pipeline for the ROBUST04 ranking competition.
**Optimized for A100 80GB GPU** - uses **THREE-STAGE neural reranking**!

**Neural Pipeline (Run 2):**
```
BM25 (1000) ‚Üí Bi-Encoder (500) ‚Üí Cross-Encoder + MonoT5-3B ‚Üí Final
   Fast           Fast              Precise ensemble
```

**Three Runs:**
1. **Run 1**: BM25 + RM3 + Claude Query2Doc (lexical + PRF + semantic expansion)
2. **Run 2**: Three-Stage Neural Reranking (Bi-Encoder ‚Üí BGE-v2-m3 ‚Üí MonoT5-3B)
3. **Run 3**: Optimal Multi-Signal RRF Fusion

**Expected Performance:**
- Run 1 (Lexical): MAP ~0.33-0.36
- Run 2 (Neural): MAP ~0.44-0.50 (three-stage pipeline!)
- Run 3 (Fusion): MAP ~0.47-0.52

**Resource Usage (A100 80GB):**
- GPU Memory: ~17GB (BGE-large ~1.3GB + BGE-reranker ~2.3GB + MonoT5-3B ~12GB)
- System RAM: ~10GB
- Disk: ~12GB (index + models)


## Cell 1: Setup Environment


In [None]:
# Check GPU availability
!nvidia-smi

# Check Python version
import sys
print(f"Python version: {sys.version}")


In [None]:
# Clone repository (replace with your repo URL)
!git clone https://github.com/YOUR_USERNAME/text_retrieval.git
%cd text_retrieval/final-project


In [None]:
# Install Java (required by pyserini/Lucene)
!apt-get update -qq 2>/dev/null
!apt-get install -qq openjdk-21-jdk-headless > /dev/null 2>&1

# Verify Java installation
!java -version

# Install Python dependencies
# Note: Dependency conflicts with pre-installed Colab packages are harmless
%pip install -q pyserini faiss-cpu torch transformers sentence-transformers \
    pytrec_eval langchain langchain-text-splitters tqdm scikit-learn numpy accelerate 2>/dev/null

print("\n‚úì All dependencies installed successfully!")


In [None]:
# Download Pyserini's prebuilt ROBUST04 index
print("=" * 60)
print("DOWNLOADING REQUIRED RESOURCES")
print("=" * 60)

print("\n[1/4] Downloading ROBUST04 index...")
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index('robust04')
print(f"  ‚úì Index loaded: {searcher.num_docs:,} documents")
searcher.close()

# Pre-download neural models to cache them
import torch
from sentence_transformers import CrossEncoder
from transformers import AutoTokenizer, T5ForConditionalGeneration

# Download BGE-base Bi-Encoder (FAST!)
print("\n[2/5] Downloading BGE-base Bi-Encoder (BAAI/bge-base-en-v1.5)...")
from sentence_transformers import SentenceTransformer
bi_encoder = SentenceTransformer('BAAI/bge-base-en-v1.5', device='cpu')
params = sum(p.numel() for p in bi_encoder.parameters()) / 1e6
print(f"  ‚úì BGE-base Bi-Encoder downloaded ({params:.0f}M params)")
del bi_encoder

# Download MiniLM-L6 Cross-Encoder (FAST 6-layer model!)
print("\n[3/5] Downloading MiniLM-L6 Cross-Encoder...")
ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device='cpu')
params = sum(p.numel() for p in ce_model.model.parameters()) / 1e6
print(f"  ‚úì MiniLM-L6 Cross-Encoder downloaded ({params:.0f}M params)")
del ce_model

# Download MonoT5-base (faster than 3B!)
print("\n[4/5] Downloading MonoT5-base (castorini/monot5-base-msmarco)...")
tokenizer = AutoTokenizer.from_pretrained('castorini/monot5-base-msmarco')
model = T5ForConditionalGeneration.from_pretrained(
    'castorini/monot5-base-msmarco',
    torch_dtype=torch.bfloat16,
)
params_m = sum(p.numel() for p in model.parameters()) / 1e6
print(f"  ‚úì MonoT5-base downloaded ({params_m:.0f}M params)")
del model, tokenizer
torch.cuda.empty_cache()

# Check GPU memory
print("\n[5/5] Checking GPU resources...")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"  ‚úì GPU: {gpu_name} ({gpu_mem:.0f}GB)")
else:
    print("  ‚ö† No GPU detected - neural reranking will be slow!")

print("\n" + "=" * 60)
print("‚úì ALL RESOURCES READY!")
print("=" * 60)


## Validation: Test All Components

Run this cell to verify all libraries are working correctly before running the full pipeline.


In [None]:
# Validate all components work correctly
print("=" * 50)
print("VALIDATION: Testing all components")
print("=" * 50)

# Test 0: Data files exist and are correct
print("\n0. Checking data files...")
import os
assert os.path.exists('Files-20260103/queriesROBUST.txt'), "Queries file missing!"
assert os.path.exists('Files-20260103/qrels_50_Queries'), "Qrels file missing!"
assert os.path.exists('data/expanded_queries.csv'), "Expanded queries file missing!"

# Load and verify queries
from src.data_loader import load_queries, load_expanded_queries, load_qrels, get_train_qids, get_test_qids
queries = load_queries()
expanded = load_expanded_queries()
qrels = load_qrels()
train_qids = get_train_qids()
test_qids = get_test_qids()

print(f"   ‚úì Loaded {len(queries)} queries")
print(f"   ‚úì Loaded {len(expanded)} expanded queries")
print(f"   ‚úì Loaded qrels for {len(qrels)} queries")
print(f"   ‚úì Train queries: {len(train_qids)} (301-350)")
print(f"   ‚úì Test queries: {len(test_qids)} (351-450, 601-700 minus 672)")
assert len(queries) == 249, f"Expected 249 queries, got {len(queries)}"
assert len(expanded) == 249, f"Expected 249 expanded queries, got {len(expanded)}"
assert len(qrels) == 50, f"Expected 50 qrels, got {len(qrels)}"
assert len(train_qids) == 50, f"Expected 50 train qids, got {len(train_qids)}"
assert len(test_qids) == 199, f"Expected 199 test qids, got {len(test_qids)}"
print(f"   ‚úì All counts verified!")

# Test 1: BM25 Search
print("\n1. Testing BM25 search...")
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index('robust04')
searcher.set_bm25(k1=0.9, b=0.4)
hits = searcher.search('international organized crime', k=10)
print(f"   ‚úì BM25 returned {len(hits)} results")
print(f"   Top doc: {hits[0].docid} (score: {hits[0].score:.4f})")

# Test 2: BM25 + RM3 Search
print("\n2. Testing BM25 + RM3 search...")
searcher.set_rm3(fb_docs=10, fb_terms=10, original_query_weight=0.5)
hits_rm3 = searcher.search('international organized crime', k=10)
print(f"   ‚úì BM25+RM3 returned {len(hits_rm3)} results")
searcher.close()

# Test 3: MiniLM-L6 Cross-Encoder (FAST 6-layer model)
print("\n3. Testing MiniLM-L6 Cross-Encoder...")
from sentence_transformers import CrossEncoder
import torch
ce = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device='cuda')
scores = ce.predict([
    ('international crime', 'This document discusses transnational criminal organizations.'),
    ('international crime', 'The weather today is sunny and warm.')
])
print(f"   ‚úì Relevant doc score: {scores[0]:.4f}")
print(f"   ‚úì Irrelevant doc score: {scores[1]:.4f}")
assert scores[0] > scores[1], "Relevant doc should score higher!"
print(f"   ‚úì Sanity check passed!")
del ce
torch.cuda.empty_cache()

# Test 4: LangChain chunking
print("\n4. Testing LangChain chunking...")
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=256, chunk_overlap=64,
    separators=["\n\n", "\n", ". ", "? ", "! ", "; ", ", ", " ", ""]
)
chunks = splitter.split_text("This is a test document. " * 50)
print(f"   ‚úì Created {len(chunks)} chunks from test text")

# Test 5: pytrec_eval
print("\n5. Testing pytrec_eval...")
import pytrec_eval
qrels = {'q1': {'d1': 1, 'd2': 0, 'd3': 1}}
results = {'q1': {'d1': 0.9, 'd2': 0.5, 'd3': 0.8}}
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'map', 'ndcg'})
metrics = evaluator.evaluate(results)
print(f"   ‚úì MAP: {metrics['q1']['map']:.4f}")
print(f"   ‚úì NDCG: {metrics['q1']['ndcg']:.4f}")

print("\n" + "=" * 50)
print("‚úì All components validated successfully!")
print("=" * 50)


## Cell 2: Parameter Tuning (Optional - Run Once)

This tunes BM25 and RM3 parameters using 5-fold cross-validation on the 50 training queries.

**Takes ~15-30 minutes.** Results are saved so you only need to run this once.


In [None]:
# Run parameter tuning (optional - skip if using defaults)
!python -m src.main tune --output tuning_results/


## Cell 3: Generate Run 1 - BM25 + RM3 + Query2Doc

Uses Claude's pre-generated query expansions + tuned BM25/RM3.


In [None]:
!python -m src.main run1 \
    --config tuning_results/best_config.json \
    --output results/run_1.res


## Cell 4: Generate Run 2 - Neural MaxP Reranking

Uses Cross-Encoder to rerank passages, then aggregates to document scores using MaxP.

**Takes ~30-60 minutes** depending on GPU.


In [None]:
!python -m src.main run2 \
    --config tuning_results/best_config.json \
    --output results/run_2.res \
    --rerank-depth 200 \
    --gpu


## Cell 5: Generate Run 3 - Optimal RRF Fusion

Fuses Run 1 and Run 2 using Reciprocal Rank Fusion.


In [None]:
!python -m src.main run3 \
    --run1 results/run_1.res \
    --run2 results/run_2.res \
    --output results/run_3.res


## Cell 6: Evaluate on Training Queries

Check performance on the 50 training queries (301-350).


In [None]:
# Evaluate on the 50 TRAINING queries (301-350)
# The full run files contain ALL queries (train + test) for evaluation purposes
# The submission files (*_submission.res) contain only test queries

!python -m src.main evaluate \
    results/run_1.res \
    results/run_2.res \
    results/run_3.res

print("\nüí° Note: MAP shown above is on TRAINING data only (for tuning).")
print("   Competition score will be based on 199 TEST queries.")


## Cell 7: Verify Output Format


In [None]:
# Check file sizes and line counts
!echo "=== Run 1 ==="
!wc -l results/run_1.res
!head -5 results/run_1.res
!echo ""
!echo "=== Run 2 ==="
!wc -l results/run_2.res
!head -5 results/run_2.res
!echo ""
!echo "=== Run 3 ==="
!wc -l results/run_3.res
!head -5 results/run_3.res


In [None]:
# Verify query coverage
# FULL runs: should have 249 queries (50 train + 199 test)
# SUBMISSION runs: should have 199 queries (test only)

print("=== FULL RUN FILES (for evaluation) ===")
print("Expected: 249 queries (50 train + 199 test)\n")
!echo "run_1.res:" && cut -d' ' -f1 results/run_1.res | sort -u | wc -l
!echo "run_2.res:" && cut -d' ' -f1 results/run_2.res | sort -u | wc -l
!echo "run_3.res:" && cut -d' ' -f1 results/run_3.res | sort -u | wc -l

print("\n=== SUBMISSION FILES (for competition) ===")
print("Expected: 199 queries (test only)\n")
!echo "run_1_submission.res:" && cut -d' ' -f1 results/run_1_submission.res | sort -u | wc -l
!echo "run_2_submission.res:" && cut -d' ' -f1 results/run_2_submission.res | sort -u | wc -l
!echo "run_3_submission.res:" && cut -d' ' -f1 results/run_3_submission.res | sort -u | wc -l

# Verify line counts (should be queries √ó 1000 docs each)
print("\n=== LINE COUNTS ===")
print("Full runs: 249 √ó 1000 = 249,000 lines expected")
print("Submission runs: 199 √ó 1000 = 199,000 lines expected\n")
!wc -l results/run_*.res


## Cell 8: Package for Submission


In [None]:
# ‚ö†Ô∏è IMPORTANT: Use SUBMISSION files (199 test queries only) for competition!
# NOT the full files (which include training queries)

import os
import shutil

# Create submission directory with correctly named files
os.makedirs('submission', exist_ok=True)
shutil.copy('results/run_1_submission.res', 'submission/run_1.res')
shutil.copy('results/run_2_submission.res', 'submission/run_2.res')
shutil.copy('results/run_3_submission.res', 'submission/run_3.res')

# Verify files
print("=== SUBMISSION FILES ===")
!ls -la submission/

# Verify format (first few lines)
print("\n=== Sample lines from run_1.res ===")
!head -3 submission/run_1.res

# Verify query IDs (should be 351-450 and 601-700, NOT 301-350)
print("\n=== First and last query IDs (should NOT include 301-350) ===")
!echo "First 5 queries:" && cut -d' ' -f1 submission/run_1.res | sort -u | head -5
!echo "Last 5 queries:" && cut -d' ' -f1 submission/run_1.res | sort -u | tail -5

# Create the final zip
!cd submission && zip -r ../Final_Project_Part_A.zip run_1.res run_2.res run_3.res
!ls -la Final_Project_Part_A.zip

print("\n‚úÖ Submission zip created successfully!")
print("üìã Contains: run_1.res, run_2.res, run_3.res (199 test queries each)")


In [None]:
# Download the submission file (for Colab)
from google.colab import files
files.download('Final_Project_Part_A.zip')


---

## Summary of Methods (Optimized for A100 80GB)

| Run | Method | Key Techniques |
|-----|--------|----------------|
| 1 | BM25+RM3+Q2D | Claude Query2Doc expansions, tuned BM25/RM3, pseudo-relevance feedback |
| 2 | Three-Stage Neural | **Bi-Encoder ‚Üí BGE-Reranker ‚Üí MonoT5-3B** pipeline, contextual chunking, MaxP |
| 3 | RRF Fusion | Reciprocal Rank Fusion (k=60) of Run 1 and Run 2 |

**Three-Stage Neural Pipeline (Run 2):**
```
BM25 (1000 docs) ‚Üí Bi-Encoder (top 500) ‚Üí Cross-Encoder + MonoT5 (ensemble) ‚Üí Final
```

**GPU Utilization (MAXED OUT for A100 80GB):**
- **BGE-large-en-v1.5**: ~1.3GB VRAM, batch=**1024**
- **BGE-Reranker-v2-m3**: ~2.3GB VRAM, batch=**1024**
- **MonoT5-3B**: ~12GB VRAM, batch=**256**
- **Total**: ~17GB VRAM + large batches = **~40-50GB peak usage**
- Rerank depth: 200 docs ‚Üí filter to 500 ‚Üí final ensemble
