# ROBUST04 Ranking Competition - Ultimate Best Practices Pipeline

This notebook runs the complete pipeline for the ROBUST04 ranking competition.

**Three Methods:**
1. **Run 1**: BM25 + RM3 + Claude Query2Doc (lexical + PRF + semantic expansion)
2. **Run 2**: Neural MaxP Reranking (Cross-Encoder on passages)
3. **Run 3**: Optimal Multi-Signal RRF Fusion

**Expected Performance:**
- Run 1 (Lexical): MAP ~0.33-0.36
- Run 2 (Neural): MAP ~0.37-0.40
- Run 3 (Fusion): MAP ~0.40-0.44


## Cell 1: Setup Environment


In [None]:
# Check GPU availability
!nvidia-smi

# Check Python version
import sys
print(f"Python version: {sys.version}")


In [None]:
# Clone repository (replace with your repo URL)
!git clone https://github.com/YOUR_USERNAME/text_retrieval.git
%cd text_retrieval/final-project


In [None]:
# Install Java (required by pyserini/Lucene)
!apt-get update -qq
!apt-get install -qq openjdk-21-jdk-headless > /dev/null

# Verify Java installation
!java -version

# Install Python dependencies
%pip install -q pyserini faiss-cpu torch transformers sentence-transformers \
    pytrec_eval langchain langchain-text-splitters tqdm scikit-learn numpy accelerate


In [None]:
# Download Pyserini's prebuilt ROBUST04 index
print("Downloading ROBUST04 index (this may take a few minutes on first run)...")
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index('robust04')
print(f"✓ Index loaded successfully!")
print(f"  Total documents: {searcher.num_docs:,}")
searcher.close()

# Pre-download neural models to cache them
print("\nPre-downloading neural models...")
from sentence_transformers import CrossEncoder
from transformers import AutoTokenizer, T5ForConditionalGeneration

# Download Cross-Encoder
print("  Downloading Cross-Encoder (ms-marco-MiniLM-L-12-v2)...")
ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', device='cpu')
del ce_model
print("  ✓ Cross-Encoder downloaded")

# Download MonoT5 tokenizer (model downloaded on-demand if used)
print("  Downloading T5 tokenizer...")
tokenizer = AutoTokenizer.from_pretrained('castorini/monot5-base-msmarco')
del tokenizer
print("  ✓ T5 tokenizer downloaded")

print("\n✓ All models and indexes ready!")


## Validation: Test All Components

Run this cell to verify all libraries are working correctly before running the full pipeline.


In [None]:
# Validate all components work correctly
print("=" * 50)
print("VALIDATION: Testing all components")
print("=" * 50)

# Test 1: BM25 Search
print("\n1. Testing BM25 search...")
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index('robust04')
searcher.set_bm25(k1=0.9, b=0.4)
hits = searcher.search('international organized crime', k=10)
print(f"   ✓ BM25 returned {len(hits)} results")
print(f"   Top doc: {hits[0].docid} (score: {hits[0].score:.4f})")

# Test 2: BM25 + RM3 Search
print("\n2. Testing BM25 + RM3 search...")
searcher.set_rm3(fb_docs=10, fb_terms=10, original_query_weight=0.5)
hits_rm3 = searcher.search('international organized crime', k=10)
print(f"   ✓ BM25+RM3 returned {len(hits_rm3)} results")
searcher.close()

# Test 3: Cross-Encoder scoring
print("\n3. Testing Cross-Encoder...")
from sentence_transformers import CrossEncoder
ce = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', device='cuda')
scores = ce.predict([
    ('international crime', 'This document discusses transnational criminal organizations.'),
    ('international crime', 'The weather today is sunny and warm.')
])
print(f"   ✓ Relevant doc score: {scores[0]:.4f}")
print(f"   ✓ Irrelevant doc score: {scores[1]:.4f}")
del ce

# Test 4: LangChain chunking
print("\n4. Testing LangChain chunking...")
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=256, chunk_overlap=64,
    separators=["\n\n", "\n", ". ", "? ", "! ", "; ", ", ", " ", ""]
)
chunks = splitter.split_text("This is a test document. " * 50)
print(f"   ✓ Created {len(chunks)} chunks from test text")

# Test 5: pytrec_eval
print("\n5. Testing pytrec_eval...")
import pytrec_eval
qrels = {'q1': {'d1': 1, 'd2': 0, 'd3': 1}}
results = {'q1': {'d1': 0.9, 'd2': 0.5, 'd3': 0.8}}
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'map', 'ndcg'})
metrics = evaluator.evaluate(results)
print(f"   ✓ MAP: {metrics['q1']['map']:.4f}")
print(f"   ✓ NDCG: {metrics['q1']['ndcg']:.4f}")

print("\n" + "=" * 50)
print("✓ All components validated successfully!")
print("=" * 50)


## Cell 2: Parameter Tuning (Optional - Run Once)

This tunes BM25 and RM3 parameters using 5-fold cross-validation on the 50 training queries.

**Takes ~15-30 minutes.** Results are saved so you only need to run this once.


In [None]:
# Run parameter tuning (optional - skip if using defaults)
!python -m src.main tune --output tuning_results/


## Cell 3: Generate Run 1 - BM25 + RM3 + Query2Doc

Uses Claude's pre-generated query expansions + tuned BM25/RM3.


In [None]:
!python -m src.main run1 \
    --config tuning_results/best_config.json \
    --output results/run_1.res


## Cell 4: Generate Run 2 - Neural MaxP Reranking

Uses Cross-Encoder to rerank passages, then aggregates to document scores using MaxP.

**Takes ~30-60 minutes** depending on GPU.


In [None]:
!python -m src.main run2 \
    --config tuning_results/best_config.json \
    --output results/run_2.res \
    --rerank-depth 100 \
    --gpu


## Cell 5: Generate Run 3 - Optimal RRF Fusion

Fuses Run 1 and Run 2 using Reciprocal Rank Fusion.


In [None]:
!python -m src.main run3 \
    --run1 results/run_1.res \
    --run2 results/run_2.res \
    --output results/run_3.res


## Cell 6: Evaluate on Training Queries

Check performance on the 50 training queries (301-350).


In [None]:
!python -m src.main evaluate \
    results/run_1.res \
    results/run_2.res \
    results/run_3.res


## Cell 7: Verify Output Format


In [None]:
# Check file sizes and line counts
!echo "=== Run 1 ==="
!wc -l results/run_1.res
!head -5 results/run_1.res
!echo ""
!echo "=== Run 2 ==="
!wc -l results/run_2.res
!head -5 results/run_2.res
!echo ""
!echo "=== Run 3 ==="
!wc -l results/run_3.res
!head -5 results/run_3.res


In [None]:
# Verify query coverage (should have 199 test queries)
!echo "Unique queries in run_1:"
!cut -d' ' -f1 results/run_1.res | sort -u | wc -l
!echo "Unique queries in run_2:"
!cut -d' ' -f1 results/run_2.res | sort -u | wc -l
!echo "Unique queries in run_3:"
!cut -d' ' -f1 results/run_3.res | sort -u | wc -l


## Cell 8: Package for Submission


In [None]:
# Create submission zip
!zip -r Final_Project_Part_A.zip results/run_1.res results/run_2.res results/run_3.res
!ls -la Final_Project_Part_A.zip


In [None]:
# Download the submission file (for Colab)
from google.colab import files
files.download('Final_Project_Part_A.zip')


---

## Summary of Methods

| Run | Method | Key Techniques |
|-----|--------|----------------|
| 1 | BM25+RM3+Q2D | Claude Query2Doc expansions, tuned BM25/RM3, pseudo-relevance feedback |
| 2 | Neural MaxP | Cross-Encoder (L-12), contextual chunking, MaxP aggregation |
| 3 | RRF Fusion | Reciprocal Rank Fusion (k=60) of Run 1 and Run 2 |
