# ROBUST04 Ranking Pipeline

**Commands:** `train` (50 queries + evaluate) | `test` (199 queries for submission)

**Runs:** 1. BM25+RM3+Q2D → 2. Neural Reranking → 3. RRF Fusion

## 1. Setup

In [None]:
!nvidia-smi --query-gpu=name,memory.total --format=csv

In [None]:
%cd /content
!rm -rf text-retrieval-and-search-engines
!git clone https://github.com/er1009/text-retrieval-and-search-engines.git
%cd text-retrieval-and-search-engines/final-project

In [None]:
!apt-get update -qq && apt-get install -qq openjdk-21-jdk-headless > /dev/null 2>&1
%pip install -q pyserini faiss-cpu torch transformers sentence-transformers \
    pytrec_eval langchain-text-splitters tqdm accelerate 2>/dev/null
print("✓ Dependencies installed")

In [None]:
from pyserini.search.lucene import LuceneSearcher
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import AutoTokenizer, T5ForConditionalGeneration
import torch

print("Downloading models...")
s = LuceneSearcher.from_prebuilt_index('robust04'); print(f"✓ Index ({s.num_docs:,} docs)"); s.close()
SentenceTransformer('BAAI/bge-base-en-v1.5', device='cpu'); print("✓ Bi-Encoder")
CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device='cpu'); print("✓ Cross-Encoder L6")
CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', device='cpu'); print("✓ Cross-Encoder L12")
AutoTokenizer.from_pretrained('castorini/monot5-base-msmarco')
T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco', torch_dtype=torch.bfloat16)
print("✓ MonoT5")
torch.cuda.empty_cache()
print("\n✓ All models cached!")

## 2. Train (Run + Evaluate)

Runs all 3 methods on **50 training queries** and evaluates.

**Note:** `--bi-encoder-top-k` auto-scales with `--rerank-depth` if not specified.

In [None]:
!python -m src.main train \
    --output-dir results \
    --bm25-k 1000 \
    --rerank-depth 1000 \
    --chunk-size 256 \
    --chunk-overlap 64 \
    --bi-batch-size 2048 \
    --ce-batch-size 2048 \
    --monot5-batch-size 512 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60

## 3. Test (Submission)

Runs all 3 methods on **199 test queries** for competition submission.

In [None]:
!python -m src.main test \
    --output-dir submission \
    --bm25-k 1000 \
    --rerank-depth 1000 \
    --chunk-size 256 \
    --chunk-overlap 64 \
    --bi-batch-size 2048 \
    --ce-batch-size 2048 \
    --monot5-batch-size 512 \
    --ce-weight 0.5 \
    --neural-weight 0.8 \
    --rrf-k 60

## 4. Package & Download

In [None]:
!echo "Query counts (should be 199 each):"
!for f in submission/run_*.res; do echo -n "$f: "; cut -d' ' -f1 $f | sort -u | wc -l; done
!cd submission && zip -r ../Final_Project_Part_A.zip run_1.res run_2.res run_3.res
!ls -la Final_Project_Part_A.zip

In [None]:
from google.colab import files
files.download('Final_Project_Part_A.zip')

---

## Parameters Reference

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--config` | None | Tuned config JSON |
| `--output-dir` | results | Output directory |
| `--bm25-k` | 1000 | BM25 retrieval depth |
| `--rerank-depth` | 1000 | Docs to rerank |
| `--chunk-size` | 256 | Chunk size (chars) |
| `--chunk-overlap` | 64 | Chunk overlap (chars) |
| `--bi-encoder-top-k` | **auto** | Auto-scales: min(depth×15, max(3000, depth×8)) |
| `--bi-batch-size` | 2048 | Bi-encoder batch |
| `--ce-batch-size` | 2048 | Cross-encoder batch |
| `--monot5-batch-size` | 512 | MonoT5 batch |
| `--ce-weight` | 0.5 | CE weight in ensemble |
| `--neural-weight` | 0.8 | Neural/BM25 weight |
| `--rrf-k` | 60 | RRF k parameter |
| `--no-gpu` | flag | Disable GPU |