# 🚀 AIML Hackathon 2026 - Starter Notebook

Welcome to the Passage Ranking Challenge! This notebook will guide you through:
1. Setting up your environment.
2. Downloading the dataset.
3. Generating a **Random Baseline** submission.

**Goal**: Rank passages for each query in the test set.

## 1. Install Dependencies
We need `pandas` and `pyarrow` for data handling.

In [None]:
# Install all dependencies explicitly
!pip install torch>=2.0.0 transformers>=4.30.0 sentence-transformers>=2.2.0 pytorch-lightning>=2.0.0 \
    scikit-learn>=1.3.0 xgboost>=2.0.0 optuna>=3.4.0 gensim>=4.3.0 \
    numpy>=1.24.0 pandas>=2.0.0 tqdm>=4.65.0 rank_bm25>=0.2.2 \
    nltk>=3.8.0 matplotlib>=3.7.0 requests>=2.31.0 ir_datasets>=0.5.0 pyarrow

## 2. Download Dataset
Download the competition dataset from GitHub releases.

In [None]:
import os
import pandas as pd
import random
import zipfile
import urllib.request

# Dataset URL (from GitHub Releases)
DATASET_URL = "https://github.com/fabsilvestri/aiml_hackathon_data/releases/download/v1.0/kaggle_data.zip"
DATA_DIR = "data"

# Always download the dataset
print("Downloading dataset...")
zip_path = "kaggle_data.zip"

# Download
print(f"Downloading from {DATASET_URL}...")
urllib.request.urlretrieve(DATASET_URL, zip_path)
print("Download complete!")

# Extract INTO data/ folder
os.makedirs(DATA_DIR, exist_ok=True)
print("Extracting...")
with zipfile.ZipFile(zip_path, 'r') as zf:
    zf.extractall(DATA_DIR)
print(f"Extracted to {DATA_DIR}/")

# Cleanup zip
os.remove(zip_path)

# List files
print("\nDataset files:")
for f in os.listdir(DATA_DIR):
    print(f"  - {f}")

## 3. Load Data
Load the test queries and passage collection.

In [None]:
# Load Resources
print("Loading data...")

# Load test queries
test_queries = pd.read_csv(f"{DATA_DIR}/test.csv")
print(f"Loaded {len(test_queries)} test queries.")

# Load collection
collection = pd.read_parquet(f"{DATA_DIR}/collection.parquet")
all_pids = collection['pid'].astype(str).tolist()
print(f"Loaded {len(all_pids)} passages.")

## 4. Generate Random Baseline
For each query, we randomly select 10 passages from the collection.

In [None]:
# Generate Random Rankings
print("Generating random rankings...")
results = []
for qid in test_queries['id']:
    # Randomly sample 10 PIDs from the collection
    ranked_pids = random.sample(all_pids, 10)
    results.append({
        'id': str(qid),
        'expected': " ".join(ranked_pids)
    })

print(f"Generated rankings for {len(results)} queries.")

## 5. Export Submission
Save the results to CSV format for Kaggle submission.

In [None]:
# Save submission
submission = pd.DataFrame(results)
submission.to_csv("submission.csv", index=False)
print(f"Created submission.csv with {len(submission)} rows.")
print("\n✅ You can now submit this file to the Kaggle Leaderboard!")