# 🚀 AIML Hackathon 2026 - Starter Notebook

Welcome to the Passage Ranking Challenge! This notebook will guide you through:
1. Setting up your environment.
2. Loading the dataset.
3. Generating a **Random Baseline** submission.

**Goal**: Rank passages for each query in the test set.

## 1. Install Dependencies
We need `pandas` and `pyarrow/fastparquet` for data handling.

In [None]:
# Install dependencies from the repository's requirements.txt
!pip install -r requirements.txt

# If you are running this without the repo, use:
# !pip install pandas numpy pyarrow fastparquet

## 2. Setup Paths
Detects if we are running on Kaggle or locally.

In [None]:
import os
import pandas as pd
import random

# 1. Define Paths (Kaggle or Local)
if os.path.exists("/kaggle/input"):
    # NOTE: You may need to update this path based on your Kaggle Dataset name
    DATA_DIR = "/kaggle/input/aiml-hackathon-2526/msmarco_sampled" 
    TEST_FILE = "/kaggle/input/aiml-hackathon-2526/test.csv"
else:
    # Local fallback (assuming you downloaded data)
    DATA_DIR = "msmarco_sampled"
    TEST_FILE = "test.csv"

print(f"Using Data Dir: {DATA_DIR}")

## 3. Load Data
We verify the files and load the test queries.

In [None]:
# 2. Check Input
if not os.path.exists(f"{DATA_DIR}/collection.parquet"):
    print(f"WARNING: collection.parquet not found at {DATA_DIR}.")
    print("Please verify the 'Add Input' step in Instructions.")
else:
    # 3. Load Resources
    print("Loading queries and collection...")
    test_queries = pd.read_csv(TEST_FILE)
    collection = pd.read_parquet(f"{DATA_DIR}/collection.parquet")
    all_pids = collection['pid'].astype(str).tolist()
    
    print(f"Loaded {len(test_queries)} queries and {len(all_pids)} passages.")

## 4. Generate Random Baseline
For each query, we randomly select 10 passages from the collection.

In [None]:
# 4. Generate Random Rankings
print("Generating random rankings...")
results = []
for qid in test_queries['id']:
    # Randomly sample 10 PIDs from the collection
    ranked_pids = random.sample(all_pids, 10)
    results.append({
        'id': str(qid),
        'expected': " ".join(ranked_pids)
    })

## 5. Export Submission
Save the results to standard CSV format.

In [None]:
# 5. Save
os.makedirs("submission_files", exist_ok=True)
submission = pd.DataFrame(results)
submission.to_csv("submission_files/submission_random.csv", index=False)
print("Created submission_files/submission_random.csv with", len(submission), "rows.")
print("You can now submit this file to the Leaderboard!")