# FinSurvival Competition: Starter Notebook (XGBoost Cox Model Prediction Submission)

**Objective:** This notebook provides a workflow for creating a valid prediction submission using the XGBoost Cox survival model. The competition requires you to submit a `.zip` file containing 16 separate prediction files in CSV format.

This notebook will guide you through:
1.  Loading the training and test sets for each of the 16 tasks from a single directory.
2.  Training a model (using XGBoost Cox model as an example).
3.  Generating predictions on the test set in the required format.
4.  Saving each set of predictions to a correctly named CSV file.
5.  Zipping all 16 prediction files for submission.

## Step 1: Setup and Imports

In [15]:
!export CUDA_VISIBLE_DEVICES=1

# Install required packages
# pip install -q pandas xgboost scikit-learn numpy

# Import libraries
import pandas as pd
import numpy as np
import os
import shutil
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from typing import Tuple, Optional

## Step 2: Define a Preprocessing Function

Even though you are not submitting this code, you will still need a preprocessing pipeline to train your models effectively. You can use the one below as a starting point.

In [16]:
def preprocess(
    train_df_with_labels: pd.DataFrame,
    test_features_df: Optional[pd.DataFrame] = None,
) -> Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]:
    """
    Preprocesses data for the competition.
    """
    train_targets = train_df_with_labels[["timeDiff", "status"]]
    train_features = train_df_with_labels.drop(columns=["timeDiff", "status"])
    cols_to_drop = ["id", "user", "pool", "Index Event", "Outcome Event", "type", "timestamp"]
    train_features = train_features.drop(columns=cols_to_drop, errors="ignore")
    categorical_cols = train_features.select_dtypes(include=["object", "category"]).columns
    for col in categorical_cols:
        top_categories = train_features[col].value_counts().nlargest(10).index
        train_features[col] = train_features[col].where(train_features[col].isin(top_categories), "Other")
    train_features_encoded = pd.get_dummies(train_features, columns=categorical_cols, dummy_na=True, drop_first=True)
    numerical_cols = train_features_encoded.select_dtypes(include=np.number).columns
    scaler = StandardScaler()
    train_features_scaled = scaler.fit_transform(train_features_encoded[numerical_cols])
    train_features_final = pd.DataFrame(train_features_scaled, index=train_features_encoded.index, columns=numerical_cols).fillna(0)
    cols_to_keep = train_features_final.columns[train_features_final.var() != 0]
    train_features_final = train_features_final[cols_to_keep]
    test_processed_features = None
    if test_features_df is not None:
        test_features = test_features_df.drop(columns=cols_to_drop, errors="ignore")
        for col in categorical_cols:
            top_categories = train_features[col].value_counts().nlargest(10).index
            test_features[col] = test_features[col].where(test_features[col].isin(top_categories), "Other")
        test_features_encoded = pd.get_dummies(test_features, columns=categorical_cols, dummy_na=True, drop_first=True)
        train_cols = train_features_encoded.columns
        test_features_aligned = test_features_encoded.reindex(columns=train_cols, fill_value=0)
        test_features_scaled = scaler.transform(test_features_aligned[numerical_cols])
        test_features_final = pd.DataFrame(test_features_scaled, index=test_features_aligned.index, columns=numerical_cols).fillna(0)
        test_processed_features = test_features_final[cols_to_keep]
    return train_features_final, train_targets, test_processed_features

## Step 3: Loop, Train, and Save Predictions

This is the main part of the notebook. We will loop through all 16 tasks. For each task, we will:
1. Load the training data and the test features.
2. Preprocess both.
3. Train a model on the training data.
4. Generate predictions on the processed test features.
5. Save the predictions to a CSV file with the correct name.

In [19]:
# Define path to the single participant data folder.
DATA_PATH = "./data/"
CACHE_DIR = "./cache/"
os.makedirs(CACHE_DIR, exist_ok=True)


def get_model_for_pair_and_date(
    index_event: str, outcome_event: str, model_date: str = None
) -> str:
    model_filename = f"xgboost_cox_{index_event}_{outcome_event}_{model_date}.model"
    model_path = os.path.join(CACHE_DIR, model_filename)

    if os.path.exists(model_path):
        return model.load_model(model_path)

    dataset_path = os.path.join(index_event, outcome_event)

    # --- Load and Preprocess ---
    train_df = pd.read_csv(os.path.join(DATA_PATH, dataset_path, "data.csv"))
    train_df = (
        train_df[train_df["timestamp"] + train_df["timeDiff"] <= model_date]
        if model_date
        else train_df
    )

    X_train, y_train, _ = preprocess(train_df)

    # --- Train Model ---
    # Prepare target variables for Cox regression
    y_train_duration = y_train["timeDiff"].values
    y_train_event = y_train["status"].values

    # Create model with Cox objective
    model = XGBRegressor(
        objective="survival:cox",
        eval_metric="cox-nloglik",
        max_depth=6,
        learning_rate=0.1,
        n_estimators=100,
        random_state=42,
        verbosity=0,
    )

    # Fit model: XGBoost Cox expects labels to be the event indicators
    # and the sample_weight to be the durations
    model.fit(X_train, y_train_event, sample_weight=y_train_duration)

    # Save model: prefer the sklearn wrapper's save_model, fall back to Booster.save_model
    try:
        model.save_model(modelPath)
    except Exception:
        model.get_booster().save_model(modelPath)
    return model


# Define all 16 event pairs
index_events = ["Borrow", "Deposit", "Repay", "Withdraw"]
outcome_events = index_events + ["Liquidated"]
event_pairs = []
for index_event in index_events:
    for outcome_event in outcome_events:
        if index_event == outcome_event:
            continue
        event_pairs.append((index_event, outcome_event))

for index_event, outcome_event in event_pairs:
    print(f"\n{'='*50}")
    print(f"Training for: {index_event} -> {outcome_event}")
    print(f"{'='*50}")

    get_model_for_pair_and_date(index_event, outcome_event, 1751328000)

print("\n\nAll prediction files have been generated.")


Training for: Borrow -> Deposit


  model.get_booster().save_model(modelPath)



Training for: Borrow -> Repay


  model.get_booster().save_model(modelPath)



Training for: Borrow -> Withdraw


  model.get_booster().save_model(modelPath)



Training for: Borrow -> Liquidated


  model.get_booster().save_model(modelPath)



Training for: Deposit -> Borrow


  model.get_booster().save_model(modelPath)



Training for: Deposit -> Repay


  model.get_booster().save_model(modelPath)



Training for: Deposit -> Withdraw


  model.get_booster().save_model(modelPath)



Training for: Deposit -> Liquidated


  model.get_booster().save_model(modelPath)



Training for: Repay -> Borrow


  model.get_booster().save_model(modelPath)



Training for: Repay -> Deposit


  model.get_booster().save_model(modelPath)



Training for: Repay -> Withdraw


  model.get_booster().save_model(modelPath)



Training for: Repay -> Liquidated


  model.get_booster().save_model(modelPath)



Training for: Withdraw -> Borrow


  model.get_booster().save_model(modelPath)



Training for: Withdraw -> Deposit


  model.get_booster().save_model(modelPath)



Training for: Withdraw -> Repay


  model.get_booster().save_model(modelPath)



Training for: Withdraw -> Liquidated


All prediction files have been generated.


  model.get_booster().save_model(modelPath)
