# FinSurvival Competition: Starter Notebook (AFT Model Prediction Submission)

**Objective:** This notebook provides a workflow for creating a valid prediction submission using the `WeibullAFTFitter` model. The competition requires you to submit a `.zip` file containing 16 separate prediction files in CSV format.

This notebook will guide you through:
1.  Loading the training and test sets for each of the 16 tasks from a single directory.
2.  Training a model (using `WeibullAFTFitter` as an example).
3.  Generating predictions on the test set in the required format.
4.  Saving each set of predictions to a correctly named CSV file.
5.  Zipping all 16 prediction files for submission.

## Step 1: Setup and Imports

In [1]:
# Install required packages
# pip install -q pandas lifelines==0.27.8 scikit-learn==1.2.2 scikit-survival==0.21.0 numpy

# Import libraries
import pandas as pd
import numpy as np
import os
import shutil
from lifelines import WeibullAFTFitter
from lifelines.exceptions import ConvergenceError
from sklearn.preprocessing import StandardScaler
from typing import Tuple, Optional
from utils.constants import *
import pickle as pkl
import xgboost as xgb
from sksurv.metrics import concordance_index_censored
from utils.model_training import preprocess, get_best_params

# Module-level cache for loaded/trained models to reuse across calls
MODELS_CACHE: dict = {}

# Module-level cache for preprocessing artifacts (scaler, columns, categories)
PREPROCESS_CACHE: dict = {}

def get_concordance_index(
    test_df: pd.DataFrame, 
    predictions: np.ndarray
) -> float:
    """
    Calculates the concordance index for survival models using scikit-survival.
    Replaces any NaN predictions with -1.
    """
    # Replace NaN predictions with a value representing the worst possible score (shortest survival)
    # Using -1 is a robust way to handle failed predictions without causing numerical errors.
    predictions[np.isnan(predictions)] = -1
    
    event_indicator = test_df['status'].astype(bool)
    event_time = test_df['timeDiff']

    # Handle cases where all events are censored or all are non-censored in the test set
    if len(np.unique(event_indicator)) == 1:
        return 0.5  # Return a neutral score

    c_index, _, _, _, _ = concordance_index_censored(
        event_indicator, event_time, -predictions
    )
    
    return c_index

## Step 3: Loop, Train, and Save Predictions

This is the main part of the notebook. We will loop through all 16 tasks. For each task, we will:
1. Load the training data and the test features.
2. Preprocess both.
3. Train a model on the training data.
4. Generate predictions on the processed test features.
5. Save the predictions to a CSV file with the correct name.

In [None]:
# Define path to the single participant data folder.
PARTICIPANT_DATA_PATH = "./data/"

# Define all 16 event pairs
index_events = ["Borrow", "Deposit", "Repay", "Withdraw", "Liquidated"]
outcome_events = index_events
event_pairs = [
    (index_event, outcome_event)
    for index_event in index_events
    for outcome_event in outcome_events
]

for index_event, outcome_event in event_pairs:
    print(f"\n{'='*50}")
    print(f"Processing and Predicting for: {index_event} -> {outcome_event}")
    print(f"{'='*50}")

    dataset_path = os.path.join(index_event, outcome_event)

    # --- Load and Preprocess ---
    try:
        data_df = pd.read_csv(
            os.path.join(PARTICIPANT_DATA_PATH, dataset_path, "data.csv")
        )
    except FileNotFoundError as e:
        print(f"Data not found for {dataset_path}. Skipping.")
        continue

    if data_df is None or data_df.shape[0] == 0:
        continue

    buffer_duration = 30 * 60 * 60 * 24
    train_cutoff = 1722526142 - buffer_duration

    train_df = data_df[data_df["timestamp"] <= train_cutoff]
    test_df = data_df[data_df["timestamp"] > train_cutoff]
    reference_cols = ["timeDiff", "status"]
    feature_cols = [col for col in train_df.columns if col not in reference_cols]
    test_features_df = test_df[feature_cols]
    test_references_df = test_df[reference_cols]

    X_train, y_train, X_test_processed, _ = preprocess(train_df, test_features_df)

    # --- Train Model ---
    try:
        params = {
            "objective": "survival:cox",
            "eval_metric": "cox-nloglik",
            "device": "cuda",
            "tree_method": "hist",
            "device": "cuda",
            "seed": seed,
            "verbosity": 1,
            "max_bin": 64,
            "learning_rate": 0.04,
            "max_depth": 5,
            "subsample": 0.85,
            "colsample_bytree": 0.8,
            "min_child_weight": 5,
            "reg_lambda": 1.0,
            "reg_alpha": 0.1,
        }
        model = xgb.train(
            params,
            X_train,
            num_boost_round=1000,
            evals=[(X_train, "train")],
            verbose_eval=100,
        )

        # --- Generate and Save Predictions ---
        print(f"Generating predictions for {dataset_path}...")
        # Use the processed test features to make predictions
        predictions = -model.predict(X_test_processed)

        print("Calculating Concordance Index...")
        print(
            f"Concordance index: {get_concordance_index(test_references_df, predictions)}"
        )

    except (ConvergenceError, ValueError) as e:
        print(
            f"\nERROR: The model for {dataset_path} failed to train. No prediction file will be created."
        )
        print(f"Details: {e}")

print("\n\nAll prediction files have been generated.")


Processing and Predicting for: Borrow -> Borrow


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:13.02686
[100]	train-cox-nloglik:12.46353
[200]	train-cox-nloglik:12.43498
[300]	train-cox-nloglik:12.42362
[400]	train-cox-nloglik:12.41585
[500]	train-cox-nloglik:12.41030
[600]	train-cox-nloglik:12.40585
[700]	train-cox-nloglik:12.40168
[800]	train-cox-nloglik:12.39794
[900]	train-cox-nloglik:12.39456
[999]	train-cox-nloglik:12.39122
Generating predictions for Borrow/Borrow...
Calculating Concordance Index...
Concordance index: 0.7580991696462193

Processing and Predicting for: Borrow -> Deposit


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:13.34857
[100]	train-cox-nloglik:12.62611
[200]	train-cox-nloglik:12.52044
[300]	train-cox-nloglik:12.45962
[400]	train-cox-nloglik:12.41861
[500]	train-cox-nloglik:12.38611
[600]	train-cox-nloglik:12.35869
[700]	train-cox-nloglik:12.33761
[800]	train-cox-nloglik:12.31973
[900]	train-cox-nloglik:12.30231
[999]	train-cox-nloglik:12.28820
Generating predictions for Borrow/Deposit...
Calculating Concordance Index...
Concordance index: 0.7976878511868747

Processing and Predicting for: Borrow -> Repay


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:12.90863
[100]	train-cox-nloglik:12.09682
[200]	train-cox-nloglik:12.04188
[300]	train-cox-nloglik:12.02712
[400]	train-cox-nloglik:12.01904
[500]	train-cox-nloglik:12.01326
[600]	train-cox-nloglik:12.00856
[700]	train-cox-nloglik:12.00425
[800]	train-cox-nloglik:12.00021
[900]	train-cox-nloglik:11.99628
[999]	train-cox-nloglik:11.99286
Generating predictions for Borrow/Repay...
Calculating Concordance Index...
Concordance index: 0.8316366275053256

Processing and Predicting for: Borrow -> Withdraw


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:13.36746
[100]	train-cox-nloglik:12.51423
[200]	train-cox-nloglik:12.42420
[300]	train-cox-nloglik:12.36258
[400]	train-cox-nloglik:12.31621
[500]	train-cox-nloglik:12.28470
[600]	train-cox-nloglik:12.25637
[700]	train-cox-nloglik:12.23250
[800]	train-cox-nloglik:12.21064
[900]	train-cox-nloglik:12.19166
[999]	train-cox-nloglik:12.17420
Generating predictions for Borrow/Withdraw...
Calculating Concordance Index...
Concordance index: 0.7893704199101154

Processing and Predicting for: Borrow -> Liquidated


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:12.99286
[100]	train-cox-nloglik:11.48899
[200]	train-cox-nloglik:11.13069
[300]	train-cox-nloglik:10.89325
[400]	train-cox-nloglik:10.69311
[500]	train-cox-nloglik:10.52992
[600]	train-cox-nloglik:10.40198
[700]	train-cox-nloglik:10.28913
[800]	train-cox-nloglik:10.19518
[900]	train-cox-nloglik:10.11120
[999]	train-cox-nloglik:10.04227
Generating predictions for Borrow/Liquidated...
Calculating Concordance Index...
Concordance index: 0.7793637427524134

Processing and Predicting for: Deposit -> Borrow


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:14.25664
[100]	train-cox-nloglik:13.09202
[200]	train-cox-nloglik:13.00774
[300]	train-cox-nloglik:12.95082
[400]	train-cox-nloglik:12.91094
[500]	train-cox-nloglik:12.88164
[600]	train-cox-nloglik:12.85799
[700]	train-cox-nloglik:12.83585
[800]	train-cox-nloglik:12.81678
[900]	train-cox-nloglik:12.79684
[999]	train-cox-nloglik:12.78206
Generating predictions for Deposit/Borrow...
Calculating Concordance Index...
Concordance index: 0.8977365094986042

Processing and Predicting for: Deposit -> Deposit


Parameters: { "predictor" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	train-cox-nloglik:13.87636
[100]	train-cox-nloglik:13.14358
[200]	train-cox-nloglik:13.10512
[300]	train-cox-nloglik:13.09325
[400]	train-cox-nloglik:13.08671
[500]	train-cox-nloglik:13.08099
[600]	train-cox-nloglik:13.07662
[700]	train-cox-nloglik:13.07295
[800]	train-cox-nloglik:13.06954
[900]	train-cox-nloglik:13.06669
[999]	train-cox-nloglik:13.06428
Generating predictions for Deposit/Deposit...
Calculating Concordance Index...


KeyboardInterrupt: 