# Class req



## Mid-term submission
1. Documentation
2. Cover letter
3. Abstract
4. Problem statement
5. Tools and Technologies to be used
6. Dataset
7. Team Name and responsibilities of each team member.
8. Presentation

## Final submission
1. Documentation
2. Briefly describe each topic mentioned in the documentation of mid-term
submission.
3. Methods (how you trained data, why you used particular tools and
technologies rather than other tools and technologies, etc.)
4. Result (Screenshots)
5. Conclusion
6. Presentation with Demo

# CIBMTR - Equity in post - HCT Survival Predictions
This notebook is intended as our main source for working on the kaggle competition [CIBMTR - Equity in post - HCT Survival Predictions](https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/team )

This is for UofMDearbornTeam1's submission.

## Introduction

This is where we can all collectively work on different portions of the code to be broken off into separate files, notebooks or python scripts if needed. Otherwise we will contain it in a single python notebook hosted here on google colab.


Make your code as self describing as possible. Where needed, add a text description like the one here or in the code itself using comments. Using '#', '##', '###', etc to notate different sections is much appreciated.

If you've never used collab/notebooks/jupyter before or you just forgot how to add text or code blocks just float your mouse to the top or bottom center of an existing block of text or code. You'll then be able to add a new text box or a code block with just a click. You can also do it using the insert menu in the toolbar or just below it, by default it is a hotkey in the toolbar.

A great place to start finding what we need to do is to look here:
https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/549968

You can see other people's submission here:
https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/leaderboard

A guide on getting started with kaggle API's can be found here:

https://www.kaggle.com/competitions/rsna-2024-lumbar-spine-degenerative-classification/code

The github page for kaggle api is here:

https://github.com/Kaggle/kagglehub?tab=readme-ov-file#authenticate

## Previous Works
This is where you can list out previous work similar to our own that can be applied towards this project

### Example 1

## Prequisite Installs and Imports
A set of scripts that might need to be ran before your environment will run properly. This has been written with collab in mind and might not be needed. If you regularly have to install certain libraries please initiate a pull request with CIBMTR/Kaggle to modify the competition as they do not include internet access for the analysis section of the competition - this begins after the competition is over. Any additional libraries that you need installed have to be added by them in order to properly run.

Start by adding your library to the import page, if it natively runs you may not need an install. If it does not or if you get outdated errors, consider adding it to the install section.

### Installs

In [None]:
%pip install kagglehub
%pip install kaggle



### Imports

In [None]:
import kagglehub, shutil, numpy as np, os, pandas as pd
import shap, torch, torch.nn as nn
from google.colab import userdata, drive, files
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

## Kaggle API Access Codeblock
Many different options are included here because...both google and kaggle keep changing what they will and wont allow. Forcing us to change the method in which we login. Option 5 works for me right now. This page has people with other methods:
https://www.kaggle.com/discussions/general/51898


### Option 1
login using the credentials generated from kaggle. This is different from the username and login that you use in kaggle.

Instead you will...
1. Go to kaggle.com
2. Select your profile
3. Go to settings
4. Generate new api keys. (alternatively if you already have them generated you can use them.)
5. It will generate and download a 'kaggle.json' file that contains your username and password for the API.

In [None]:
#kagglehub.login()


### Option 2
You will hardcode your username and password (see option 1 for more details on how to get these) and export them to kaggle

In [None]:
#export KAGGLE_USERNAME = your_un
#export KAGGLE_KEY = your_generated_api_key/token

### Option 3
upload your kaggle.json file to the file structure at this path

~/.kaggle/kaggle.json

By default this is where !kaggle commands look for your credentials

If your going to use this option make sure you upload your kaggle.json file to your google drive and named 'kaggle.json' - this code will take care of the rest.

In order to generate this private key all you have to do is:
1. Login to kaggle
2. Click your profile in the upper right corner
3. Click Settings
4. Scroll down to API and create new token

In [None]:
#drive.mount('/content/drive')
#SourceFile = '/content/drive/MyDrive/kaggle.json'
#!mkdir -p /root/.kaggle
#DestinationFile = '/root/.kaggle'
#shutil.copy(SourceFile,DestinationFile)

### Option 4
update your Secretes page (the key symbol on the left) to include kaggle credentials:
 1. Click the little key on left tab (default locations and icons for google collab)
 2. Add new secret (yes use all capitals for the Name)
 3. Add KAGGLE_USERNAME
 4. Add KAGGLE_KEY
 5. Toggle Notebook access to on


In [None]:
#os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
#os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')

### Option 5
Change the Kaggle CLI working directory
1. Login to kaggle
2. Click your profile in the upper right corner
3. Click Settings
4. Scroll down to API and create new token
5. Move that token to your google drive under the folder .kaggle

(if you don't have the folder on your drive create it or run this script once and it will create it for you even though you'll get an error)

In [None]:
drive.mount('/content/drive')
!mkdir -p /content/drive/MyDrive/.kaggle
os.environ['KAGGLE_CONFIG_DIR'] = "/content/drive/MyDrive/.kaggle/"

Mounted at /content/drive


## Data
The following code blocks are dedicated to data loading, cleaning and normalization


### Data Loading
Use this code to load the data using the API.

#### Download the Data from Kaggle
You only need to run this once but can be ran everytime as well(details below).

When you download, by default, it will put it on your google drive. It will be under the following file structure:

- zip file ---- MyDrive/CIBMTR_Equity/dataset/equity-post-HCT-survival-predictions.zip
- unzipped ---- MyDrive/CIBMTR_Equity

With the following files in ../CIBMTR_Equity/:
- data_dictionary.csv
- sample_submission.csv
- test.csv
- train.csv


The script is designed to check for updates when needed and skip if you have the most up to date zip file. Likewise it checks the files unzipped and compares them to files already in the directory. If there's a difference it will replace them - you may want to save your work in a different subdirectory or use a different file to manipulate your data as you go if needed.

In [None]:
#Data Load
#Use the drive folders and create RSNA24/dataset if it doesn't exist
%cd /content/drive/MyDrive
%mkdir -p CIBMTR_Equity/dataset
%cd /content/drive/MyDrive/CIBMTR_Equity/dataset

#Download and unzip the data from kaggle
#ETA Instant -
%cd /content/drive/MyDrive
!kaggle competitions download -c equity-post-HCT-survival-predictions
!unzip -u equity-post-HCT-survival-predictions.zip -d /content/drive/MyDrive/CIBMTR_Equity

%cd CIBMTR_Equity/

/content/drive/MyDrive
/content/drive/MyDrive/CIBMTR_Equity/dataset
/content/drive/MyDrive
equity-post-HCT-survival-predictions.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  equity-post-HCT-survival-predictions.zip
/content/drive/MyDrive/CIBMTR_Equity


#### Load the data into a pandas dataframe
Now that we have the data let's get it loaded into panda so we can look at it.

In [None]:
dfDataDict = pd.read_csv('data_dictionary.csv')
dfSample = pd.read_csv('sample_submission.csv')
dfTest = pd.read_csv('test.csv')
dfTrain = pd.read_csv('train.csv')

### Data Cleaning
This is the code block where we will clean the data. Here we may have to remove bad data, change formats, check for errors, separate different forms of data etc

#### Data Dictionary

tbi = traumatic brain injury

In [None]:
    #Data Cleaning
## First let's show the samples of each file:
dfDataDict

Unnamed: 0,variable,description,type,values
0,dri_score,Refined disease risk index,Categorical,['Intermediate' 'High' 'N/A - non-malignant in...
1,psych_disturb,Psychiatric disturbance,Categorical,['Yes' 'No' nan 'Not done']
2,cyto_score,Cytogenetic score,Categorical,['Intermediate' 'Favorable' 'Poor' 'TBD' nan '...
3,diabetes,Diabetes,Categorical,['No' 'Yes' nan 'Not done']
4,hla_match_c_high,Recipient / 1st donor allele level (high resol...,Numerical,
5,hla_high_res_8,Recipient / 1st donor allele-level (high resol...,Numerical,
6,tbi_status,TBI,Categorical,"['No TBI' 'TBI + Cy +- Other' 'TBI +- Other, <..."
7,arrhythmia,Arrhythmia,Categorical,['No' nan 'Yes' 'Not done']
8,hla_low_res_6,Recipient / 1st donor antigen-level (low resol...,Numerical,
9,graft_type,Graft type,Categorical,['Peripheral blood' 'Bone marrow']


### Modify Data Dictionary

In [None]:
# Dictionary with additional details for each variable
additional_details = {
    "dri_score": "Refined Disease Risk Index (DRI), categorizing the risk level of a patient's disease.",
    "psych_disturb": "Indicates whether the patient has a psychiatric condition (Yes/No).",
    "cyto_score": "Cytogenetic score assessing chromosomal abnormalities in diseases like leukemia.",
    "diabetes": "Indicates if the patient has been diagnosed with diabetes (Yes/No).",
    "arrhythmia": "Presence of cardiac arrhythmia (irregular heartbeats).",
    "vent_hist": "History of mechanical ventilation support.",
    "renal_issue": "Severity of renal (kidney) dysfunction.",
    "pulm_severe": "Indicates severe pulmonary (lung) disease.",
    "prim_disease_hct": "Primary disease leading to Hematopoietic Cell Transplantation (HCT).",
    "obesity": "Indicates if the patient is obese.",
    "hepatic_severe": "Presence of moderate/severe liver disease.",
    "prior_tumor": "History of prior solid tumors (non-blood cancers).",
    "peptic_ulcer": "History of peptic ulcer disease (PUD).",
    "age_at_hct": "Patient’s age at the time of HCT.",
    "rheum_issue": "Presence of rheumatologic conditions (e.g., lupus, rheumatoid arthritis).",
    "efs": "Event-Free Survival (EFS), indicating if the patient remained free of relapse or complications.",
    "efs_time": "Time to event-free survival in months.",
    "graft_type": "Type of graft used in transplantation (e.g., bone marrow, peripheral blood stem cells).",
    "conditioning_intensity": "Intensity of chemotherapy/radiation regimen before transplantation.",
    "tbi_status": "Whether Total Body Irradiation (TBI) was used in conditioning.",
    "rituximab": "Use of Rituximab, a monoclonal antibody in conditioning therapy.",
    "prod_type": "Type of cell product used (bone marrow, peripheral blood, or cord blood).",
    "melphalan_dose": "Dose of Melphalan chemotherapy drug (mg/m²).",
    "gvhd_proph": "Graft-versus-Host Disease (GVHD) prophylaxis to prevent complications.",
    "hla_nmdp_6": "HLA matching based on National Marrow Donor Program (NMDP) classification (A, B, DRB1).",
    "hla_match_a_high": "High-resolution HLA-A match between recipient and donor.",
    "hla_match_b_high": "High-resolution HLA-B match.",
    "hla_match_c_high": "High-resolution HLA-C match.",
    "hla_match_drb1_high": "High-resolution HLA-DRB1 match.",
    "hla_match_dqb1_high": "High-resolution HLA-DQB1 match.",
    "tce_match": "T-Cell Epitope (TCE) matching between donor and recipient.",
    "comorbidity_score": "Hematopoietic Cell Transplantation Comorbidity Index (HCT-CI) by Sorror, assessing additional health risks before transplant.",
    "karnofsky_score": "Karnofsky Performance Status (KPS) at HCT, measuring patient functional ability (scale 0-100).",
    "hepatic_mild": "Indicates mild liver disease.",
    "pulm_moderate": "Indicates moderate pulmonary disease.",
    "year_hct": "Year when the Hematopoietic Cell Transplantation (HCT) was performed.",
    "sex_match": "Sex matching between donor and recipient.",
    "race_group": "Race classification of the patient.",
    "ethnicity": "Ethnicity classification of the patient.",
    "cmv_status": "Cytomegalovirus (CMV) serostatus of donor and recipient.",
    "hla_low_res_6": "HLA Low-Resolution Matching at A, B, and DRB1.",
    "hla_low_res_8": "HLA Low-Resolution Matching at A, B, C, and DRB1.",
    "hla_low_res_10": "HLA Low-Resolution Matching at A, B, C, DRB1, and DQB1.",
    "hla_high_res_6": "HLA High-Resolution Matching at A, B, and DRB1.",
    "hla_high_res_8": "HLA High-Resolution Matching at A, B, C, and DRB1.",
    "hla_high_res_10": "HLA High-Resolution Matching at A, B, C, DRB1, and DQB1.",
    "tce_imm_match": "TCE Immunogenicity match assessment.",
    "tce_div_match": "TCE Diversity match assessment.",
    "cyto_score_detail": "Detailed cytogenetics classification for Disease Risk Index (DRI).",
    "donor_age": "Age of the donor at the time of transplant.",
    "donor_related": "Indicates whether the donor was related or unrelated to the patient.",
    "mrd_hct": "Minimal Residual Disease (MRD) status at HCT.",
    "gvhd_proph": "Graft-versus-Host Disease (GVHD) prophylaxis method used.",
    "in_vivo_tcd": "In-Vivo T-Cell Depletion (TCD) method used, such as ATG or alemtuzumab."
}

# Update the description column with additional details
dfDataDict["description"] = dfDataDict["variable"].map(additional_details).fillna(dfDataDict["description"])
dfDataDict.to_csv('DataDictionary.csv')
# Display updated dataframe
dfDataDict


Unnamed: 0,variable,description,type,values
0,dri_score,"Refined Disease Risk Index (DRI), categorizing...",Categorical,['Intermediate' 'High' 'N/A - non-malignant in...
1,psych_disturb,Indicates whether the patient has a psychiatri...,Categorical,['Yes' 'No' nan 'Not done']
2,cyto_score,Cytogenetic score assessing chromosomal abnorm...,Categorical,['Intermediate' 'Favorable' 'Poor' 'TBD' nan '...
3,diabetes,Indicates if the patient has been diagnosed wi...,Categorical,['No' 'Yes' nan 'Not done']
4,hla_match_c_high,High-resolution HLA-C match.,Numerical,
5,hla_high_res_8,"HLA High-Resolution Matching at A, B, C, and D...",Numerical,
6,tbi_status,Whether Total Body Irradiation (TBI) was used ...,Categorical,"['No TBI' 'TBI + Cy +- Other' 'TBI +- Other, <..."
7,arrhythmia,Presence of cardiac arrhythmia (irregular hear...,Categorical,['No' nan 'Yes' 'Not done']
8,hla_low_res_6,"HLA Low-Resolution Matching at A, B, and DRB1.",Numerical,
9,graft_type,"Type of graft used in transplantation (e.g., b...",Categorical,['Peripheral blood' 'Bone marrow']


#### Test Data

In [None]:
dfTest

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,...,karnofsky_score,hepatic_mild,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10
0,28800,N/A - non-malignant indication,No,,No,,,No TBI,No,6.0,...,90.0,No,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0
1,28801,Intermediate,No,Intermediate,No,2.0,8.0,"TBI +- Other, >cGy",No,6.0,...,90.0,No,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,Yes,10.0
2,28802,N/A - non-malignant indication,No,,No,2.0,8.0,No TBI,No,6.0,...,90.0,No,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,No,10.0


#### Train Data

In [None]:
dfTrain

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,...,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
0,0,N/A - non-malignant indication,No,,No,,,No TBI,No,6.0,...,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,42.356
1,1,Intermediate,No,Intermediate,No,2.0,8.0,"TBI +- Other, >cGy",No,6.0,...,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,Yes,10.0,1.0,4.672
2,2,N/A - non-malignant indication,No,,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,19.793
3,3,High,No,Intermediate,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,102.349
4,4,High,No,,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Related,MEL,8.0,No,2.0,No,10.0,0.0,16.223
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28795,28795,Intermediate - TED AML case <missing cytogenetics,,Favorable,No,2.0,8.0,No TBI,No,6.0,...,Bi-directional non-permissive,,"N/A, Mel not given",8.0,,2.0,No,10.0,0.0,18.633
28796,28796,High,No,Poor,Yes,1.0,4.0,No TBI,No,5.0,...,GvH non-permissive,Related,"N/A, Mel not given",6.0,Yes,1.0,Yes,8.0,1.0,4.892
28797,28797,TBD cytogenetics,,Poor,,2.0,8.0,No TBI,,6.0,...,GvH non-permissive,Unrelated,"N/A, Mel not given",8.0,,2.0,No,10.0,0.0,23.157
28798,28798,N/A - non-malignant indication,No,Poor,No,1.0,4.0,No TBI,No,3.0,...,Permissive mismatched,Related,MEL,4.0,No,1.0,No,5.0,0.0,52.351


#### Sample Submission

In [None]:
dfSample

Unnamed: 0,ID,prediction
0,28800,0.5
1,28801,0.5
2,28802,0.5


### Data Normalization
Often when we load data into a NN we will have to normalize the data to some extent. In raw data this might mean changing the categorical values to numeric or setting the mean to 0 and applying a standard deviation to the data.

In [None]:
# Define columns
categorical_columns = dfTrain.select_dtypes(include=['object']).columns.tolist()
numerical_columns = dfTrain.select_dtypes(exclude=['object']).columns.tolist()

# Remove ID and target columns from features
columns_to_remove = ['ID', 'efs', 'efs_time']
for col in columns_to_remove:
    if col in numerical_columns:
        numerical_columns.remove(col)

# Define target column (focusing on efs_time prediction)
target_column = 'efs_time'

def convert_to_string(df, columns):
    """Convert specified columns to string type"""
    df = df.copy()
    for col in columns:
        df[col] = df[col].astype(str)
    return df

def no_nulls(df, numerical, categorical):
    """Fill null values in the dataframe"""
    df = df.copy()
    for column in categorical:
        df[column] = df[column].fillna('Unknown')
    for column in numerical:
        df[column] = df[column].fillna(df[numerical].mean()[column])
    return df

def preprocess_data(df, numerical_columns, categorical_columns, scaler=None, encoder=None, fit=False):
    """
    Preprocess the data with option to fit or use existing transformers
    """
    df = df.copy()

    # Convert categorical columns to string type
    df = convert_to_string(df, categorical_columns)

    # Fill missing values
    df = no_nulls(df, numerical_columns, categorical_columns)

    # Scale numerical features
    if numerical_columns:
        if scaler is None and fit:
            scaler = MinMaxScaler()
            scaled_features = scaler.fit_transform(df[numerical_columns])
        elif scaler is not None:
            scaled_features = scaler.transform(df[numerical_columns])

        df[numerical_columns] = scaled_features

    # Encode categorical features
    if categorical_columns:
        if encoder is None and fit:
            encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            encoded_features = encoder.fit_transform(df[categorical_columns])
        elif encoder is not None:
            encoded_features = encoder.transform(df[categorical_columns])

        encoded_df = pd.DataFrame(
            encoded_features,
            columns=encoder.get_feature_names_out(categorical_columns),
            index=df.index
        )

        # Drop original categorical columns and join encoded features
        df = pd.concat([df.drop(columns=categorical_columns), encoded_df], axis=1)

    return df, scaler, encoder

# Split the data
y = dfTrain[target_column]
X = dfTrain.drop(columns=[target_column, 'efs'])

X_train_raw, X_val_raw, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preprocess training data
X_train, scaler, encoder = preprocess_data(
    X_train_raw,
    numerical_columns,
    categorical_columns,
    fit=True
)

# Preprocess validation data
X_val, _, _ = preprocess_data(
    X_val_raw,
    numerical_columns,
    categorical_columns,
    scaler=scaler,
    encoder=encoder
)

# Preprocess test data
X_test, _, _ = preprocess_data(
    dfTest,
    numerical_columns,
    categorical_columns,
    scaler=scaler,
    encoder=encoder,
    fit = True
)

# Set series to dataframes
y_val = pd.DataFrame(y_val)
y_train = pd.DataFrame(y_train)

In [None]:
#Check for mismatched columns
namecompare = X_train.columns.difference(X_val.columns)
print(namecompare)

Index([], dtype='object')


In [None]:
# Ensure no NaN values remain in the data
print("NaN values in X_train:", X_train.isnull().sum().sum())
print("NaN values in y_train:", y_train.isnull().sum().sum())

for x in X_train.isnull():
  print(x)

X_val.info()

NaN values in X_train: 0
NaN values in y_train: 0
ID
hla_match_c_high
hla_high_res_8
hla_low_res_6
hla_high_res_6
hla_high_res_10
hla_match_dqb1_high
hla_nmdp_6
hla_match_c_low
hla_match_drb1_low
hla_match_dqb1_low
year_hct
hla_match_a_high
donor_age
hla_match_b_low
age_at_hct
hla_match_a_low
hla_match_b_high
comorbidity_score
karnofsky_score
hla_low_res_8
hla_match_drb1_high
hla_low_res_10
dri_score_High
dri_score_High - TED AML case <missing cytogenetics
dri_score_Intermediate
dri_score_Intermediate - TED AML case <missing cytogenetics
dri_score_Low
dri_score_Missing disease status
dri_score_N/A - disease not classifiable
dri_score_N/A - non-malignant indication
dri_score_N/A - pediatric
dri_score_TBD cytogenetics
dri_score_Very high
dri_score_nan
psych_disturb_No
psych_disturb_Not done
psych_disturb_Yes
psych_disturb_nan
cyto_score_Favorable
cyto_score_Intermediate
cyto_score_Normal
cyto_score_Not tested
cyto_score_Other
cyto_score_Poor
cyto_score_TBD
cyto_score_nan
diabetes_No
diab

#### Build Tensors

In [None]:
#Check the shape of each dataset
print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")

X_train shape: (23040, 214)
X_val shape: (5760, 214)
y_train shape: (23040, 1)
y_val shape: (5760, 1)


In [None]:
# Convert to tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).reshape(-1, 1)
X_val_tensor = torch.tensor(X_val.values, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val.values, dtype=torch.float32).reshape(-1, 1)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)

## Model
Here we train and test the model

### Model Definition

In [None]:
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.leaky_relu = nn.LeakyReLU()
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        out = self.leaky_relu(out)
        out = self.dropout(out)
        out = self.fc2(out)
        return out

# Define the model

# Number of features
input_size = X_train.shape[1]

# Number of neurons in the hidden layer
# Found using the original number training features (59) minus 3 (efs, efs_time & id)
hidden_size = 56

# Number of target variables (efs and efs_time)
output_size = 1

model = SimpleNN(input_size, hidden_size, output_size)

In [None]:
# Define loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error for regression
#We may want to try different types of loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer

In [None]:
#CrossEntropyLoss with SGD
#criterion = nn.CrossEntropyLoss()
#optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [None]:
# Training loop
num_epochs = 100000
best_val_loss = 0.03
patience = 100000
patience_counter = 0
val_threshold = 524

print("Starting training...")
print(f"Training data shape: {X_train_tensor.shape}")
print(f"Validation data shape: {X_val_tensor.shape}")

for epoch in range(num_epochs):
    # Training
    model.train()
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Validation
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_val_tensor)
        val_loss = criterion(val_outputs, y_val_tensor)

        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            # Save the best model
            torch.save(model.state_dict(), 'best_model.pth')
        else:
            patience_counter += 1

        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch+1}")
            break

        # elif val_loss.item() <= val_threshold:
        #     print(f"Early stopping at epoch {epoch+1}")
        #     break

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}]')
        print(f'Training Loss: {loss.item():.4f}')
        print(f'Validation Loss: {val_loss.item():.4f}')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Validation Loss: 547.9101
Epoch [83340/100000]
Training Loss: 439.9840
Validation Loss: 550.4993
Epoch [83350/100000]
Training Loss: 445.3303
Validation Loss: 543.8710
Epoch [83360/100000]
Training Loss: 439.1339
Validation Loss: 549.9436
Epoch [83370/100000]
Training Loss: 444.2703
Validation Loss: 550.3005
Epoch [83380/100000]
Training Loss: 434.7476
Validation Loss: 550.7262
Epoch [83390/100000]
Training Loss: 440.2317
Validation Loss: 549.9515
Epoch [83400/100000]
Training Loss: 446.0052
Validation Loss: 547.3174
Epoch [83410/100000]
Training Loss: 438.1192
Validation Loss: 549.8209
Epoch [83420/100000]
Training Loss: 438.2850
Validation Loss: 549.8766
Epoch [83430/100000]
Training Loss: 442.4743
Validation Loss: 551.2457
Epoch [83440/100000]
Training Loss: 437.7892
Validation Loss: 548.7646
Epoch [83450/100000]
Training Loss: 442.7766
Validation Loss: 545.4547
Epoch [83460/100000]
Training Loss: 440.8141
Validation L

In [None]:
# Example DataFrames
df1 =X_val

df2 = y_train

# Get column names of both DataFrames
columns_df1 = set(df1.columns)
columns_df2 = set(df2.columns)

# Find columns missing in df2 but present in df1
missing_in_df2 = columns_df1 - columns_df2

# Find columns missing in df1 but present in df2
missing_in_df1 = columns_df2 - columns_df1

# Output the results
print("Columns missing in df2:", missing_in_df2)
print("Columns missing in df1:", missing_in_df1)

Columns missing in df2: {'prim_disease_hct_Other leukemia', 'diabetes_Yes', 'tce_div_match_Permissive mismatched', 'dri_score_N/A - non-malignant indication', 'cyto_score_Favorable', 'vent_hist_No', 'cyto_score_detail_Intermediate', 'rituximab_nan', 'prim_disease_hct_HIS', 'rheum_issue_nan', 'mrd_hct_nan', 'hepatic_mild_Yes', 'cyto_score_Not tested', 'arrhythmia_Yes', 'hla_nmdp_6', 'prim_disease_hct_IMD', 'hepatic_mild_nan', 'cyto_score_detail_Favorable', 'comorbidity_score', 'rituximab_No', 'tce_imm_match_H/H', 'peptic_ulcer_Yes', 'hla_low_res_6', 'cyto_score_Other', 'psych_disturb_Yes', 'prim_disease_hct_HD', 'gvhd_proph_FKalone', 'hla_high_res_6', 'dri_score_Intermediate - TED AML case <missing cytogenetics', 'gvhd_proph_CSA +- others(not FK,MMF,MTX)', 'race_group_Asian', 'race_group_Native Hawaiian or other Pacific Islander', 'pulm_moderate_No', 'cardiac_nan', 'cyto_score_Poor', 'tce_match_nan', 'tce_div_match_nan', 'peptic_ulcer_nan', 'renal_issue_Yes', 'hla_match_a_high', 'tce_di

In [None]:
#We are no longer using this section because the test data is missing the observed values

# Evaluate the model on the test data
# with torch.no_grad():
#     test_outputs = model(X_test_tensor)
#     test_loss = criterion(test_outputs, y_test_tensor)
#     print(f'Test Loss: {test_loss.item():.4f}')

In [None]:
# Save the model
torch.save(model.state_dict(), 'model.pth')

## Output

In [None]:
# Generate predictions for the test set
with torch.no_grad():
    test_outputs = model(X_test_tensor)

# Convert to Numpy for proper output
test_outputs = test_outputs.numpy().flatten().tolist()

# Convert predictions to a DataFrame
submission = pd.DataFrame({
    'ID': dfTest['ID'],  # Assuming 'ID' is the identifier column
    #'efs': test_outputs[:, 0],  # Predicted efs
    '# prediction': test_outputs  # Predicted efs_time
})

In [None]:
print(submission)

      ID  # prediction
0  28800     33.041428
1  28801     32.950745
2  28802     25.749702


## Suggested Changes

### Cross Validation

In [None]:
from sklearn.model_selection import KFold
from torch.utils.data import DataLoader, TensorDataset

# Convert data to PyTorch tensors
X_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_tensor = torch.tensor(y_train.values, dtype=torch.float32)

# Define k-fold cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Store results
results = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X_tensor)):
    print(f"Fold {fold + 1}")

    # Split data into training and validation sets
    X_train_fold, X_val_fold = X_tensor[train_idx], X_tensor[val_idx]
    y_train_fold, y_val_fold = y_tensor[train_idx], y_tensor[val_idx]

    # Create DataLoader for batching
    train_dataset = TensorDataset(X_train_fold, y_train_fold)
    val_dataset = TensorDataset(X_val_fold, y_val_fold)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

    # Initialize model, loss, and optimizer
    model = SimpleNN(input_size, hidden_size, output_size)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        for batch_X, batch_y in train_loader:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        with torch.no_grad():
            val_loss = 0
            for batch_X, batch_y in val_loader:
                outputs = model(batch_X)
                val_loss += criterion(outputs, batch_y).item()
            val_loss /= len(val_loader)

        print(f"Epoch [{epoch+1}/{num_epochs}], Validation Loss: {val_loss:.4f}")

    results.append(val_loss)

# Print average validation loss across folds
print(f"Average Validation Loss: {np.mean(results):.4f}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch [4449/100000], Validation Loss: 547.8719
Epoch [4450/100000], Validation Loss: 544.5656
Epoch [4451/100000], Validation Loss: 550.2896
Epoch [4452/100000], Validation Loss: 548.2507
Epoch [4453/100000], Validation Loss: 551.0410
Epoch [4454/100000], Validation Loss: 546.0304
Epoch [4455/100000], Validation Loss: 550.4890
Epoch [4456/100000], Validation Loss: 559.3771
Epoch [4457/100000], Validation Loss: 552.8243
Epoch [4458/100000], Validation Loss: 544.6044
Epoch [4459/100000], Validation Loss: 544.8429
Epoch [4460/100000], Validation Loss: 551.0281
Epoch [4461/100000], Validation Loss: 545.2144
Epoch [4462/100000], Validation Loss: 549.0888
Epoch [4463/100000], Validation Loss: 547.0138
Epoch [4464/100000], Validation Loss: 617.7866
Epoch [4465/100000], Validation Loss: 551.4327
Epoch [4466/100000], Validation Loss: 555.0651
Epoch [4467/100000], Validation Loss: 550.7375
Epoch [4468/100000], Validation Loss: 547.

In [None]:
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Early stopping parameters
patience = 5
best_loss = float('inf')
counter = 0

# Training loop with early stopping
for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in train_loader:
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Validation
    model.eval()
    with torch.no_grad():
        val_loss = 0
        for batch_X, batch_y in val_loader:
            outputs = model(batch_X)
            val_loss += criterion(outputs, batch_y).item()
        val_loss /= len(val_loader)

    print(f"Epoch [{epoch+1}/{num_epochs}], Validation Loss: {val_loss:.4f}")

    # Early stopping logic
    if val_loss < best_loss:
        best_loss = val_loss
        counter = 0
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered.")
            break

Concordance(Rank) Loss Custom template

In [None]:
import torch
import torch.nn as nn

class ConcordanceLoss(nn.Module):
    def __init__(self):
        super(ConcordanceLoss, self).__init__()

    def forward(self, pred_risk, time, event):
        """
        pred_risk: Predicted risk scores (higher scores mean higher risk of event).
        time: True event/censoring times.
        event: Event indicator (1 = event occurred, 0 = censored).
        """
        # Create pairs of patients
        n = len(time)
        concordance = 0
        permissible = 0

        for i in range(n):
            if event[i] == 1:  # Only consider pairs where the first patient had an event
                for j in range(n):
                    if time[j] > time[i]:  # Second patient must have a longer survival time
                        permissible += 1
                        if pred_risk[i] > pred_risk[j]:  # Correct ranking
                            concordance += 1
                        elif pred_risk[i] == pred_risk[j]:  # Tie
                            concordance += 0.5

        if permissible == 0:
            return torch.tensor(0.0, requires_grad=True)  # No permissible pairs

        c_index = concordance / permissible
        return 1 - c_index  # Minimize 1 - C-Index

### Plts for summary of specific features

In [None]:
import shap

# Explain the model's predictions using SHAP
explainer = shap.DeepExplainer(model, X_train_tensor[:100])  # Use a subset of data
shap_values = explainer.shap_values(X_train_tensor[:100])

# Plot feature importance
shap.summary_plot(shap_values, X_train[:100], feature_names=X_train.columns)