<a href="https://colab.research.google.com/github/alan-w25/Cancer-Capstone/blob/main/delfi/delfi_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

We will begin to test many models to predict the days alive variable. Procedure we will follow:
1. Choose a model
1. Choose Objective function
1. Choose learning algorithm
1. Add in regularization, adjust hyperparameters, etc

First we will just test the models with all the features, and do some feature selection if need be for statistical methods. It seems that for clinical relevance, we should indeed keep all the samples and should not drop any features of interest

This document will comprise of all the testing of learning models for the delfi_score data. We are aiming to predict the days_alive variable with the survival_status as our event occured indicator. There are several ML methods in literature that have good results and we will use those and other methods to obtain the best results with our data. <br>

Before applying any models, there are two considerations that we can make to transform our data. First, we can uncensor the data, if the number of censored samples is not too great. This method could cause bias in terms of traditional methods <br>

The models that we will be testing have several categories:
1. Linear Models
1. Tree Ensemble Methods
1. Neural Networks
1. Deep Methods



The following procedure will be what we will follow to determine the 'time to death', or the 'days_alive' variable. We will use the models to model the ISDs (individual survival distribution) of the samples and then from there we will use the median time as the survival time. K-fold cross validation will be used to validate the models

In [1]:
!pip install scikit-survival

Collecting scikit-survival
  Downloading scikit_survival-0.22.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn<1.4,>=1.3.0 (from scikit-survival)
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn, scikit-survival
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.3.2 scikit-survival-0.22.2


# Setting up the data

In [2]:
import pandas as pd
import numpy as np
import sklearn as skl


In [3]:
data = pd.read_csv('data_Lucas_encoded.csv', index_col=[0])
data.head()

Unnamed: 0,survival_status,days_alive,delfi_score,stage_I,stage_II,stage_III,stage_IV,treatment_Chemotherapy/Radiation with curative intent,treatment_No treatment,treatment_Palliative Chemotherapy/Radiation,treatment_Surgery,treatment_Surgery+adjuvant treatment
0,1,1059,0.099037,1,0,0,0,0,0,0,1,0
1,1,1640,0.533453,1,0,0,0,0,0,0,1,0
2,1,101,0.822662,0,0,0,1,0,0,1,0,0
3,1,1228,0.23895,0,0,0,1,0,0,1,0,0
4,1,754,0.19982,0,0,1,0,1,0,0,0,0


In [4]:
data['survival_status'] = data['survival_status'].astype('bool')
data.head()

Unnamed: 0,survival_status,days_alive,delfi_score,stage_I,stage_II,stage_III,stage_IV,treatment_Chemotherapy/Radiation with curative intent,treatment_No treatment,treatment_Palliative Chemotherapy/Radiation,treatment_Surgery,treatment_Surgery+adjuvant treatment
0,True,1059,0.099037,1,0,0,0,0,0,0,1,0
1,True,1640,0.533453,1,0,0,0,0,0,0,1,0
2,True,101,0.822662,0,0,0,1,0,0,1,0,0
3,True,1228,0.23895,0,0,0,1,0,0,1,0,0
4,True,754,0.19982,0,0,1,0,1,0,0,0,0


In [5]:
Y = data[['survival_status', 'days_alive']]
Y = np.array(list(zip(data['survival_status'], data['days_alive'])),
             dtype = [('survival_status', 'bool'), ('days_alive', 'float')])
X = data.drop(['survival_status', 'days_alive', ], axis = 1)
print("X Matrix: ")
print(X.head())
print()

print("Y Matrix: ")
print(Y[1:5])


X Matrix: 
   delfi_score  stage_I  stage_II  stage_III  stage_IV  \
0     0.099037        1         0          0         0   
1     0.533453        1         0          0         0   
2     0.822662        0         0          0         1   
3     0.238950        0         0          0         1   
4     0.199820        0         0          1         0   

   treatment_Chemotherapy/Radiation with curative intent  \
0                                                  0       
1                                                  0       
2                                                  0       
3                                                  0       
4                                                  1       

   treatment_No treatment  treatment_Palliative Chemotherapy/Radiation  \
0                       0                                            0   
1                       0                                            0   
2                       0                                

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 42)

In [7]:
X_train.head()

Unnamed: 0,delfi_score,stage_I,stage_II,stage_III,stage_IV,treatment_Chemotherapy/Radiation with curative intent,treatment_No treatment,treatment_Palliative Chemotherapy/Radiation,treatment_Surgery,treatment_Surgery+adjuvant treatment
33,0.189878,0,1,0,0,0,0,0,1,0
11,0.158892,0,0,0,1,0,0,0,1,0
44,0.953926,0,0,0,1,0,1,0,0,0
49,0.965122,0,0,1,0,1,0,0,0,0
31,0.678103,0,0,1,0,1,0,0,0,0


In [8]:
X_test.head()

Unnamed: 0,delfi_score,stage_I,stage_II,stage_III,stage_IV,treatment_Chemotherapy/Radiation with curative intent,treatment_No treatment,treatment_Palliative Chemotherapy/Radiation,treatment_Surgery,treatment_Surgery+adjuvant treatment
68,0.999358,0,1,0,0,0,0,0,1,0
22,0.204565,0,0,0,1,0,0,1,0,0
72,1.0,0,0,1,0,0,0,1,0,0
73,0.992746,0,0,1,0,1,0,0,0,0
0,0.099037,1,0,0,0,0,0,0,1,0


In [9]:
y_train[0:5]

array([( True, 2254.), ( True,  245.), ( True,   50.), ( True,  364.),
       ( True,  868.)],
      dtype=[('survival_status', '?'), ('days_alive', '<f8')])

In [10]:
y_test[0:5]

array([(False, 2575.), ( True,  575.), ( True,  550.), (False, 2758.),
       ( True, 1059.)],
      dtype=[('survival_status', '?'), ('days_alive', '<f8')])

In [11]:
from sksurv.metrics import concordance_index_censored
def score_survival_model(model, X, y):
    prediction = model.predict(X)
    result = concordance_index_censored(y["Status"], y["Survival_in_days"], prediction)
    return result



In [103]:
# train model function




Before we train any models, a major question in survival analysis is what metric can we use to evaluate our models. The most popular models include:
1. C-Index (Harrell's Concordance Index ) -> Ratio of correctly ordered concordant pairs to comparable pairs. C-Index close to 1 represents perfect prediction
1. MAE (Mean Absolute Error) -> literature has introduced many adaptations of the MAE estimator to handle survival data. This means estimating the true survival time of the censored samples or other methods. This reduces the survival problem down to a typical regression/classification problem.

# Models


## Linear Models