## Gradient boosting for survival datasets


Gradient boosting has revealed itself to be one of the most powerful modern machine learning algorithms as evidenced by its ubiquity in the leader boards of Kaggle competitions. Boosting is a type of **ensemble learning**, a machine learning paradigm in which the final prediction is determined by a combination of models fit to the data. Two of the most popular ensemble algorithms are Random Forests and Gradient Boosting. The former fits many CARTs to the data and then uses a fair-voting system during the inference stage. In contrast gradient boosting iteratively fits CARTs to the training data, where a given CART uses the cumulative residuals from all previous CARTs as its classification/regression task. Whereas Random Forests are very good at providing low variance estimates, boosting techniques specialize in fitting non-linear dependencies in the data. 

However naive implementations of tree methods can be computationally burdensome when many trees are fit and when there are many variables to consider splits on (due to the greedy nature of the CART algorithm.) The `xgboost` package has been specifically developed to computationally scale well as well as provide built-in regularization techniques to prevent overfitting. This post will show how `xgboost` can be fit to right-censored survival data with a Python implementation.

In [117]:
# --- CALL IN THE NECESSARY LIBRARIES --- #
import numpy as np
import pandas as pd
import xgboost as xgb
# Some custom utilities for data processing
import utils_boosting as uu

### Load in the data

In [118]:
# Load the lung dataset
from lifelines.datasets import load_lung
lung_dat = load_lung()
lung_dat.head()

Unnamed: 0,inst,time,status,age,sex,ph.ecog,ph.karno,pat.karno,meal.cal,wt.loss
0,3.0,306,2,74,1,1.0,90.0,100.0,1175.0,
1,3.0,455,2,68,1,0.0,90.0,90.0,1225.0,15.0
2,3.0,1010,1,56,1,0.0,90.0,90.0,,15.0
3,5.0,210,2,57,1,1.0,90.0,60.0,1150.0,11.0
4,1.0,883,2,60,1,0.0,100.0,90.0,,0.0


### Data cleaning

Fill in any missing values using the median.

Create dummies for the different features

In [119]:
# Mill missing with median
lung_dat = lung_dat.fillna(lung_dat.median())
# Extract the time and status then drop
y = lung_dat['time']
# event indicator (1==censored, 2==dead)
d = lung_dat['status']
# Drop
lung_dat.drop(columns=['inst','time','status'],inplace=True)

# Encode sex as 0/1
dict_sex = {1: 'male', 2:'female'}
lung_dat.replace({'sex': dict_sex},inplace=True)
dummies_sex = pd.get_dummies(lung_dat.sex,drop_first=True)
# Create dummy variables for ph.ecog (we'll keep original integer form too)
dict_phecog = {0: 'ph0', 1: 'ph1', 2: 'ph23', 3:'ph23'}
lung_dat.replace({'ph.ecog': dict_phecog}, inplace=True)
dummies_phecog = pd.get_dummies(lung_dat['ph.ecog'],drop_first=True)
# Drop the originals and concatenate
lung_dat = pd.concat([lung_dat.drop(columns=['sex','ph.ecog']), pd.concat([dummies_sex, dummies_phecog],axis=1)],axis=1)
# Re-check output
lung_dat.head()

# Convert to numeric versions
X = lung_dat.as_matrix()
y = np.array(y)
d = np.where(np.array(d)==2,1,0)

### Case study: xgboost vs ElasticNet

To see which model performs better, we'll look at the out-of-sample concordance scores (the *de rigueur* evaluation metric for survival data) for the `ElasticNet` and `xgboost` models. Using an 80/20 training/test split, both models will perform hyperparameter optimizing using 5-fold CV on the training set. This experiment will be repeated 100 times to generate a null distribution of effects and then the two distributions can be compared.

Note that stratified CV is used to ensure that our folds have *roughly* an equal number of events. Since an unweighted average scores of concordance scores is being compared, this will ensure that the comparisons are fair.

In [295]:
# Import out neccesary modules and set up parameter dictionaries
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import KFold
from sklearn.linear_model import ElasticNetCV

# Define a dictionary of hyperparameters for xgboost

# Number of simulations to run
nsim = 3

lst_score = [] # List to store accuracies in

stratfolds = RepeatedKFold(n_splits=2, n_repeats = nsim, random_state=1) #

for train_idx, test_index in stratfolds.split(X):
    print(np.mean(d[train_idx]))
    print(np.mean(d[test_idx]))

0.7543859649122807
0.543859649122807
0.6929824561403509
0.543859649122807
0.7543859649122807
0.543859649122807
0.6929824561403509
0.543859649122807
0.7105263157894737
0.543859649122807
0.7368421052631579
0.543859649122807


In [220]:
# Subset, X, y, d for the train/test indices
X_train, X_test = X[train_idx,:], X[test_idx,:]
y_train, y_test = y[train_idx], y[test_idx]
d_train, d_test = d[train_idx], d[test_idx]
# by giving y negative values, this indicates it is "censored" for xgboost
time_train, time_test = np.where(d_train == 0, -y_train, y_train), np.where(d_test == 0, -y_test, y_test)
# Convert training/test data to DMatrices for xgboost
dtrain = xgb.DMatrix(X_train, time_train)
dtest = xgb.DMatrix(X_test, time_test)

# # Fit Elastic net
# temp_elnet_cv = ElasticNetCV(l1_ratio=np.linspace(0.01,1), n_alphas=10, cv=5,
#                         fit_intercept=True,normalize=True,random_state=1).fit(X_train,y_train)


### Comparing the results

As the results below show, whether we use the parameter t-test or non-parametric Wilcoxon (rank-based) test, the `xgboost` model ....

In [274]:
# Function to calculate the concordance score

#preds=np.array([4,3,2,1,2]);z=np.array([1,-2,3,4,5])
def conc(preds,z):
    preds = np.array(preds)
    z = np.array(preds)
    # Get the event idx
    idx_event = np.where(z > 0)[0]
    ee, cc = 0, 0
    for ii in idx_event:
        idx_rskset = np.where(z[ii] < np.abs(z))[0]
        idx_scores = preds[ii] < preds[idx_rskset]
        cc += np.sum( idx_scores )
        ee += len(idx_scores)
    # Return concordance
    tmp_conc = cc/ee
    return 'concordance', tmp_conc

# Fit using the simple CoxPH model
np.mean(time_train <0)


# Use CV on the training set to do hyperparameter configuration
# params={'max_depth': 1,
#         'min_child_weight': 1,
#         'eta': 0.3,
#         'subsample': 1,
#         'colsample_bytree': 1,
#         'objective': 'survival:cox'}

# temp_xgb = xgb.train(params,dtrain,num_boost_round = 2)
# preds = temp_xgb.predict(dtest)

# conc(preds,time_test)
# temp_xgb_cv = xgb.cv(params,dtrain,num_boost_round=20,maximize=True,feval=conc,
#                     seed=42,nfold=2,early_stopping_rounds=10)
# temp_xgb_cv


0.27631578947368424

In [None]:
# --- SUPPORT FUNCTIONS --- #
n=100
beta=np.array([1,2])
cens=0.25

#def dgp_surv(n,beta):
# Error checking
if len(beta.shape)==1:
    p = beta.shape[0]
elif (len(beta.shape)==2) & (beta.shape[1]==1):
    p = beta.shape[0]
else:
    raise ValueError("Error! beta is not a (column) vector")
# Create the X values
np.random.seed(1)
X = np.random.randn(n,p)
Xbeta = np.dot(X,beta)
# Hazards
haz = np.exp(Xbeta)
# Generate time bazed on hazard

