# Building Well Calibrated Severity Scores
## Model Development
### C.V. Cosgriff, MIT Critical Data

Previous work by our group focused on applying a _sequential_ or _stepwise_ approach to mortality modeling to improve the model calibration of illness severity scores. That is, fitting mortality models within a high-risk subset as defined by a previous model; it was hypothesized that the distribution of case severity in subcohort would lead to models that are better calibrated with respect to mortality prediction in this subcohort as traditional models such as APACHE have been shown to perform poorly with respect to prognostication in this cohort. However, given the volume and granulairty and volume of EHR databases it may possible to achieve well-calibrated models using the full-cohort. 

In previously unpublished work by the LCP, the model trained via the sequential approach had superior calibration by visual inspection of the graphs provided, but the model trained on the whole cohort had superior discriminatory capability, and comparable calibration.

The goal of this notebook is to reproduce and reframe that prior work. Using the project's previous code as a starting point, the current state-of-the-art for predictive modeling with structured data will be employed in both the full-cohort and high-risk cohort, and the models discriminative ability and calibration will be examined in the high-risk cohort. Thus, the goal of this study is to determine which strategy leads to models that can accurately forecast mortality in high-risk subsets where previous models have struggled.

With respect to the modeling approaches, we'll implement a generalized linear model similar to APACHE IV, but, given the extremely large number of features we will impose regularization to reduce model complexity and prevent overfitting; we will employe $L^2$ penalization also known as ridge regression. We will also implement a tree based approach as these have been exceedingly successful in recent works. Specifically, we will use a gradient boosted tree approach as implemented by the _extreme gradient boosting_ algorithim in `xgBoost` _(XGBoost: A Scalable Tree Boosting System, arXiv:1603.02754 [cs.LG])_. More details on implementation will be discussed below.

__Notebook Outline:__
* Envrionment preparation
* Load full cohort extraction (features & labels)
* Train models on full cohort
    * Penalized linear model ($L^2$, ridge regression)
    * Extreme Gradient Boosting (with Isotonic Regression)
* Train models on high-risk subset of full cohort
    * Penalized linear model ($L^2$, ridge regression)
    * Extreme Gradient Boosting (with Isotonic Regression)
* Compare full-cohort and high-risk cohort models on high-risk subset
    * Discrimination
    * Calibration

## 0 - Environment Setup

Here we'll load the standard data science stack, preprocessing, linear model, and model evalutaion tools from `scikit-learn`, and gradient boosting classifier from `xgboost`. We are using version 0.19.1 of `scikit-learn` and version 0.71 of `xgboost`.

In [None]:
# Data science stack
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Machine learning tools
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.linear_model import RidgeClassifierCV
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.calibration import calibration_curve

# Extreme gradient boosting model
from xgboost import XGBClassifier

# Graphing stuff
# "Tableau 20" colors as RGB for plotting
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),    
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),    
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),    
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),    
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]  
  
# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts
for i in range(len(tableau20)):    
    r, g, b = tableau20[i]    
    tableau20[i] = (r / 255., g / 255., b / 255.)

%matplotlib inline

## 1 - Load Data

We now load the cohort data. The files are stored as separate CSV files, and were previously extracted by LCP from the full eICU-CRD. The dataset inclues all features used in determining the APACHE score as well as an expanded set of features engineered by the LCP. This portion of the code was adapted from `run_sequential_model.py` which was written by Aaron Kaufman. Of note, his script contains code for loading 24 hour or 48 hour data based on a flag `TEST_ONLY_24` and we'll adapt their code to use the 24 hour data only.

In [None]:
# read in dataset files

# I was working under the presumption that sequential_model_features_n contained the full cohort
# but in fact they contain only the high risk cohort and I do not have the full dataset.

# TODO: Once we have full dataset, replace these paths with those files

data_set_0 = pd.read_csv('./data/sequential_model_features0.csv')
data_set_1 = pd.read_csv('./data/sequential_model_features1.csv')
data_set_2 = pd.read_csv('./data/sequential_model_features2.csv')
data_set_3 = pd.read_csv('./data/sequential_model_features3.csv')
data_set_4 = pd.read_csv('./data/sequential_model_features4.csv')

# concatetnate into a single dataset
data_set = pd.concat([data_set_0, data_set_1, data_set_2, data_set_3, data_set_4])

# remove data_set_n from memory
del data_set_0
del data_set_1
del data_set_2
del data_set_3
del data_set_4

# only include 24h data; adapred from old code
columns = data_set.columns.values.tolist()
col_24h = [] # collection of all the column names for 24h
for col in columns:
    if '48h' in col:
        continue
    if col == 'APACHE Predicted' or col == 'Death':
        continue
    else:
        col_24h.append(col)

data_set = data_set[col_24h + ['APACHE Predicted', 'Death']]

Next we'll form a train-test split for use throughout the modeling process. The previous file, `run_sequential.py` used the first file as the held out testing set, but I am unsure if these files have any inherent ordering left over from the extraction and feature engineering, and so to avoid this problem we'll perform a random train/test split of the data. We'll hold out 25% of the data for final testing.

In [None]:
train, test = train_test_split(data_set, test_size = 0.25, random_state = 42)
del data_set # no longer needed in memory

Then we split up the dataset into features and labels. We'll also store the APACHE predicted mortalities in another array as these are not meant to be used as a feature but will be used later when splitting off the high-risk subset of the cohort and for examining the discriminative ability and calibration of the original APACHE prediction.

In [None]:
train_apache = train.loc[:, 'APACHE Predicted'] # APACHE probabilities
train_labels = train.loc[:, 'Death']  # Labels
train_features = train.iloc[:, :-2]  # Features only

We are now ready to implement the models.

## 2 - Full Cohort Models

__Ridge Logistic Regression__

Our first model is a linear model similar to that used in the development of APACHE IV. As we have many variables, we choose to constrain model complexity using a $L^2$ regularization, and thus will train a ridge logistic regression model. For selection of $\lambda$, which `scikit-learn` calls $\alpha$, we'll use cross validation; because the nature of this model's formulation has a highly efficient implemntation of leave-one-out cross validation (LOO-CV) we'll use the generalized CV as implemented in the `RidgeClassiferCV` function.  

Unlike tree based approaches, ridge regression requires features be of the same scale for proper performance. It is also not robust to missing data. As such, we'll employ mean imputation, followed by scaling and centering of the features, and then ridge regression with LOO-CV with a $\lambda$ range of 1 to 2,000 with a stepsize of 10.

In [None]:
ridge_classifier = Pipeline([('impute', Imputer()),
                       ('center_scale', StandardScaler()),
                       ('ridge', RidgeClassifierCV(alphas=np.arange(1., 2000., 10.)))])
ridge_classifier.fit(train_features, train_labels)

__Extreme Gradient Boosting__  

As compared to logistic regression, which directly minimizes the log-loss via MLE, boosted trees are typically poorly calibrated classifiers. Training them to directly minimize log-loss may overcome some of this, but has been shown to produce subpar results. However, they can be greatly improved via approaches such as Platt Scaling and Isotonic Regression _(Obtaining Calibrated Probabilities from Boosting, arXiv:1207.1403 [cs.LG])_. We will therefore use isotonic regression to calibrate the gradient boosted tree model. This will be implemented following steps:
* Create a split in the data, saving 25% of the training data as a calibration set for isotonic regression
    * We do this as using the same data the model was trained will result in biased result (see the above paper)
* Using the training set (75% of original training data) determine hyperparameters by 5-fold cross validation
* Fit model with optimal hyperparameters on full training set (75% of original training data)
* Calibrate model with isotonic regression using calibration set (25% of original training data)

Of note, hyperparameters will be obtained by a random sampling of the hyperparaemter space, as opposed to an exhaustive grid search, as Bergstra et al. showed this to be superior whilst remaining computationally cheaper _(Journal of Machine Learning Research 13 (2012) 281-30)_.

We begin by creating the calibration set.

In [None]:
train_X, calib_X, train_y, calib_y = train_test_split(train_features, train_labels, test_size=0.25, random_state=42)

We then perform a cross validated search of the hyperparameter space, randomly sampling the grid 250 times.

In [None]:
params = {'objective':['binary:logistic'],
          'learning_rate': [0.01, 0.05, 0.10],
          'max_depth': [3, 6, 9, 12],
          'min_child_weight': [6, 8, 10, 12],
          'silent': [True],
          'subsample': [0.6, 0.8, 1],
          'colsample_bytree': [0.5, 0.75, 1],
          'n_estimators': [500, 1000]}
K = 5
xgb_model = XGBClassifier()
skf = StratifiedKFold(n_splits=K, shuffle=True, random_state=42)
cv_grid_search = RandomizedSearchCV(xgb_model, param_distributions=params, n_iter=100, scoring='roc_auc',
                                    n_jobs=48, cv=skf.split(train_X, train_y), verbose=1, random_state=42 )
cv_grid_search.fit(train_X, train_y)
print(cv_grid_search.best_estimator_)

We then run this optimal estimator on the full training set.

In [None]:
xgb_classifier = # output from above here

Finally, we calibrate the model on the held out calibration set.

In [None]:
xgb_classifer_ir = CalibrateClassifierCV(base_estimator=xgb_classifier, method='isotonic', cv='prefit')
xgb_classifer_ir.fit(calib_X, calib_y)

## 3 - High-risk Models

Again, we'll simply reproduce the steps for model training used in `run_sequential_model.py`, this time training the models on the high-risk subset. We start with feature normalization followed by training a logistic regression, ridge regression, random forest, AdaBoost, and a deep feed forward network.

In [None]:
train_labels_HR = train_data.loc[(train_apache > 0.10), 'Death'] 
train_features_HR = train_data.iloc[(train_apache > 0.10), :-2]

We then produce the same models as above. Because are only changing the training data, the process will be less verbose. Unless otherwise stated, everything is as above.

__Ridge Logistic Regression__

In [None]:
ridge_classifier_HR = Pipeline([('impute', Imputer()),
                       ('center_scale', StandardScaler()),
                       ('ridge', RidgeClassifierCV(alphas=np.arange(1., 2000., 10.)))])
ridge_classifier_HR.fit(train_features_HR, train_labels_HR)

__Extreme Gradient Boosting__

As above, split off a calibration set.

In [None]:
train_X_HR, calib_X_HR, train_y_HR, calib_y_HR = train_test_split(train_features_HR, train_labels_HR, test_size=0.25, random_state=42)

Then 5-fold cross validated grid search.

In [None]:
params = {'objective':['binary:logistic'],
          'learning_rate': [0.01, 0.05, 0.10],
          'max_depth': [3, 6, 9, 12],
          'min_child_weight': [6, 8, 10, 12],
          'silent': [True],
          'subsample': [0.6, 0.8, 1],
          'colsample_bytree': [0.5, 0.75, 1],
          'n_estimators': [500, 1000]}
K = 5
xgb_model = XGBClassifier()
skf = StratifiedKFold(n_splits=K, shuffle=True, random_state=42)
cv_grid_search = RandomizedSearchCV(xgb_model, param_distributions=params, n_iter=100, scoring='roc_auc',
                                    n_jobs=48, cv=skf.split(train_X_HR, train_y_HR), verbose=1, random_state=42 )
cv_grid_search.fit(train_X_HR, train_y_HR)
print(cv_grid_search.best_estimator_)

We then run this optimal estimator on the full training set.

In [None]:
xgb_classifier_HR = # output from above here

We conclude, again, by calibrating our model on the held out calibration set.

In [None]:
xgb_classifer_ir_HR = CalibrateClassifierCV(base_estimator=xgb_classifier_HR, method='isotonic', cv='prefit')
xgb_classifer_ir_HR.fit(calib_X_HR, calib_y_HR)

## 4 - Comparison of Approaches

We'll now evaluate these models on discriminatory capability and calibration. We first need to construct our testing data from the original train/test split.

In [None]:
test_apache = test.loc[:, 'APACHE Predicted']
test_labels = test.loc[(test_apache > 0.10), 'Death']
test_features = test.iloc[(test_apache > 0.10), :-2]

__Discrimination, Full Cohort Models__

In [None]:
f_hat_ridge = ridge_classifier.predict_proba(test_features)
roc_ridge = roc_curve(test_labels, f_hat_ridge[:, 1])
auc_ridge = roc_auc_score(y_test, f_hat_ridge[:, 1])

f_hat_xgb = ha_xgb.predict_proba(test_features)
roc_xgb = roc_curve(test_labels, f_hat_xgb[:, 1])
auc_xgb = roc_auc_score(y_test, f_hat_xgb[:, 1])

plt.plot(roc_ridge[0], roc_ridge[1], color = tableau20[7], label='Ridge\n(area = %0.3f)'.format(auc_ridge))
plt.plot(roc_xgb[0], roc_xgb[1], color = tableau20[9], label='GBM\n(area = %0.3f)'.format(auc_xgb))
plt.plot([0, 1], [0, 1], color= tableau20[0])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC: Full Cohort Models')
plt.legend(loc="lower right")
plt.show()

__Calibration, Full Cohort Models__

__Discrimination, High-risk Subset Models__

__Calibration, High-risk Subset Models__

<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />