# Developing Well Calibrated Illness Severity Scores
## Model Building
### C.V. Cosgriff, MIT Critical Data

The goal of this notebook is to apply current state-of-the-art for predictive modeling with structured data to a full ICU cohort and high-risk sub-cohort in order to compare the  discriminative ability and calibration in the high-risk cohort. We seek to determine which strategy leads to models that can accurately forecast mortality in high-risk subsets where previous models have struggled by examining the role of constraining the case-severity mix.

With respect to the modeling approaches, we'll implement a generalized linear model similar to APACHE IV as a baseline comparison, but, given the extremely large number of features we will impose regularization to reduce model complexity and prevent overfitting; we will therefore employ $L2$ penalization also known as ridge regression. We will also implement a tree based approach as these have been exceedingly successful in recent works and is considered the state-of-the-art for predictive modeling with structured data. Specifically, we will use a gradient boosted tree approach as implemented by the _extreme gradient boosting_ algorithim in `xgBoost` _(XGBoost: A Scalable Tree Boosting System, arXiv:1603.02754 [cs.LG])_.

__Notebook Outline:__
* Envrionment preparation
* Load training data
* Train models on full cohort
    * Penalized generalized linear model ($L2$, _ridge regression_)
    * Gradient boosted tree (_xgBoost_)
* Train models on high-risk subset of full cohort
    * Penalized generalized linear model ($L2$, _ridge regression_)
    * Gradient boosted tree (_xgBoost_)
    
_All model (hyper-)parameters will be chosen by 5-fold cross validation._

## 0 - Environment Setup

Here we'll load the standard data science stack, preprocessing, linear model, and model evalutaion tools from `scikit-learn`, and gradient boosting classifier from `xgboost`. We are using version 0.19.2 of `scikit-learn` and version 0.72 of `xgboost`.

In [1]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, RobustScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

from xgboost import XGBClassifier

import pickle

## 1 - Load Training Data

In [2]:
train_X_full = pd.read_csv('../extraction/data/train_X.csv').set_index('patientunitstayid').values
train_y_full = pd.read_csv('../extraction/data/train_y.csv').values.ravel()
train_apache_full = pd.read_csv('../extraction/data/train_apache.csv').values.ravel()

We seperate off the high-risk subcohort.

In [3]:
train_X_HR = train_X_full[(train_apache_full >= 0.10), :] 
train_y_HR = train_y_full[(train_apache_full >= 0.10)] 

Because the _Full Cohort_ will always have more data we'll randomly sample from the full training set so that the data are the same size of the _high-risk cohort_. These data will define the RS cohort. 

In [4]:
np.random.seed(seed=42)
sample_index = np.random.choice(np.arange(0, train_X_full.shape[0]), size=train_y_HR.shape[0])
train_X = train_X_full[sample_index, :]
train_y = train_y_full[sample_index]

## 2 - RS Models

__Ridge Logistic Regression__

Our first model is a linear model similar to that used in the development of APACHE IV. As we have many variables, we choose to constrain model complexity using a $L2$ regularization, and thus will train a ridge logistic regression model. For selection of $\lambda$, which `scikit-learn` calls $\frac{1}{C}$, we'll use 5-fold cross validation.

Unlike tree based approaches, ridge regression requires features be of the same scale for proper performance. It is also not robust to missing data. As such, we'll employ mean imputation, followed by scaling and centering of the features, and then ridge regression with 5-fold CV searching for $\lambda$ in  $[10, 5000)$ with a step-size of 10.

In [5]:
K = 5
lam = np.arange(1, 500, 1)
ridge_classifier = Pipeline([('impute', Imputer(strategy='median')),
                             ('center_scale', RobustScaler()),
                             ('ridge', LogisticRegressionCV(Cs=(1/lam), cv=K, 
                                                            scoring='roc_auc', 
                                                            n_jobs=4, refit=True, 
                                                            random_state=42))])
ridge_classifier.fit(train_X, train_y)

scores = ridge_classifier.named_steps['ridge'].scores_[1]
scores_fold_avg = np.mean(scores, axis=0)
print('Best AUC estimated by 5-fold CV: {0:.3f}'.format(scores.max()))
print('Optimal lambda: {0:.3f}'.format(1 / ridge_classifier.named_steps['ridge'].C_[0]))

pickle.dump(ridge_classifier, open('./models/ridge_full-cohort', 'wb'))

Best AUC estimated by 5-fold CV: 0.895
Optimal lambda: 257.000


__Gradient Boosting__  

As compared to logistic regression, which directly minimizes the log-loss via MLE, boosted trees are typically poorly calibrated classifiers _(Obtaining Calibrated Probabilities from Boosting, arXiv:1207.1403 [cs.LG])_. However, `xgboost` implements multiple objective functions and can use the log-loss as logistic regression does. As such, Platt scaling and Isotonic Regression are unnecessary to achieve well-calibrated classifiers with a properly chosen loss function.

In contrast to ridge regression, there are far more hyperparameters to choose from with the gradient boosting model, and thus an exhaustive search is very computationally expensive. Hyperparameters will therefore be obtained by a random sampling of the hyperparaemter space, as opposed to an exhaustive grid search. Bergstra et al. showed this to be superior whilst remaining computationally cheaper _(Journal of Machine Learning Research 13 (2012) 281-30)_. This will be implemented below and the grid will be sampled from 100 times. Our version of `xgboost` is compiled to use the GPU, and thus we'll use the GPU version of the fast histogram algorithim; the GPU on this machine is a Titan Xp. However, because of the implementation we can only run one job at a time, although this is still substantially faster than parallelizing over CPU cores.

In [6]:
params = {'objective':['binary:logistic'],
          'learning_rate': [0.01, 0.05, 0.10],
          'max_depth': [3, 6, 9, 12],
          'min_child_weight': [6, 8, 10, 12],
          'silent': [True],
          'subsample': [0.6, 0.8, 1],
          'colsample_bytree': [0.5, 0.75, 1],
          'n_estimators': [500, 1000]}

xgb_model = XGBClassifier(tree_method='gpu_hist', predictor='gpu_predictor')
skf = StratifiedKFold(n_splits=K, shuffle=True, random_state=42)
cv_grid_search = RandomizedSearchCV(xgb_model, param_distributions=params, n_iter=100, 
                                    scoring='roc_auc', n_jobs=1, cv=skf.split(train_X, train_y), 
                                    verbose=1, random_state=42)
cv_grid_search.fit(train_X, train_y)
print('Best AUC estimated by 5-fold CV: {0:.3f}'.format(cv_grid_search.best_score_))

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 35.8min finished


Best AUC estimated by 5-fold CV: 0.930


We then output the best model to examine the parameters chosen.

In [7]:
print(cv_grid_search.best_estimator_)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.75, gamma=0, learning_rate=0.01,
       max_delta_step=0, max_depth=12, min_child_weight=6, missing=None,
       n_estimators=1000, n_jobs=1, nthread=None,
       objective='binary:logistic', predictor='gpu_predictor',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=0.6, tree_method='gpu_hist')


Finally, we fit this optimal estimator on the full training set and save the result.

In [8]:
xgb_classifier = cv_grid_search.best_estimator_
xgb_classifier.fit(train_X, train_y)
pickle.dump(xgb_classifier, open('./models/xgb_full-cohort', 'wb'))

## 3 - High-risk Models

We'll now repeat the above modeling steps in the high-risk cohort. We begin by selecting the high-risk cohort, which was defined above as patients with an APACHE IV predicted mortality $\geq0.10$.

We will produce the same models as above. Because are only changing the training data, the process will be less verbose and unless otherwise stated everything is as above.

__Ridge Logistic Regression__

In [9]:
ridge_classifier_HR = Pipeline([('impute', Imputer(strategy='median')),
                                ('center_scale', RobustScaler()),
                                ('ridge', LogisticRegressionCV(Cs=(1/lam), cv=K, 
                                                               scoring='roc_auc', 
                                                               n_jobs=4, refit=True, 
                                                               random_state=42))])
ridge_classifier_HR.fit(train_X_HR, train_y_HR)

scores = ridge_classifier_HR.named_steps['ridge'].scores_[1]
scores_fold_avg = np.mean(scores, axis=0)
print('Best AUC estimated by 5-fold CV: {0:.3f}'.format(scores.max()))
print('Optimal lambda: {0:.3f}'.format(1 / ridge_classifier_HR.named_steps['ridge'].C_[0]))

pickle.dump(ridge_classifier_HR, open('./models/ridge_HR-cohort', 'wb'))

Best AUC estimated by 5-fold CV: 0.823
Optimal lambda: 425.000


__Gradient Boosting__

5-fold CV hyperparameter search.

In [10]:
params = {'objective':['binary:logistic'],
          'learning_rate': [0.01, 0.05, 0.10],
          'max_depth': [3, 6, 9, 12],
          'min_child_weight': [6, 8, 10, 12],
          'silent': [True],
          'subsample': [0.6, 0.8, 1],
          'colsample_bytree': [0.5, 0.75, 1],
          'n_estimators': [500, 1000]}

xgb_model = XGBClassifier(tree_method='gpu_hist', predictor='gpu_predictor')
skf = StratifiedKFold(n_splits=K, shuffle=True, random_state=42)
cv_grid_search_HR = RandomizedSearchCV(xgb_model, param_distributions=params, n_iter=100, 
                                    scoring='roc_auc', n_jobs=1, cv=skf.split(train_X_HR, train_y_HR), 
                                    verbose=1, random_state=42)
cv_grid_search_HR.fit(train_X_HR, train_y_HR)
print('Best AUC estimated by 5-fold CV: {0:.3f}'.format(cv_grid_search_HR.best_score_))

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 43.8min finished


Best AUC estimated by 5-fold CV: 0.852


Output the optimal estimator.

In [11]:
print(cv_grid_search_HR.best_estimator_)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.5, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=9, min_child_weight=10, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='binary:logistic',
       predictor='gpu_predictor', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.6, tree_method='gpu_hist')


Fit optimal model on all training data, and save the final model.

In [12]:
xgb_classifier_HR = cv_grid_search_HR.best_estimator_
xgb_classifier_HR.fit(train_X_HR, train_y_HR)
pickle.dump(xgb_classifier_HR, open('./models/xgb_HR-cohort', 'wb'))

We now turn to model analysis.