# Exercise 5: Cross-validation 

One of the common problem for model fitting is: a model may just repeat the labels which it has seen in training dataset but it would fail to peform well when it sees a completely new data in testing data. This is called overfitting in data analysis. 

K-fold cross validation is one technique which is commonly used to control overfitting in clinical data anlaysis. As we demonstrated previously, we split data into a training and testing set for model fitting and evaluation. In K-fold cross validation, we split the dataset into K subsets, called K folds. Then, we iteratively fit the model K times, each time, we use the K-1 of the folds as training data and evaluating the model on the Kth fold (called validation fold). For example, if K=5, we perform 5-fold cross validation and split the data into 5 subsets. In the first iteration, we train the model with the first 4 folds and evaluate the performance on the fifth fold. In the second iteration, we train the model with the first, second, third and fifth fold and evaluate on the fourth fold. We repeat the procedure until the fifth iteration and then we average the performance metrics on each of the iteration as the final validation metrics for the fitted model. 

In this exercise, we will perform 5-fold cross validation with Python and scikit-learn package

In [2]:
# Perform previous operations
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("./imputed_data.csv", index_col=0)
features = data.iloc[:, 3:-1]
features = pd.get_dummies(features).values
labels = data.mort_icu.values
train_features, test_features, train_labels, test_labels = \
                                    train_test_split(features, labels, test_size = 0.25, random_state = 2018)

In [3]:
# Import cross validation functions 
from sklearn.model_selection import cross_validate

In [4]:
# Import the model we will use for each fold
from sklearn.ensemble import RandomForestClassifier

In [5]:
# Define the model
rf_model = RandomForestClassifier(n_estimators=1000)

To use the cross validation function in scikit-learn, we need to define the following hyperparameters: 

+ estimator: the model to fit data
+ X: the features to feed 
+ y: the labels of data 
+ cv: the number of folds 
+ scoring: performance metrics for model
+ n_jobs: The number of CPUs to use to do the computation
+ return_train_score: Whether to include the training scores 

Same as previous exercises, we use 'accuracy' and 'roc_auc' as our performance metrics (score). 

In [6]:
k_fold_cv = cross_validate(rf_model, features, labels, cv=5, 
                           scoring=('accuracy', 'roc_auc'), 
                           n_jobs=8, 
                           return_train_score=True)

In [7]:
print("The training accuracy of 5-fold cross validation is {:.2f} %. ".format(k_fold_cv['train_accuracy'].mean()))
print("The training AUROC of 5-fold cross validation is {:.2f}. ".format(k_fold_cv['train_roc_auc'].mean()))

The training accuracy of 5-fold cross validation is 1.00 %. 
The training AUROC of 5-fold cross validation is 1.00. 


In [8]:
print("The testing accuracy of 5-fold cross validation is {:.2f} %. ".format(k_fold_cv['test_accuracy'].mean()))
print("The testing AUROC of 5-fold cross validation is {:.2f}. ".format(k_fold_cv['test_roc_auc'].mean()))

The testing accuracy of 5-fold cross validation is 0.93 %. 
The testing AUROC of 5-fold cross validation is 0.88. 


We have performed cross validation above. Please note that: cross validation does not help you to "solve" overfitting problem, instead, it helps you to detect overfitting. If the performance in testing set are drastically worse than it in training set (about 20-30% worse), it means that you may consider overfitting occurs and adjust your model parameters. 

In the above example, we have a similar performance between training and testing dataset. The overfitting problem is not serious in this case. 