# Exercise 4: Model fitting and performance evaluation 


In [10]:
import pandas as pd

# import random forest model 
from sklearn.ensemble import RandomForestClassifier

In [11]:
# Perform the previous operations on data
from sklearn.model_selection import train_test_split
data = pd.read_csv("./imputed_data.csv", index_col=0)
features = data.iloc[:, 3:-1]
features = pd.get_dummies(features).values
labels = data.mort_icu.values
train_features, test_features, train_labels, test_labels = \
                                    train_test_split(features, labels, test_size = 0.25, random_state = 2018)

## Random Forest model fitting 

As discussed in background, random forest is the meta model which uses a number of decision tree classifiers to fit on sub-samples of the dataset. In addition, it averages the results from the decision trees to improve the accuracy and control overfitting. 

The workflow of model fitting is as following:

+ Define the model parameters
+ Feed in the training features and labels to fit the model 
+ Feed in the testing features to your fitted model and compute the predicitions. 
+ Evaluate the testing labels with the predictions on the proper metrics 

#### Define the model

We are using RandomForestClassifier from [scikit-learn](https://scikit-learn.org/stable/index.html) Python package. The detailed documentation of RandomForestClassifier can be found at [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). For a clinical problem, we commonly adjust the following parameters to obtain a good model: 

+ n_estimators: number of decision trees
+ max_features: max number of features considered for splitting a node
+ max_depth: max number of levels in each decision tree
+ min_samples_split: min number of data points placed in a node before the node is split
+ min_samples_leaf: min number of data points allowed in a leaf node
+ bootstrap: method for sampling data points (with or without replacement)

We may review the literatures related to the clincial problem to define these parameters or use a grid search method. In this example, we use the default hyperparameters with 100 estimators first to try the performance first. 

In [24]:
# Instantiate model with default 100 decision trees
model = RandomForestClassifier(n_estimators=1000)

## Feed training data and fit the model

In [25]:
# Modeling fitting on training set. 
model.fit(train_features, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Feed in testing data and compute predictions

In [26]:
test_pred = model.predict(test_features)

## Evaluate the performace

We use the following metrics to evaluate the performance of fitted model

+ Accuracy
+ Area under Receiver Operating Characteristic (AUROC)

In [27]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

In [28]:
# Accuracy
acc = accuracy_score(test_pred, test_labels)
print("The testing accuracy is {:.2f} %".format(acc*100))

The testing accuracy is 93.40 %


In [29]:
auc = roc_auc_score(test_pred, test_labels)
print("The testing AUROC is {:.2f}".format(auc))

The testing AUROC is 0.83


# Support Vector Machine (SVM) Fitting

In SVM model fitting, the workflow will be as same as fitting a random forest model. The only difference we need to consider is the hyperparameter tunning for SVM. We commonly will adjust the following hyperparameters for SVM:
+ C: Penalty parameter C of the error term.
+ kernel : Kernel type of the model, one of 'linear', 'poly', 'rbf', 'sigmoid' or 'precomputed'. 
+ gamma : The coefficient for 'rbf', 'poly' and 'sigmoid' kernels. 

We use default hyperparameters first in this exercise. 

In [18]:
from sklearn import svm

In [19]:
svm = svm.SVC()

In [20]:
# Model fitting. The fitting may take some time due to the large number of patients in the dataset. 
svm.fit(train_features, train_labels)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [21]:
test_pred = svm.predict(test_features)

In [22]:
acc = accuracy_score(test_pred, test_labels)
print("The testing accuracy is {:.2f} %".format(acc*100))

The testing accuracy is 92.66 %


In [23]:
auc = roc_auc_score(test_pred, test_labels)
print("The testing AUROC is {:.2f}".format(auc))

The testing AUROC is 0.96


# Further readings

To obtain a good model, the hyper-parameters in "Define the model" section must be properly adjusted. To further learn about hyper-parameters of random forest, please refer to following online materials: 

- Hyperparameter Tuning the Random Forest in Python [https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74]

- Tuning the parameters of your Random Forest model. [https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/]