![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 4: Tree Based Methods

# Assessment: Random Forest


# 1. Introduction

In this exercise, we will introduce tree based methods. First, we will learn about the basic decision tree, then we will see how decision trees performance can be improved via ensemble methods - specifically, gradient descent boosting.


## 1.1. Aims of the Exercise:
 1. To introduce the single Decision Tree, as well as the Gradient Boosted Trees.
 2. To explore parameters and determine appropriate choices.

 
It aligns with all the learning outcome of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.


## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been executed, whereas square brackets that contain a number means that the cell has been executed. Run all the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In the document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press "the floppy disk" icon button above. 
 4. To clean the content of all cells and re-start Notebook, please go to Cell->All Output->Clear


# 2. Load the liver patient dataset

Data Set Information (from https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset):<p>

This data set contains 416 liver patient records and 167 non liver patient records.The data set was collected from test samples in North East of Andhra Pradesh, India. 'is_patient' is a class label used to divide into groups (liver patient or not). This data set contains 441 male patient records and 142 female patient records. Any patient whose age exceeded 89 is listed as being of age "90".<p>


Attribute Information:<p>
    1. Age: Age of the patient
    2. Gender: Gender of the patient
    3. Total_Bilirubin: Total Bilirubin
    4. Direct_Bilirubin: Direct Bilirubin
    5. Alkaline_Phosphatase: Alkaline Phosphatase
    6. Alanine_Aminotransferase: Alanine Aminotransferase
    7. Aspartate_Aminotransferase: Aspartate Aminotransferase
    8. Total_Proteins: Total Proteins
    9. Albumin: Albumin
    10. Albumin_and_Globulin_Ratio: Albumin and Globulin Ratio

In [8]:
import sys
print(sys.version)
#For this notebook to work, Python must be 3.6.4 or 3.6.5

import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]


In [16]:
liver = pd.read_csv('data/liver-data/data_Clean.csv', sep=',')

In [17]:
# Sanity Check:
display(liver[:][:5])
print(liver.shape)

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,liver_patient
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


(579, 11)


## 2.1.Split the data into features and response

In [24]:
X = liver.drop(['liver_patient'], axis = 1)
y = liver[['liver_patient']].values

In [25]:
display(X[:][:5])

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4


In [21]:
display(y[0:5])

array([[1],
       [1],
       [1],
       [1],
       [1]], dtype=int64)

# 3. Predicting Liver Patient using Random forests
In this section we will tune the parameters and eventually build a random forest to predict if a patient suffers from liver disease or not.

## 3.1. Parameters
Here are some of the more important parameters:

* n_estimators: Unlike Gradient Boosted Decision Trees, Random Forests *always* improve with an increased number of estimators, and there is no danger of overfitting. However, there are diminishing returns, with improvement quickly plateauing. The number of estimators is limited by our time and computational resources. <p>
* max_features: This is the number of features to consider when looking for the best split. max_features determines how random each tree is. A large max_features means the trees will be more similar, possibly allowing for overfitting. On the other hand, a smaller max_features reduces overfitting, but may force each tree to be very deep. For classification, it is common to use the default of max_features = $\sqrt {number\:of\:features}$.<p>
    
* class_weight: Random Forest has a parameter to penalise incorrect class labels differently. This is very useful for our imbalanced data.<p>
    
* max_depth: This parameter controls the maximum depth of the tree. If not specified, then nodes are expanded until all leaves are pure, or until all leaves contain less than min_samples_split samples.


## 3.2. Tuning our Model
We will use a cross validated grid search, using GridSearchCV. The parameter we will tune is max_depth. We will set n_estimators = 100, and use the default max_features = $\sqrt {number\:of\:features}$. <p>

We will seek to maximise the unweighted average of f1 score (f1_macro). We will set n_estimators to quite a low value while we are using grid search, so that the algorithm doesn't take too long. Once we have found the optimum parameters, we will use a larger number of estimators for the final model.

In [35]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split
import warnings; warnings.simplefilter('ignore') #prevent warnings

### 4.2.1. Split the data into training and test set

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0, test_size = 0.2)

### Instructions

Build a Random Forest Classifier to predict if a patient is a liver patient or not. 

1. Use GridSearchCV to find the best 'max_depth' and 'class_weight'. 
2. Train a new classifier using that 'max_depth' and 'class_weight'. 
3. Assess the classifier in the test set: accuracy, f1 score, f1_macro, precision, recall, and AUC/ROC.
4. Display feature importance

Tips:
1. Follow the template of the second part of Exercise 1. In Exercise 1, we searched the best parameters in two rounds. Do only one round here, but tune the grid as many times as you need. You can give an explanation of the tuning that you followed in the space provided below.
2. Pay attention to the categorical variables.

In [55]:
param_grid = {'max_depth': [values for max_depth],
              'class_weight': [values for class_weight]}  

<b> Parameters Tuning:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################