![alt text](images/HDAT9500Banner.PNG)
<br>

# Chapter 4: Tree Based Methods

# Assessment: Random Forest

#####################################################################################

Double-click to write down your name and surname.

**Name:**


**Surname:**

**Honour Pledge** <p>
    
    
Declaration: <p>
    
    
I declare that this assessment item is my own work, except where acknowledged, and has not been submitted for academic credit elsewhere or previously, or produced independently of this course (e.g. for a third party such as your place of employment) and acknowledge that the assessor of this item may, for the purpose of assessing this item: 

    a. Reproduce this assessment item and provide a copy to another member of the University; and/or 
    b. Communicate a copy of this assessment item to a plagiarism checking service (which may then retain a copy of the assessment item on its database for the purpose of future plagiarism checking). 

#####################################################################################

# 1. Introduction

In this exercise, we will introduce tree based methods. First, we will learn about the basic decision tree, then we will see how the performance of decision trees can be improved via ensemble methods - specifically, gradient descent boosting.


## 1.1. Aims of the Exercise:
 1. To introduce Random Forest.
 2. To explore parameters and determine appropriate choices.

 
It aligns with all of the learning outcomes of our course: 

1.	Distinguish a range of task specific machine learning techniques appropriate for Health Data Science.
2.	Design machine learning tasks for Health Data Science scenarios.
3.	Construct appropriate training and test sets for health research data.


## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.
 
## 1.3. Tips
 1. The square brackets on the left hand side of each cell indicate whether the cell has been executed or not. Empty square brackets mean that the cell has not been executed, whereas square brackets that contain a number means that the cell has been executed. Run all the cells in sequence, using the "Run" button.
 2. To edit this notebook, just double-click in each cell. In the document, each cell can be a "Code" cell or "text-Markdown" cell. To choose between these two options, go to the combo-box above. 
 3. If you want to save your notebook, please make sure you press the "floppy disk" icon button above. 
 4. To clean the content of all cells and re-start the Notebook, please go to Cell->All Output->Clear

# 2. Load the liver patient dataset

Patients with liver disease have been continuously increasing because of excessive consumption of alcohol, inhalation of harmful gases, etc. <p>

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease = 1) or not (no disease = 0). This data set contains 441 male patient records and 142 female patient records.<p>

Any patient whose age exceeded 89 is listed as being of age "90".<p>

Use these patient records to determine which patients have liver disease and which ones do not.<p>

Attribute Information:<p>
    1. Age: Age of the patient
    2. Gender: Gender of the patient
    3. Total_Bilirubin: Total Bilirubin
    4. Direct_Bilirubin: Direct Bilirubin
    5. Alkaline_Phosphatase: Alkaline Phosphatase
    6. Alanine_Aminotransferase: Alanine Aminotransferase
    7. Aspartate_Aminotransferase: Aspartate Aminotransferase
    8. Total_Proteins: Total Proteins
    9. Albumin: Albumin
    10. Albumin_and_Globulin_Ratio: Albumin and Globulin Ratio
    
Source: https://www.kaggle.com/arslanengr/liver-patient-classification-data-analysis/data

Journal Article: https://pdfs.semanticscholar.org/c92d/38a7a76c20a317de63fb9278bb10102c758b.pdf

### <font color='blue'> Question 1: What is the research question of this problem? (5 marks)</font>
<p><font color='green'> Tip: Read the journal article cited above </font></p>

<b> Write the answer here:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################

In [None]:
import sys
print(sys.version)

import numpy as np
import pandas as pd
from IPython.display import display

from plotnine import *

In [None]:
liver = pd.read_csv('data/liver-data/data_Clean.csv', sep=',')

In [None]:
# Sanity Check:
#display(liver[:][:5])
#print(liver.shape)

## 2.1.Split the data into features and response

In [None]:
X = liver.drop(['liver_patient'], axis = 1)
y = liver[['liver_patient']].values

In [None]:
# Sanity Check
# display(X[:][:5])

In [None]:
# Sanity Check
# display(y[0:5])

![alt text](images/ML-work-flow.PNG)

### <font color='blue'> Question 2 - Step 2 - Preparing Data: Visualise, explore and clean the data (if necessary). (20 marks)</font>

In [None]:
# Write Python code here:
# Minimum of 3 plots and 1 descriptive table












# 3. Predicting Liver Patient using Random Forests
In this section we will tune the parameters and eventually build a random forest to predict if a patient suffers from liver disease or not.

## 3.1. Parameters
Here are some of the more important parameters:

* n_estimators: Unlike Gradient Boosted Decision Trees, Random Forests *always* improve with an increased number of estimators, and there is no danger of overfitting. However, there are diminishing returns, with improvement quickly plateauing. The number of estimators are limited by our time and computational resources. <p>
* max_features: This is the number of features to consider when looking for the best split. max_features determines how random each tree is. A large max_features means the trees will be more similar, possibly allowing for overfitting. On the other hand, a smaller max_features reduces overfitting, but may force each tree to be very deep. For classification, it is common to use the default of max_features = $\sqrt {number\:of\:features}$.<p>
    
* class_weight: Random Forest has a parameter to penalise incorrect class labels differently. This is very useful for our imbalanced data.<p>
    
* max_depth: This parameter controls the maximum depth of the tree. If not specified, the nodes are expanded until all leaves are pure, or until all leaves contain less than min_samples_split samples.


## 3.2. Tuning our model
We will use a cross-validated grid search, using GridSearchCV.  

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split
import warnings; warnings.simplefilter('ignore') #prevent warnings

The Random Forest algorithm allows categorical data to be used without creating dummies. Nevertheless, the scikit-learn library in Python needs all the variables to be numeric. Therefore, our categorical variables must be converted to dummy variables.


Some readings:
1. https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
2. https://datascience.stackexchange.com/questions/26283/how-can-i-fit-categorical-data-types-for-random-forest-classification

![alt text](images/ML-work-flow.PNG)

**Build a Random Forest Classifier to predict if a patient is a liver patient or not. (Step 3: Choosing a model)**

<p><font color='green'>Tips:
1. Follow the template of the second part of Exercise 1. In Exercise 1, we searched the best parameters in two rounds. Do only one round here, but tune the grid as many times as you need. You can give an explanation of the tuning that you followed in the space provided below.
    2. Pay attention to the categorical variables.</p>

### <font color='blue'> Question 3 - Step 4 + Step 6 Training/Fit the model + Hyperparameter Tuning: Define and run the GridSearchCV to find the best 'max_features', 'max_depth' and 'n_estimators'. Train the model. (25 marks)</font>

In [None]:
# Write Python code here:


### <font color='blue'> Question 4 - Step 5 Evaluation: Assess the classifier in the test set: accuracy, f1 score, f1_macro, precision, recall, and AUC/ROC. (20 marks)</font>

In [None]:
# Write Python code here:


### <font color='blue'> Question 5 - Step 8 Interpretation: Display feature importance. (10 marks)</font>


In [None]:
# Write Python code here:


### <font color='blue'> Question 6 - Step 9 Deployment: Would you use this classifier? (20 marks)</font>


<b> Parameters Tuning:</b>
#####################################################################################################################

(Double-click here)


#####################################################################################################################