# 05 - Ensemble Learning

The goal of this exercise is to to develop an understanding how to implement a random forest classifier.

<div class="alert alert-block alert-info">
To solve this notebook you need the knowledge from the previous notebook. If you have problems solving it, take another look at the last week's notebooks.
    
It's also recommended to read the chapter 7 of the book in advance.
</div>

**Task**: In this exercise, the same dataset as last week is used to predict, if a patient has a heart disease or not, depending on some medical measurements.

In [1]:
# Run this cell to import the following modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<h2 style="color:blue" align="left">Load and preprocess the data</h2>

First of all, we need to load the dataset.

In [2]:
dataset = pd.read_csv('dataset/heart.dat', delim_whitespace=True)
dataset.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


After we've load the dataset, we perform the train-test-split to validate the performance of your model later on. 

In [None]:
from sklearn.model_selection import train_test_split
X = dataset.drop('target', axis=1)
y = dataset['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=dataset['target'])
X_train.shape, X_test.shape, y_train.shape, y_test.shape

We can see, that we have 216 sample in the training set and 54 sample in the test test.

<h2 style="color:blue" align="left">Train and evaluate the model</h2>

Scikit-learn has a built-in model for Random forests called RandomForestClassifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

<div class="alert alert-block alert-success"><b>Task</b><br> 
Use the metrics Confusion Matrix and Accuracy score to evalute the performance of the random forest model with default hyperparameters. Evaluate the model with the training and the test set. How do you assess the results? Compare the results with the performance to the last week decision tree.
</div>

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Write Your Code Here


<h2 style="color:blue" align="left">Train and evaluate the model</h2>

Hyperparameter tuning is about optimizing the performance of the model. In this task, we will first examine the influence of individual hyperparameters on the accuracy. Then we start an automatic search over the total parameter space to find the optimal result. For the performance evaluation we use a cross validation.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

<div class="alert alert-block alert-success"><b>Task</b><br> 
In the following cells, the influence of different hyperparameters is to be tested. At the beginning of the cell, the values for each hyperparameter are given. Use a for-loop to iterate through the list of values and create a random forest with each parameter. Use a cross-validation with 10 folds for each ensemble. Append the scores to the variable cv_scores. Then you can use the plot_validations function to visualize the results. 
</div>

In [None]:
def plot_validations(cv_scores, x_label, x_ticklabels):
    plt.figure(figsize=(len(x_ticklabels),4))
    ax = sns.boxplot(data=cv_scores)
    ax.set_xticklabels(x_ticklabels)
    ax.set_ylabel('accuracy')
    ax.set_xlabel(x_label);

### n_estimators

In [None]:
n_trees = [10, 50, 100, 500, 1000]
cv_scores = []
for n_estimator in n_trees:
    # Write Your Code Here
    
plot_validations(cv_scores, 'n_estimators', n_trees)

### max_depth

In [None]:
max_depths = range(1,8)
cv_scores = []
# Write Your Code Here

plot_validations(cv_scores, 'max_depth', max_depths)

### max_features

In [None]:
max_features = range(1,X.shape[1])
cv_scores = []
# Write Your Code Here

plot_validations(cv_scores, 'max_features', max_features)

## GridSearch

With GridSearch you can search over specified parameter values for an estimator. With this search each combination of parameter is tested and evaluated with a cross validation. At the end you will get the set of hyperparameter with the best performance regarding the given metric. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
import time

<div class="alert alert-block alert-success"><b>Task</b><br> 
Perform a grid search with the previously used hyperparameter set. The grid search with all the needed parameter is already prepared, you only have to fit it to the training set. With grid_search.best_params_ you can see the parameter set with the best cross validation accuracy. For the evaluation on the test set you can run the subsequent cell.
</div>

<div class="alert alert-block alert-warning"> <b>Warning</b><br> 
This evaluation can take up to several minutes. So maybe get yourself a coffee while running this cell. ☕️
</div>

In [None]:
params_grid = {'n_estimators': n_trees,
          'max_features': max_features,
          'max_depth': max_depths}
grid_search = GridSearchCV(model, params_grid, cv=KFold(3, random_state=42, shuffle=True), verbose=3, n_jobs=-1, \
                           scoring='accuracy')
start_time = time.time()
# Write Your Code Here

# computation time
comp_time_gs = time.time() - start_time
print("--- Computation time for grid search: %s seconds ---" % comp_time_gs)

In [None]:
predictions_gs = grid_search.best_estimator_.predict(X_test)
test_accuracy_gs = accuracy_score(y_test, predictions_gs)
print(f'Test set Accuracy: {round(test_accuracy_gs, 4)*100}%.')

## RandomSearch

In [None]:
from sklearn.model_selection import RandomizedSearchCV

<div class="alert alert-block alert-success"><b>Task</b><br> 
Perform a random search with the previously used hyperparameter set. The random search with all the needed parameter is already prepared, you only have to fit it to the training set. With random_search.best_params_ you can see the parameter set with the best cross validation accuracy. For the evaluation on the test set you can run subsequent cell. 
</div>

In [None]:
random_search = RandomizedSearchCV(model, params_grid, n_iter=100, cv=KFold(3, random_state=42, shuffle=True), verbose=3,\
                                   n_jobs=-1, scoring='accuracy', random_state=42)
start_time = time.time()
# Write Your Code Here

# computation time
comp_time_rs = time.time() - start_time
print("--- Computation time for random search: %s seconds ---" % comp_time_rs)

In [None]:
predictions_rs = random_search.best_estimator_.predict(X_test)
test_accuracy_rs = accuracy_score(y_test, predictions_rs)
print(f'Test set Accuracy: {round(test_accuracy_rs, 4)*100}%.')

## Comparison GridSearch and RandomSearch

Run the following cell to output a small comprehension of the grid search with random search regarding computation time and accuracy.

In [None]:
print(f'Speed: RandomSearch is {round(comp_time_gs / comp_time_rs, 1)}x faster.')
print(f'Accuracy: RandomSearch is {round((test_accuracy_rs / test_accuracy_gs)-1,4)*100}% more accurate.')

## Optimizing the recall

<div class="alert alert-block alert-success"><b>Task</b><br> 
Instead of optimizing the accuracy for a heart disease it is a good idea to optimize the recall. Then we avoid the misclassication of persons who have a disease in the class "no disease". 
Find an apropriate model using RandomSearch.
</div>

In [None]:
# Write Your Code Here
# RandomizedSearchCV(model, ... scoring='recall')

## Gradient Boosting Classifier

In the book Gradient Boosting is applied for regression. Here it should be used for classification. 

<div class="alert alert-block alert-success"><b>Task</b><br> 
Build a model with the Gradient Boosting classifier and measure the time it takes. Compare it to a Random Forest model. 
Use the commands from above to measure accuracy and the confusion matrix. 
</div>

In [None]:
# Write Your Code Here

## Hist Gradient Boosting

Sci-kit also introduces a new algorithm called HistGradientBoosting Classifier which was inspired by a successful algorithm called LightGBM (see https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting ). It is much faster for big data sets with nearly the same accuracy. 

<div class="alert alert-block alert-success"><b>Task</b><br> 
Build a model with the Hist Gradient Boosting classifier and measure the time it takes. Compare it to a Random Forest model. 
Use the commands from above to measure accuracy and the confusion matrix. 
</div>

In [None]:
# Write Your Code Here

<div class="alert alert-block alert-success"><b>Task</b><br> 
Optimize the model with hyperparameter tuningn. Use GridSeachCV to try different learning rates and max_depth. 
</div>

In [None]:
# Write Your Code Here