## Phase 3.30
# Ensemble Methods

# Introduction
- An *ensemble* refers to an algorithm that uses more than one model to make a prediction.

> **You are looking for investment advice. Instead of asking a single person, you ask three specialists.**
>   - **Stock Broker** who is correct 80% of the time.
>   - **Finance Professor** who is correct 65% of the time.
>   - **Investment Expert** who is correct 85% of the time.
>
> *If all three experts predict that a given investment is good, what are the odds that all three are wrong?*
> 
> . . .

In [None]:
# If all three experts predict that a given investment is good, 
# what is the probability that all three are wrong?


# Bagging

*Bootstrap Aggregation*

<img src='./images/bagging.png' width='800'>

Training a *bagging classifier*:
- Split training data into a given number of *bags* (with replacement).
- Train a classifier on each subset of data.

Predicting with a *bagging classifier*:
- Each classifier makes a prediction.
- All predictions are aggregated into a single prediction.

---

## Random Forest
- A ***Random Forest*** is an ensemble algorithm which uses $n$-*Decision Trees* as its internal classifiers.
- Each *Decision Tree* is trained on **a subset of the data** (both rows *and* features).

### Pros and Cons
#### Pros
- Interpretability.
    - Accessible feature importances.
- Less data preprocessing required.
- Do not overfit (in theory).
- Good performance /accuracy.
- Robust to noise.

#### Cons
- Do not predict a continuous output (for regression).
- It does not predict beyond the range of the response values in the training data.

### Some hyperparameters to tune:

- **n_estimators:**
It defines the number of decision trees to be created in a random forest.
Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.


- **criterion:**
It defines the function that is to be used for splitting.
The function measures the quality of a split for each feature and chooses the best split.


- **max_features :**
It defines the maximum number of features allowed for the split in each decision tree.
Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.


- **max_depth:**
Random forest has multiple decision trees. This parameter defines the maximum depth of the trees.


- **min_samples_split:**
Used to define the minimum number of samples required in a leaf node before a split is attempted.
If the number of samples is less than the required number, the node is not split.


- **min_samples_leaf:** 
This defines the minimum number of samples required to be at a leaf node.
Smaller leaf size makes the model more prone to capturing noise in train data.


- **max_leaf_nodes:** 
This parameter specifies the maximum number of leaf nodes for each tree.
The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node.

# Preparing Some Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv('./data/diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Coding! Random Forest

---

# Boosting

1. Train a single **weak learner**.
    - ***Weak Learner:*** *A simple model that does only slightly better than random guessing.*
- Figure out **which examples** the weak learner got wrong.
- Build another weak learner that **focuses on the areas the first weak learner got wrong**.
- **Continue this process** until a predetermined stopping condition is met, such as until a set number of weak learners have been created, or the model's performance has plateaued.

<img src='./images/new_gradient-boosting.png'>
    
*The weak learners are trained sequentially on the **residuals** of the prior weak learner.*