## Phase 3.30
# Ensemble Methods
## Objectives
- Introduce the <a href='#backbone'>backbone of Ensemble methods.</a>
- Learn about <a href='#bagging'>Bagging</a> and <a href='#boosting'>Boosting</a> algorithms and some popular models.
- <a href='#coding'>Code</a> through an example!


- Learn to <a href='#make-your-own'>Make Your Own</a> Ensemble Classifiers!

<a id='backbone'></a>
# Introduction
- An *ensemble* refers to an algorithm that uses more than one model to make a prediction.

> **You are looking for investment advice. Instead of asking a single person, you ask three specialists.**
>   - **Stock Broker** who is correct 80% of the time.
>   - **Finance Professor** who is correct 65% of the time.
>   - **Investment Expert** who is correct 85% of the time.
>
> *If all three experts predict that a given investment is good, what are the odds that all three are wrong?*
> 
> . . .

In [None]:
# If all three experts predict that a given investment is good, 
# what is the probability that all three are wrong?


<a id='bagging'></a>
# Bagging

*Bootstrap Aggregation*

<img src='./images/bagging.png' width='800'>

**Training a *bagging classifier*:**
- Split training data into a given number of *bags* (with replacement).
- Train a classifier on each subset of data.

**Predicting with a *bagging classifier*:**
- Each classifier makes a prediction.
- All predictions are aggregated into a single prediction.

---

## Random Forest
- A ***Random Forest*** is an ensemble algorithm which uses $n$-*Decision Trees* as its internal classifiers.
- Each *Decision Tree* is trained on **a subset of the data** (both rows *and* features).

### Pros and Cons
#### Pros
- Interpretability.
    - Accessible feature importances.
- Less data preprocessing required.
- Do not overfit (in theory).
- Good performance /accuracy.
- Robust to noise.

#### Cons
- Do not predict a continuous output (for regression).
- It does not predict beyond the range of the response values in the training data.

### Hyperparameters

- **n_estimators:**

    - It defines the number of decision trees to be created in a random forest.
    - Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.


- **criterion:**

    - It defines the function that is to be used for splitting.
    - The function measures the quality of a split for each feature and chooses the best split.


- **max_features :**

    - It defines the maximum number of features allowed for the split in each decision tree.
    - Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.


- **max_depth:**

    - Random forest has multiple decision trees. This parameter defines the maximum depth of the trees.


- **min_samples_split:**

    - Used to define the minimum number of samples required in a leaf node before a split is attempted.
    - If the number of samples is less than the required number, the node is not split.


- **min_samples_leaf:** 

    - This defines the minimum number of samples required to be at a leaf node.
    - Smaller leaf size makes the model more prone to capturing noise in train data.


- **max_leaf_nodes:** 

    - This parameter specifies the maximum number of leaf nodes for each tree.
    - The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node.

---
<a id='boosting'></a>
# Boosting

1. Train a single **weak learner**.
    - ***Weak Learner:*** *A simple model that does only slightly better than random guessing.*
    
    
2. Figure out **which examples** the weak learner got wrong.
- Build another weak learner that **focuses on the areas the first weak learner got wrong**.
- **Continue this process** until a predetermined stopping condition is met, such as until a set number of weak learners have been created, or the model's performance has plateaued.

<img src='./images/new_gradient-boosting.png'>

- *The weak learners are trained sequentially on the **residuals** of the prior weak learner.*
- *Predictions are made where the predictions from each internal classifier are given a **weight of importance**.*

## AdaBoost

### Pros and Cons
#### Pros
- Doesn't overfit easily.
- Few parameters to tune.

#### Cons
- Can be sensitive to outliers.

### Hyperparameters
- **base_estimators:** 
    - It helps to specify the type of base estimator, that is, the machine learning algorithm to be used as base learner.
    
- **n_estimators:**
    - It defines the number of base estimators.
    - The default value is 10, but you should keep a higher value to get better performance.
    
- **learning_rate:** 
    - This parameter controls the contribution of the estimators in the final combination.
    - There is a trade-off between learning_rate and n_estimators.
    
- **max_depth:**
    - Defines the maximum depth of the individual estimator.
    - Tune this parameter for best performance.

## XGBoost
> The boosting algorithm with the highest performance right now is **XGBoost**, which is short for eXtreme Gradient Boosting.
> 
> XGBoost is a stand-alone library that implements popular gradient boosting algorithms in the fastest, most performant way possible. There are many under-the-hood optimizations that allow XGBoost to train more quickly than any other library implementations of gradient boosting algorithms. 
> For instance, XGBoost is configured in such a way that it parallelizes the construction of trees across all your computer's CPU cores during the training phase. It also allows for more advanced use cases, such as distributing training across a cluster of computers, which is often a technique used to speed up computation. The algorithm even automatically handles missing values!

### Hyperparameters

- **nthread**:

    - Analogous to learning rate in GBM (*Gradient-Boosted Machine*).
    - Makes the model more robust by shrinking the weights on each step.

*This is used for parallel processing and the number of cores in the system should be entered..If you wish to run on all cores, do not input this value. The algorithm will detect it automatically.*


- **min_child_weight**:

    - Defines the minimum sum of weights of all observations required in a child.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.


- **max_depth**:

    - It is used to define the maximum depth.
    - Higher depth will allow the model to learn relations very specific to a particular sample.


- **max_leaf_nodes**:

    - The maximum number of terminal nodes or leaves in a tree.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of $n$ would produce a maximum of $2^{n}$ leaves.
    - If this is defined, GBM will ignore max_depth.


- **gamma**:

    - A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.


- **subsample**:

    - Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
    - Lower values make the algorithm more conservative and prevent overfitting but values that are too small might lead to under-fitting.


- **colsample_bytree**:

    - It is similar to max_features in GBM.
    - Denotes the fraction of columns to be randomly sampled for each tree.

<a id='coding'></a>
# Preparing Some Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv('./data/diabetes.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


# Modeling

<a id='make-your-own'></a>
# Make-Your-Own Ensemble
## BaggingClassifier
- Uses the **Bagging** process with any classifier you choose!
    - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
    
```python
>>> bagging = BaggingClassifier(
...     KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
```

## VotingClassifier
- Uses voting / majority-rule for classifiers. 
    - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
    
```python
>>> clf1 = LogisticRegression()
>>> clf2 = RandomForestClassifier(n_estimators=50)
>>> clf3 = GaussianNB()

>>> eclf = VotingClassifier(
...     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)])
```

## StackingClassifier
- Trains a **final estimator** on outputs of the given estimators.
    - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html#sklearn.ensemble.StackingClassifier
    
```python
>>> clf1 = LogisticRegression()
>>> clf2 = RandomForestClassifier(n_estimators=50)
>>> clf3 = GaussianNB()
>>> clf_final = KNeighborsClassifier()

>>> reg = StackingRegressor(
...     estimators=[clf1, clf2, clf3],
...     final_estimator=clf_final)
```

# Scikit-Learn Ensembles Documentation 
- https://scikit-learn.org/stable/modules/ensemble.html