# Ensemble Methods: Simple Averaging and Bootstrap Aggregating (aka Bagging)

# Objectives

- Use `sklearn` to build voting models
- Describe the algorithm of bagging
- Describe the differences among simple bagging, random forest, and extra trees algorithms
- Implement bagging models in `sklearn`

# Ensemble Methods

Because many heads are better than one!

<img width=50% src='images/captain_planet.jpg'/>

> "With our powers combined..."

Ensemble Methods take advantage of the "wisdom of crowds" where the average of multiple independent estimates is usually more consistently accurate than the individual estimates.

## Three Varieties, Three Levels of Randomization

We'll talk about two kinds of ensemble methods today:

1. **Simple Averaging**: Train multiple model, then average
2. **Bagging**: aka *B*ootstrap *AGG*regation - letting each model only see part of the data to train, then aggregating the results
    - For trees, we'll specifically focus on two bagging techniques:
    1. **Random Forest**: Choose a random set of features at each decision point
    2. **Extra Trees**: Choose a path at random!

## Data Preparation for Examples

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
df = pd.read_csv('data/cars.csv', na_values = ' ')
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum().sum()

### Defining Our Problem

Let's see if we can predict whether a car is American or not.

In [None]:
df[' brand'].value_counts()

In [None]:
df['target'] = df[' brand'] == ' US.'

In [None]:
df.head()

### Fix Columns with Missing Values

In [None]:
X = df.drop(['target', ' brand'], axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
si = SimpleImputer()

si.fit(X_train)

X_tr_im = si.transform(X_train)
X_te_im = si.transform(X_test)

## Version 1: Simple Averaging

> Each model uses the same data to train and then we "vote" to make a prediction

### Simple Ensemble Techniques - How do we use the wisdom of the crowd? 

1. **Max Voting** - The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.

> For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can consider this as taking the mode of all the predictions.

2. **Averaging** - Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

3. **Weighted Averaging** - This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction. For instance, if two of your colleagues are movie critics, while others have no prior experience in this field, then the answers by these two friends are given more importance as compared to the other people.

[User Guide!](https://scikit-learn.org/stable/modules/ensemble.html)


### Model 1 - Logistic Regression

In [None]:
# Instantiate and fit our logreg
lr = LogisticRegression(max_iter=1000, random_state=42)

lr.fit(X_tr_im, y_train)

In [None]:
# Check our scores
scores = cross_val_score(estimator=lr, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
lr.score(X_te_im, y_test)

### Model 2 - KNN

In [None]:
# Instantiate and fit a knn with k=3
knn = KNeighborsClassifier(3)

knn.fit(X_tr_im, y_train)

In [None]:
# Check our scores
scores = cross_val_score(estimator=knn, X=X_tr_im,
                y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
knn.score(X_te_im, y_test)

### Model 3 - Decision Tree

In [None]:
# Instantiate and fit an untuned decision tree
dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_tr_im, y_train)

In [None]:
# Check our scores
scores = cross_val_score(estimator=dt, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
dt.score(X_te_im, y_test)

### Averaging the Models

#### Building a `VotingClassifier`

Of course there's an SKLearn class for that!

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)

In [None]:
# Need to import!


In [None]:
# Instantiate and fit our VotingClassifier
avg = None

In [None]:
# Check our scores
scores = cross_val_score(estimator=avg, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
avg.score(X_te_im, y_test)

#### Weighted Averaging with the `VotingClassifier`

> Even if the vote is 50-50, you'd probably side with the "smart" ones more

This meta-estimator is not as good as one of our base estimators, so in this case the averaging did not work very well. Realizing that the logistic regression is performing better than the decision tree and the k-nearest-neighbors model, however, we might decide to build a meta-estimator by calculating a **weighted average** of the base estimators' predictions. And we can weight, or bias, this estimator in favor of the best-performing base estimator. Suppose we weight the logistic regression 50%, the knn model 25%, and the logistic regression 25%:

In [None]:
# Instantiate and fit, this time with the weights outlined above
w_avg = 

In [None]:
# Check our scores
scores = cross_val_score(estimator=w_avg, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
w_avg.score(X_te_im, y_test)

## Version 2: Bagging

A single decision tree will often overfit your training data. Let's see if we have evidence of that in the current case:

In [None]:
# Scoring our earlier dt on train
dt.score(X_tr_im, y_train)

#### 🧠 Knowledge Check: What is this score? And why is it equal to 1?

- 


In [None]:
scores = cross_val_score(estimator=dt, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
dt.score(X_te_im, y_test)

But it's often better to do something else: Plant another tree!

Of course, if a second tree is going to be of any value, it has to be *different* from the first. Here's a good algorithm for achieving that:

## Bootstrap Aggregation

The idea behind **bagging** is combining the results of multiple models (for instance, all decision trees) to get a generalized result. Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. So how can we solve this problem? One of the techniques is bootstrapping.

**Bootstrapping** is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set.

**Bagging (or Bootstrap Aggregating)** technique uses these subsets (bags) to get a fair idea of the distribution (complete set). The size of subsets created for bagging may be less than the original set.
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/image20-768x289.png)


Multiple subsets are created from the original dataset, selecting observations with replacement.
A base model (weak model) is created on each of these subsets.
The models run in parallel and are independent of each other.
The final predictions are determined by combining the predictions from all the models.
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/Screenshot-from-2018-05-08-13-11-49-768x580.png)

### Bagging with `sklearn`

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

In [None]:
# Import!

In [None]:
# Instatiate and fit a BaggingClassifier with n_estimators=100
# Note the base esimator is by default a decision tree
bag = None

In [None]:
# Check our scores
scores = cross_val_score(estimator=bag, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
bag.score(X_te_im, y_test)

### Fitting a Random Forest

Let's add an extra layer of randomization: Instead of using *all* the features of my model to optimize a branch at each node, I'll just choose a subset of my features.

That's the essence of a random forest model. Note that there are now **two** levels of random sampling happening: To build a new tree, I'll be taking only some of my data points; and at any branching point in a tree, I'll be using only some of my features to determine the split.

#### Steps:

1. Save a portion of data for validation (**out-of-bag**), the rest for training (**bag**)
2. The data for training (**bag**) is then split up by randomly selecting predictors
3. Grow/train your tree with the training data using just those features
4. Use our validation set (**out-of-bag**), take out the columns used in our tree from the previous step, and predict using the tree & this *out-of-bag* data
5. Compare on how well the tree did *out-of-bag error*
6. Repeat to make new trees and use the result to "vote" for the final decision

### Random Forest with `sklearn`

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
# Import!

In [None]:
# Instantiate and fit a RandomForestClassifier

rfc = None


In [None]:
# Check our scores
scores = cross_val_score(estimator=rfc, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
rfc.score(X_te_im, y_test)

### Cool Features of Random Forests

#### Investigate Your Forest 🌲🌲👀🌲🌲

We can check out our trained estimators after training the ensemble. This isn't necessarily unique to random forests, but since the base model is always a decision tree we can really investigate how the model is working!

In [None]:
model_estimators = rfc.estimators_ 
print(len(model_estimators))
model_estimators

In [None]:
print(f'Overall model\'s score was {score:.3f}')
print('='*70)

for model in model_estimators[-5:]:
    display(model)
    model_score = model.score(X_te_im, y_test)
    print(f'\tModel gave score of {model_score:.3f}')

#### Feature Importances

We can use [`.feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_) property of the trained model to get an idea of what features mattered the most

In [None]:
feat_import = {name: imp for name, imp in zip(X_train.columns, rfc.feature_importances_)}
feat_import

### Extremely Randomized Trees (Extra Trees)

Sometimes we might want even one more bit of randomization. Instead of always choosing the *optimal* branching path, we might just choose a branching path at random. If we're doing that, then we've got extremely randomized trees.

There are now **three** levels of randomization: sampling of data, sampling of features, and random selection of branching paths.

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

In [None]:
# Import!

In [None]:
# Instantiate and fit an ExtraTreesClassifier
etc = None

In [None]:
# Check our scores
scores = cross_val_score(estimator=etc, X=X_tr_im,
                         y=y_train, cv=5)
print(f"Median score: {np.median(scores):.4f} (+/- {np.std(scores):.4f})")

In [None]:
# Test score
etc.score(X_te_im, y_test)

## Pros and Cons of Random Forests 

(FYI - Random Forests are the most common of the techniques we've explored today, hence the focus here! Many of these pros/cons would also apply to other ensemble or bagging techniques)

**Pros:**
* Strong performance -- because this is an ensemble algorithm, the model is naturally resistant to noise and variance in the data, and generally tends to perform quite well.

* Interpretability: each tree in the random forest is a Glass-Box Model (meaning that the model is interpretable, allowing us to see how it arrived at a certain decision)

**Cons:**
* Computational complexity: On large datasets, the runtime can be quite slow compared to other algorithms.

* Memory usage: Random forests tend to have a larger memory footprint that other models. It's not uncommon to see random forests that were trained on large datasets have memory footprints in the tens, or even hundreds of MB.

* Interpretability: although each tree is a Glass-Box Model and quite interpretable, it can be harder to grasp exactly what's happening in aggregat without some extra work (and the `feature_importances_` given by random forest models are notoriously bad/unreliable!)

    - Additional details about why we don't trust random forest feature importances: https://explained.ai/rf-importance/

-----

# Level Up: Stacking

#### Meta-Classifier/Meta-Regressor

- First, we ask several different models to make predictions about the target
- Rather than taking a simple average or vote to determine the outcome, feed these results into a final model that makes the prediction based on the other models’ predictions
- If it seems like we are approaching a neural network...you are correct!

Remember weighted averaging? Stacking is about using DS models to estimate those weights for us. This means we'll have one layer of base estimators and another layer that is "**trained to optimally combine the model predictions to form a new set of predictions**". See [this short blog post](https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/) for more.

## Initial Data Prep

In [None]:
import xlrd
import os

wb = xlrd.open_workbook('data/Sales Report.xls',
                        logfile=open(os.devnull, 'w'))

sales = pd.read_excel(wb)
sales = sales.dropna()

In [None]:
sales.head()

In [None]:
sales.dtypes

In [None]:
sales['Category'].value_counts()

In [None]:
sales['Sub-Category'].value_counts()

In [None]:
X_num = sales[['Discount', 'Profit']].columns
X_cat = sales[['Category', 'Sub-Category']].columns

In [None]:
X = sales[['Discount', 'Profit', 'Category', 'Sub-Category']]
y = sales['Sales']

## Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Setting Up a Pipeline

In [None]:
numTrans = Pipeline(steps=[
    ('scaler', StandardScaler())
])
catTrans = Pipeline(steps=[
    ('ohe', OneHotEncoder(drop='first',
                          sparse=False))
])

In [None]:
pp = ColumnTransformer(transformers=[
    ('num', numTrans, X_num),
    ('cat', catTrans, X_cat)
])

In [None]:
pp.fit(X_train)

In [None]:
X_tr_pp = pp.transform(X_train)

## Setting Up a Stack

In [None]:
estimators = [
    ('lr', LinearRegression()),
    ('knn', KNeighborsRegressor()),
    ('rt', DecisionTreeRegressor())
]

sr = StackingRegressor(estimators)

In [None]:
sr.fit(X_tr_pp, y_train)

In [None]:
X_test_pp = pp.transform(X_test)

In [None]:
print(f"Train Score: {sr.score(X_tr_pp, y_train)}")
print(f"Test Score: {sr.score(X_test_pp, y_test)}")

## Comparison with Base Estimators

In [None]:
lr = LinearRegression().fit(X_tr_pp, y_train)
print(f"Train Score: {lr.score(X_tr_pp, y_train)}")
print(f"Test Score: {lr.score(X_test_pp, y_test)}")

In [None]:
knn = KNeighborsRegressor().fit(X_tr_pp, y_train)
print(f"Train Score: {knn.score(X_tr_pp, y_train)}")
print(f"Test Score: {knn.score(X_test_pp, y_test)}")

In [None]:
rt = DecisionTreeRegressor().fit(X_tr_pp, y_train)
print(f"Train Score: {rt.score(X_tr_pp, y_train)}")
print(f"Test Score: {rt.score(X_test_pp, y_test)}")