<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Ensembles and Random Forests
 
_Author: Joseph Nelson (DC)_

*Adapted from Chapter 8 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)*

---

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
%matplotlib inline

<a id="introduction"></a>
## Introduction

### What is Ensembling?

**Ensemble learning (or "ensembling")** is the process of combining several predictive models in order to produce a combined model that is more accurate than any individual model. For example, given predictions from several models we could:

- **Regression:** Take the average of the predictions.
- **Classification:** Take a vote and use the most common prediction.

For ensembling to work well, the models must be:

- **Accurate:** They outperform the null model.
- **Independent:** Their predictions are generated using different processes.

**The big idea:** If you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when you average the models.

There are two basic **methods for ensembling:**

- Manually ensembling your individual models.
- Using a standard "meta-model" that does ensembling internally.

<a id="part-one"></a>
## Part 1: Manual Ensembling

What makes an effective manual ensemble?

- Different types of **models**.
- Different combinations of **features**.
- Different **tuning parameters**.

![Machine learning flowchart](../assets/images/crowdflower_ensembling.jpg)

*Machine learning flowchart created by the [winner](https://github.com/ChenglongChen/Kaggle_CrowdFlower) of Kaggle's [CrowdFlower competition](https://www.kaggle.com/c/crowdflower-search-relevance)*.

### Comparing Ensembling With a Single-Model Approach

**Advantage of ensembling:** it can increase predictive accuracy.

**Disadvantages of ensembling:**

- It decreases interpretability.
- It takes longer to train.
- It takes longer to predict.
- It is more complex to automate and maintain, particularly with manual ensembling.

Outside of machine learning competitions, you have to weigh gains in accuracy against added complexity.

<a id="part-two"></a>
## Part 2: Bagging

The primary weakness of **decision trees** is that they don't tend to have the best predictive accuracy. This is partially because of **high variance**, meaning that different splits in the training data can lead to very different trees.

**Bagging** is a general-purpose procedure for reducing the variance of a machine learning method but is particularly useful for decision trees. Bagging is short for **bootstrap aggregation**, meaning the aggregation of bootstrap samples.

A **bootstrap sample** is a random sample with replacement. So, it has the same size as the original sample but might duplicate some of the original observations.

In [None]:
# Set a seed for reproducibility.
np.random.seed(1)

In [None]:
# Create an array of 1 through 20.


In [None]:
# Sample that array 20 times with replacement.


**How does bagging work (for decision trees)?**

1. Grow B trees using B bootstrap samples from the training data.
2. Train each tree on its bootstrap sample and make predictions.
3. Combine the predictions:
    - Average the predictions for **regression trees**.
    - Take a vote for **classification trees**.

**Notes:**

- **Each bootstrap sample** is typically the same size as the original training set. (It may contain repeated rows.)
- **B** should be a large enough value that the error seems to have "stabilized".
- The trees are **grown deep** so that they have low bias/high variance.

Training multiple trees through bagging can give more consistent results than training a single tree, thereby reducing variance.

<a id="manual-bagged"></a>
## Manually Implementing Bagged Decision Trees (with B=10)

In [None]:
# Read in and prepare the vehicle training data.
path = Path('..', 'assets', 'data', 'vehicles_train.csv')
train = pd.read_csv(path)

In [None]:
# Transform "vtype" to "is_truck"


In [None]:
# Set a seed for reproducibility.
np.random.seed(123)

In [None]:
# Create ten bootstrap samples (which will be used to select rows from the DataFrame).


In [None]:
# Show the rows for the first decision tree.


In [None]:
# Read in and prepare the vehicle testing data.


In [None]:
# Import decision tree regressor


In [None]:
# Grow each tree deep.


In [None]:
# Define testing data.


In [None]:
# Grow one tree for each bootstrap sample and make predictions on testing data.


In [None]:
# Convert predictions from list to NumPy array.


In [None]:
# Average predictions.


In [None]:
# Calculate RMSE.


<a id="manual-sklearn"></a>
## Bagged Decision Trees in `scikit-learn` (with B=500)

In [None]:
# Define the training and testing sets.


In [None]:
# Instruct BaggingRegressor to use DecisionTreeRegressor as the "base estimator."


In [None]:
# Fit and predict.


In [None]:
# Calculate RMSE.


<a id="oos-error"></a>
## Estimating Out-of-Sample Error

For bagged models, out-of-sample error can be estimated without using **train/test split** or **cross-validation**!

For each tree, the **unused observations** are called "out-of-bag" observations.

In [None]:
# Show the first bootstrap sample.


In [None]:
# Show the "in-bag" observations for each sample.


In [None]:
# Show the "out-of-bag" observations for each sample.


**Calculating "out-of-bag error:"**

1. For each observation in the training data, predict its response value using **only** the trees in which that observation was out-of-bag. Average those predictions (for regression) or take a vote (for classification).
2. Compare all predictions to the actual response values in order to compute the out-of-bag error.

When B is sufficiently large, the **out-of-bag error** is an accurate estimate of **out-of-sample error**.

In [None]:
# Compute the out-of-bag R-squared score (not MSE, unfortunately) for B=500.


### Estimating Feature Importance

Bagging increases **predictive accuracy** but decreases **model interpretability** because it's no longer possible to visualize the tree to understand the importance of each feature.

However, we can still obtain an overall summary of **feature importance** from bagged models:

- **Bagged regression trees:** Calculate the total amount that **MSE** decreases due to splits over a given feature, averaged over all trees
- **Bagged classification trees:** Calculate the total amount that **Gini index** decreases due to splits over a given feature, averaged over all trees

**Exercise (6 mins.)**

In your own words...

- What is ensembling?

- How do bagged classification trees work?

- What is out-of-bag error?

- Compare and contrast ensembling and K-fold cross validation in terms of both process and aims.

$\blacksquare$

<a id="part-three"></a>
## Part 3: Random Forests

Random Forests offer a **slight variation on bagged trees** that usually gives better performance:

- Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.
- However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**.
    - A new random sample of features is chosen for **every single tree at every single split**.
    - For **classification**, m is typically chosen to be the square root of p.
    - For **regression**, m is typically chosen to be somewhere between p/3 and p.

What's the point?

- Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**.
- Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).
- By randomly leaving out candidate features from each split, **random forests "decorrelate" the trees** to the extent that the averaging process can reduce the variance of the resulting model.
- Another way of looking at it is that sometimes one or two strong features dominate every tree in bagging, resulting in essentially the same tree as every predictor. (This is what was meant when saying the trees could be highly correlated.) By using a subset of features to generate each tree, we get a wider variety of predictive trees that do not all use the same dominant features.

<a id="part-four"></a>
## Part 4: Building and Tuning Decision Trees and Random Forests

In this section, we will use the sklearn implementation of random forests.

- Major League Baseball player data from 1986-87: [data](https://github.com/justmarkham/DAT8/blob/master/data/hitters.csv), [data dictionary](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) (page 7)
- Each observation represents a player.
- **Goal:** Predict player salary.

### Preparing the Data

In [None]:
# Read in the data.
path = Path('..', 'assets', 'data', 'hitters.csv')
hitters = pd.read_csv(path)

In [None]:
# Remove rows with missing values.


In [None]:
# Encode categorical variables as integers.


In [None]:
# Create a scatter plot of hits vs years, colored by salary


In [None]:
# Define features: Exclude career statistics (which start with "C") and the response (salary).


In [None]:
# Define X and y.


<a id="decision-tree"></a>
## Predicting Salary With a Decision Tree

Let's first recall how we might predict salary using a single decision tree.

We'll first find the best **max_depth** for a decision tree using cross-validation:

In [None]:
# List of values to try for max_depth:


In [None]:
# Use 10-fold cross-validation with each value of max_depth.


In [None]:
# Plot max_depth (x-axis) versus RMSE (y-axis).


In [None]:
# Show the best RMSE and the corresponding max_depth.


In [None]:
# max_depth=2 was best, so fit a tree using that parameter.


In [None]:
# Compute feature importances.


<a id="random-forest-demo"></a>
## Predicting Salary With a Random Forest

### Fitting a Random Forest With the Best Parameters

In [None]:
# Import random forest regressor


In [None]:
# max_features=5 is best and n_estimators=150 is sufficiently large.


In [None]:
# Compute feature importances.


In [None]:
# Compute the out-of-bag R-squared score.


In [None]:
# Find the average RMSE.


<a id="comparing"></a>
## Comparing Random Forests With Decision Trees

**Advantages of random forests:**

- Their performance is often competitive with the best supervised learning methods, unlike that of decision trees.
- They provide a more reliable estimate of feature importance.
- They allow you to estimate out-of-sample error without using train/test split or cross-validation.

**Disadvantages of random forests:**

- They are less interpretable.
- They are slower to train.
- They are slower to predict.

**Exercise (12 mins., pair programming)**

- How does a random forest differ from an ordinary bagged decision tree?

- Use a `RandomForestClassifier` estimator to predict who survives on the Titanic. Use five-fold cross-validation to evaluate its accuracy.

*Tip:* For your first model, just use the numeric columns without missing values as-is as your features. You can mess with missing data, dummy coding, and additional feature engineering later.

In [None]:
path = Path('..', 'assets', 'data', 'titanic.csv')
titanic = pd.read_csv(path)

- **BONUS:** Find a way to change your model that improves its five-fold cross-validation accuracy.

*Tip:* If you have a small gap between training-set performance and test-set performance, then focus on decreasing bias. Otherwise, focus on decreasing variance.

$\blacksquare$

<a id="tuning"></a>
## Optional: Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor()
rfreg

### Tuning n_estimators

One important tuning parameter is **n_estimators**, which represents the number of trees that should be grown. This should be a large enough value that the error seems to have "stabilized."

In [None]:
# List of values to try for n_estimators:
estimator_range = list(range(10, 310, 10))

# List to store the average RMSE for each value of n_estimators:
RMSE_scores = []

# Use five-fold cross-validation with each value of n_estimators (Warning: Slow!).
for estimator in estimator_range:
    rfreg = RandomForestRegressor(n_estimators=estimator, random_state=1)
    MSE_scores = cross_val_score(rfreg, X, y, cv=5, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

In [None]:
# Plot RMSE (y-axis) versus n_estimators (x-axis).

fig, ax = plt.subplots()
ax.plot(estimator_range, RMSE_scores);
ax.set_xlabel('n_estimators');
ax.set_ylabel('RMSE (lower is better)');

**Adding more trees will only help average performance, with diminishing returns.**

### Tuning max_features

The other important tuning parameter is **max_features**, which represents the number of features that should be considered at each split.

In [None]:
# List of values to try for max_features:
feature_range = list(range(1, len(feature_cols)+1))

# List to store the average RMSE for each value of max_features:
RMSE_scores = []

# Use 10-fold cross-validation with each value of max_features (Warning: Super slow!).
for feature in feature_range:
    rfreg = RandomForestRegressor(n_estimators=150, max_features=feature, random_state=1)
    MSE_scores = cross_val_score(rfreg, X, y, cv=10, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

In [None]:
# Plot max_features (x-axis) versus RMSE (y-axis).
fig, ax = plt.subplots()
ax.plot(feature_range, RMSE_scores);
ax.set_xlabel('max_features');
ax.set_ylabel('RMSE (lower is better)');

In [None]:
# Show the best RMSE and the corresponding max_features.
sorted(zip(RMSE_scores, feature_range))[0]

<a id="summary"></a>
## Summary

**Which model is best?** The best classifier for a particular task is task-dependent. In many business cases, interpretability is more important than accuracy. So, decision trees may be preferred. In other cases, accuracy on unseen data might be paramount, in which case random forests would likely be better (since they typically overfit less). 

---

**In this lesson:**

- We looked at ensemble models.
- We saw how decision trees could be extended using two ensemble techniques -- bagging and random forests.
- We looked at methods of evaluating feature importance and tuning parameters.