## Chapter 7 - Ensemble Learning and Random Forests

### Bagging and Pasting

One way to get diverse classifiers is to use different models. Another way is to use the same model but train it on different subsets of the training set. 

Decision trees generally suffer from high variance. If we train the same model on different subsets of the training set, we could get vastly different results.

<b>Bagging</b>, or bootstrap aggregation is a general purpose procedure to reduce the variance of a statistical learning method. Recall that given a set of $n$ independent observations, $Z_1, \cdots, Z_n$, each with variance $\sigma^2$, the variance of the mean $\bar Z$ of the observations is given by $\frac{sigma^2}{n}$. In other words, averaging a set of observations reduces variance. 

Bagging involves generating $B$ different bootstrapped training datasets. We then train a model on each bootstrapped training set in order to get $\hat f ^{*b}(x)$ and average all the predictions to get 
$$\hat f _{\text{bag}}(x) = \frac 1B \sum _{b=1}^B \hat f ^{*b}(x)$$

Bagging is particularly useful for decision trees. To apply bagging on regression trees, construct $B$ regression trees using $B$ bootstrapped training sets and average the resulting predictions. Each tree has high variance, but low bias. Averaging the prediction of $B$ trees reduces variance. In contrast, to apply bagging for classification, we can use hard voting (the outcome of the ensemble is the majority vote of all the learners) or soft voting (predict the class with the highest class probability, averaged across all the learners).

Also, these different predictors can be trained in parallel. This is also why they are preferred - they scale very well.

If each training dataset is created by drawing with replacement it is called bagging or bootstrapping. If the drawing is without replacement it is called pasting.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

from sklearn.datasets import make_moons

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
# Train for bagging classifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, 
                            max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

### Out of Bag Error Estimation

There is a very straight forward way to estimate the test error of a bagged model. In bagging models, some instances might be sampled multiple times while some instances might not be sampled at all. Generally, two-thirds of the samples are use for model training while the remaining one-third is not. They are referred to as the <u>out of bag</u> observations.

We can these out of bag samples to be used to evalute our ensemble. This way, we can obtain an out-of-bag (OOB) MSE or accuracy score. This can be performed using SKLearn

In [4]:
# Train for bagging classifier, factoring in the OOB option
bag_clf2 = BaggingClassifier(DecisionTreeClassifier(), 
                            n_estimators=500, max_samples=100, bootstrap=True, 
                            n_jobs=-1, oob_score=True)
bag_clf2.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=None,


In [5]:
# Obtain the oob score
print(bag_clf2.oob_score_)
# Obtain the decision function for each OOB sample
print(bag_clf2.oob_decision_function_[:3])

0.9253333333333333
[[0.31315789 0.68684211]
 [0.38258575 0.61741425]
 [1.         0.        ]]


In [6]:
# Accuracy score for the test set
y_pred2 = bag_clf2.predict(X_test)
print(accuracy_score(y_test, y_pred2))

0.92


### Feature Importance

To understand the model more, we can obtain the feature importance of each tree. For decision trees, simply use `tree.feature_importance_`. For the Bagging Classifier, recall that it consists of an ensemble of decision trees so we can also obtain `feature_importance_` for each tree and aggregate them.

In [7]:
for i in bag_clf2.estimators_[:3]:
    print(i.feature_importances_)

[0.52125406 0.47874594]
[0.32160243 0.67839757]
[0.45760872 0.54239128]


In [8]:
# Feature importance for Decision Tree example
# Ingest
iris_dataset = datasets.load_iris()
X = pd.DataFrame(iris_dataset['data'], columns=iris_dataset['feature_names'])
y = pd.Series(iris_dataset['target'])

# Train
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X[['petal length (cm)','petal width (cm)']], y)

# Feature importances for Decision trees
print(tree_clf.feature_importances_)

[0.56199095 0.43800905]


With this in mind, we can combine Bagging classifiers with other regularisation parameters in traditional decision trees. For example, we can use the `max_features` parameter to train each tree only using a maximum number of randomly drawn subset of features from the original feature space. 

- Performing sampling on both features and instances during training is called <b>Random Patches</b>
- Performing sampling only on the features is <b>Random Subspaces</b>

### Random Forests

Random Forests provide an improvement over bagged by way of a small tweak that decorrelates the trees. Similar to Bagging, Random Forests build a number of decision trees on bootstrapped training samples. But when building a tree, a random sample of $m$ predictors is chosen as split candidates from all $p$ features. A fresh sample of $m$ predictors is used when each new tree is built.

This results in decorrelating the trees. Consider a dataset with one strong predictor, $p_1$. If bagging was used, then most if not all the trees will look quite similar to each other (seeing $p_1$ at the top split). Many of the bagged trees will be highly correlated, but averaging many highly correlated quantities <u>does not lead to a reduction in variance as compared to averaging many uncorrelated quantities</u>. So bagging does not lead to a substantial reduction in variance.

Random forests overcome this by forcing each split to only consider a subset of features. Instead of searching for the best feature to split a node, it searches for the best feature among a random subset of features. On average, $\frac{p-m}{p}$ of the splits will not have the strong predictor, and other predictors have a better chance of ending up in the top split. This process is <u>decorrelating the trees, resulting in the prediction to have less variance</u>. Some textbooks consider this a better diversity of trees. Of course, if $m=p$ then this process is equivalent to bagging.

In [9]:
forest_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
forest_clf.fit(X_train, y_train)
y_pred3 = forest_clf.predict(X_test)

Important features are likely to appear closer to the root of the tree. It is possible to see this by computing the average depth that the feature appears in all the trees. 

In [10]:
forest_clf.feature_importances_

array([0.41475403, 0.58524597])

### Extra Trees 

Extra trees further enhances the Random forest by searching random thresholds for each feature rather than searching for the best threshold to split on. An ensemble built this way is an Extremely Randomised Trees ensemble (or Extra Trees ensemble). 

Compared to random forests, this is also much faster to train as finding the best threshold is the most expensive step of growing a tree.

In [11]:
xtrees_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
xtrees_clf.fit(X_train, y_train)
y_pred4 = xtrees_clf.predict(X_test)