## Chapter 7 - Ensemble Learning and Random Forests

### Bagging and Pasting

One way to get diverse classifiers is to use different models. Another way is to use the same model but train it on different subsets of the training set. 

Training a model can be done on randomly drawn subsets of the training set. If drawing is with replacement it is called bagging or bootstrapping. If drawing is without replacement it is called pasting.

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors (e.g. using a hard voting classifier).  Aggregating predictors reduce both bias and variance. Also, these different predictors can be trained in parallel. This is also why they are preferred - they scale very well.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

from sklearn.datasets import make_moons

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
# Train for bagging classifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, 
                            max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

The out-of-bag instances can be used to perform evaluation.

In [4]:
# Train for bagging classifier using OOB
bag_clf2 = BaggingClassifier(DecisionTreeClassifier(), 
                            n_estimators=500, max_samples=100, bootstrap=True, 
                            n_jobs=-1, oob_score=True)
bag_clf2.fit(X_train, y_train)
# Obtain the oob score
print(bag_clf2.oob_score_)
# Obtain the decision function for each OOB sample
print(bag_clf2.oob_decision_function_[:3])

0.92
[[0.35449735 0.64550265]
 [0.43684211 0.56315789]
 [1.         0.        ]]


In [5]:
# Accuracy score for the test set
y_pred2 = bag_clf2.predict(X_test)
print(accuracy_score(y_test, y_pred2))

0.928


On top of this, you can sample features too. Then each predictor will be trained on a random subset of the input features. This uses the `max_features` param.

Sampling both features and instances is Random Patches, while sampling only features is Random Subspaces.

### Random Forests

Random forests are more optimised for Decision Trees. Also, instead of searching for the best feature to split a node, it searches for the best feature among a random subset of features. This gives greater tree diversity which trades higher bias for lower variance, generally yielding an overall better model.

In [6]:
forest_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
forest_clf.fit(X_train, y_train)
y_pred3 = forest_clf.predict(X_test)

Important features are likely to appear closer to the root of the tree. It is possible to see this by computing the average depth that the feature appears in all the trees. 

In [7]:
forest_clf.feature_importances_

array([0.41746298, 0.58253702])

Extra trees further enhances the Random forest by searching random thresholds for each feature rather than searching for the best threshold to split on. This is an Extremely Randomised Trees ensemble (or Extra Trees ensemble). This is also much faster to train as finding the best threshold is the most expensive step of growing a tree.

In [8]:
xtrees_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
xtrees_clf.fit(X_train, y_train)
y_pred4 = xtrees_clf.predict(X_test)