# Ensemble Learning and Random Forests

## Voting Classifiers

Suppose you have a few classifiers, each achieving about 80% accuracy. A simple way of creating a better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This is called a *hard voting* classifier

Often achieves a higher accuracy score than the best classifier in the ensemble. Even if each classifier is a *weak learner* (scores slightly better than random guessing), the ensemble can still be a *strong learner*, provided there are a sufficient number of weak learners and they are sufficiently diverse. 

Creates 3 different classifiers and trains a voting classifier in Scikit-Learn

In [11]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

X, y = make_moons(n_samples=1000, noise=0.15)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

voting_clf.fit(X_train, y_train)

If all classifiers are able to estimate class probabilites (ie they all have a `predict_proba()` method) then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called *soft voting*. It often performs better than hard voting because it gives weight to highly confident votes. 

## Bagging and Pasting

Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training data. When sampling is performed *with* replacement, this is called *bagging*. When performed *without* it is called *pasting*. 

Makes a prediction by aggregating the predictions of all predictors. Aggregation function is typically the *statistical mode* (ie, the most frequent prediciton, like a hard voting classifier) for classification, or the average for regression. Aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variancethan a single predictor trained on the original training set. 

Predictors can be trained in parallel, via different CPU cores or servers. Similarly predictions can also be made in parallel. Makes this method very popular because it scales well. 

## Bagging and Pasting in Scikit-Learn

Following code trains an ensemble of 500 Decision Tree classifiers: each is trained on 100 training instances randomly sampled from the training set with replacement (bagging, but to use pasting set `bootstrap=False`). The n_jobs parameters tells Scikit-Learn the number of CPU cores to use for training and predicitons (-1 means use all available cores). 

In [12]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, 
    max_samples=100, bootstrap=True, n_jobs=-1
)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

## Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. About 63%of the training instances that are not sampled are called *out-of-bag* (oob) instances. Note they are not the same 37% for all predictors. 

A bagging ensemble can be evaluated using oob instances without the need for a separate validation set. In Scikit-Learn set `oob_score=True` when creating a `BaggingClassifier` to request an automatic oob evaluation after training. 

In [13]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.99

High accuracy on test set according to the oob evaluation 

In [14]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.97

Close enough

`oob_decision_function_` variable returns the class probabilites for each training instance

In [15]:
bag_clf.oob_decision_function_

array([[0.        , 1.        ],
       [0.        , 1.        ],
       [1.        , 0.        ],
       ...,
       [0.98159509, 0.01840491],
       [0.        , 1.        ],
       [1.        , 0.        ]])

## Random Patches and Random Subspaces

`BaggingClassifier` supports sampling the features controlled by two hyperparameters: `max_features` and `bootstrap_features`. Work the same way as `max_samples` and `bootstrap` but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features. 

Useful when dealing with high-dimensional inputs (such as images). Sampling both training instances and features is called the *Random Patches* method. Keeping all training instances (setting `bootstrap=False` and `max_samples=1.0`) but sampling features (by setting `bootstrap_features=True` and/or `max_features` to a smaller value than 1.0) is called the *Random Subspaces* method.

Sampling features results in even more predictor diversity: more bias for lower variance.

# Random Forests

In [16]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

Random Forest introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. Results in greater tree diversity, trading higher bias for lower variance. 

Following `BaggingClassifier` is equivalent to the `RandomForestClassifier`

In [17]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features='auto', max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

## Extra-Trees

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting (as discussed earlier). It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like Decision Trees do). 

These types of random trees are called *Extremely Randomized Trees* ensemble (or *Extra-Trees*). Trades more bias for a lower variance. Also makes them much faster to train than Random Forests because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree. 

`ExtraTreesClassifier` and `ExtraTreesRgressor` have the same API as `RandomForestClassifier` and `RandomForestRegressor` respectively. 

Its hard to tell which one is better, so they must be compared using cross-validation

## Feature Importance

Random Forests make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). It is a weighted average, where each node's weight is equal to the number of training samples that are associated with it. 

Can be accessed using `feature_importance` variable. Example using iris dataset (returns importance by percentage)

In [18]:
from sklearn.datasets import load_iris

iris = load_iris()

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10149877140751586
sepal width (cm) 0.021331743659590098
petal length (cm) 0.46033505589063345
petal width (cm) 0.41683442904226065


Most important are petal width and length

## Boosting

*Boosting* refers to any Ensemble method that can combine several weak learners into a strong learner. Train predictors sequentially, each trying to correct its predecessor. Most popular methods are *AdaBoost* (short for *Adaptive Boosting*) and *Gradient Boosting*. 

## AdaBoost

One way for a new predictor to correct its predecssor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more onthe hard cases. 

When training an AdaBoost classifier, the algorithm first trains a base classifier and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances. Then it trains a second classifier, using the updated weights, and again makes predicitions on the training set, updates the instance weights, and so on. 

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set. 

AdaBoost cannot be parallelized (because it is sequential). Each predictor can only be trained after the previous predictor has been trained and evaluated. Does not scale as well as bagging or pasting. 

Trains an AdaBoost classifier based on 200 *Decision Stumps*. A Decision Stump is a Decision Tree with `max_depth=1`: a tree composed of a single decision node plus two leaf nodes. 

In [19]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5
)

ada_clf.fit(X_train, y_train)

If AdaBoost ensemble is overfitting training set, you can try reducing the number of estimators or more strongly regularizing the base estimator

## Gradient Boosting

Works similar to AdaBoost (sequentially adds predictors to an ensemble, each one correcting its predecessor). But instead of tweaking the instance weights at every iteration (AdaBoost), it tries to fit the new predictor to the *residual errors* made by the previous predictor. 

Example using Decision Trees: *Gradient Tree Boosting* or *Gradient Boosted Regression Trees* (GBRT)

First fit a `DecisionTreeRegressor` to training set

In [21]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

Next, train a second `DecisionTreeRegressor` on the residual errors made by the first predictor

In [23]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

Then train a third regressor on the residual errors made by the second predictor

In [27]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

Now we have an ensemble of 3 trees that can make predictions on a new instance by adding up the predictions of all the trees

In [32]:
X_new = X # assume new instances
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

Use `GradientBoostingRegressor`. Like the `RandomForestRegressor`, it has hyperparameters to control the growth of Decision Trees (eg `max_depth`, `min_samples_leaf`) and the ensemble training such as the number of trees (`n_estimators`). Following code creates the same ensemble as the previous:

In [33]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

`learning_rate` scales the contribution of each tree. If set to a low value, more trees are needed to fit the training set, but the predictions will usually generalize better. This regularization technique is called shrinkage. Setting too low will cause overfitting

To find optimal number of trees, utilize early stopping. Simple way to implement is using `staged_predict()` method: returns an iterator over the predictions made by the ensemble at each stage of training (with one trees, two trees, etc.). Following code trains a GBRT ensemble with 120 trees then measures the validation error at each stage of training to find the optimal number of trees and finally trains another GBRT ensemble using the optimal number of trees

In [35]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
         for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

Possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). Set `warm_start=True`, which makes Scikit-Learn keep existing trees when the `fit()` method is called, allowing incremental learning. Following code stops training when the validation error does not improve for five iterations in a row:

In [36]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float('inf')
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # early stopping

`GradientBoostingRegressor` supports a `subsample` hyperparameter which specifies the fraction of training instances to be used for training each tree. If `subsample=0.25` then each tree is trained on 25% of the trianing instances, selected randomly. Trades higher bias for lower variance. Speeds up training considerably. This is called *Stochastic Gradient Boosting*. 

Optimized implementation of Gradient Boosting is available in XGBoost library (Extreme Gradient Boosting). XGBoost API is similar to Scikit-Learn's:

In [38]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

Offers several nice features, like automatically taking care of early stopping

In [40]:
xgb_reg.fit(X_train, y_train,
           eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)

[0]	validation_0-rmse:0.36056
[1]	validation_0-rmse:0.26662
[2]	validation_0-rmse:0.20457
[3]	validation_0-rmse:0.16400
[4]	validation_0-rmse:0.13751
[5]	validation_0-rmse:0.12609
[6]	validation_0-rmse:0.12056
[7]	validation_0-rmse:0.11795
[8]	validation_0-rmse:0.11622
[9]	validation_0-rmse:0.11596
[10]	validation_0-rmse:0.11455
[11]	validation_0-rmse:0.11458




## Stacking

Instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don't we train a model to perform this aggregation? When a new instance is used, the predictors make different predictions then the final predictor (called a *blender* or *meta learner*) takes the predictions as inputs and makes a final prediction

To train a blender use a hold-out-set. First training set is split into two subsets. The first subset is used to train the predictors in the first layer. Next the first layer's predictors are used to make predictions on the second (held-out) set. this ensures that the predictions are "clean", since the predictors never saw these instances during training. For each instance in the hold-out set, there are x predicted values (x being the number of predictors). We can create a new training set using these predicted values as input features and keeping the target values. The blender is trained on the new training set, so it learns to predict the target value, given the first layer's predictions. 

# Exercises

1. **If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?**
<br>
No; enesemble training requires that the training data be independent of one another, therefore training on the exact same data will not work (hence why we split the training randomly between all the models)
<br>
*It will still work fine as long as the models are all very different*

2. **What is the difference between hard and soft voting classifiers?**
<br>
Hard voting aggregates the classes predicted by the predictors and picks the most chosen. Soft voting averages the probability of each class chosen by the predictors and returns the highest probability (requires `predict_proba()` function in every predictor used)
<br>
*Soft voting gives high confidence votes more weight*

3. **Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?**
Training can be done across multiple cores/servers when using bagging, pasting, and Random Forests (due to the nature of training several predictors independently and comparing them at the end). However, with boosting and stacking, training each predictor is not independent of one another. Each predictor is created with some output from a predecessor (whether it be error, layers, etc). This means training must be done sequentially and not in parallel.
<br>
*Each predictor in boosting ensemble is built based on the previous predictor (error), so training is necessarily sequential. For stacking, all the predictors in a given layer are independent of each other, so they can be trained in parallel on multiple servers. However, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained.*

4. **What is a benefit of out-of-bag evaluation?**
<br>
It yields a reasonable estimation
<br>
*Evaluated using instances that were not trained on (held out). This makes it possible to have a fairly unbiased evaluation of the ensemble without the need for a validation set*

5. **What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?**
<br>
Only a random subset of features is considered for splitting. They can also use random thresholds per feature instead of finding the best threshold (like a Decision Tree does). This is the longest part of training a Decision Tree so it speeds up training as well. 
<br>
*Randomized thresholds act like an extra step of regularization; better if Randomized Forest is overfitting data*

6. **If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?**
<br>
Try increasing the number of estimators or strongly regularizing the base estimator.
<br>
*Also slightly increase the learning rate*

7. **If your Gradient Boosting ensemble overfits the training data, should you increase or decrease the learning rate?** <br>
Decrease the learning rate (create more bias generalizing predictions). 
<br>
*Use early stopping to find the right number of predictors (probably have too many)*

8. **Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g, use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?**

Load data and split into training, validation, and testing sets

In [71]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"]

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=10000, random_state=42)
X_train, X_test, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=10000, random_state=42)

# X_train, X_val, X_test = X[:50000], X[50000:60000], X[60000:]
# y_train, y_val, y_test = y[:50000], y[50000:60000], y[60000:]

Training various classifiers (Random Forest, Extra-Trees, SVM)

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
ext_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVM(random_state=42)
#mlp_clf = MLPClassifier(random_state=42)

In [None]:
estimators = [rnd_clf, ext_clf, svm_clf]
for estimator in estimators:
    print('Training', estimator)
    estimator.fit(X_train, y_train)

In [None]:
[estimator.score(X_val, y_val) for estimator in estimators]

**Voting classifier**

In [None]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('rf', rnd_clf), ('et', ext_clf)],
    voting='soft'
)
voting_clf.fit(X_train, y_train)

In [None]:
voting_clf.score(X_val, y_val)

Accuracy on test set

In [None]:
[estimator.score(X_test, y_test) for estimator in estimators] 

In [None]:
voting_clf.score(X_test, y_test)

9. **Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?**

Create new training set

In [None]:
rnd_pred = rnd_clf.predict(X_val)
ext_pred = ext_clf.predict(X_val)
#mlp_pred = mlp_clf.predict(X_val)

In [None]:
X_new = []
for i, j, k in zip(rnd_pred, ext_pred):
    X_new.append([i, j, k])

Train new classifier on training set

In [None]:
blender = RandomForestClassifier()
blender.fit(X_new, y_val)

Make predictions with stacking ensemble (using predictions from previous question)

In [None]:
blend_set = []
for i in range(len(X_test)):
    r, e, s = y1_pred[i], y2_pred[i], y3_pred[i]
    blend_set.append([r, e, s])

blend_predict = blender.predict(blend_set)
accuracy_score(blend_predict, y_test)