# Ensemble
[![Run in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bobmh43/handson_ml/blob/master/notebooks/ch7_ensemble.ipynb)

# Questions at the end of the chapter

*1. If you have trained five different models on the exact same data, and they all achieved 95% accuracy, can you combine them to get btter results?*\
Yes. We can aggregate their predictions through hard or soft voting. If we assume that these models are different and are trained differently, this method works by the law of large numbers, which tells us that as the size of the independent random sample increases, the sample mean converges to the expectation.

*2. What is the difference between hard and soft voting classifiers?*\
Hard voting takes the modes of the classes predicted by the various classifiers. Soft voting takes the mean of the class probabilities predicted by the various classifiers before taking the argmax. Soft voting is generally more accurate as it takes into account the confidence of the votes. However, it does require the classifiers to be well-calibrated.

*3. Is it possible to train these in parallel: bagging, pasting, boosting, random forests and stacking ensembles?*\
Bagging, pasting and random forests can be trained in parallel, as the estimators do not depend on each other. Boosting must be trained sequentially, as each estimator depends upon its predecessor. For a stacking ensemble, the estimators in the first layer can be trained in parallel, but the blender must be trained after.

*4. What is the benefit of out-of-bag evaludation?*\
Compared to holding out a separate validation set, it allows the ensemble to be trained on a larger dataset. Compared to cross-validation, it is way less computationally intensive and still gives an accurate estimate of the generalization error.

*5. What makes Extra-Trees more random? How does this added randomness help? Are they faster to train?*\
During training, for each tree, at each node, for each feature considered (out of a random subset of all the features), a threshold is randomly selected. (The comparison is still made between the possible splits.) Training is thus quicker. And the added randomness reduces the variance in the ensemble by increasing the bias of each of the components. (it acts as regularization.)

*6. If your AdaBoost ensemble underfits the training data, what hyperparameters should you tweak and how?*\
First, we note that each component estimator in a boosting ensemble is meant to be highly regularized -- the goal of boosting is to have the model fit the data better. The more estimators there are in an AdaBoost ensemble, the better it is going to fit the data. Thus, we should increase n_estimators (it is an inverse-regularization hyperparameter).

*7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?*\
Again, we think about the concept of boosting. The higher the learning rate, the more the ensemble is going to learn from the data. So the learning rate should be decreased (it is also an inverse-regularization hyperparameter).

# Setup

In [None]:
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.tree import ExtraTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier, StackingClassifier
# note, a single extra tree can be imported from sklearn.tree.ExtraTreeClassifier (no s)

# Exercise VotingClassifier

Note: the VotingClassifier clones each of the component classifiers provided to it. It trains them on the class indices and not the class labels themselves.

A note on `set_params()`: For simple estimators, `estimator.set_params(param=sth)` is the same as `estimator.param = sth`. For nested objects (pipelines, meta estimators, etc.), `set_params` gives access to the components and their parameters, these being inaccessible the other way as the components are not individual attributes of the nested object. `obj.set_params(component="drop")`; `obj.set_params(component__param = sth)`

In [None]:
# load mnist and split train-val-test 50k-10k-10k
X, y = sklearn.datasets.fetch_openml('mnist_784', return_X_y=True, as_frame=False,
                                parser='auto')

In [None]:
X_train, y_train = X[:50000], y[:50000]
X_val, y_val = X[50000:60000], y[50000:60000]
X_test, y_test = X[60000:], y[60000:]

In [None]:
# train a random forest, extra-trees, and svm, and a MLP
for_clf = RandomForestClassifier().fit(X_train, y_train)
clf_list = [RandomForestClassifier(n_estimators=100, random_state=42),
            ExtraTreesClassifier(n_estimators=100, random_state=42),
            make_pipeline(StandardScaler(),
                          LinearSVC(max_iter=100, tol=20, dual=True, random_state=42)),
            MLPClassifier(random_state=42)]
score_list = [clf.fit(X_train, y_train).score(X_val, y_val) for clf in clf_list]

In [None]:
# test their accuracy on the val set
_ = [print("val accuracy of " + clf.__class__.__name__ + ": ", score) for score, clf in zip(score_list, clf_list)]

val accuracy of RandomForestClassifier:  0.9736
val accuracy of ExtraTreesClassifier:  0.9743
val accuracy of Pipeline:  0.8691
val accuracy of MLPClassifier:  0.9613


In [None]:
# combine them via voting
voter = VotingClassifier(list(zip(map(lambda c: c.__class__.__name__, clf_list), clf_list)), voting="hard", n_jobs=-1)
voter.fit(X_train, y_train)
voter.score(X_val, y_val)

0.9738

In [None]:
# remove the svc (poor performance)
tmp = voter.named_estimators_.pop("Pipeline")
voter.estimators_.remove(tmp)
voter.score(X_val, y_val)

0.9761

In [None]:
# test on the test set, compare to the individuals
# it is fine to compare the test errors as we have already made the decision to use the voting ensemble.
voter_score = voter.score(X_test, y_test)
component_scores = [clf.score(X_test, voter.le_.transform(y_test)) for clf in voter.estimators_]
print("VotingClassifier's test error:", voter_score)
_ = [print("test accuracy of " + clf.__class__.__name__ + ": ", clf.score(X_test, voter.le_.transform(y_test))) for clf in voter.estimators_]
print("The error rate decreased by ", round(100 * (1 - (1 - voter_score) / (1 - np.mean(component_scores))), 1), "%")

VotingClassifier's test error: 0.9733
test accuracy of RandomForestClassifier:  0.968
test accuracy of ExtraTreesClassifier:  0.9703
test accuracy of MLPClassifier:  0.9618
The error rate decreased by  19.8 %


# Continued Exercise: Stacking

Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set.

In [None]:
# map the validation inputs
X_val_predictions = np.stack([clf.predict(X_val) for clf in clf_list], axis=1)

# create the blender
blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=43)
blender.fit(X_val_predictions, y_val)
blender.oob_score_

0.9729

Now let's evaluate the stacking ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. It unfortunately performs worse than the voting classifier

In [None]:
blender.score(np.stack([clf.predict(X_test) for clf in clf_list], axis=1), y_test)

0.9677

Now we try again using a StackingClassifier instead. We get a better performance because it uses K-fold cross validation. It generates the out-of-fold predictions for each training instance for each component classifier. Now, you have X_predictions of shape (n_instances, n_estimators), and you train the blender with this. The component predictors are then retrained on the entire dataset. Total trainings done is n_cv * n_estimators + 1 + n_estimators.

In [None]:
stack_clf = StackingClassifier(list(zip(map(lambda c: c.__class__.__name__, clf_list), clf_list)),
                             final_estimator=blender,
                             cv=5,
                             n_jobs=-1)

In [None]:
X_big = np.vstack((X_train, X_val))
y_big = np.hstack((y_train, y_val))

In [None]:
stack_clf.fit(X_big, y_big)



In [None]:
stack_clf.score(X_test, y_test)

0.9783