In [1]:
# Ensemble Learning is where the aggretate of the predictions from a ensemble of predictors (such as classifiers or
# regressors) is used to get a better prediction than the output of the best individual predictor. An example is where a
# group of Decision Tree classifiers are trained each on a different random subset of the training set - predictions are
# made by aggregating the predictions of all the individual trees, then prediciting the class that gets the most votes.
# This ensemble is called a Random Forest which while simple is one of the most powerful ML algorithms available.

# Ensemble methods are often used near the end of a project - once you have a few good predictors they can be combined into
# a superior one (used in the winning solution to the Netflix Prize).
# The most popular Ensemble methods include bagging, boosting, and stacking, as well as Random Forests.

In [2]:
# Voting Classifiers:
# Suppose there are a few trained classifiers (say a Logistic Regression classifier, SVM, Random Forest, and KNN) each with
# around 80% accuracy - a simple way to create a better classifier is to aggregate the predicitions of each and predict the
# class that gets the most votes. This majority-vote classifier is called a hard voting classifier. This voting classifier
# often achieves higher accuracy than the best classifier in the ensemble. Even if each classifier is a weak learner (they
# only do slightly better than random guessing), the ensemble can be a strong learner provided there are a sufficient
# number of weak learners and they are sufficiently diverse.

# How is this possible? The following analogy can help shed some light on this mystery. Suppose you have a slightly biased
# coin that has a 51% chance of coming up heads and 49% chance of coming up tails. If you toss it 1,000 times, you will
# generally get more or less 510 heads and 490 tails, and hence a majority of heads. If you do the math, you will find that
# the probability of obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the
# higher the probability (e.g., with 10,000 tosses, the probability climbs over 97%). This is due to the law of large
# numbers: as you keep tossing the coin, the ratio of heads gets closer and closer to the probability of heads (51%).
# Similarly, suppose you build an ensemble containing 1,000 classifiers that are individually correct only 51% of the time
# (barely better than random guessing). If you predict the majority voted class, you can hope for up to 75% accuracy!
# However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly
# not the case because they are trained on the same data. They are likely to make the same types of errors, so there
# will be many majority votes for the wrong class, reducing the ensemble’s accuracy.

# Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse
# classifiers is to train them using very different algorithms. This increases the chance that they will make very different
# types of errors, improving the ensemble’s accuracy.

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_clf = LogisticRegression(solver='lbfgs', random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


In [14]:
# If all classifiers are able to estimate class probabilities (i.e., they all have a predict_proba() method), then you can
# tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.
# This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly
# confident votes. All you need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can
# estimate class probabilities. This is not the case for the SVC class by default, so you need to set its probability
# hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down
# training, and it will add a predict_proba() method).

svm_clf = SVC(probability=True, random_state=42)
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


In [15]:
# As discussed, one way to get a diverse set of classifiers is to use very different training algorithms. Another approach
# is to use the same training algorithm for every predictor and train them on different random subsets of the training set.
# When this sampling is performed with replacement, the method is called bagging (short for bootstrap aggregating, where
# bootstrapping in statistics is just resampling with replacement). When sampling is performed without replacement, it is
# called pasting. Bagging and pasting both allow training instances to be sampled several times across multiple predictors,
# but only bagging allows training instances to be sampled several times for the same predictor. Note that both the training
# of the predictors and the predictions themselves can be performed in parallel, via different CPU cores or servers, so
# bagging and pasting scale very well.

# Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the
# predictions of all predictors. The aggregation function is typically the statistical mode (i.e. the most frequent
# prediction, just like a hard voting classifier) for classification, or the average for regression. Each individual
# predictor has a higher bias than if it were trained on the original training set (since training on a reduced dataset
# means the hypothesis set is likely to be further away from the optimal hypothesis), but aggregation reduces both bias
# and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single
# predictor trained on the original training set.

In [23]:
# Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier class (or BaggingRegressor for
# regression). The following code trains an ensemble of 500 Decision Tree classifiers:5 each is trained on 100 training
# instances randomly sampled from the training set with replacement (for pasting, bootstrap=False). The n_jobs parameter
# specifies the number of CPU cores to use for training and predicitions.
# N.B. max_samples can alternatively by between 0.0 and 1.0, in which case the max number of training instances to sample
# is equal to the size of the training set * max_samples.
# Also, BaggingCLassifier automatically performs soft voting instead of hard voting if the base classifier can estimate
# probabilities (i.e. it has a predict_proba() method), which is the case with Decision Tree classifiers.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=42)

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)

for clf in (tree_clf, bag_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

DecisionTreeClassifier 0.856
BaggingClassifier 0.904


In [17]:
# The decision boundary of the bagging ensemble of 500 trees is much smoother and more reasonable than that of the single
# Decision Tree and will likely generalize better: the ensemble has comparable bias but smaller variance (it makes around
# the same number of errors on the training set, but the decision boundary is less irregular).
# Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a
# slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated,
# so the ensemble’s variance is reduced. Overall, bagging often results in better models, which explains why it is generally
# preferred. However, if you have spare time and CPU power, you can use cross-validation to evaluate both bagging and
# pasting and select the one that works best.

In [36]:
# With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all.
# By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the
# entire training set. This means that, due to the replacement, only about 63% of the training instances are sampled on
# average for each predictor (as m (the sample size) grows, the ratio approaches 1 - exp(-1) = 63.2%) The remaining 37% of
# the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37%
# for all predictors. Not sure how oob evaluation works when max_samples, the number of training instances to draw, is less
# than the size of the dataset. Say max_samples=100, then the model is trained on 100 instances picked from X with
# replacement, but how does it pick the instances for the oob evaluation...


# Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need
# for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evaluations of each
# predictor. In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to request an automatic oob
# evaluation after training.

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True, random_state=40)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9013333333333333

In [37]:
# So the oob evaluation suggests that the accuracy on the test_set should be around 90.1%
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

In [42]:
# The oob_devision_function_ variable (since the base esitmator has a predict_proba() method) gives the class probabilities
# for each training instance. The first training instance has 31.75% of being in the negative class and 68.25% positive.
bag_clf.oob_decision_function_[:10], len(
    bag_clf.oob_decision_function_), len(X_train), len(X_test)

(array([[0.31746032, 0.68253968],
        [0.34117647, 0.65882353],
        [1.        , 0.        ],
        [0.        , 1.        ],
        [0.        , 1.        ],
        [0.08379888, 0.91620112],
        [0.31693989, 0.68306011],
        [0.02923977, 0.97076023],
        [0.97687861, 0.02312139],
        [0.97765363, 0.02234637]]),
 375,
 375,
 125)

In [43]:
# The BaggingClassifier also supports sampling of the features as well, controlled by the hyperparameters max_features
# and bootstrap_features (work the same way as max_samples and bootstrap) - so each predictor will be trained on a random
# subset of the input features. This is particularly useful when dealing with high-dimensional inputs (such as images).
# Sampling the training instances and features is called the Random Patches method, while keeping all training instances
# (setting bootstrap=False and max_samples=1.0) but sampling features is called the Random Subspaces method.
# Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

In [53]:
# Random Forests are simply ensembles of Decision Trees, geneerally trained via the bagging method (or sometimes pasting),
# typically with max_samples set to the size of the training set. Instead of using a BaggingClassifier and passing it a
# DecisionTreeClassifier, there is a convenient and optimized RandomForestClassifier (and RandomForestRegressor) - note
# that the BaggingClassifier is still useful if you want to bag something other than Decision Trees. With a few exceptions,
# RandomForestCLassifier has all the hyperparameters of DecisionTreeClassifier (to control how the trees are grown) and of
# BaggingClassifier (to control the ensemble itself).

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500,
                                 max_leaf_nodes=16, n_jobs=-1,
                                 random_state=42)
rnd_clf.fit(X_train, y_train)

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(
        splitter='random', max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)

for clf in (rnd_clf, bag_clf):
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

RandomForestClassifier 0.912
BaggingClassifier 0.912


In [54]:
# When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for
# splitting (as discussed earlier). It is possible to make trees even more random by also using random thresholds for
# each feature rather than searching for the best possible thresholds (like regular Decision Trees do). A forest of such
# extremely random trees is called an Extremely Randomized Trees ensemble (or Extra-Trees for short). Once again, this
# technique trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular Random
# Forests, because finding the best possible threshold for each feature at every node is one of the most time-consuming
# tasks of growing a tree. You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier class.
# Its API is identical to the RandomForestClassifier class. Similarly, the Extra TreesRegressor class has the same API
# as the RandomForestRegressor class.
# It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier.
# Generally, the only way to know is to try both and compare them using cross-validation (tuning the hyperparameters using
# grid search).

In [60]:
# Another quality of Random Forests is that the make it easy to measure the relative importance of each feature.
# Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity
# on average (across all trees in the forest). More precisely, it is a weighted average, where each node’s weight is equal
# to the number of training samples that are associated with it (see Chapter 6). Scikit-Learn computes this score
# automatically for each feature after training, then it scales the results so that the sum of all importances is equal
# to 1. You can access the result using the feature_importances_ variable. For example, the following code trains a
# RandomForestClassifier on the iris dataset and outputs each feature’s importance. It seems that the most important 
# features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison 
# (11% and 2%, respectively):

from sklearn.datasets import load_iris
iris = load_iris()

X, y = iris['data'], iris['target']

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(X, y)

for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)
    
# Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need
# to perform feature selection.

sepal length (cm) 0.11249225099876375
sepal width (cm) 0.02311928828251033
petal length (cm) 0.4410304643639577
petal width (cm) 0.4233579963547682


In [61]:
# Boosting (or hypothesis booding) refers to any Ensemble method that can combine several weak learners into a strong one
# (high accuracy). The general idea is to train predictors sequentially, each trying to correct its predecessor. The most
# popular boosting methods are AdaBoost (Adaptive Boosing) and Gradient Boosting.

# AdaBoost forces the next predictor in the sequence to focus more on the training instances that the predecessor
# underfitted - this results in new predictors focusing more and more on the hard cases. For example, the algorithm first
# trains a base classifier (such as a Decision Tree) and uses it to make predictions on the training set. The algorithm
# then increases the relative weight of misclassified training instances. Then it trains a second classifier, using the
# updated weights, and makes predictions on the training set, updates the weights, and so on. Note that SVMs are generally
# not good base predictors for AdaBoost since they are slow and tend to be unstable with it.

# The base predictor generally gets many instances wrong, so their weights get boosted. The second classifier therefore
# does a better job on these instances, and so on. AdaBoost has an associated learning rate - if this is halved, the
# misclassified instance weights are boosted half as much at every iteration). This sequential learning technique has some
# similarities with Gradient Descent, except that instead of tweaking a single predictor’s parameters to minimize a cost
# function, AdaBoost adds predictors to the ensemble, gradually making it better. Once all the predictors are trained,
# the ensemble makes predictions similar to bagging/pasting, except that predictors have different weights depending on
# their overall accuracy on the weighted training set.

# Note that this sequential learning technique cannot be parallelized (fully), since each predictor can only be trained
# after the last one has been trained and evaluated. Hence, it does not scale as well as bagging/pasting.

# See book for AdaBoost algorithm.

# Scikit-Learn uses a multiclass version of AdaBoost called SAMME16 (Stagewise Additive Modeling using a Multiclass
# Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. If the predictors can
# estimate class probabilities (i.e., if they have a predict_proba() method), Scikit-Learn can use a variant of SAMME
# called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally
# performs better.

In [68]:
# The following code trains an AdaBoost classifier based on 200 Decision Stumps using Scikit-Learn’s AdaBoostClassifier
# class (there is also an AdaBoostRegressor class). A Decision Stump is a Decision Tree with max_depth=1 - in other words,
# a tree composed of a single decision node plus two leaf nodes. This is the default base estimator for the
# AdaBoostClassifier class:

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1, random_state=42), n_estimators=200,
    algorithm='SAMME.R', learning_rate=0.5)
ada_clf.fit(X_train, y_train)

# If the AdaBoost ensemble overfits the training set (fits the noise), try reducing the number of estimators or more
# strongly regularize the base estimator.

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                          

In [94]:
# Gradient Boosting, like AdaBoost, sequentially adds predictors to an ensemble, each correcting its predecessor; however,
# instead of tweaking the instance weights at every iteration like AdaBoost, this method tries to fit the new predictor 
# to the residual errors made by the previous predictor.
# For an example, here is Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT) - simply a regression task
# using Decision Trees as the base predictors. 

from sklearn.tree import DecisionTreeRegressor
import numpy

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

#Now train a second DecisionTreeRegressor on the residual errors made by the first
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

#and a third...
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

#Now we have an ensemble of three trees - it can make predictions on a new instance by summing the predictions of each tree
X_new = [[0.5]]
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([0.75026781])

In [95]:
# A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRegressor class. Much like the
# RandomForestRegressor class, it has hyperparameters to control the growth of Decision Trees (e.g., max_depth,
# min_samples_leaf), as well as hyperparameters to control the ensemble training, like the number of trees (n_estimators).

from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(
    max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)
gbrt.predict([[0.5]])

array([0.75026781])

In [105]:
# The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you
# will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is
# a regularization technique called shrinkage. Note that with a small learning rate, if n_estimators is too low then
# the GBRT will underfit the training data, while if n_estimators is too high it will overfit).
# To find the optimal number of trees, you can use early stopping. A simple way to implement this is use of the
# staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training
# (with one tree, two trees, etc.).

# The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to
# find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

print(bst_n_estimators)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)

56


GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=56,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=42, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [108]:
# It is also possible to implement early stopping by actually stopping training early (instead of training a large number
# of trees first and then looking back to find the optimal number). You can do so by setting warm_start=True, which makes
# Scikit- Learn keep existing trees when the fit() method is called, allowing incremental training. The following code
# stops training when the validation error does not improve for five iterations in a row:

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break   # early stopping
gbrt.n_estimators_

61

In [112]:
# The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training
# instances to be used for training each tree. For example, if subsample=0.25, then each tree is trained on 25% of the
# training instances, selected randomly. As you can probably guess by now, this technique trades a higher bias for a
# lower variance. It also speeds up training considerably. This is called Stochastic Gradient Boosting.
# It is also possible to use Gradient Boosting with other cost functions, controlled by the loss hyperparameter (default
# is least-squares, see docs for more info).

# XGBoost (Extreme Gradient Boosing) is a Python library with an optimized implementation of Gradient Boosting, aiming to
# be extremely fast, scalable and portible - it is often used as a component in winning entries to ML competitions. It has
# an API quite similar to Scikit-Learn's:

import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)
print()
# XGBoost also offers features such as automatically taking care of early stopping:

xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)


[0]	validation_0-rmse:0.286719
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1]	validation_0-rmse:0.258221
[2]	validation_0-rmse:0.232634
[3]	validation_0-rmse:0.210526
[4]	validation_0-rmse:0.190232
[5]	validation_0-rmse:0.172196
[6]	validation_0-rmse:0.156394
[7]	validation_0-rmse:0.142241
[8]	validation_0-rmse:0.129789
[9]	validation_0-rmse:0.118752
[10]	validation_0-rmse:0.108388
[11]	validation_0-rmse:0.100155
[12]	validation_0-rmse:0.09208
[13]	validation_0-rmse:0.084791
[14]	validation_0-rmse:0.078699
[15]	validation_0-rmse:0.073248
[16]	validation_0-rmse:0.069391
[17]	validation_0-rmse:0.066277
[18]	validation_0-rmse:0.063458
[19]	validation_0-rmse:0.060326
[20]	validation_0-rmse:0.0578
[21]	validation_0-rmse:0.055643
[22]	validation_0-rmse:0.053943
[23]	validation_0-rmse:0.053138
[24]	validation_0-rmse:0.052415
[25]	validation_0-rmse:0.051821
[26]	validation_0-rmse:0.051226
[27]	validation_0-rmse:0.051135
[28]	validation_0-rmse:0.05091
[29]	validation_0-rms

In [None]:
# Stacking (short for stacked generalization) involves training a model to perform the aggregation itself, instead of using
# trivial functions (such as hard voting) to aggregate the predictions of all the predictors in te ensemble. This model
# is called a blender, or meta learner, and would take in the regression predictions from the individual models as an input
# before making the final prediction.
# Blenders are generally trained using a hold-out set (or by using out-of-fold predictions). (1) The training set is split
# into two subsets. The first is used to train the predictors in the first layer. (2) The first layer predictors are used
# to make predictions on the second (held-out) set. This ensures the predictions are 'clean' since the predictors never
# saw them during training. If there are N predictors in the first layer, then for each instance in the hold-out set
# there are N predicted values (so it effectively becomes a new training set which is N-dimensional with the same target
# values). (3) The blender is trained on this new training set, so it learns to predict the target value given the first
# layer's predictions.

# It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using
# Random Forest Regression), to get a whole layer of blenders. The trick is to split the training set into three subsets:
# the first one is used to train the first layer, the second one is used to create the training set used to train the
# second layer (using predictions made by the predictors of the first layer), and the third one is used to create the
# training set to train the third layer (using predictions made by the predictors of the second layer). Once this is done,
# we have the single final blender in its own third layer and we can make a prediction for a new instance by going through
# each layer sequentially

# Unfortunately, Scikit-Learn does not support stacking directly, but it is not too hard to roll out your own
# implementation (see the exercises). Alternatively, you can use an open source implementation such as DESlib.