# Stacked ensebles
This script shows how to use stacked ensembles and how they perform better than isolated machine learning models. 

## Introduction to stacked ensembles
Ensemble machine learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Many of the popular modern machine learning algorithms are actually ensembles. For example, Random Forest and Gradient Boosting Machine (GBM) are both ensemble learners. Both bagging (e.g. Random Forest) and boosting (e.g. GBM) are methods for ensembling that take a collection of weak learners (e.g. decision tree) and form a single, strong learner.

Stacked Ensemble method is supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking.

Stacking, also called [Super Learning](https://www.degruyter.com/view/j/sagmb.2007.6.issue-1/sagmb.2007.6.1.1309/sagmb.2007.6.1.1309.xml),  is a class of algorithms that involves training a second-level “metalearner” to find the optimal combination of the base learners. Unlike bagging and boosting, the goal in stacking is to ensemble strong, diverse sets of learners together.

There are some ensemble methods that are broadly labeled as stacking, however, the Super Learner ensemble is distinguished by the use of cross-validation to form what is called the “level-one” data, or the data that the metalearning algorithm is trained on.

The "level-one" data is the results of the L algorithms to be stacked and N cross-validation predicted values. Every model must then use the same cross-validation technique to form the "level-one" data to perform the stacked ensemble.

## Import the libraries and get the data

In [1]:
import h2o
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,21 hours 12 mins
H2O cluster timezone:,Europe/Paris
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,4 months and 6 days !!!
H2O cluster name:,H2O_from_python_a_nogue_sanchez_vl8y04
H2O cluster total nodes:,1
H2O cluster free memory:,3.321 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [2]:
data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/airlines/allyears2k_headers.zip")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [3]:
train, valid, test = data.split_frame([0.8,0.1], seed = 69)

In [4]:
print("%d/%d/%d" %(train.nrows, valid.nrows, test.nrows))

35255/4272/4451


In [5]:
y = "IsArrDelayed"
ignoreFields = ["ArrDelay","DepDelay", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", 
                "LateAircraftDelay", "IsDepDelayed", "IsArrDelayed", "ActualElapsedTime", "ArrTime", "TailNum"]
x = [i for i in train.names if i not in ignoreFields]

In [6]:
nfolds = 5
train2 = train.rbind(valid)

In [7]:
train2.nrows

39527

## Set and train the 3 models 

In [8]:
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

In [9]:
m_GLM = H2OGeneralizedLinearEstimator(
    family = 'binomial',
    model_id = "glm_def",
    nfolds = nfolds,
    fold_assignment = "Modulo", # because we want the same order of k-fold in every model
    keep_cross_validation_predictions = True)
m_GLM.train(x, y, train2)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [10]:
m_GBM = H2OGradientBoostingEstimator(
    model_id = "gbm_def",
    nfolds = nfolds,
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = True)
m_GBM.train(x, y, train2)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [11]:
m_RF = H2ORandomForestEstimator(
    model_id = "rf_def",
    nfolds = nfolds,
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = True)
m_RF.train(x, y, train2)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [12]:
models = [m_GLM.model_id, m_GBM.model_id, m_RF.model_id]

In [13]:
m_SE = H2OStackedEnsembleEstimator(model_id = "SE_glm_gbm_rf",
                                   base_models = models)
m_SE.train(x, y, train2)

stackedensemble Model Build progress: |███████████████████████████████████| 100%


## Evaluate te performance of the stacked ensemble

In [14]:
all_models = [m_GLM, m_GBM, m_RF, m_SE]

In [15]:
names = ["GLM", "GBM", "RF", "SE"]

In [16]:
pd.Series(map(lambda x: x.logloss(), all_models), names)

GLM    0.573282
GBM    0.508120
RF     0.513290
SE     0.232895
dtype: float64

In [17]:
pd.Series(map(lambda x: x.auc(), all_models), names)

GLM    0.768183
GBM    0.850507
RF     0.836385
SE     0.991652
dtype: float64

In [18]:
pd.Series(map(lambda x: x.auc(xval = True), all_models), names)

GLM    0.760942
GBM    0.805887
RF     0.840297
SE          NaN
dtype: float64

We can see that SE metrics are sgnificantly better than the 3 other models. But, when looking at the cross validation results, we can see that SE returned NaN. It is because the stacked ensemble has been trained on all the data, all the cross validation data. There was no separate data set to evaluate it against. So to evaluate performance, we must use the test set.

In [19]:
test_perf = list(map(lambda x: x.model_performance(test), all_models))

In [20]:
pd.Series(map(lambda p: p.logloss(), test_perf), names)

GLM    0.580694
GBM    0.544807
RF     0.481551
SE     0.475998
dtype: float64

In [21]:
pd.Series(map(lambda p: p.auc(), test_perf), names)

GLM    0.755183
GBM    0.801738
RF     0.850072
SE     0.850352
dtype: float64