## Stacking/Blending

Stacking (sometimes called stacked generalization or blending) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. In practice, a single-layer logistic regression model is often used as the combiner, although stacking can theoretically represent a variety of ensemble techniques by using any arbitrary combiner algorithm. Stacking/blending typically yields performance better than any single one of the trained models.

In [1]:
# (c) 2014 Reid Johnson
#
# Modified from:
# Kemal Eren (https://github.com/kemaleren/scikit-learn/blob/stacking/sklearn/ensemble/stacking.py)
#
# Generates a stacking/blending of base models. Cross-validation is used to 
# generate predictions from base (level-0) models that are used as input to a 
# combiner (level-1) model.

import numpy as np
#from itertools import zip
from sklearn.model_selection import ParameterGrid
#from sklearn.grid_search import ParameterGrid
from sklearn.base import ClassifierMixin, RegressorMixin
from sklearn.ensemble.base import BaseEnsemble
from sklearn.utils.validation import assert_all_finite

# TODO: Built-in nested cross validation, re-using base classifiers, to pick 
#       best stacking method.
# TODO: Access to best, vote, etc. after training.

__all__ = [
    "Stacking",
    "StackingFWL",
    'estimator_grid'
]


def estimator_grid(*args):
    """Generate candidate estimators from a list of parameter values on the 
    combination of the various parameter lists given.

    Parameters
    ----------
    args : array
        List of classifiers and corresponding parameters.

    Returns
    -------
    result : array
        The generated estimators.
    """
    result = []
    pairs = zip(args[::2], args[1::2])
    for estimator, params in pairs:
        if len(params) == 0:
            result.append(estimator())
        else:
            for p in ParameterGrid(params):
                result.append(estimator(**p))
    return result


class MRLR(ClassifierMixin):
    """Converts a multi-class classification task into a set of indicator 
    regression tasks.

    References
    ----------
    .. [1] K. M. Ting, I. H. Witten, "Issues in Stacked Generalization", 1999.

    """
    def __init__(self, regressor, stackingc, **kwargs):
        self.estimator_ = regressor
        self.estimator_args_ = kwargs
        self.stackingc_ = stackingc

    def _get_subdata(self, X):
        """Returns subsets of the data, one for each class. Assumes the 
        columns of X are striped in order.

        e.g. if n_classes_ == 3, then returns (X[:, 0::3], X[:, 1::3],
        X[:, 2::3])

        Parameters
        ----------
        X : np.ndarray, shape=(n, m)
            The feature data.

        Returns
        -------
        array of shape = [len(set(y)), n_samples]
            The subsets of the data.
        """
        if not self.stackingc_:
            return [X, ] * self.n_classes_

        result = []
        for i in range(self.n_classes_):
            slc = (slice(None), slice(i, None, self.n_classes_))
            result.append(X[slc])
        return result

    def fit(self, X, y):
        """Fit the estimator given predictor(s) X and target y. Assumes the
        columns of X are predictions generated by each predictor on each 
        class. Fits one estimator for each class.

        Parameters
        ----------
        X : np.ndarray, shape=(n, m)
            The feature data for which to compute the predicted output.

        y : array of shape = [n_samples]
            The actual outputs (class data).
        """
        self.n_classes_ = len(set(y))
        self.estimators_ = []

        # Generate feature data subsets corresponding to each class.
        X_subs = self._get_subdata(X)

        # Fit an instance of the estimator to each data subset.
        for i in range(self.n_classes_):
            e = self.estimator_(**self.estimator_args_)
            y_i = np.array(list(j == i for j in y))
            X_i = X_subs[i]
            e.fit(X_i, y_i)
            self.estimators_.append(e)

    def predict(self, X):
        """Predict label values with the fitted estimator on predictor(s) X.

        Returns
        -------
        array of shape = [n_samples]
            The predicted label values of the input samples.
        """
        proba = self.predict_proba(X)
        return np.argmax(proba, axis=1)

    def predict_proba(self, X):
        """Predict label probabilities with the fitted estimator on 
        predictor(s) X.

        Returns
        -------
        proba : array of shape = [n_samples]
            The predicted label probabilities of the input samples.
        """
        proba = []

        X_subs = self._get_subdata(X)

        for i in range(self.n_classes_):
            e = self.estimators_[i]
            X_i = X_subs[i]
            pred = e.predict(X_i).reshape(-1, 1)
            proba.append(pred)
        proba = np.hstack(proba)

        normalizer = proba.sum(axis=1)[:, np.newaxis]
        normalizer[normalizer == 0.0] = 1.0
        proba /= normalizer

        assert_all_finite(proba)

        return proba


class Stacking(BaseEnsemble):
    """Implements stacking/blending.

    Parameters
    ----------
    meta_estimator : string or callable
        May be one of "best", "vote", "average", or any classifier or 
        regressor constructor

    estimators : iterator
        An iterable of estimators; each must support predict_proba()

    cv : iterator
        A cross validation object. Base (level-0) estimators are trained on 
        the training folds, then the meta (level-1) estimator is trained on 
        the testing folds.

    stackingc : bool
        Whether to use StackingC or not. For more information, refer to the 
        following paper:

        Reference:
          A. K. Seewald, "How to Make Stacking Better and Faster While Also 
          Taking Care of an Unknown Weakness," 2002.

    kwargs :
        Arguments passed to instantiate meta_estimator.

    References
    ----------
    .. [1] D. H. Wolpert, "Stacked Generalization", 1992.

    """

    # TODO: Support different features for each estimator.
    # TODO: Support "best", "vote", and "average" for already trained model.
    # TODO: Allow saving of estimators, so they need not be retrained when 
    #       trying new stacking methods.

    def __init__(self, meta_estimator, estimators,
                 cv, stackingc=True, proba=True,
                 **kwargs):
        self.estimators_ = estimators
        self.n_estimators_ = len(estimators)
        self.cv_ = cv
        self.stackingc_ = stackingc
        self.proba_ = proba

        if stackingc:
            if isinstance(meta_estimator, str) or not issubclass(meta_estimator, RegressorMixin):
                raise Exception('StackingC only works with a regressor.')

        if isinstance(meta_estimator, str):
            if meta_estimator not in ('best',
                                      'average',
                                      'vote'):
                raise Exception('Invalid meta estimator: {0}'.format(meta_estimator))
            raise Exception('"{0}" meta estimator not implemented'.format(meta_estimator))
        elif issubclass(meta_estimator, ClassifierMixin):
            self.meta_estimator_ = meta_estimator(**kwargs)
        elif issubclass(meta_estimator, RegressorMixin):
            self.meta_estimator_ = MRLR(meta_estimator, stackingc, **kwargs)
        else:
            raise Exception('Invalid meta estimator: {0}'.format(meta_estimator))

    def _base_estimator_predict(self, e, X):
        """Predict label values with the specified estimator on predictor(s) X.

        Parameters
        ----------
        e : int
            The estimator object.

        X : np.ndarray, shape=(n, m)
            The feature data for which to compute the predicted outputs.

        Returns
        -------
        pred : np.ndarray, shape=(len(X), 1)
            The mean of the label probabilities predicted by the specified 
            estimator for each fold for each instance X.
        """
        # Generate array for the base-level testing set, which is n x n_folds.
        pred = e.predict(X)
        assert_all_finite(pred)
        return pred

    def _base_estimator_predict_proba(self, e, X):
        """Predict label probabilities with the specified estimator on 
        predictor(s) X.

        Parameters
        ----------
        e : int
            The estimator object.

        X : np.ndarray, shape=(n, m)
            The feature data for which to compute the predicted outputs.

        Returns
        -------
        pred : np.ndarray, shape=(len(X), 1)
            The mean of the label probabilities predicted by the specified 
            estimator for each fold for each instance X.
        """
        # Generate array for the base-level testing set, which is n x n_folds.
        pred = e.predict_proba(X)
        assert_all_finite(pred)
        return pred

    def _make_meta(self, X):
        """Make the feature set for the meta (level-1) estimator.

        Parameters
        ----------
        X : np.ndarray, shape=(n, m)
            The feature data.

        Returns
        -------
        An n x len(self.estimators_) array of meta-level features.
        """
        rows = []
        for e in self.estimators_:
            if self.proba_:
                # Predict label probabilities
                pred = self._base_estimator_predict_proba(e, X)
            else:
                # Predict label values
                pred = self._base_estimator_predict(e, X)
            rows.append(pred)
        return np.hstack(rows)

    def fit(self, X, y):
        """Fit the estimator given predictor(s) X and target y.

        Parameters
        ----------
        X : np.ndarray, shape=(n, m)
            The feature data on which to fit.

        y : array of shape = [n_samples]
            The actual outputs (class data).
        """
        # Build meta data.
        X_meta = [] # meta-level features
        y_meta = [] # meta-level labels

        print ('Training and validating the base (level-0) estimator(s)...')
        print()
        for i, (a, b) in enumerate(self.cv_.split(X, y)):
            print ('Fold [%s]' % (i))

            X_a, X_b = X[a], X[b] # training and validation features
            y_a, y_b = y[a], y[b] # training and validation labels

            # Fit each base estimator using the training set for the fold.
            for j, e in enumerate(self.estimators_):
                print ('  Training base (level-0) estimator %d...' % (j)),
                e.fit(X_a, y_a)
                print ('done.')

            proba = self._make_meta(X_b)
            X_meta.append(proba)
            y_meta.append(y_b)
        print

        X_meta = np.vstack(X_meta)
        if y_meta[0].ndim == 1:
            y_meta = np.hstack(y_meta)
        else:
            y_meta = np.vstack(y_meta)

        # Train meta estimator.
        print ('Training meta (level-1) estimator...'),
        self.meta_estimator_.fit(X_meta, y_meta)
        print ('done.')

        # Re-train base estimators on full data.
        for j, e in enumerate(self.estimators_):
            print ('Re-training base (level-0) estimator %d on full data...' % (j)),
            e.fit(X, y)
            print ('done.')

    def predict(self, X):
        """Predict label values with the fitted estimator on predictor(s) X.

        Parameters
        ----------
        X : np.ndarray, shape=(n, m)
            The feature data for which to compute the predicted output.

        Returns
        -------
        array of shape = [n_samples]
            The predicted label values of the input samples.
        """
        X_meta = self._make_meta(X)
        return self.meta_estimator_.predict(X_meta)

    def predict_proba(self, X):
        """Predict label probabilities with the fitted estimator on 
        predictor(s) X.

        Parameters
        ----------
        X : np.ndarray, shape=(n, m)
            The feature data for which to compute the predicted output.

        Returns
        -------
        array of shape = [n_samples]
            The predicted label probabilities of the input samples.
        """
        X_meta = self._make_meta(X)
        return self.meta_estimator_.predict_proba(X_meta)


class StackingFWL(Stacking):
    """Implements Feature-Weighted Linear Stacking.

    References
    ----------
    .. [1] J. Sill, G. Takács, L. Mackey, D. Lin, "Feature-Weighted Linear 
           Stacking", 2009.

    """
    pass

Here, we will demonstrate stacking/blending by using the Iris flower dataset. Thus, we first load and perform some preprocessing on the data. The preprocessing involves altering the target or class variables, which in the Iris dataset are by default represented as strings (nominal values), but for compatibility reasons need to be represented as integers (numeric values). We perform this conversion using a label-encoding method available via scikit-learn.

In [2]:
import numpy as np
import pandas as pd
from sklearn import preprocessing

label_encode = True

# Load the Iris flower dataset
fileURL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(fileURL, 
                   names=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species'],
                   header=None)
iris = iris.dropna()

X = iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']] # features
labels = iris['Species'] # class

if label_encode:
    # Transform string (nominal) output to numeric
    y = preprocessing.LabelEncoder().fit_transform(labels)
else:
    y = labels

Next, we generate the base (level-0) models, which are the models whose predictions on the training data will be combined by a higher-level (level-1) model. Here, we use different variants of decision trees as our base models, specifically random forest and extra trees, both of which are decision tree ensembles (collections of individual decision trees). By default, we use 10 trees for each ensemble. We generate an array of these base models, which will be used as input to our stacking algorithm.

In [3]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

n_trees = 10

# Generate a list of base (level 0) classifiers.
clfs = [RandomForestClassifier(n_estimators=n_trees, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=n_trees, n_jobs=-1, criterion='entropy'),
        #GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=n_trees)
        ]

Next, we partition the dataset into non-overlapping training and testing sets, with 60% of the data allocated to the training set and 40% allocated to the testing set. All of the training for the stacking algorithm will be performed on the training set, while the testing set will be used solely to generate predictions, the accuracy of which we will later evaluate. 

In [4]:
from sklearn.model_selection import train_test_split

# The training sets will be used for all training and validation purposes.
# The testing sets will only be used for evaluating the final blended (level 1) classifier.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

The concept of stacking requires that the base models generate output that can be further processed by a higher-level model. To generate this output, the base models must produce predictions on some sort of testing data. The base models cannot use the original testing set, as this data should not be used to influence model training, while the predictions generated by the base models will be used to train the higher-level model. Thus, the training set must itself be divided into training and testing portions for the base models, which can be accomplished by cross-validation.

Here, we use 5-fold cross-validation to partition the training set into five non-overlapping sets or folds. Note that the folds we generate are stratified, which means that each fold contains roughly the same proportion of each class label. The output is a cross-validation object, which  has the information of which instances belong to which folds. This will be used as an input to our stacking algorithm.

In [5]:
from sklearn.model_selection import StratifiedKFold

# Generate k stratified folds of the training data.
skf = StratifiedKFold(n_splits=5)

Our stacking algorithm will iterate through the training set folds previously generated. At each iteration, the selected fold will be used for validation, while the remaining folds will be used for training. The predictions generated by the base models on the validation set will be used as training features or predictor variables input to the higher-level model, while the predictions generated by the base models on the original testing set will be used as testing features or predictor variables input to the higher-level model. The labels (the class or target variable) will remain the same. Thus, one may think of this process as replacing the original feature values for each instance by the predictions made by each model. By using cross-validation, we are able to use the original training portion of the dataset to both train and evaluate our base models, which allows us to obtain predictions over (and thus generate new feature values for) all of the training instances. Since our higher-level model will be trained on these new feature values, we must also replace the original feature values for the testing data with the base model predictions on the testing data. As these preditions can be generated over the entire testing set on each fold, the average predictions over all folds will be used as testing input to the higher-level model.

Once the base models are trained, they are stacked/blended. This means that their outputs (predictions) are used as input to a higher-level (level-1) model. Here, we use logistic regression as the higher-level model. As a result, the output generated by the logistic regression model, which is trained or fit on the predictions of the lower-level models, is used to predict the target or class variable of interest.

In [6]:
from sklearn.linear_model import LogisticRegression

stk = Stacking(LogisticRegression, clfs, skf, stackingc=False, proba=True)
stk.fit(X_train.values, y_train)

Training and validating the base (level-0) estimator(s)...

Fold [0]
  Training base (level-0) estimator 0...
done.
  Training base (level-0) estimator 1...
done.
Fold [1]
  Training base (level-0) estimator 0...
done.
  Training base (level-0) estimator 1...
done.
Fold [2]
  Training base (level-0) estimator 0...
done.
  Training base (level-0) estimator 1...
done.
Fold [3]
  Training base (level-0) estimator 0...
done.
  Training base (level-0) estimator 1...
done.
Fold [4]
  Training base (level-0) estimator 0...
done.
  Training base (level-0) estimator 1...
done.
Training meta (level-1) estimator...
done.
Re-training base (level-0) estimator 0 on full data...
done.
Re-training base (level-0) estimator 1 on full data...
done.


Now we can compare the performance of the ensemble of blended models to that of the individual (base) ones:

In [7]:
from sklearn import metrics

### Generate predictions with stacked/blended (level-1) classifier. ###

score = metrics.accuracy_score(y_test, stk.predict(X_test))
print()
print( 'Blended Classifier Accuracy = %s' % (score))
print()

### Generate predictions with base (level-0) classifiers. ###

# Random forest predictions.
score0 = metrics.accuracy_score(y_test, stk._base_estimator_predict(stk.estimators_[0], X_test))
print( 'Random Forest (10 trees) Accuracy = %s' % (score0))

# Extra trees predictions.
score1 = metrics.accuracy_score(y_test, stk._base_estimator_predict(stk.estimators_[1], X_test))
print( 'Extra Trees (10 trees) Accuracy = %s' % (score1))


Blended Classifier Accuracy = 0.933333333333

Random Forest (10 trees) Accuracy = 0.916666666667
Extra Trees (10 trees) Accuracy = 0.9


Notice that by using the method of stacking/blending, we generate a "meta" model that will often outperform each of the base models.