# stackEnsemble

The stackEnsemble function implements the technique of stack ensembling; the practice of using 1st level models to predict on the trainset (out of folds). These models are used to create predictions on the entire testset. This yields 'meta-features', and can be used as features for 2nd level models.


In [9]:
# imports for predictive models and validation
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

# imports for stackEnsemble()
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split

In [2]:
# Load the data
X = pd.read_csv('../../VIVAT2/data/X_categ.csv').reset_index()
y = pd.read_csv('../../VIVAT2/data/y_categ.csv').reset_index().drop('index', axis=1)
    
# Split the data in test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# This is some random shit, i dont even 
# think you need that for every dataset
X_train = X_train.reset_index().drop(['level_0', 'index'], axis=1)
y_train = y_train.reset_index().drop('index', axis=1)
X_test = X_test.reset_index().drop(['level_0', 'index'], axis=1)
y_test = y_test.reset_index().drop('index', axis=1)

In [18]:
''' Function to perform stack ensembling on arbitrary dataset with sklearn models.

# Arguments:
    models:  list, a list of models to be used to create meta-features
    X:       dataframe, the trainset features
    y:       dataframe, the trainset labels
    Xtest:   dataframe, the testset features
    splits:  int, the number of splits for trainset CV
    verbose: bool, true when print outputs are desired
    
# Returns:
    X:       dataframe, new trainset with meta-features
    Xtest:   dataframe, new testset with meta-features
'''

def stackEnsemble(models, X, y, Xtest, splits, verbose):

    # assert correct data-types 
    assert type(models) == list
    assert type(splits) == int
    assert type(verbose) == bool

    # init variables
    kf = KFold(n_splits = splits)
    predsTR = {}
    predsTE = {}

    # iterate over all inserted models
    for n, model in enumerate(models):
        if verbose: print('Using model %d to make predictions..' % (n+1))

        # prepare split for predictions
        predsTR['model'+str(n+1)] = []
        for i, (train, test) in enumerate(kf.split(X)):
            if verbose: print('..on split %d' % (i+1))

            # fit on split and predict
            model.fit(X.iloc[train], y.iloc[train])
            predsTR['model'+str(n+1)].append(list(model.predict(X.iloc[test])))

        # predict on testset
        predsTE['model'+str(n+1)] = list(model.predict(Xtest))
    
    # combine trainset predictions in dataframe, join with trainset
    meta_feats = pd.DataFrame(columns = [col for col in predsTR.keys()])
    for model in predsTR.keys():
        meta_feats[model] = np.array([item for lst in predsTR[model] for item in lst])
    X = pd.concat([X, meta_feats], axis=1)

    # combine testset predictions in dataframe, join with testset
    meta_feats = pd.DataFrame(columns = [col for col in predsTE.keys()])
    for model in predsTE.keys():
        meta_feats[model] = np.array(predsTE[model])
    Xtest = pd.concat([Xtest, meta_feats], axis=1)

    # return trainset and testset with metafeatures
    return X, Xtest

## Example

In the blocks below I will use the stackEnsemble function to create metafeatures for the provided dataset. I will use metafeatures from three simple algorithms: AdaBoost ensemble of decision trees, a Support Vector Classifier and a Logistic Regression model.

In [4]:
# define the models
model1 = LogisticRegression(C=1e5, class_weight='balanced')
model2 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2)
                            , n_estimators=300
                            , learning_rate=1.5
                            , algorithm="SAMME")
model3 = SVC(decision_function_shape='ovo')

In [5]:
models = [model1, model2, model3]
X_train_stack, X_test_stack = stackEnsemble(models, X_train, y_train, X_test, 10, True)

Using model 1 to make predictions..
..on split 1
..on split 2
..on split 3
..on split 4
..on split 5
..on split 6
..on split 7
..on split 8
..on split 9
..on split 10
Using model 2 to make predictions..
..on split 1
..on split 2
..on split 3
..on split 4
..on split 5
..on split 6
..on split 7
..on split 8
..on split 9
..on split 10
Using model 3 to make predictions..
..on split 1
..on split 2
..on split 3
..on split 4
..on split 5
..on split 6
..on split 7
..on split 8
..on split 9
..on split 10


## Validation

Below I test whether stack ensembling improves classification performance with Logistic Regression. The first example uses the raw dataset; it has no metafeatures. In the second example I use the metafeatures from the models I ensembled with above. 

In [16]:
print accuracy_score(y_pred=model1.fit(X_train, y_train).predict(X_test), y_true=y_test)

0.453268641471


In [15]:
print accuracy_score(y_pred=model1.fit(X_train_stack, y_train).predict(X_test_stack), y_true=y_test)

0.560393258427
