## Part 5: Model stacking/blending

In parts 1-4 of this binary classification series, I saved a total of 8 models predicting history of high blood pressure (4 that accounted for class imbalance, 4 that did not). One way to improve final predictions is to use ensembling techniques. In this post I'll explore a few stacking/blending approaches:

- Simple majority vote
- XGBoost using model predictions as features
- XGBoost using original training data + model predictions as features

First, I'll import the models and retrieve both class and probability predictions (not including SVM):

In [1]:
import pickle
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from xgboost import XGBClassifier

# Train/Val/vars
x_train = pickle.load( open( "x_train.pickle", "rb" ) )
x_val = pickle.load( open( "x_val.pickle", "rb" ) )
x_test = pickle.load( open( "x_test.pickle", "rb" ) )
y_train = pickle.load( open( "y_train.pickle", "rb" ) )
y_val = pickle.load( open( "y_val.pickle", "rb" ) )
y_test = pickle.load( open( "y_test.pickle", "rb" ) )
keep_vars10 = pickle.load( open( "keep_vars10.pickle", "rb" ) )


# Make dicts of models
models = {'xgb_bal':pickle.load(open('xgb_bal.pickle','rb')),
          'xgb_unbal':pickle.load(open('xgb_unbal.pickle','rb')),
          'svc_bal':pickle.load(open('svc_bal.pickle','rb')),
          'svc_unbal':pickle.load(open('svc_unbal.pickle','rb')),
          'lr_bal':pickle.load(open('lr_bal.pickle','rb')),
          'lr_unbal':pickle.load(open('lr_unbal.pickle','rb')),
          'ann_bal':pickle.load(open('ann_bal.pickle','rb')),
          'ann_unbal':pickle.load(open('ann_unbal.pickle','rb'))
         }
prob_models = {'xgb_bal':pickle.load(open('xgb_bal.pickle','rb')),
          'xgb_unbal':pickle.load(open('xgb_unbal.pickle','rb')),
          'lr_bal':pickle.load(open('lr_bal.pickle','rb')),
          'lr_unbal':pickle.load(open('lr_unbal.pickle','rb')),
          'ann_bal':pickle.load(open('ann_bal.pickle','rb')),
          'ann_unbal':pickle.load(open('ann_unbal.pickle','rb'))
         }

# Retrieve Class Predictions
def pick_predict(key,m_dict,df):
    if 'ann' in key:
        return m_dict[key].predict_classes(df)
    if 'xgb' in key:
        bnl = m_dict[key].best_ntree_limit
        lpreds = m_dict[key].predict(df,ntree_limit=bnl)
        return lpreds
    else:
        return m_dict[key].predict(df)

train_preds = []
val_preds = []
test_preds = []
for i in models:
    train_preds.append(pick_predict(i,models,x_train[keep_vars10]))
    val_preds.append(pick_predict(i,models,x_val[keep_vars10]))
    test_preds.append(pick_predict(i,models,x_test[keep_vars10]))

# Make array of all predictions
t_preds_array = np.array(train_preds).transpose()
v_preds_array = np.array(val_preds).transpose()
test_preds_array = np.array(test_preds).transpose()

# Retrieve Prob Predictions
def pick_predict_prob(key,m_dict,df):
    if 'ann' in key:
        return np.array([i[1] for i in m_dict[key].predict(df)])
    if 'xgb' in key:
        bnl = m_dict[key].best_ntree_limit
        lpreds = m_dict[key].predict_proba(df,ntree_limit=bnl)
        return np.array([i[1] for i in lpreds])
    else:
        return np.array([i[1] for i in m_dict[key].predict_proba(df)])

train_preds_prob = []
val_preds_prob = []
test_preds_prob = []
for i in prob_models:
    train_preds_prob.append(pick_predict_prob(i,
                                              prob_models,
                                              x_train[keep_vars10]))
    val_preds_prob.append(pick_predict_prob(i,
                                            prob_models,
                                            x_val[keep_vars10]))
    test_preds_prob.append(pick_predict_prob(i,
                                            prob_models,
                                            x_test[keep_vars10]))

# Make array of all predictions
t_preds_array_prob = np.array(train_preds_prob).transpose()
v_preds_array_prob = np.array(val_preds_prob).transpose()
test_preds_array_prob = np.array(test_preds_prob).transpose()

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.


These ensemble approaches work best when there's some heterogeneity in predictions across models. Let's take a look at how similar these models' predictions are:

In [2]:
unanimous = [True if sum(i)==0 or sum(i)==i.shape[0] else False \
            for i in t_preds_array]
m_agree = [i.sum()/i.shape[0] if i.sum()/i.shape[0] >.5 \
                   else 1-i.sum()/i.shape[0] for i in t_preds_array]
disc_agree = np.array(m_agree)[[False if i else True for i in unanimous]]

print('%.2f percent of training predictions were unanimous.' \
      %(np.mean(unanimous)*100))
print('Overall, there was %.2f percent agreement across models for' \
      %(np.mean(m_agree)*100) + \
      'training data on average.')
print('Among discordant predictions, there was %.2f percent agreement' \
      %(np.mean(disc_agree)*100) + 
      ' across models for training data on average.')

unanimous_v = [True if sum(i)==0 or sum(i)==i.shape[0] else False \
            for i in v_preds_array]
m_agree_v = [i.sum()/i.shape[0] if i.sum()/i.shape[0] >.5 \
                   else 1-i.sum()/i.shape[0] for i in v_preds_array]
disc_agree_v = np.array(m_agree_v)[[False if i else True for i in unanimous_v]]

print('\n\n%.2f percent of training predictions were unanimous.' \
      %(np.mean(unanimous_v)*100))
print('Overall, there was %.2f percent agreement across models for' \
      %(np.mean(m_agree_v)*100) + \
      'training data on average.')
print('Among discordant predictions, there was %.2f percent agreement' \
      %(np.mean(disc_agree_v)*100) + 
      ' across models for training data on average.')

74.16 percent of training predictions were unanimous.
Overall, there was 92.56 percent agreement across models fortraining data on average.
Among discordant predictions, there was 71.23 percent agreement across models for training data on average.


74.04 percent of training predictions were unanimous.
Overall, there was 92.51 percent agreement across models fortraining data on average.
Among discordant predictions, there was 71.13 percent agreement across models for training data on average.


Since the models largely agreed with each other and had similar performance, I don't expect to get much of a performance boost (if any), but I'll go ahead and do it anyway just to see what happens. 

Next I'll add all the predictions to training dataset, then fit another XGBoost model on the combined data.

In [3]:
# Make stacked array predictions
t_stack_p = np.concatenate([t_preds_array_prob,
                         t_preds_array],axis=1)
v_stack_p = np.concatenate([v_preds_array_prob,
                         v_preds_array],axis=1)
test_stack_p = np.concatenate([test_preds_array_prob,
                         test_preds_array],axis=1)


# Make stacked array of features and predictions
t_stack_fp = np.concatenate([x_train[keep_vars10],
                         t_preds_array_prob,
                         t_preds_array],axis=1)
v_stack_fp = np.concatenate([x_val[keep_vars10],
                         v_preds_array_prob,
                         v_preds_array],axis=1)
test_stack_fp = np.concatenate([x_test[keep_vars10],
                         test_preds_array_prob,
                         test_preds_array],axis=1)

## Majority Vote

First I'll simply take the majority vote of all the models, then look at the classification report. I'm not tuning any parameters here, so there's no real need for the validation set (I'll compare all models using the test set in the next post), but I'll include it for comparison.

In [4]:
print('Train')
print(confusion_matrix(y_train,t_preds_array.mean(axis=1)>=.5))
print(classification_report(y_train,t_preds_array.mean(axis=1)>=.5))

print('Validation')
print(confusion_matrix(y_val,v_preds_array.mean(axis=1)>=.5))
print(classification_report(y_val,v_preds_array.mean(axis=1)>=.5))

Train
[[109322  48110]
 [ 28884  78929]]
              precision    recall  f1-score   support

         0.0       0.79      0.69      0.74    157432
         1.0       0.62      0.73      0.67    107813

   micro avg       0.71      0.71      0.71    265245
   macro avg       0.71      0.71      0.71    265245
weighted avg       0.72      0.71      0.71    265245

Validation
[[27092 12239]
 [ 7250 19731]]
              precision    recall  f1-score   support

         0.0       0.79      0.69      0.74     39331
         1.0       0.62      0.73      0.67     26981

   micro avg       0.71      0.71      0.71     66312
   macro avg       0.70      0.71      0.70     66312
weighted avg       0.72      0.71      0.71     66312



## Predictions as features

I'll train a simple XGBoost model using predictions as features. Including both class and probability predictions introduces obvious redundancy, but the model will adjust for that automatically anyway. I'll specify scale_pos_weight and use early stopping (1000 trees, stopping after 20 trees with no improvement in val loss). Then I'll print the classification report.

In [5]:
pf = XGBClassifier(max_depth=3,n_estimators=1000,random_state=1234,
                  scale_pos_weight=sum(y_train==0)/sum(y_train==1))
pf.fit(t_stack_p,
              y_train,
        eval_metric=['error','logloss'],
        eval_set=[(t_stack_p,
                   y_train),
                  (v_stack_p,
                   y_val)],
       early_stopping_rounds=20,
             verbose=0)

print('Train')
print(confusion_matrix(y_train,
                       pf.predict(t_stack_p,
                                 ntree_limit=pf.best_ntree_limit)))
print(classification_report(y_train,
                            pf.predict(t_stack_p,
                                 ntree_limit=pf.best_ntree_limit)))

print('Validation')
print(confusion_matrix(y_val,
                       pf.predict(v_stack_p,
                                 ntree_limit=pf.best_ntree_limit)))
print(classification_report(y_val,
                            pf.predict(v_stack_p,
                                 ntree_limit=pf.best_ntree_limit)))

Train
[[104806  52626]
 [ 24871  82942]]
              precision    recall  f1-score   support

         0.0       0.81      0.67      0.73    157432
         1.0       0.61      0.77      0.68    107813

   micro avg       0.71      0.71      0.71    265245
   macro avg       0.71      0.72      0.71    265245
weighted avg       0.73      0.71      0.71    265245

Validation
[[25960 13371]
 [ 6320 20661]]
              precision    recall  f1-score   support

         0.0       0.80      0.66      0.73     39331
         1.0       0.61      0.77      0.68     26981

   micro avg       0.70      0.70      0.70     66312
   macro avg       0.71      0.71      0.70     66312
weighted avg       0.72      0.70      0.71     66312



## Training data + predictions as features

Finally, I'll train another XGBoost model using both the original training data and model predictions as features. As with th previous model, I'll specify scale_pos_weight and use early stopping (1000 trees, stopping after 20 trees with no improvement in val loss). Then I'll print the classification report.

In [6]:
fpf = XGBClassifier(max_depth=3,n_estimators=1000,random_state=1234,
                   scale_pos_weight=sum(y_train==0)/sum(y_train==1))
fpf.fit(t_stack_fp,
              y_train,
        eval_metric=['error','logloss'],
        eval_set=[(t_stack_fp,
                   y_train),
                  (v_stack_fp,
                   y_val)],
       early_stopping_rounds=20,
             verbose=0)

print('Train')
print(confusion_matrix(y_train,
                       fpf.predict(t_stack_fp,
                                 ntree_limit=fpf.best_ntree_limit)))
print(classification_report(y_train,
                            fpf.predict(t_stack_fp,
                                 ntree_limit=fpf.best_ntree_limit)))

print('Validation')
print(confusion_matrix(y_val,
                       fpf.predict(v_stack_fp,
                                 ntree_limit=fpf.best_ntree_limit)))
print(classification_report(y_val,
                            fpf.predict(v_stack_fp,
                                 ntree_limit=fpf.best_ntree_limit)))

Train
[[104810  52622]
 [ 24848  82965]]
              precision    recall  f1-score   support

         0.0       0.81      0.67      0.73    157432
         1.0       0.61      0.77      0.68    107813

   micro avg       0.71      0.71      0.71    265245
   macro avg       0.71      0.72      0.71    265245
weighted avg       0.73      0.71      0.71    265245

Validation
[[25960 13371]
 [ 6326 20655]]
              precision    recall  f1-score   support

         0.0       0.80      0.66      0.72     39331
         1.0       0.61      0.77      0.68     26981

   micro avg       0.70      0.70      0.70     66312
   macro avg       0.71      0.71      0.70     66312
weighted avg       0.72      0.70      0.71     66312



In the next and final post of this series, I'll compare the performance of all these models using the test data I set aside at the very beginning.

In [7]:
# Save Test Class Predictions
maj_vote = np.array(test_preds_array.mean(axis=1)>=.5).reshape(-1,1)
pf_predict = pf.predict(test_stack_p,
                        ntree_limit=pf.best_ntree_limit).reshape(-1,1)
fpf_predict = fpf.predict(test_stack_fp,
                          ntree_limit=fpf.best_ntree_limit).reshape(-1,1)

tc_array = np.concatenate([test_preds_array, maj_vote, 
                           pf_predict, fpf_predict],
              axis=1)

test_class_pred = dict(zip(['XGB Unbalanced', 'XGB Balanced',
                           'SVM Unbalanced', "SVM Balanced",
                           'Logistic Unbalanced', 'Logistic Balanced',
                           'ANN Unbalanced', 'ANN Balanced',
                           'Majority Vote', 'Prediction Stack',
                           'Features and Prediction Stack'],
                          tc_array.transpose()))

# Save Test Probability Predictions
maj_vote_proba = np.array(test_preds_array_prob.mean(axis=1)).reshape(-1,1)
pf_predict_proba = [i[1] for i in \
                    pf.predict_proba(test_stack_p, 
                                     ntree_limit=pf.best_ntree_limit)]
pf_predict_proba = np.array(pf_predict_proba).reshape(-1,1)

fpf_predict_proba = [i[1] for i in \
                     fpf.predict_proba(test_stack_fp,
                                       ntree_limit=fpf.best_ntree_limit)]
fpf_predict_proba = np.array(fpf_predict_proba).reshape(-1,1)

tp_array = np.concatenate([test_preds_array_prob, maj_vote_proba,
                           pf_predict_proba, fpf_predict_proba],
                          axis=1)

test_prob_pred = dict(zip(['XGB Unbalanced', 'XGB Balanced',
                           'Logistic Unbalanced', 'Logistic Balanced',
                           'ANN Unbalanced', 'ANN Balanced',
                           'Majority Vote', 'Prediction Stack',
                           'Features and Prediction Stack'],
                          tp_array.transpose()))

# Pickle for next post
pickle.dump(test_class_pred, open('test_class_pred.pickle', 'wb'))
pickle.dump(test_prob_pred, open('test_prob_pred.pickle', 'wb'))