## Stacking Ensemble

Stacking: Aggregate multiple algorithms to make prediction.(Similar to Bagging & Boosting). But the biggest difference is that stacking use prediction data for next prediction.  
Make each algorithm's prediction dataset into final matadata set. And then with this matadata set, use another ML algorithm for final training.(aka. meta model)  
Stacking model need two type of model. Individual base models and final meta model.   

### Base of Stacking Model

In [2]:
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

cancer_data = load_breast_cancer()

X_features = cancer_data.data
y_target = cancer_data.target

X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.2)

In [13]:
X_train.shape

(455, 30)

In [12]:
y_train.shape

(455,)

In [14]:
# Individual Base Models
KNN = KNeighborsClassifier(n_neighbors=4)
RF = RandomForestClassifier(n_estimators=100)
DT = DecisionTreeClassifier()
AdaBoost = AdaBoostClassifier(n_estimators=100)

KNN.fit(X_train, y_train)
RF.fit(X_train, y_train)
DT.fit(X_train, y_train)
AdaBoost.fit(X_train, y_train)

KNN_pred = KNN.predict(X_test)
RF_pred = RF.predict(X_test)
DT_pred = DT.predict(X_test)
AdaBoost_pred = AdaBoost.predict(X_test)

KNN_acc = accuracy_score(y_test, KNN_pred)
RF_acc = accuracy_score(y_test, RF_pred)
DT_acc = accuracy_score(y_test, DT_pred)
AdaBoost_acc = accuracy_score(y_test, AdaBoost_pred)

print('Accuracy of KNN model: {}'.format(KNN_acc))
print('Accuracy of RandomForest model: {}'.format(RF_acc))
print('Accuracy of DecisionTree model: {}'.format(DT_acc))
print('Accuracy of AdaBoost model: {}'.format(AdaBoost_acc))


Accuracy of KNN model: 0.9210526315789473
Accuracy of RandomForest model: 0.9736842105263158
Accuracy of DecisionTree model: 0.9385964912280702
Accuracy of AdaBoost model: 0.9736842105263158


In [15]:
# Make Prediction Matrix
pred_M = np.array([KNN_pred, RF_pred, DT_pred, AdaBoost_pred])
print(pred_M.shape)

# Use np.transpose to fit
pred = np.transpose(pred_M)
print(pred.shape)

(4, 114)
(114, 4)


In [17]:
# Meta Model using LogisticRegression
LR_final = LogisticRegression(C=10)
LR_final.fit(pred, y_test)
final_pred = LR_final.predict(pred)
final_acc = accuracy_score(y_test, final_pred)

print('Accuracy of Final Meta Model: {:.4f}'.format(final_acc))

Accuracy of Final Meta Model: 0.9649


### CV set base Stacking

In [22]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

# In individual base model, make train/test data for meta model
def get_stacking_base_datasets(model, X_train, y_train, X_test, n_folds):
    KF = KFold(n_splits=n_folds, shuffle=False)
    
    # Initialization of data for meta model
    train_fold_pred = np.zeros((X_train.shape[0], 1))   #train_pred from KFold
    test_pred = np.zeros((X_test.shape[0], n_folds))    #test_pred
    print(model.__class__.__name__, 'model starts...')
    
    #Start CV
    for folder_counter, (train_index, valid_index) in enumerate(KF.split(X_train)):
        print('#{} fold set'.format(folder_counter))
        X_tr = X_train[train_index]     #train features
        y_tr = y_train[train_index]     #train labels
        X_va = X_train[valid_index]     #validation features

        model.fit(X_tr, y_tr)           #Training...
        train_fold_pred[valid_index, :] = model.predict(X_va).reshape(-1,1) #Input prediction of validation data
        test_pred[:, folder_counter] = model.predict(X_test)                #Input prediction of test data

    test_pred_mean = np.mean(test_pred, axis=1).reshape(-1,1)               #Average prediction of test data

    return train_fold_pred, test_pred_mean

In [23]:
KNN = KNeighborsClassifier(n_neighbors=4)
RF = RandomForestClassifier(n_estimators=100)
DT = DecisionTreeClassifier()
AdaBoost = AdaBoostClassifier(n_estimators=100)

knn_train, knn_test = get_stacking_base_datasets(KNN, X_train, y_train, X_test, 7)
rf_train, rf_test = get_stacking_base_datasets(RF, X_train, y_train, X_test, 7)
dt_train, dt_test = get_stacking_base_datasets(DT, X_train, y_train, X_test,  7)    
ada_train, ada_test = get_stacking_base_datasets(AdaBoost, X_train, y_train, X_test, 7)

KNeighborsClassifier model starts...
#0 fold set
#1 fold set
#2 fold set
#3 fold set
#4 fold set
#5 fold set
#6 fold set
RandomForestClassifier model starts...
#0 fold set
#1 fold set
#2 fold set
#3 fold set
#4 fold set
#5 fold set
#6 fold set
DecisionTreeClassifier model starts...
#0 fold set
#1 fold set
#2 fold set
#3 fold set
#4 fold set
#5 fold set
#6 fold set
AdaBoostClassifier model starts...
#0 fold set
#1 fold set
#2 fold set
#3 fold set
#4 fold set
#5 fold set
#6 fold set


In [24]:
#np.concatenate: Combine multiple arrays by row/col level
Final_X_train = np.concatenate((knn_train, rf_train, dt_train, ada_train), axis=1)
Final_X_test = np.concatenate((knn_test, rf_test, dt_test, ada_test), axis=1)
print('Shape of original train data:',X_train.shape, 'Shape of original test data:',X_test.shape)
print('Shape of stacking train data:', Final_X_train.shape, 'Shape of stacking test data:',Final_X_test.shape)

Shape of original train data: (455, 30) Shape of original test data: (114, 30)
Shape of stacking train data: (455, 4) Shape of stacking test data: (114, 4)


In [27]:
Stacking_final_LR = LogisticRegression(C=10)
Stacking_final_LR.fit(Final_X_train, y_train)
final_pred = Stacking_final_LR.predict(Final_X_test)
final_pred_proba = Stacking_final_LR.predict_proba(Final_X_test)[:,1]

from evaluation import get_clf_eval
print('Final meta model evalutation')
print(get_clf_eval(y_test, final_pred, final_pred_proba))

Final meta model evalutation
Confusion Matrix:
[[43  1]
 [ 4 66]]
Accuracy: 0.9561, Precision: 0.9851, Recall: 0.9429, f1_score: 0.9635, AUC: 0.9945
None
