# Stacked ensemble attempt

In this notebook I use ML-Ensemble to create a stacked ensemble using classifiers from the work described in the "Final_Submission" notebook.  A requirements.txt file has also been added that should be compatible with this notebook.

Work towards the end of the notebook is unfinished.  At last attempt I had successfully created a stacked ensemble, however I was modifying the parameters to improve the results.

http://ml-ensemble.com/

## Setup

In [1]:
# Import libraries.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer, LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, make_union
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from xgboost.sklearn import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC

from mlens.ensemble import SuperLearner

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning) 

[MLENS] backend: threading


In [2]:
# Import data files from Kaggle.
DATA_PATH = './data/extracted'
dfs_raw = {}
dfs = {}
for root, dirs, files in os.walk(DATA_PATH):
    for file in files:
        dfs[file.split('.')[0]] = pd.read_csv(f'{DATA_PATH}/{file}')
        print(file)

age_gender_bkts.csv
countries.csv
sample_submission_NDF.csv
sessions.csv
test_users.csv
train_users_2.csv


## Feature Engineering

In [3]:
# utility function
def transformToDatetime(series_input):
    return pd.to_datetime(series_input,format='%Y%m%d%H%M%S', errors='coerce')

#Function to bucket ages prior to one-hot encoding
def getAgeBucket(df_input):
    z = df_input.age
    z = pd.to_numeric(df_input.age, errors='coerce')
    return pd.cut(z,
                    [0, 4, 9, 14, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 84, 89, 94,99,10000],
                    labels=['0-4', '5-9', '10-14','15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49',
                            '50-54', '55-59','60-64', '65-69','70-74','75-79','80-84','85-89','90-94','95-99','100+'],
                    include_lowest=True)

#Add month and year features
def getYearFirstActive(df_input):
    return pd.Series(transformToDatetime(df_input.timestamp_first_active).dt.year, index=df_input.index)

def getMonthFirstActive(df_input):
    return pd.Series(transformToDatetime(df_input.timestamp_first_active).dt.month, index=df_input.index)

def getSeason(df_input):
    season = pd.Series(transformToDatetime(df_input.timestamp_first_active).dt.month, index=df_input.index)
    season[season.isin([12, 1, 2])] = 'Winter'
    season[season.isin([3, 4, 5])] = 'Spring'
    season[season.isin([6, 7, 8])] = 'Summer'
    season[season.isin([9, 10, 11])] = 'Fall'
    return season

def getSessionActivityCount(df_input):
    return dfs['sessions'].groupby(['user_id']).size().reset_index(name='counts').set_index('user_id')


# Use this method to add features to datasets
def add_features(df_input):
    engineered_data = df_input
    engineered_data['age_bucket'] = getAgeBucket(engineered_data)
    engineered_data['first_active_year'] = getYearFirstActive(engineered_data)
    engineered_data['first_active_month'] = getMonthFirstActive(engineered_data)
    engineered_data['season'] = getSeason(engineered_data)
    engineered_data['activity_count'] = getSessionActivityCount(engineered_data)
    return engineered_data

## Preprocessing pipeline

- Data Imputation
  - Categorical: replace np.NaN with "unknown"
  - Numerical: replace np.NaN with mean
- Encoding
  - Categorical: OneHotEncoder - convert categorical to multiple binary columns
  - Numerical: StandardScaler - (value - mean) / unit variance

In [4]:
# Create Preprocessor pipeline.

def create_preprocessor_pipeline():
    
    categorical_columns_to_process = [
        'gender',
        'signup_method',
        'signup_flow',
        'language',
        'affiliate_channel',
        'affiliate_provider',
        'first_affiliate_tracked',
        'signup_app',
        'first_device_type',
        'first_browser',
        'age_bucket',
        'season',
        'first_active_month',
        'first_active_year'
    ]
    
    numerical_columns_to_process = [
        'activity_count'
    ]
    
    return make_column_transformer(
        (categorical_columns_to_process,
         make_pipeline(
             SimpleImputer(missing_values=np.nan, strategy='constant', fill_value="unknown"),
             OneHotEncoder(handle_unknown='ignore')
         )
        ),
        (numerical_columns_to_process,
         make_pipeline(
             SimpleImputer(missing_values=np.nan, strategy='mean'),
             StandardScaler()
         )
        ),
        remainder='drop'
    )

preprocessor = create_preprocessor_pipeline()

## Prepare Data

- Apply feature engineering
- Split data into training and development sets
- Create a balanced training set to address imbalanced source data

In [5]:
# add features to training and test data: stateful data mutations (ew)
# attempted to use pipelines, column_transformer, function_transformer, feature_union
# the features can't be arbitrarily included since function_transformer and column_transformer return ndarray's instead of dataframes

all_train_data = add_features(dfs["train_users_2"].iloc[:, 0:-1].copy().set_index('id'))
all_train_labels = dfs["train_users_2"].iloc[:, -1:]
all_test_data = add_features(dfs["test_users"].copy().set_index('id'))

In [6]:
test_size = .5

le = LabelEncoder()
encoded_labels = le.fit_transform(all_train_labels.values.ravel())

# Use (train_test_split) to randomize train_users_2 before splitting into train/dev.
train_data, dev_data, train_labels, dev_labels = train_test_split(all_train_data, encoded_labels, test_size=test_size, random_state=42)

# Final test data for Kaggle submission.
test_data = all_test_data

In [7]:
def create_balanced_training_set():
    countries = np.unique(encoded_labels)
    z = train_data.copy()
    z['dest'] = pd.Series(train_labels, index=z.index)
    picks = round(len(train_data)/len(countries))
    rx = []
    for destination in countries:
        options = z[z['dest'] == destination].index
        ff = np.random.choice(options, picks)
        rx.append(z.ix[ff])
    
    balanced_data = pd.concat(rx) 
    return balanced_data.iloc[:, 0:-1], balanced_data.iloc[:, -1:].values.ravel()

balanced_train_data, balanced_train_labels = create_balanced_training_set()

## Plain XGBoost baseline

Performs a basic grid search for best boosting algorithm.

In [8]:
params={'booster':['gbtree', 'gblinear','dart']}
xgb = XGBClassifier(nthread=-1)
xgb_gs = GridSearchCV(xgb, params, cv=3, scoring='f1_weighted', n_jobs=-1)
pipeline = make_pipeline(preprocessor, xgb)
pipeline.fit(balanced_train_data, balanced_train_labels)
dev_pred = pipeline.predict(dev_data)
accuracy = accuracy_score(dev_pred, dev_labels)
print('Accuracy: ',accuracy)
print(classification_report(dev_pred, dev_labels))

Accuracy:  0.33545715195922265
              precision    recall  f1-score   support

           0       0.19      0.01      0.01     10629
           1       0.09      0.01      0.02      6198
           2       0.18      0.01      0.02      9790
           3       0.09      0.03      0.04      3941
           4       0.06      0.04      0.05      3755
           5       0.04      0.02      0.03      2307
           6       0.10      0.03      0.04      5943
           7       0.52      0.80      0.63     40541
           8       0.09      0.01      0.01      5305
           9       0.08      0.00      0.00      9003
          10       0.07      0.46      0.13      4939
          11       0.07      0.07      0.07      4375

   micro avg       0.34      0.34      0.34    106726
   macro avg       0.13      0.12      0.09    106726
weighted avg       0.27      0.34      0.26    106726



## Stacking Ensemble

Currently configured to used the balanced dataset.

- Layer 1: RandomForestClassifier, XGBClassifier
- Layer 2: LogisticRegression

- Ensemble #1 = don't propagate features from first layer to second layer.  don't propagate class probablities forward to second layer.  this is really just a proof of concept to get the stacking ensemble to work.  the logistic regression on the second layer only has the predicted classes from the first layer to work with, so we don't expect this to be any better than a voting classifier.
- Ensemble #2 - it is the same as first except that it propagates everything forward (original data features, predicted classes, and probabilities of predictions)

In [9]:
def stacking(propagate=None, proba=None):
    seed = 142
    
    # --- Build ---
    # Passing a scoring function will create cv scores during fitting
    # the scorer should be a simple function accepting to vectors and returning a scalar
    ensemble = SuperLearner(scorer=accuracy_score, random_state=seed, verbose=2)
    pf = False
    if propagate:
        pf = list(range(preprocessor.fit_transform(balanced_train_data).shape[1]))
    # Build the first layer
    ensemble.add([
        RandomForestClassifier(random_state=seed,n_estimators=300, max_depth=10),
        XGBClassifier()
    ],
        #preprocessing=[preprocessor],
        propagate_features=pf,
        proba=proba)

    # Attach the final meta estimator
    ensemble.add_meta(LogisticRegression())

    # --- Use ---

    # Fit ensemble
    pipeline = make_pipeline(preprocessor, ensemble)
    pipeline.fit(balanced_train_data, balanced_train_labels)
    
    dev_pred = pipeline.predict(dev_data)
    accuracy = accuracy_score(dev_pred, dev_labels)

    #ensemble.fit(train_data, train_labels)

    # Predict
    #preds = ensemble.predict(dev_data)
    print('Accuracy: ',accuracy)
    print(classification_report(dev_pred, dev_labels))
    print("Fit data:\n%r" % ensemble.data)
    #print("Prediction score: %.3f" % accuracy_score(preds, dev_labels))

stacking(False,False)
stacking(True,True)


Fitting 2 layers
Processing layer-1             done | 00:02:05
Processing layer-2             done | 00:00:01
Fit complete                        | 00:02:06

Predicting 2 layers




Processing layer-1             done | 00:00:07
Processing layer-2             done | 00:00:00
Predict complete                    | 00:00:07
Accuracy:  0.11075089481475929
              precision    recall  f1-score   support

         0.0       0.11      0.00      0.01      8830
         1.0       0.00      0.00      0.00         0
         2.0       0.06      0.00      0.01      6452
         3.0       0.21      0.01      0.01     40785
         4.0       0.00      0.00      0.00         0
         5.0       0.11      0.01      0.03      9463
         6.0       0.06      0.03      0.04      3586
         7.0       0.09      0.47      0.16     12326
         8.0       0.00      0.00      0.00         0
         9.0       0.11      0.00      0.00     11408
        10.0       0.18      0.40      0.24     13876
        11.0       0.00      0.00      0.00         0

   micro avg       0.11      0.11      0.11    106726
   macro avg       0.08      0.08      0.04    106726
weighted avg    

  'recall', 'true', average, warn_for)



Fitting 2 layers
Processing layer-1             

ValueError("Classification metrics can't handle a mix of multiclass and continuous-multioutput targets")
ValueError("Classification metrics can't handle a mix of multiclass and continuous-multioutput targets")
ValueError("Classification metrics can't handle a mix of multiclass and continuous-multioutput targets")
ValueError("Classification metrics can't handle a mix of multiclass and continuous-multioutput targets")


done | 00:02:08
Processing layer-2             done | 00:00:45
Fit complete                        | 00:02:55

Predicting 2 layers




Processing layer-1             done | 00:00:07
Processing layer-2             done | 00:00:00
Predict complete                    | 00:00:09
Accuracy:  0.33034124768097745
              precision    recall  f1-score   support

         0.0       0.05      0.00      0.01      3425
         1.0       0.02      0.01      0.01      1561
         2.0       0.06      0.01      0.02      3168
         3.0       0.02      0.01      0.02      1207
         4.0       0.01      0.05      0.02       604
         5.0       0.02      0.02      0.02      1376
         6.0       0.12      0.03      0.04      6516
         7.0       0.49      0.80      0.61     38102
         8.0       0.21      0.01      0.01     10988
         9.0       0.28      0.00      0.00     25950
        10.0       0.13      0.47      0.20      8647
        11.0       0.09      0.08      0.09      5182

   micro avg       0.33      0.33      0.33    106726
   macro avg       0.13      0.12      0.09    106726
weighted avg    