# [Chapter 13] The final online shopping purchase intent predictions

## **[DSLC stages]**: Analysis


In this document, you will find the PCS workflow and code for identifying the "single best" fit, computing an ensemble fit, and computing PPIs for the online shopping prediction project.


The following code sets up the libraries and creates cleaned and pre-processed training, validation and test data that we will use in this document.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics 
from joblib import Parallel, delayed
from itertools import product
import matplotlib.pyplot as plt


# define all of the objects we need by running the preparation script
%run functions/prepare_shopping_data.py

pd.set_option('display.max_columns', None)
pd.options.display.max_colwidth = 500
pd.options.display.max_rows = 100

In [2]:
# look at all variables defined in our space
%who

LinearRegression	 LogisticRegression	 Parallel	 RandomForestClassifier	 delayed	 ff	 go	 metrics	 np	 
pd	 plt	 preprocess_shopping_data	 product	 px	 shopping_orig	 shopping_test	 shopping_test_preprocessed	 shopping_train	 
shopping_train_preprocessed	 shopping_train_preprocessed_nodummy	 shopping_val	 shopping_val_preprocessed	 



In this document we will demonstrate how to use the principles of PCS to choose the final prediction. We will demonstrate two different formats of the final prediction based on:

1. The **single "best"** predictive algorithm, in terms of validation set performance from among a range of different algorithms each trained on several different cleaning/pre-processing judgment call-perturbed versions of the training dataset.

1. An **ensemble** prediction, which combines the predictions from a range of predictive fits from across different algorithms and cleaning/pre-processing judgment call-perturbations that pass a predictability screening test.

Note that we do not compute PCS **prediction perturbation intervals** for binary response predictions, because although we can compute intervals for our class probability predictions, we cannot calibrate them (and so our interpretation will necessarily be heavily influenced by the number of perturbations that we consider).


## Computing the perturbed predictions

Since each of these approaches will involve each perturbed version of the cleaning/pre-processing judgment call training (and validation) datasets that we used in our stability analyses, we will create the cleaning/pre-processing judgment call-perturbed datasets and fit the algorithms here.


### Create the perturbed datasets

First, let's create the tibble containing the cleaning/pre-processing judgment call perturbed datasets:

In [3]:
perturb_options = list(product([True, False], 
                               [True, False],
                               [True, False],
                               [True, False]))
perturb_options = pd.DataFrame(perturb_options, columns=('numeric_to_cat', 
                                                         'month_numeric',
                                                         'log_page',
                                                         'remove_extreme'))
perturb_options

Unnamed: 0,numeric_to_cat,month_numeric,log_page,remove_extreme
0,True,True,True,True
1,True,True,True,False
2,True,True,False,True
3,True,True,False,False
4,True,False,True,True
5,True,False,True,False
6,True,False,False,True
7,True,False,False,False
8,False,True,True,True
9,False,True,True,False



Then we will create a version of the pre-processed dataset per each of the judgment call combination options. Note that unlike for the data perturbations (which were all based on the "default" pre-processed training dataset) where we could use the "default" validation dataset, we will need to explicitly create perturbed versions of the pre-processed validation data to match each perturbed version of the pre-processed training data.

In [4]:
# conduct judgment call perturbations of training data
shopping_jc_perturb = [preprocess_shopping_data(shopping_train,
                                                numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                                month_numeric=perturb_options['month_numeric'][i],
                                                log_page=perturb_options['log_page'][i],
                                                remove_extreme=perturb_options['remove_extreme'][i])
                       for i in range(perturb_options.shape[0])]

# create a version of each perturbed dataset without dummy variables so that we can
# ensure that the same unique values of each categorical variable are present in the
# validation set
shopping_jc_perturb_nodummy = [preprocess_shopping_data(shopping_train,
                                                        dummy=False,
                                                        numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                                        month_numeric=perturb_options['month_numeric'][i],
                                                        log_page=perturb_options['log_page'][i],
                                                        remove_extreme=perturb_options['remove_extreme'][i])
                               for i in range(perturb_options.shape[0])]

# conduct judgment call perturbations of validation data (we need to make sure each 
# validation set is compartible with the relevant training set)
shopping_val_jc_perturb = []
for i in range(perturb_options.shape[0]):
    
    # create preprocessed validation set
    shopping_val_jc_perturb.append(
        preprocess_shopping_data(shopping_val,
                                 numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                 month_numeric=perturb_options['month_numeric'][i],
                                 log_page=perturb_options['log_page'][i],
                                 # note that since we want to ensure that all 
                                 # fits are compared on the same set of data 
                                 # points, we need to remove the extreme data 
                                 # points for all perturbations
                                 remove_extreme=True,
                                 # make sure val set matches training set
                                 column_selection=list(shopping_jc_perturb[i].columns),
                                 operating_systems_levels=shopping_jc_perturb_nodummy[i]['operating_systems'].unique(),
                                 browser_levels=shopping_jc_perturb_nodummy[i]['browser'].unique(),
                                 traffic_type_levels=shopping_jc_perturb_nodummy[i]['traffic_type'].unique())
    )



# conduct judgment call perturbations of test data (we need to make sure each test set is 
# compartible with the relevant training set)
shopping_test_jc_perturb = []
for i in range(perturb_options.shape[0]):
    
    # create preprocessed validation set
    shopping_test_jc_perturb.append(
        preprocess_shopping_data(shopping_test,
                                 numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                 month_numeric=perturb_options['month_numeric'][i],
                                 log_page=perturb_options['log_page'][i],
                                 # note that since we want to ensure that all 
                                 # fits are compared on the same set of data 
                                 # points, we need to remove the extreme data 
                                 # points for all perturbations
                                 remove_extreme=True,
                                 # make sure val set matches training set
                                 column_selection=list(shopping_jc_perturb[i].columns),
                                 operating_systems_levels=shopping_jc_perturb_nodummy[i]['operating_systems'].unique(),
                                 browser_levels=shopping_jc_perturb_nodummy[i]['browser'].unique(),
                                 traffic_type_levels=shopping_jc_perturb_nodummy[i]['traffic_type'].unique())
    )



### Fitting the algorithms to each perturbed dataset



Let's fit the LS (applied to a binary response problem), logistic regression, and RF algorithms, each trained using each judgment call-perturbed version of the training data.

In [5]:
def fit_models(df, standardize=False):
    
    # if specified, standardize the predictive features
    df_x = df.drop(columns='purchase')
    if standardize:
        df_x = (df_x - df_x.mean()) / df_x.std()
        
    ls = LinearRegression().fit(X=df_x, y=df['purchase'])
    lr = LogisticRegression().fit(X=df_x, y=df['purchase'])
    rf = RandomForestClassifier().fit(X=df_x, y=df['purchase'])
    
    return (ls, lr, rf)

In [6]:
results_jc_perturbed = Parallel(n_jobs=-1)(delayed(fit_models)(df) for df in shopping_jc_perturb)
ls_jc_perturbed, lr_jc_perturbed, rf_jc_perturbed = zip(*results_jc_perturbed)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

We can then generate sale price predictions for each session in the validation set using each perturbed LS fits.

In [7]:
# compute the predictions on the validaion set for ls_all_perturbed, lr_perturbed and rf_perturbed
ls_val_jc_pred_perturbed = [ls_jc_perturbed[i].predict(X=shopping_val_jc_perturb[i].drop(columns='purchase')) for i in range(len(ls_jc_perturbed))]
lr_val_jc_pred_perturbed = [lr_jc_perturbed[i].predict_proba(X=shopping_val_jc_perturb[i].drop(columns='purchase'))[:,1] for i in range(len(lr_jc_perturbed))]
rf_val_jc_pred_perturbed = [rf_jc_perturbed[i].predict_proba(X=shopping_val_jc_perturb[i].drop(columns='purchase'))[:,1] for i in range(len(rf_jc_perturbed))]

# compute the predictions on the test set for ls_all_perturbed, lr_perturbed and rf_perturbed
ls_test_jc_pred_perturbed = [ls_jc_perturbed[i].predict(X=shopping_test_jc_perturb[i].drop(columns='purchase')) for i in range(len(ls_jc_perturbed))]
lr_test_jc_pred_perturbed = [lr_jc_perturbed[i].predict_proba(X=shopping_test_jc_perturb[i].drop(columns='purchase'))[:,1] for i in range(len(lr_jc_perturbed))]
rf_test_jc_pred_perturbed = [rf_jc_perturbed[i].predict_proba(X=shopping_test_jc_perturb[i].drop(columns='purchase'))[:,1] for i in range(len(rf_jc_perturbed))]

In [8]:
# LS model
ls_val_jc_perturbed_accuracy = [metrics.accuracy_score(shopping_val_jc_perturb[i]['purchase'], ls_val_jc_pred_perturbed[i] > 0.161) for i in range(len(shopping_val_jc_perturb))]
ls_val_jc_perturbed_tp_rate = [metrics.recall_score(shopping_val_jc_perturb[i]['purchase'], ls_val_jc_pred_perturbed[i] > 0.161) for i in range(len(shopping_val_jc_perturb))]
ls_val_jc_perturbed_tn_rate = [metrics.recall_score(shopping_val_jc_perturb[i]['purchase'], ls_val_jc_pred_perturbed[i] > 0.161, pos_label=0) for i in range(len(shopping_val_jc_perturb))]
ls_val_jc_perturbed_auc = [metrics.roc_auc_score(shopping_val_jc_perturb[i]['purchase'], ls_val_jc_pred_perturbed[i]) for i in range(len(shopping_val_jc_perturb))]

# LR model
lr_val_jc_perturbed_accuracy = [metrics.accuracy_score(shopping_val_jc_perturb[i]['purchase'], lr_val_jc_pred_perturbed[i] > 0.161) for i in range(len(shopping_val_jc_perturb))]
lr_val_jc_perturbed_tp_rate = [metrics.recall_score(shopping_val_jc_perturb[i]['purchase'], lr_val_jc_pred_perturbed[i] > 0.161) for i in range(len(shopping_val_jc_perturb))]
lr_val_jc_perturbed_tn_rate = [metrics.recall_score(shopping_val_jc_perturb[i]['purchase'], lr_val_jc_pred_perturbed[i] > 0.161, pos_label=0) for i in range(len(shopping_val_jc_perturb))]
lr_val_jc_perturbed_auc = [metrics.roc_auc_score(shopping_val_jc_perturb[i]['purchase'], lr_val_jc_pred_perturbed[i]) for i in range(len(shopping_val_jc_perturb))]

# RF model
rf_val_jc_perturbed_accuracy = [metrics.accuracy_score(shopping_val_jc_perturb[i]['purchase'], rf_val_jc_pred_perturbed[i] > 0.161) for i in range(len(shopping_val_jc_perturb))]
rf_val_jc_perturbed_tp_rate = [metrics.recall_score(shopping_val_jc_perturb[i]['purchase'], rf_val_jc_pred_perturbed[i] > 0.161) for i in range(len(shopping_val_jc_perturb))]
rf_val_jc_perturbed_tn_rate = [metrics.recall_score(shopping_val_jc_perturb[i]['purchase'], rf_val_jc_pred_perturbed[i] > 0.161, pos_label=0) for i in range(len(shopping_val_jc_perturb))]
rf_val_jc_perturbed_auc = [metrics.roc_auc_score(shopping_val_jc_perturb[i]['purchase'], rf_val_jc_pred_perturbed[i]) for i in range(len(shopping_val_jc_perturb))]


In [9]:
# place all of the accuracy performance lists in a data frame
perturbed_jc_accuracy = pd.DataFrame({
    'ls': ls_val_jc_perturbed_accuracy,
    'lr': lr_val_jc_perturbed_accuracy,
    'rf': rf_val_jc_perturbed_accuracy,
    'numeric_to_cat': perturb_options['numeric_to_cat'],
    'month_numeric': perturb_options['month_numeric'],
    'log_page': perturb_options['log_page'],
    'remove_extreme': perturb_options['remove_extreme'],
    }).melt(id_vars=['numeric_to_cat', 'month_numeric', 'log_page', 'remove_extreme'], 
            var_name='model', 
            value_name='accuracy')

# place all of the tp rate performance lists in a data frame
perturbed_jc_tp_rate = pd.DataFrame({
    'ls': ls_val_jc_perturbed_tp_rate,
    'lr': lr_val_jc_perturbed_tp_rate,
    'rf': rf_val_jc_perturbed_tp_rate,
    'numeric_to_cat': perturb_options['numeric_to_cat'],
    'month_numeric': perturb_options['month_numeric'],
    'log_page': perturb_options['log_page'],
    'remove_extreme': perturb_options['remove_extreme'],
    }).melt(id_vars=['numeric_to_cat', 'month_numeric', 'log_page', 'remove_extreme'], 
            var_name='model', 
            value_name='tp_rate')

# place all of the tn rate performance lists in a data frame
perturbed_jc_tn_rate = pd.DataFrame({
    'ls': ls_val_jc_perturbed_tn_rate,
    'lr': lr_val_jc_perturbed_tn_rate,
    'rf': rf_val_jc_perturbed_tn_rate,
    'numeric_to_cat': perturb_options['numeric_to_cat'],
    'month_numeric': perturb_options['month_numeric'],
    'log_page': perturb_options['log_page'],
    'remove_extreme': perturb_options['remove_extreme'],
    }).melt(id_vars=['numeric_to_cat', 'month_numeric', 'log_page', 'remove_extreme'], 
            var_name='model', 
            value_name='tn_rate')

# place all of the auc performance lists in a data frame
perturbed_jc_auc = pd.DataFrame({
    'ls': ls_val_jc_perturbed_auc,
    'lr': lr_val_jc_perturbed_auc,
    'rf': rf_val_jc_perturbed_auc,
    'numeric_to_cat': perturb_options['numeric_to_cat'],
    'month_numeric': perturb_options['month_numeric'],
    'log_page': perturb_options['log_page'],
    'remove_extreme': perturb_options['remove_extreme'],
    }).melt(id_vars=['numeric_to_cat', 'month_numeric', 'log_page', 'remove_extreme'], 
            var_name='model', 
            value_name='auc')



## Approach 1: Choosing a single predictive fit using PCS

Having computed the performance of each of our judgment-call perturbed fits for each algorithm we considered in this book, we can then identify which fit yields the "best" performance.

The following code prints the details of the fits with the highest AUC performance:

In [10]:
perturbed_jc_auc.sort_values(by='auc', ascending=False).head(10)

Unnamed: 0,numeric_to_cat,month_numeric,log_page,remove_extreme,model,auc
39,True,False,False,False,rf,0.93389
36,True,False,True,True,rf,0.933636
45,False,False,True,False,rf,0.933221
38,True,False,False,True,rf,0.932207
43,False,True,False,False,rf,0.931875
46,False,False,False,True,rf,0.93167
44,False,False,True,True,rf,0.931539
37,True,False,True,False,rf,0.930366
34,True,True,False,True,rf,0.930075
32,True,True,True,True,rf,0.9296


Then we can print the details of the fits with the highest true positive rate:

In [11]:
perturbed_jc_tp_rate.sort_values(by='tp_rate', ascending=False).head(5)

Unnamed: 0,numeric_to_cat,month_numeric,log_page,remove_extreme,model,tp_rate
0,True,True,True,True,ls,0.89779
1,True,True,True,False,ls,0.89779
3,True,True,False,False,ls,0.892265
11,False,True,False,False,ls,0.892265
2,True,True,False,True,ls,0.889503


and we can print the details of the fits with the highest true negative rate:

In [12]:
perturbed_jc_tn_rate.sort_values(by='tn_rate', ascending=False).head(5)

Unnamed: 0,numeric_to_cat,month_numeric,log_page,remove_extreme,model,tn_rate
22,True,False,False,True,lr,0.856187
31,False,False,False,False,lr,0.845198
38,True,False,False,True,rf,0.84472
41,False,True,True,False,rf,0.844243
40,False,True,True,True,rf,0.843287


and we can print the details of the fits with the highest accuracy:

In [13]:
perturbed_jc_accuracy.sort_values(by='accuracy', ascending=False).head(5)

Unnamed: 0,numeric_to_cat,month_numeric,log_page,remove_extreme,model,accuracy
40,False,True,True,True,rf,0.849287
41,False,True,True,False,rf,0.84888
38,True,False,False,True,rf,0.84888
36,True,False,True,True,rf,0.848065
47,False,False,False,False,rf,0.847658


When we did this in R, the "best" fit in terms of the highest validation set AUC is the RF fit with the following cleaning/pre-processing judgment call options:

- `numeric_to_cat=False`

- `month_numeric=False`

- `log_page=False`

- `remove_extreme=True`

This particular fit is usually in the top 10 (out of 48) in this particular implementation in Python (the results are slightly different each time the code is run and recall that the scikit-learn implementation of each algorithm, which is slightly different to the corresponding R versions)

For consistency with the book, we will use the **RF algorithm trained on the training set with these particular cleaning/pre-processing judgment calls as our "final" algorithm.**

In [14]:
shopping_train_preprocessed_selected = preprocess_shopping_data(
    shopping_train,
    numeric_to_cat=False,
    month_numeric=False,
    log_page=False,
    remove_extreme=True
)

single_fit = RandomForestClassifier()
single_fit.fit(
    X=shopping_train_preprocessed_selected.drop(columns='purchase'), 
    y=shopping_train_preprocessed_selected['purchase']
)


### Test set evaluation

Let's then evaluate this final fit using the test set (since our validation set was used to choose it, it can no longer provide an independent assessment of its performance).

First we must create the relevant pre-processed test set.

In [15]:
# create a version of the training data without dummy variables to ensure that the same
# unique values of each categorical variable are present in the test set
shopping_train_preprocessed_selected_nodummy = preprocess_shopping_data(
    shopping_train,
    numeric_to_cat=False,
    month_numeric=False,
    log_page=False,
    remove_extreme=True,
    dummy=False)

# create the test set
shopping_test_preprocessed_selected = preprocess_shopping_data(
    shopping_test,
    numeric_to_cat=False,
    month_numeric=False,
    log_page=False,
    remove_extreme=True,
    column_selection=list(shopping_train_preprocessed_selected.columns),
    operating_systems_levels=shopping_train_preprocessed_selected_nodummy['operating_systems'].unique(),
    browser_levels=shopping_train_preprocessed_selected_nodummy['browser'].unique(),
    traffic_type_levels=shopping_train_preprocessed_selected_nodummy['traffic_type'].unique()
)

And then we can compute the predictions for the test set and evaluate them.

In [16]:
shopping_test_pred = single_fit.predict_proba(X=shopping_test_preprocessed_selected.drop(columns='purchase'))[:,1]
shopping_test_pred

array([0.02, 0.59, 0.  , ..., 0.34, 0.17, 0.02])

In [17]:
# compute the auc, tp rate, tn rate, and accuracy
auc = metrics.roc_auc_score(shopping_test_preprocessed_selected['purchase'], shopping_test_pred)
tp_rate = metrics.recall_score(shopping_test_preprocessed_selected['purchase'], shopping_test_pred > 0.161)
tn_rate = metrics.recall_score(shopping_test_preprocessed_selected['purchase'], shopping_test_pred > 0.161, pos_label=0)
accuracy = metrics.accuracy_score(shopping_test_preprocessed_selected['purchase'], shopping_test_pred > 0.161)

# print out the results
print("AUC:", round(auc, 3))
print("True Positive Rate:", round(tp_rate, 3))
print("True Negative Rate:", round(tn_rate, 3))
print("Accuracy:", round(accuracy, 3))


AUC: 0.92
True Positive Rate: 0.84
True Negative Rate: 0.824
Accuracy: 0.826



These performance measures all indicate very good performance for this particular RF algorithm fit on the test set.


## Approach 2: PCS ensemble prediction 

In this approach, we take a look at all of the predictions that we computed above (across all algorithms and judgment call combinations), and we first conduct a predictability screening test to ensure that we are not using particularly poorly performing fits to create our ensemble.

Let's visualize the distribution of the correlation performance measure across all of the algorithms and cleaning/pre-processing judgment calls (grouping by algorithm) using boxplots.

First we will consider the AUC:

In [18]:
px.box(perturbed_jc_auc, y='auc', x='model', color='model', title='AUC by Model')

And then the true positive rate (based on the 0.161 cutoff):

In [19]:
px.box(perturbed_jc_tp_rate, y='tp_rate', x='model', color='model', title='True pos rate by Model')


And then the true negative rate (based on the 0.161 cutoff):

In [20]:
px.box(perturbed_jc_tn_rate, y='tn_rate', x='model', color='model', title='True neg rate by Model')


and the accuracy (based on the 0.161 cutoff):


In [21]:
px.box(perturbed_jc_accuracy, y='accuracy', x='model', color='model', title='Accuracy rate by Model')


It is clear that across all measures, the RF algorithm stands out as a top performer. While we could choose to define a predictability screening test that only considers the RF fits, this feels unnecessarily limiting, so we will keep all fits in our ensemble (although we may not expect to see an improvement in performance on the previous "single best" fit).



### Test set evaluation

To evaluate the ensemble, let's compute the ensemble predictions for each of the *test set* data points.


In [22]:
# re-format the predictions for each session and each judgment call-perturbed LS, LR, and RF fit into a DataFrame
ls_test_jc_pred_perturbed_df = pd.DataFrame(ls_test_jc_pred_perturbed)
ls_test_jc_pred_perturbed_df['jc_id'] = range(ls_test_jc_pred_perturbed_df.shape[0])
ls_test_jc_pred_perturbed_df = ls_test_jc_pred_perturbed_df.melt(id_vars='jc_id', var_name='id_test', value_name='pred')
ls_test_jc_pred_perturbed_df['algorithm'] = 'ls'

lr_test_jc_pred_perturbed_df = pd.DataFrame(lr_test_jc_pred_perturbed)
lr_test_jc_pred_perturbed_df['jc_id'] = range(lr_test_jc_pred_perturbed_df.shape[0])
lr_test_jc_pred_perturbed_df = lr_test_jc_pred_perturbed_df.melt(id_vars='jc_id', var_name='id_test', value_name='pred')
lr_test_jc_pred_perturbed_df['algorithm'] = 'lr'

rf_test_jc_pred_perturbed_df = pd.DataFrame(rf_test_jc_pred_perturbed)
rf_test_jc_pred_perturbed_df['jc_id'] = range(rf_test_jc_pred_perturbed_df.shape[0])
rf_test_jc_pred_perturbed_df = rf_test_jc_pred_perturbed_df.melt(id_vars='jc_id', var_name='id_test', value_name='pred')
rf_test_jc_pred_perturbed_df['algorithm'] = 'rf'


In [23]:
# combine the predictions from all models into a single dataframe
perturbed_jc_pred_test = pd.concat([ls_test_jc_pred_perturbed_df, lr_test_jc_pred_perturbed_df, rf_test_jc_pred_perturbed_df])
# create a binary prediction
perturbed_jc_pred_test['pred_binary'] = perturbed_jc_pred_test['pred'] > 0.161
perturbed_jc_pred_test



Unnamed: 0,jc_id,id_test,pred,algorithm,pred_binary
0,0,0,0.069796,ls,False
1,1,0,0.069483,ls,False
2,2,0,0.078850,ls,False
3,3,0,0.080171,ls,False
4,4,0,0.064278,ls,False
...,...,...,...,...,...
39259,11,2453,0.060000,rf,False
39260,12,2453,0.020000,rf,False
39261,13,2453,0.020000,rf,False
39262,14,2453,0.040000,rf,False


In [24]:
# identify for each test set session (id_test), whether the majority of the judgment call-perturbed models predict a purchase
ensemble_pred_test = perturbed_jc_pred_test.groupby('id_test')['pred_binary'].mean() >= 0.5
ensemble_pred_test

id_test
0       False
1        True
2       False
3       False
4       False
        ...  
2449     True
2450    False
2451     True
2452    False
2453    False
Name: pred_binary, Length: 2454, dtype: bool


The performance of the ensemble predictions can then be computed using the regular metrics

In [25]:
# compute the AUC, tp rate, tn rate, and accuracy for these ensemble predictions
tp_rate = metrics.recall_score(shopping_test_preprocessed['purchase'], ensemble_pred_test)
tn_rate = metrics.recall_score(shopping_test_preprocessed['purchase'], ensemble_pred_test, pos_label=0)
accuracy = metrics.accuracy_score(shopping_test_preprocessed['purchase'], ensemble_pred_test)

# print out the results
print("True Positive Rate:", round(tp_rate, 3))
print("True Negative Rate:", round(tn_rate, 3))
print("Accuracy:", round(accuracy, 3))

True Positive Rate: 0.838
True Negative Rate: 0.816
Accuracy: 0.819



Indeed, the test set correlation performance of the ensemble fit is slightly worse (lower) than the "single best" fit across all measures.
