# [Chapter 11] Predicting online shopping purchase intent using using logistic regression

## **[DSLC stages]**: Analysis



In this document, you will find the PCS workflow and code for fitting logistic regression to the online shopping data.


The following code sets up the libraries and creates cleaned and pre-processed training, validation and test data that we will use in this document.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 
from joblib import Parallel, delayed
from itertools import product
import matplotlib.pyplot as plt


# define all of the objects we need by running the preparation script
%run functions/prepare_shopping_data.py

pd.set_option('display.max_columns', None)
pd.options.display.max_colwidth = 500
pd.options.display.max_rows = 100

In [2]:
# look at all variables defined in our space
%who

LinearRegression	 LogisticRegression	 Parallel	 delayed	 ff	 go	 metrics	 np	 pd	 
plt	 preprocess_shopping_data	 product	 px	 shopping_orig	 shopping_test	 shopping_test_preprocessed	 shopping_train	 shopping_train_preprocessed	 
shopping_train_preprocessed_nodummy	 shopping_val	 shopping_val_preprocessed	 



## LS and Logistic regression for a small sample of 20 training points

Fitting Let's create the sample of 20 training sessions that we used in Chapter 12 for demonstrating using the LS and logistic regression algorithms for binary responses. 

In [3]:
# create a version of the training dataset with just the 20 training sessions
ids = [753, 6492, 6801, 1877, 
       3298, 3635,  603, 3632, 
       4258, 4783, 2758, 4615, 
       7313, 3012, 2109, 7138, 
       5740, 270, 2899,  425]
# since python is zero-indexed, we need to add 1 from each index location to get the correct row
ids = [id - 1 for id in ids]
shopping_train_sample = shopping_train_preprocessed.loc[ids,:]
# note that these won't appear in the same order as the version in R
shopping_train_sample.head()

Unnamed: 0,administrative,administrative_duration,informational,informational_duration,product_related,product_related_duration,bounce_rates,exit_rates,page_values,special_day,visitor_type,weekend,purchase,month_dec,month_feb,month_jul,month_june,month_mar,month_may,month_nov,month_oct,month_sep,operating_systems_2,operating_systems_3,operating_systems_4,operating_systems_8,operating_systems_other,browser_10,browser_2,browser_3,browser_4,browser_5,browser_6,browser_8,browser_other,region_2,region_3,region_4,region_5,region_6,region_7,region_8,region_9,traffic_type_10,traffic_type_11,traffic_type_13,traffic_type_2,traffic_type_20,traffic_type_3,traffic_type_4,traffic_type_5,traffic_type_6,traffic_type_8,traffic_type_other
752,1.0,0.458333,0.0,0.0,20.0,15.378472,0.0,0.009524,103.14029,0.0,0,0,True,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
6491,0.0,0.0,1.0,0.466667,19.0,18.671111,0.021053,0.052632,11.386421,0.0,1,1,True,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
6800,2.0,0.3,0.0,0.0,15.0,12.605556,0.0125,0.0625,29.24325,0.0,1,0,True,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1876,6.0,0.788333,0.0,0.0,14.0,14.303333,0.0,0.010526,0.0,0.0,0,0,True,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3297,12.0,3.872341,1.0,0.32,22.0,8.400476,0.0,0.003763,0.0,0.0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


The code below fits a simple one-predictor LS fit using this sample of 20 training sessions:


In [4]:
ls_example = LinearRegression()
ls_example.fit(X=np.array(shopping_train_sample['product_related_duration']).reshape(-1, 1),
                y=shopping_train_sample['purchase'])
ls_example.intercept_, ls_example.coef_[0]

(-0.19406623093971098, 0.04761091598234869)


To use this fit to generate predictions, we can use the `predict()` method. For example, to predict the purchase response of a session that spent 20 minutes on product-related pages:

In [5]:
ls_example.predict(np.array(20).reshape(-1, 1))

array([0.75815209])


Note that this is neither 0 nor is it 1. What does it mean to have a predicted purchase response of 0.76? Since it is above 0.5^[Not that this is a good threshold to use here, but it suffices for this example. A better threshold would be the proportion of sessions that ended with a purchase in the training data, but the conclusion remains the same.], we could round it up to 1 and predict that the response of such a session is "purchase" (i.e., equal to 1).

The corresponding logistic regression fit can be computed using `LogisticRegression()` class:

In [6]:
logistic_example = LogisticRegression()
logistic_example.fit(X=np.array(shopping_train_sample['product_related_duration']).reshape(-1, 1),
                     y=shopping_train_sample['purchase'])
logistic_example.intercept_, logistic_example.coef_[0]

(array([-7.87498642]), array([0.55629184]))

In [7]:
# predict the probability of a session that spends 20 minutes on product-related 
# pages being False (no purchase) or True (purchase), respectively
logistic_example.predict_proba(np.array(20).reshape(-1, 1))

array([[0.03729634, 0.96270366]])

Note that the Scikit learn version of logistic regression is slightly different to the R version (`glm()`), so the coefficient values are slightly different. The statsmodels implementation of logistic regression is closer to the R version, but since we tend to use Scikit learn for most of our work, we will stick to the Scikit learn version here.


## Fitting logistic regression to the full training dataset


Next, we can move beyond this sample of just 20 training data points and compute a logistic regression fit for the entire training dataset:

In [8]:
lr_all = LogisticRegression()
lr_all.fit(X=shopping_train_preprocessed.drop(columns='purchase'),
           y=shopping_train_preprocessed['purchase'])
lr_all.intercept_, lr_all.coef_

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(array([-1.90835125]),
 array([[ 0.02438068,  0.00379902, -0.0020568 ,  0.01856229,  0.00299081,
          0.00283155, -0.19344641, -0.27560552,  0.08571025, -0.26097382,
         -0.56096713,  0.17957566, -0.85734191, -0.20052377,  0.08861629,
         -0.26239924, -0.50818755, -0.54444198,  0.55221998, -0.0709152 ,
         -0.02790164,  0.05844832, -0.22287217, -0.26217731, -0.10272191,
         -0.02666585,  0.01065959, -0.00472242, -0.10997417, -0.01132641,
         -0.06829616, -0.13748229, -0.11193002, -0.10275242,  0.09820197,
          0.11374834, -0.20550533, -0.17648582, -0.12546271,  0.02591723,
         -0.17201337, -0.31172774,  0.13857169,  0.04549389, -0.38349138,
          0.02179054,  0.01025805, -0.61067567, -0.07014505, -0.14908573,
         -0.14570526,  0.1817329 , -0.03361922]]))

In [9]:
# also fit a LS model to the entire training dataset
ls_all = LinearRegression()
ls_all.fit(X=shopping_train_preprocessed.drop(columns='purchase'),
           y=shopping_train_preprocessed['purchase'])



### Comparing the coefficients using bootstrap standardization

Just as was the case for the LS algorithm, the coefficients of each predictive feature are not comparable unless the features have been pre-standardized (i.e., prior to fitting the logistic regression model) or the coefficients have been standardized, e.g., by using the bootstrap to estimate the standard deviation of each coefficient, and then dividing each coefficient by these estimated standard deviations.


The code below demonstrates the latter bootstrapping approach to computing comparable coefficients. 

First, we will create 1000 bootstrapped (sampled with replacement) versions of the training dataset, and we will compute a logistic regression fit to each bootstrapped training data sample.

In [10]:
# define the number of bootstrap samples
np.random.seed(3478)
n_bootstraps = 1000

# define the number of observations in the original dataset
n = shopping_train_preprocessed.shape[0]

# create an empty list to store the bootstrap samples
bootstrap_samples = []

# create an empty list to store the bootstrap coefficients
bootstrap_coefs = []

# create a for loop to generate the bootstrap samples
for i in range(n_bootstraps):
    bootstrap_sample = shopping_train_preprocessed.sample(n=n, replace=True)
    bootstrap_samples.append(bootstrap_sample)
    
    y_boot = bootstrap_sample['purchase']
    X_boot = bootstrap_sample.drop(columns='purchase')
    lr_boot = LogisticRegression().fit(X=X_boot, y=y_boot)
    bootstrap_coefs.append(lr_boot.coef_[0])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [11]:
# convert the bootstrap coefficients into a data frame
bootstrap_coefs_df = pd.DataFrame(bootstrap_coefs)
# the rows correspond tot he bootstrapped samples, and the columns correspond to the different variables/coefficients
bootstrap_coefs_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52
0,0.039595,0.022909,-0.036490,0.010321,0.001633,0.005611,-0.186564,-0.266345,0.091834,-0.240324,-0.815483,0.088864,-0.820957,-0.172312,-0.056967,-0.198964,-0.354682,-0.388691,0.717350,-0.238775,-0.313553,0.091072,-0.175236,-0.446762,-0.142064,0.040760,0.069196,0.134491,-0.142977,-0.122332,-0.142777,-0.149623,-0.142815,-0.098109,0.192214,0.169612,-0.138457,-0.045326,-0.086819,0.042512,-0.201997,-0.413235,-0.061104,0.010049,-0.467151,-0.104289,-0.016721,-0.599920,-0.023560,-0.098815,-0.076386,0.204311,-0.067778
1,0.039327,-0.012088,0.003081,0.011323,0.006325,-0.000253,-0.180076,-0.253151,0.075618,-0.291675,-0.515025,0.201793,-1.031648,-0.166737,0.141420,-0.287364,-0.189289,-0.693365,0.436036,-0.123024,0.067080,-0.013580,-0.365968,-0.317637,-0.146312,0.001867,0.099098,0.026002,-0.134060,-0.167825,-0.095959,-0.189797,-0.127778,-0.152175,0.015799,0.056762,-0.176962,-0.162411,-0.018239,0.057274,-0.277168,-0.391000,0.094985,0.148785,-0.283701,0.089877,-0.047778,-0.589068,0.033814,-0.212020,-0.137442,0.217483,-0.076851
2,0.014754,0.024108,-0.010860,0.025535,0.003509,0.003454,-0.155048,-0.222974,0.085804,-0.240804,-0.772011,0.061370,-0.866684,-0.168408,-0.018927,-0.062205,-0.452084,-0.520796,0.591020,-0.171677,-0.207373,0.131637,-0.260625,-0.249022,-0.065700,-0.010556,-0.019087,0.219747,-0.076029,-0.217592,-0.012335,-0.184227,-0.078754,-0.072153,0.162588,0.019217,-0.234450,-0.101768,-0.036504,0.151559,-0.180578,-0.073775,0.239215,0.020320,-0.244487,0.049474,-0.056791,-0.556398,-0.072267,-0.254280,-0.234135,0.060610,-0.054561
3,0.038787,0.000191,0.064972,0.007000,0.001843,0.003152,-0.250373,-0.353488,0.081360,-0.338351,-0.334756,0.284333,-0.956017,-0.264951,0.081359,-0.514741,-0.720073,-0.556787,0.482128,0.193691,0.144460,0.077398,-0.291340,-0.330653,-0.152913,-0.019628,-0.159597,-0.028770,-0.169701,0.197451,-0.322819,-0.170883,-0.158529,-0.012312,0.412322,-0.022500,-0.102079,-0.410715,0.004868,0.247531,-0.054958,-0.384492,0.222188,0.015081,-0.571588,0.144077,0.074971,-0.569771,0.312963,-0.332984,-0.314439,0.193478,-0.061040
4,0.023960,0.021719,0.033596,0.012079,0.003174,0.003111,-0.231806,-0.332345,0.078519,-0.324039,-0.521884,0.197282,-1.060550,-0.176105,-0.162540,-0.491441,-0.439184,-0.443291,0.494794,0.088251,0.101277,0.173728,-0.198154,-0.236132,-0.118787,-0.037663,0.145738,-0.073596,-0.126039,-0.054382,-0.009413,-0.101255,-0.125767,-0.268314,0.065670,0.077916,-0.184382,-0.201077,-0.205050,0.031809,-0.274993,-0.252597,0.118625,0.095473,-0.596079,0.245099,-0.084111,-0.556668,-0.061603,-0.082095,-0.057673,0.260525,-0.089184
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.018164,-0.003031,-0.008953,0.026692,0.002508,0.005909,-0.188902,-0.291495,0.089768,-0.283809,-0.513190,0.283401,-1.255178,-0.146414,0.359130,-0.175444,-0.529827,-0.429966,0.457696,-0.208519,0.060836,0.039449,-0.219487,-0.311788,-0.104649,-0.062530,-0.053993,0.008121,-0.135129,-0.038001,-0.038093,-0.061839,-0.115974,-0.153499,0.028354,0.089320,-0.118596,-0.155707,0.086513,-0.269311,-0.174960,-0.406851,0.060925,-0.007440,-0.253501,0.099816,0.085388,-0.660447,-0.236264,-0.127752,-0.089584,0.184075,0.015566
996,0.026742,0.015877,-0.057873,0.042369,0.003455,0.004414,-0.177710,-0.253685,0.083188,-0.249435,-0.595879,0.305846,-0.789375,-0.218448,0.123516,-0.299855,-0.418621,-0.607365,0.396254,-0.069888,-0.108056,-0.038169,-0.321221,-0.191720,-0.031954,0.040706,-0.126927,0.105053,-0.089247,-0.035138,0.010797,-0.198677,-0.199950,-0.099046,0.022858,0.084829,-0.132108,-0.104322,-0.114062,-0.128883,-0.259419,-0.235748,0.266778,-0.024135,-0.284213,-0.070439,0.002261,-0.680445,-0.320696,-0.076118,-0.066467,0.201703,-0.070450
997,0.048746,-0.000412,-0.017679,0.011748,0.003082,0.002320,-0.243508,-0.337207,0.085823,-0.200604,-0.695569,0.133424,-0.906403,-0.259888,0.328165,-0.309638,-0.478600,-0.439888,0.445379,-0.150443,0.092747,0.014938,-0.286450,-0.225286,-0.183104,0.029649,0.059089,0.057867,-0.135367,0.111997,-0.049132,-0.240914,-0.031072,-0.170017,0.181943,0.256407,-0.278771,-0.063449,-0.129215,-0.087045,-0.211190,-0.333866,0.272605,-0.081828,-0.281281,0.186467,0.023619,-0.418994,0.042474,-0.102471,-0.126534,0.203193,-0.007816
998,0.025838,-0.010399,-0.001736,0.017987,0.002827,0.004744,-0.124911,-0.182804,0.076601,-0.238886,-0.976853,0.045347,-0.676152,-0.098012,-0.030833,-0.187940,-0.358671,-0.619108,0.586001,-0.039592,-0.038315,-0.061723,-0.117116,-0.288960,-0.085770,0.005300,-0.014603,-0.050385,-0.051757,-0.061278,-0.005500,-0.103118,-0.085651,-0.074642,0.118774,0.061326,-0.095809,-0.171818,-0.144717,-0.063774,-0.151908,-0.184575,0.062886,0.003123,-0.236195,0.146998,0.014493,-0.524342,-0.104382,-0.103876,-0.104260,0.094374,-0.050052



Having computed 1000 bootstrapped coefficient values, we can compute their standard deviation, and use this value to standardize the original coefficient values


In [12]:
coef_std = pd.DataFrame(
    dict(variable = shopping_train_preprocessed.drop(columns='purchase').columns,
         coefficient = lr_all.coef_[0],
         boot_std = bootstrap_coefs_df.std()
         )
    )
# compute the standardized coefficients
coef_std['standardized_coefficient'] = coef_std['coefficient'] / coef_std['boot_std']
# compute the absolute value of the standardized coefficient
coef_std['standardized_coefficient_abs'] = coef_std['standardized_coefficient'].abs()
# order the coefficients in order of absolute value of the standardized coefficient
coef_std.sort_values(by='standardized_coefficient_abs', ascending=False).head(10)

Unnamed: 0,variable,coefficient,boot_std,standardized_coefficient,standardized_coefficient_abs
8,page_values,0.08571,0.005948,14.410055,14.410055
7,exit_rates,-0.275606,0.044978,-6.127545,6.127545
12,month_dec,-0.857342,0.143209,-5.98663,5.98663
18,month_nov,0.55222,0.094336,5.853772,5.853772
6,bounce_rates,-0.193446,0.034512,-5.605126,5.605126
47,traffic_type_3,-0.610676,0.115374,-5.293,5.293
17,month_may,-0.544442,0.11685,-4.659322,4.659322
16,month_mar,-0.508188,0.110962,-4.579855,4.579855
13,month_feb,-0.200524,0.047842,-4.191357,4.191357
10,visitor_type,-0.560967,0.150106,-3.737151,3.737151


Note that this is a random process, and the results will be slightly different each time you run the code (although we have set a random seed, so it shouldn't change here unless you change the seed). However, you should generally see that the `page_values` variable has the largest standardized coefficient *by far*, indicating that it is the variable that is most predictive of the purchase response.


## Evaluating binary predictions for a sample of 20 validation points

First, let's start by evaluating our predictions using just a random sample of 20 validation set sessions.

Let's create the same 20-session sample that was used in the book for evaluation.

In [13]:
sample_index = [961, 1315, 408, 1678, 1810,
                  1566, 2036, 1005, 2198, 685, 
                  1680, 1347, 2265, 286, 1393,
                  2267, 2247, 1576, 217, 420]
sample_index = [id - 1 for id in sample_index]
shopping_val_sample = shopping_val_preprocessed.loc[sample_index,:]


First, let's print out the observed and predicted (using LS and logistic regression fit to the full training set) purchase response for these 20 validation sessions.


In [14]:
pred_val_sample = pd.DataFrame(dict(
    purchase = shopping_val_sample['purchase'],
    ls_predict = ls_all.predict(shopping_val_sample.drop(columns='purchase')),
    ls_predict_binary = ls_all.predict(shopping_val_sample.drop(columns='purchase')) > 0.5,
    lr_predict = lr_all.predict_proba(shopping_val_sample.drop(columns='purchase'))[:,1],
    lr_predict_binary = lr_all.predict(shopping_val_sample.drop(columns='purchase'))))
pred_val_sample

Unnamed: 0,purchase,ls_predict,ls_predict_binary,lr_predict,lr_predict_binary
960,True,0.335059,False,0.294908,False
1314,True,0.328093,False,0.487193,False
407,True,0.618133,True,0.902232,True
1677,True,0.427632,False,0.605817,True
1809,True,0.416627,False,0.554909,True
1565,False,0.077843,False,0.051358,False
2035,False,0.34847,False,0.41475,False
1004,False,0.056462,False,0.048495,False
2197,False,0.054535,False,0.026333,False
684,False,0.08467,False,0.03952,False



### The confusion matrix

The confusion matrix for the LS (binary) fit, where the binary predictions are based (for now) on a threshold of 0.5, is



In [15]:
conf_ls = metrics.confusion_matrix(y_true=pred_val_sample['purchase'],
                                   y_pred=pred_val_sample['ls_predict_binary'])
conf_ls

array([[14,  1],
       [ 4,  1]])

and for the logistic regression fit, the confusion matrix (again, for now, based on a threshold of 0.5) is:


In [16]:
conf_lr = metrics.confusion_matrix(y_true=pred_val_sample['purchase'],
                                   y_pred=pred_val_sample['lr_predict_binary'])
conf_lr

array([[12,  3],
       [ 2,  3]])



### Prediction accuracy


There are several ways that you can compute the prediction accuracy, such as from the confusion matrix, by adding up the diagonal entries and dividing by the total:

In [17]:
# LS
(conf_ls[0, 0] + conf_ls[1, 1]) / conf_ls.sum()

0.75

In [18]:
# Logistic regression
(conf_lr[0, 0] + conf_lr[1, 1]) / conf_lr.sum()

0.75


or using the `accuracy_score()` function from the `sklearn.metrics` library:


In [19]:
# LS
metrics.accuracy_score(y_true=pred_val_sample['purchase'],
                       y_pred=pred_val_sample['ls_predict_binary'])

0.75

In [20]:
# Logistic regression
metrics.accuracy_score(y_true=pred_val_sample['purchase'],
                       y_pred=pred_val_sample['lr_predict_binary'])

0.75


### True positiveand true negative rate


Similarly, the true positive rate can be computed from the confusion matrix:


In [21]:
# LS
(conf_ls[1, 1]) / conf_ls[1,:].sum()

0.2

In [22]:
# Logistic regression
(conf_lr[1, 1]) / conf_lr[1,:].sum()

0.6

The true positive rate (recall/sensitivity) can also be computing using the `recall_score()` function from the sklearn.metrics library:

In [23]:
metrics.recall_score(y_true=pred_val_sample['purchase'],
                     y_pred=pred_val_sample['ls_predict_binary'])

0.2

In [24]:
metrics.recall_score(y_true=pred_val_sample['purchase'],
                     y_pred=pred_val_sample['lr_predict_binary'])

0.6



Similarly, the true negative rate can be computed from the confusion matrix:


In [25]:
# LS
(conf_ls[0, 0]) / conf_ls[0,:].sum()

0.9333333333333333

In [26]:
# Logistic regression
(conf_lr[0, 0]) / conf_lr[0,:].sum()

0.8

There is no specific function for the true negative rate from sklearn, but you can use the `recall_score()` function by passing in the `pos_label=0` argument:

In [27]:
# LS 
metrics.recall_score(y_true=pred_val_sample['purchase'],
                     y_pred=pred_val_sample['ls_predict_binary'],
                     pos_label=0)

0.9333333333333333

In [28]:
# Logistic regression
metrics.recall_score(y_true=pred_val_sample['purchase'],
                     y_pred=pred_val_sample['lr_predict_binary'],
                     pos_label=0)

0.8


### Predicted probability densities



We can also plot the distribution of the predicted probabilities using density plots.



In [29]:
# create overlaid density plots comparing the predicted probabilities for those who did and did not make a purchase in the validation set
pred_val_sample['purchase'] = pred_val_sample['purchase'].astype('category')
fig = ff.create_distplot([pred_val_sample[pred_val_sample.purchase == True]['lr_predict'],
                    pred_val_sample[pred_val_sample.purchase == False]['lr_predict']], 
                   show_hist=False,
                   group_labels=['Purchase', 'No Purchase'])  # specify the range of the plot
fig.update_xaxes(title_text='Predicted purchase probability',
                 range=[0, 1])
fig.update_layout(title_text = "Logistic regression")



Note, however, that using densities for so few samples is a bit misleading (there are only 5 data points in the "1" purchase class). A histogram would technically be more appropriate:

In [30]:
px.histogram(pred_val_sample, x='lr_predict', color='purchase', 
             nbins=20, opacity=0.5, 
             barmode='overlay')







### ROC curves


Computing ROC curves is easy with the `roc_curve()` function from the sklearn.metrics module. Let's plot an ROC curve for the LS and logistic regression predictions on the same plot



In [31]:
# compute the ROC curve variables
lr_fpr_sample, lr_tpr_sample, lr_thresholds_sample = metrics.roc_curve(pred_val_sample['purchase'], pred_val_sample['lr_predict'])
ls_fpr_sample, ls_tpr_sample, ls_thresholds_sample = metrics.roc_curve(pred_val_sample['purchase'], pred_val_sample['ls_predict'])

In [32]:
roc_lr_sample = pd.DataFrame({
    'False Positive Rate': lr_fpr_sample,
    'True Positive Rate': lr_tpr_sample,
    'Model': 'Logistic Regression'
}, index=lr_thresholds_sample)

roc_ls_sample = pd.DataFrame({
    'False Positive Rate': ls_fpr_sample,
    'True Positive Rate': ls_tpr_sample,
    'Model': 'LS'
}, index=ls_thresholds_sample)

roc_sample_df = pd.concat([roc_lr_sample, roc_ls_sample])


px.line(roc_sample_df, y='True Positive Rate', x='False Positive Rate',
        color='Model',
        width=700, height=500
)


The AUC of each plot can be computed using the `roc_auc_score()` function from the sklearn.metrics module:

In [33]:
# Logistic regression
lr_auc_sample = metrics.roc_auc_score(pred_val_sample['purchase'], pred_val_sample['lr_predict'])
print('Logistic regression AUC:', lr_auc_sample.round(3))

Logistic regression AUC: 0.84


In [34]:
# LS
ls_auc_sample = metrics.roc_auc_score(pred_val_sample['purchase'], pred_val_sample['ls_predict'])
print('LS AUC:', ls_auc_sample.round(3))

LS AUC: 0.813




## PCS evaluations

### Predictability (evaluating binary predictions for the full validation set)

Our predictability evaluation involves much the same computations as the previous section, but this time evaluated using the *entire* validation set, rather than just the sample of 20 validation set data points.



Note that we will use a threshold of 0.161 (the proportion of training sessions that ended with a purchase, as computed in the code below) to convert the continuous response predictions (probabilities, in the case of logistic regression) to binary predictions.


In [35]:
# compute the proportion of sessions in the training set with a purchase
shopping_train_preprocessed['purchase'].mean()

0.16128594682582745


Let's evaluate the LS and logistic regression fits on the entire validation set. First, we need to compute the predictions for the validation set.



In [36]:
# compute predictions for each model using the entire validation set
pred_val = pd.DataFrame(dict(
    purchase = shopping_val_preprocessed['purchase'],
    ls_predict = ls_all.predict(shopping_val_preprocessed.drop(columns='purchase')),
    lr_predict = lr_all.predict_proba(shopping_val_preprocessed.drop(columns='purchase'))[:,1]))

In [37]:
pred_val.head()

Unnamed: 0,purchase,ls_predict,lr_predict
0,False,-0.028682,0.05461
1,False,0.00491,0.028358
2,False,0.060235,0.07531
3,False,-0.060511,0.03744
4,False,0.086493,0.062937



Then we can compute many performance metrics at once using the `classification_report()` function from the sklearn.metrics module:


In [38]:
# compute accuracy, tpr (recall), etc  for Logistic regression
print(metrics.classification_report(y_true=pred_val['purchase'],
                                    y_pred=pred_val['lr_predict'] > 0.161))

              precision    recall  f1-score   support

       False       0.96      0.84      0.90      2094
        True       0.46      0.78      0.58       362

    accuracy                           0.83      2456
   macro avg       0.71      0.81      0.74      2456
weighted avg       0.88      0.83      0.85      2456



In [39]:
# compute accuracy, tpr (recall), etc  for LS
print(metrics.classification_report(y_true=pred_val['purchase'],
                                    y_pred=pred_val['ls_predict'] > 0.161))

              precision    recall  f1-score   support

       False       0.97      0.75      0.85      2094
        True       0.38      0.89      0.53       362

    accuracy                           0.77      2456
   macro avg       0.68      0.82      0.69      2456
weighted avg       0.89      0.77      0.80      2456



and we can compute the AUC for each model:

In [40]:
# Logistic regression
lr_auc = metrics.roc_auc_score(pred_val['purchase'], pred_val['lr_predict'])
print('Logistic regression AUC:', lr_auc.round(3))

Logistic regression AUC: 0.896


In [41]:
# LS
ls_auc = metrics.roc_auc_score(pred_val['purchase'], pred_val['ls_predict'])
print('LS AUC:', ls_auc.round(3))

LS AUC: 0.905


And we can plot ROC curves:

In [42]:
# compute the ROC curve variables
lr_fpr, lr_tpr, lr_thresholds = metrics.roc_curve(pred_val['purchase'], pred_val['lr_predict'])
ls_fpr, ls_tpr, ls_thresholds = metrics.roc_curve(pred_val['purchase'], pred_val['ls_predict'])

roc_lr = pd.DataFrame({
    'False Positive Rate': lr_fpr,
    'True Positive Rate': lr_tpr,
    'Model': 'Logistic Regression'
}, index=lr_thresholds)

roc_ls = pd.DataFrame({
    'False Positive Rate': ls_fpr,
    'True Positive Rate': ls_tpr,
    'Model': 'LS'
}, index=ls_thresholds)

roc_df = pd.concat([roc_lr, roc_ls])


px.line(roc_df, y='True Positive Rate', x='False Positive Rate',
        color='Model',
        width=700, height=500
)


As well as density plots for both LS and logistic regression:


In [43]:
# create overlaid density plots comparing the predicted probabilities for those who did and did not make a purchase in the validation set
pred_val['purchase'] = pred_val['purchase'].astype('category')
fig = ff.create_distplot([pred_val[pred_val.purchase == True]['lr_predict'],
                    pred_val[pred_val.purchase == False]['lr_predict']], 
                         show_hist=False,
                   group_labels=['Purchase', 'No Purchase'])  # specify the range of the plot
fig.update_xaxes(title_text='Predicted purchase probability',
                 range=[0, 1])
fig.update_layout(title_text = "Logistic regression")

In [44]:
# create overlaid density plots comparing the predicted probabilities for those who did and did not make a purchase in the validation set
pred_val['purchase'] = pred_val['purchase'].astype('category')
fig = ff.create_distplot([pred_val[pred_val.purchase == True]['ls_predict'],
                    pred_val[pred_val.purchase == False]['ls_predict']], 
                         show_hist=False,
                   group_labels=['Purchase', 'No Purchase'])  # specify the range of the plot
fig.update_xaxes(title_text='Predicted purchase probability',
                 range=[0, 1])
fig.update_layout(title_text = "LS")


### Stability to data perturbations


To investigate the stability of each algorithm to data perturbations (specifically, bootstrap samples), we will first create 100 perturbed versions of the training dataset:


In [45]:
# create a list of 100 perturbed (bootstrapped) versions of shopping_train_preprocessed
np.random.seed(348)
perturbed_shopping = [shopping_train_preprocessed.sample(frac=1, replace=True) for i in range(100)]


Then we can fit a LS and logistic regression fit to each perturbed dataset. 

In [46]:
def fit_models(df, standardize=False):
    
    # if specified, standardize the predictive features
    df_x = df.drop(columns='purchase')
    if standardize:
        df_x = (df_x - df_x.mean()) / df_x.std()
        
    ls_all = LinearRegression().fit(X=df_x, y=df['purchase'])
    lr_fit = LogisticRegression().fit(X=df_x, y=df['purchase'])
    
    return (ls_all, lr_fit)

In [47]:
results = Parallel(n_jobs=-1)(delayed(fit_models)(df) for df in perturbed_shopping)
ls_perturbed, lr_perturbed = zip(*results)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

We can then generate sale price predictions for each session in the validation set using each perturbed LS fits.

In [48]:
# compute the predictions on the validaion set for ls_all_perturbed, cart_perturbed and rf_perturbed
ls_val_pred_perturbed = [ls_perturbed[i].predict(X=shopping_val_preprocessed.drop(columns='purchase')) for i in range(100)]
lr_val_pred_perturbed = [lr_perturbed[i].predict_proba(X=shopping_val_preprocessed.drop(columns='purchase'))[:,1] for i in range(100)]

# compute binary predictions using the 0.161 threshold
ls_val_pred_binary_perturbed = [ls_val_pred_perturbed[i] > 0.161 for i in range(100)]
lr_val_pred_binary_perturbed = [lr_val_pred_perturbed[i] > 0.161 for i in range(100)]


Then we can look at the distributions of the performance metrics we computed for the 100 perturbed fits. First, let's look at just the first 10 perturbed ROC curves. 


In [49]:

# Compute the ROC curve variables for each perturbed prediction
roc_curves_ls = []
roc_curves_lr = []
for i in range(100):
    lr_fpr, lr_tpr, lr_thresholds = metrics.roc_curve(pred_val['purchase'], lr_val_pred_perturbed[i])
    ls_fpr, ls_tpr, ls_thresholds = metrics.roc_curve(pred_val['purchase'], ls_val_pred_perturbed[i])

    roc_lr = pd.DataFrame({
        'False Positive Rate': lr_fpr,
        'True Positive Rate': lr_tpr,
        'Model': f'Logistic Regression (Perturbed {i+1})'
    }, index=lr_thresholds)

    roc_ls = pd.DataFrame({
        'False Positive Rate': ls_fpr,
        'True Positive Rate': ls_tpr,
        'Model': f'LS (Perturbed {i+1})'
    }, index=ls_thresholds)

    roc_curves_ls.append(pd.concat([roc_ls]))
    roc_curves_lr.append(pd.concat([roc_lr]))

In [50]:
fig = go.Figure()

for i, roc_curve in enumerate(roc_curves_ls):
    fig.add_trace(go.Scatter(
        x=roc_curve['False Positive Rate'],
        y=roc_curve['True Positive Rate'],
        mode='lines',
        name=f'ROC Curve {i+1}',
        showlegend=False  # Remove the legend for each trace
    ))

fig.update_layout(
    title='(a) LS',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate'
)

fig.show()


In [51]:
fig = go.Figure()

for i, roc_curve in enumerate(roc_curves_lr):
    fig.add_trace(go.Scatter(
        x=roc_curve['False Positive Rate'],
        y=roc_curve['True Positive Rate'],
        mode='lines',
        name=f'ROC Curve {i+1}',
        showlegend=False  # Remove the legend for each trace
    ))

fig.update_layout(
    title='(b) Logistic regression',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate'
)

fig.show()



The ROC curves seem very stable.


Next, let's look at the distribution of the other performance measures using boxplots.


In [52]:
# LS 
ls_tp_rate_perturbed = [metrics.recall_score(y_true=shopping_val_preprocessed['purchase'], y_pred=pred) for pred in ls_val_pred_binary_perturbed]
ls_acc_perturbed = [metrics.accuracy_score(y_true=shopping_val_preprocessed['purchase'], y_pred=pred) for pred in ls_val_pred_binary_perturbed]
ls_tn_rate_perturbed = [metrics.recall_score(y_true=shopping_val_preprocessed['purchase'], y_pred=pred, pos_label=0) for pred in ls_val_pred_binary_perturbed]

# Logistic regression
lr_tp_rate_perturbed = [metrics.recall_score(y_true=shopping_val_preprocessed['purchase'], y_pred=pred) for pred in lr_val_pred_binary_perturbed]
lr_acc_perturbed = [metrics.accuracy_score(y_true=shopping_val_preprocessed['purchase'], y_pred=pred) for pred in lr_val_pred_binary_perturbed]
lr_tn_rate_perturbed = [metrics.recall_score(y_true=shopping_val_preprocessed['purchase'], y_pred=pred, pos_label=0) for pred in lr_val_pred_binary_perturbed]

In [53]:
ls_tp_df = pd.DataFrame(dict(tp_rate = ls_tp_rate_perturbed,
                             model = 'LS'))
lr_tp_df = pd.DataFrame(dict(tp_rate = lr_tp_rate_perturbed,
                             model = 'Logistic regression'))

px.box(pd.concat([ls_tp_df, lr_tp_df]), x='model', y='tp_rate', title='True positive rate')

In [54]:
ls_tn_df = pd.DataFrame(dict(tn_rate = ls_tn_rate_perturbed,
                             model = 'LS'))
lr_tn_df = pd.DataFrame(dict(tn_rate = lr_tn_rate_perturbed,
                             model = 'Logistic regression'))

px.box(pd.concat([ls_tn_df, lr_tn_df]), x='model', y='tn_rate', title='True negative rate')

In [55]:
ls_acc_df = pd.DataFrame(dict(acc = ls_acc_perturbed,
                              model = 'LS'))
lr_acc_df = pd.DataFrame(dict(acc = lr_acc_perturbed,
                              model = 'Logistic regression'))

px.box(pd.concat([ls_acc_df, lr_acc_df]), x='model', y='acc', title='Accuracy')


Lastly, let's look at the distributions of the coefficients themselves. To do that, we need to extract the coefficients from the perturbed LS fits. However, since the coefficients are not inherently on the same scale, we need to either standardize the coefficients themselves (e.g., using the bootstrap), or we need to standardize the features before we generate the LS fit. Since it is computationally easier to take the latter approach, the code below does exactly that:


In [56]:
results_standardized = Parallel(n_jobs=-1)(delayed(fit_models)(df, standardize=True) for df in perturbed_shopping)
ls_perturbed_std, lr_perturbed_std = zip(*results_standardized)

In [57]:
perturbed_data_coefs_ls_std = pd.DataFrame([x.coef_ for x in ls_perturbed_std])
perturbed_data_coefs_ls_std.columns = shopping_train_preprocessed.drop(columns='purchase').columns
perturbed_data_coefs_ls_std.head()

Unnamed: 0,administrative,administrative_duration,informational,informational_duration,product_related,product_related_duration,bounce_rates,exit_rates,page_values,special_day,visitor_type,weekend,month_dec,month_feb,month_jul,month_june,month_mar,month_may,month_nov,month_oct,month_sep,operating_systems_2,operating_systems_3,operating_systems_4,operating_systems_8,operating_systems_other,browser_10,browser_2,browser_3,browser_4,browser_5,browser_6,browser_8,browser_other,region_2,region_3,region_4,region_5,region_6,region_7,region_8,region_9,traffic_type_10,traffic_type_11,traffic_type_13,traffic_type_2,traffic_type_20,traffic_type_3,traffic_type_4,traffic_type_5,traffic_type_6,traffic_type_8,traffic_type_other
0,-0.001092,0.009385,0.001072,0.007483,-0.010424,0.028358,0.027715,-0.055805,0.177112,0.000431,-0.009364,-0.00087,-0.009203,-0.006679,0.00975,-0.003203,-0.008843,-0.006513,0.036105,0.005156,0.007793,0.025002,0.006069,-0.001437,0.006444,-0.00028,-0.008523,-0.028268,-0.007784,-0.012151,-0.005112,-0.012328,-0.003667,-0.006289,0.004226,0.002704,-0.003823,-0.005391,-0.005212,0.001837,-0.001607,-0.008376,0.007121,0.002381,-0.006052,0.010424,0.007385,-0.005337,0.001287,0.002637,-0.004176,0.002882,0.002701
1,0.015183,-0.007669,-0.005418,0.010453,0.006681,0.020109,0.011949,-0.033442,0.166912,-0.005318,-0.00972,0.001578,-0.026738,-0.008139,0.002649,-0.006309,-0.016869,-0.018884,0.02177,-0.001141,-0.003436,0.005901,0.003199,-0.007284,0.001115,0.000703,0.0046,-0.003957,-0.004155,0.00343,-0.007357,-0.00483,-0.005326,-0.004147,0.008181,-0.001606,-0.000781,-0.009687,0.001772,-0.000796,-0.009496,-0.008267,0.008876,0.006702,-0.008347,0.013572,0.012427,-0.006797,-0.003738,-0.00133,0.00384,0.009269,0.001665
2,-0.002802,0.002882,0.006705,0.011998,-0.001156,0.02821,0.031756,-0.054056,0.163717,-0.006567,-0.008555,0.002266,-0.018334,-0.005738,0.002634,0.000506,-0.013975,-0.013271,0.030748,0.001337,0.002174,-0.001262,-0.004619,-0.011266,-0.010018,-0.00765,0.005104,0.002219,-0.005228,0.002537,0.008926,-0.004721,-0.001273,0.001913,-0.002679,-0.001262,-0.00407,-0.007533,-0.005139,-0.000609,-0.01165,0.000159,0.010835,0.009414,-0.008371,0.024745,0.009683,-0.00423,0.003479,0.008401,0.0026,0.013208,0.000846
3,0.010809,-0.003105,0.014532,-0.000668,0.00946,0.013371,0.026094,-0.0541,0.167164,-0.003278,-0.015528,0.001435,-0.030972,-0.012022,-0.0031,-0.011405,-0.025731,-0.024849,0.017907,-0.006474,-0.004198,-0.001836,-0.013983,-0.004587,-0.001699,-0.002812,0.004001,-0.00056,-0.004099,0.002381,0.001326,-0.004166,-0.008457,-0.003198,0.001574,-0.009866,-0.001085,-0.003995,0.003739,0.002657,-0.007149,-0.00956,0.003607,0.00942,-0.005704,0.010506,0.012857,-0.005469,-0.000741,0.000389,-0.002713,0.00916,0.005248
4,0.013931,-0.000264,1.3e-05,0.006731,3.7e-05,0.0281,0.017416,-0.039854,0.163964,-0.003758,-0.012305,0.009686,-0.030811,-0.011331,0.004381,-0.013135,-0.023559,-0.029197,0.010693,0.001865,0.005789,-0.006149,-0.015956,-0.004296,-0.004937,-0.002308,0.000134,0.006188,-0.002926,0.015735,2.6e-05,-0.003978,-0.004741,0.002931,0.003357,0.00416,-0.000143,-0.000261,-0.002749,0.003289,-0.004312,-0.003275,0.014305,0.002577,-0.009327,0.01463,0.012321,-0.004863,-0.006488,-0.005029,0.002502,0.009018,-0.001505


In [58]:
perturbed_data_coefs_lr_std = pd.DataFrame([x.coef_[0] for x in lr_perturbed_std])
perturbed_data_coefs_lr_std.columns = shopping_train_preprocessed.drop(columns='purchase').columns
perturbed_data_coefs_lr_std.head()

Unnamed: 0,administrative,administrative_duration,informational,informational_duration,product_related,product_related_duration,bounce_rates,exit_rates,page_values,special_day,visitor_type,weekend,month_dec,month_feb,month_jul,month_june,month_mar,month_may,month_nov,month_oct,month_sep,operating_systems_2,operating_systems_3,operating_systems_4,operating_systems_8,operating_systems_other,browser_10,browser_2,browser_3,browser_4,browser_5,browser_6,browser_8,browser_other,region_2,region_3,region_4,region_5,region_6,region_7,region_8,region_9,traffic_type_10,traffic_type_11,traffic_type_13,traffic_type_2,traffic_type_20,traffic_type_3,traffic_type_4,traffic_type_5,traffic_type_6,traffic_type_8,traffic_type_other
0,-0.026008,0.051145,0.032517,0.036298,-0.082435,0.207856,-0.394017,-0.896982,1.589791,0.020145,-0.037151,0.015485,-0.124426,-0.545972,0.11971,-0.063261,-0.126415,-0.10347,0.356885,0.065723,0.109035,0.18522,0.030906,-0.054561,0.061776,-0.033562,-0.074159,-0.25001,-0.112168,-0.103951,0.009762,-0.139857,-0.03004,-0.046545,0.043543,0.048915,-0.02455,-0.057568,-0.042981,0.037871,0.007842,-0.101399,0.066338,0.034344,-0.071887,0.04473,0.01807,-0.099177,-0.022046,0.005621,-0.078931,0.040913,0.024998
1,0.12865,-0.081066,-0.05396,0.084526,0.058073,0.121892,-0.25856,-0.4744,1.601345,-0.110693,-0.080362,0.016242,-0.322766,-0.639055,0.039656,-0.070147,-0.198912,-0.234254,0.194881,-0.007465,-0.013428,0.009546,0.043187,-0.124729,0.030117,-0.012411,0.061508,-0.002832,-0.119404,0.034937,-0.01575,-0.052411,-0.061752,-0.045905,0.079212,0.003795,-0.001096,-0.153128,0.032159,0.010635,-0.139358,-0.080113,0.075091,0.045041,-0.098846,0.09163,0.079504,-0.122389,-0.110405,-0.038676,0.038768,0.094966,0.016702
2,-0.042275,0.014217,0.02381,0.091543,-0.02811,0.222042,0.134658,-0.778052,1.489247,-0.116558,-0.059642,0.046303,-0.202566,-0.235119,0.04574,0.008191,-0.185425,-0.180495,0.256779,-0.004393,0.029177,-0.014248,-0.009963,-0.14434,-0.09103,-0.147402,0.043311,0.018915,-0.122609,0.025043,0.106566,-0.072922,0.000472,0.027413,-0.033608,0.003415,-0.048694,-0.081236,-0.055851,-0.002313,-0.113565,0.001562,0.097288,0.082643,-0.135923,0.191248,0.063112,-0.08674,0.037023,0.036209,0.026536,0.120777,0.009621
3,0.068668,-0.024523,0.090721,-0.00754,0.060507,0.083227,-0.172667,-0.750414,1.601164,-0.016318,-0.092272,0.005076,-0.307526,-0.616666,0.007956,-0.123501,-0.251901,-0.259019,0.187029,-0.044556,-0.018733,0.019649,-0.067879,-0.065823,0.004422,-0.041018,0.042431,-0.047312,-0.120417,0.011937,0.048261,-0.067886,-0.095769,-0.064814,0.004495,-0.08621,0.006834,-0.023769,0.046356,0.038338,-0.042531,-0.091465,0.0115,0.085429,-0.061393,0.03791,0.063515,-0.086738,-0.025264,-0.038206,-0.025598,0.115198,0.041811
4,0.080724,0.009274,0.020598,0.025523,0.034481,0.169277,-0.414118,-0.534458,1.580408,-0.08577,-0.081704,0.117414,-0.327192,-0.566173,0.049502,-0.121036,-0.263044,-0.380573,0.080664,-0.011626,0.0453,-0.091215,-0.180261,-0.086636,-0.035043,-0.043735,-0.018503,0.019604,-0.065338,0.135983,0.012782,-0.068158,-0.081311,0.017731,0.045728,0.083849,0.019309,0.005704,-0.04259,0.041369,-0.030561,-0.041135,0.117909,0.009052,-0.113849,0.097689,0.098427,-0.109769,-0.096673,-0.042403,0.056491,0.102743,-0.043353



Next, we can use boxplots to visualize the distributions of the largest (in absolute value) coefficients for each fit (note that in the book, we arrange the LS plot in the same order as the logistic regression plot, but we haven't done that here).

In [59]:
# Sort the columns based on the absolute values of the coefficients
sorted_columns = perturbed_data_coefs_ls_std.abs() \
    .mean() \
    .sort_values(ascending=False) \
    .head(20) \
    .index

# Create the boxplot with the sorted columns
fig = px.box(perturbed_data_coefs_ls_std[sorted_columns],
             title='LS')
fig.update_yaxes(title_text='Standardized coefficient')


In [60]:
# Sort the columns based on the absolute values of the coefficients
sorted_columns = perturbed_data_coefs_lr_std.abs() \
    .mean() \
    .sort_values(ascending=False) \
    .head(20) \
    .index

# Create the boxplot with the sorted columns
fig = px.box(perturbed_data_coefs_lr_std[sorted_columns], 
             title='Logistic regression')
fig.update_yaxes(title_text='Standardized coefficient')




### Stability to cleaning/pre-processing judgment calls



The judgment calls that we will consider are:


1. Converting the numeric variables (such as `browser`, `region`, and `operating_system`) to categorical variables, or leaving them in a numeric format (just in case there is some meaningful order to the levels that we don't know about).

2. Converting the categorical `month` variable to a numeric format (since there is a natural ordering to the months), or leaving it in a categorical format (which will be turned into one-hot encoded dummy variables during pre-processing).

3. Applying a log-transformation to the page visit and duration variables (because this makes the distributions look more symmetric, and may help improve predictive performance), or leaving them un-transformed.

4. Removing very extreme sessions (e.g., that visited over 400 product-related pages in a single session, or spent more than 12 hours on product-related pages in a single session) that may be bots, versus leaving them in the data. Note that we chose the thresholds that we use to define a potential "bot" session based on a visualization of the distributions of these variables.


Since each judgment call above has two options (TRUE or FALSE), so there are a total of $2^4 = 16$ different judgment call combinations that we will consider in this section. 

The code below creates an object that contains a list column with each perturbed dataset.

First, we will create a grid of all of the judgment call options.


In [61]:
perturb_options = list(product([True, False], 
                               [True, False],
                               [True, False],
                               [True, False]))
perturb_options = pd.DataFrame(perturb_options, columns=('numeric_to_cat', 
                                                         'month_numeric',
                                                         'log_page',
                                                         'remove_extreme'))
perturb_options

Unnamed: 0,numeric_to_cat,month_numeric,log_page,remove_extreme
0,True,True,True,True
1,True,True,True,False
2,True,True,False,True
3,True,True,False,False
4,True,False,True,True
5,True,False,True,False
6,True,False,False,True
7,True,False,False,False
8,False,True,True,True
9,False,True,True,False



Then we will create a version of the pre-processed dataset per each of the judgment call combination options. Note that unlike for the data perturbations (which were all based on the "default" pre-processed training dataset) where we could use the "default" validation dataset, we will need to explicitly create perturbed versions of the pre-processed validation data to match each perturbed version of the pre-processed training data.

In [62]:
# conduct judgment call perturbations of training data
shopping_jc_perturb = [preprocess_shopping_data(shopping_train,
                                                numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                                month_numeric=perturb_options['month_numeric'][i],
                                                log_page=perturb_options['log_page'][i],
                                                remove_extreme=perturb_options['remove_extreme'][i])
                       for i in range(perturb_options.shape[0])]

# create a version of each perturbed dataset without dummy variables so that we can
# ensure that the same unique values of each categorical variable are present in the
# validation set
shopping_jc_perturb_nodummy = [preprocess_shopping_data(shopping_train,
                                                        dummy=False,
                                                        numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                                        month_numeric=perturb_options['month_numeric'][i],
                                                        log_page=perturb_options['log_page'][i],
                                                        remove_extreme=perturb_options['remove_extreme'][i])
                               for i in range(perturb_options.shape[0])]

# conduct judgment call perturbations of validation data data (we need to make sure each validation set is compartible with the relevant training set)
shopping_val_jc_perturb = []
for i in range(perturb_options.shape[0]):
    
    # create preprocessed validation set
    shopping_val_jc_perturb.append(
        preprocess_shopping_data(shopping_val,
                                 numeric_to_cat=perturb_options['numeric_to_cat'][i],
                                 month_numeric=perturb_options['month_numeric'][i],
                                 log_page=perturb_options['log_page'][i],
                                 remove_extreme=perturb_options['remove_extreme'][i],
                                 # make sure val set matches training set
                                 column_selection=list(shopping_jc_perturb[i].columns),
                                 operating_systems_levels=shopping_jc_perturb_nodummy[i]['operating_systems'].unique(),
                                 browser_levels=shopping_jc_perturb_nodummy[i]['browser'].unique(),
                                 traffic_type_levels=shopping_jc_perturb_nodummy[i]['traffic_type'].unique())
    )



Then we can fit a LS and logistic regression fit to each perturbed dataset (and store these in list columns too). For each LS and logistic regression fit, we can then compute the validation set predictions and compute the relevant performance measures.


In [63]:
results_jc_perturbed = Parallel(n_jobs=-1)(delayed(fit_models)(df) for df in shopping_jc_perturb)
ls_jc_perturbed, lr_jc_perturbed = zip(*results_jc_perturbed)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [64]:
# compute the predictions on the validaion set for ls_all_perturbed, cart_perturbed and rf_perturbed
ls_val_pred_jc_perturbed = [ls_jc_perturbed[i].predict(X=shopping_val_jc_perturb[i].drop(columns='purchase')) for i in range(len(ls_jc_perturbed))]
lr_val_pred_jc_perturbed = [lr_jc_perturbed[i].predict_proba(X=shopping_val_jc_perturb[i].drop(columns='purchase'))[:,1] for i in range(len(lr_jc_perturbed))]

# compute binary predictions using the 0.161 threshold
ls_val_pred_binary_jc_perturbed = [ls_val_pred_jc_perturbed[i] > 0.161 for i in range(len(ls_jc_perturbed))]
lr_val_pred_binary_jc_perturbed = [lr_val_pred_jc_perturbed[i] > 0.161 for i in range(len(lr_jc_perturbed))]


Then we can look at the distributions of the performance metrics we computed for the 100 perturbed fits. First, let's look at the perturbed ROC curves. 


In [65]:

# Compute the ROC curve variables for each perturbed prediction
roc_curves_ls_jc = []
roc_curves_lr_jc = []
for i in range(len(shopping_val_jc_perturb)):
    lr_fpr, lr_tpr, lr_thresholds = metrics.roc_curve(shopping_val_jc_perturb[i]['purchase'], lr_val_pred_jc_perturbed[i])
    ls_fpr, ls_tpr, ls_thresholds = metrics.roc_curve(shopping_val_jc_perturb[i]['purchase'], ls_val_pred_jc_perturbed[i])

    roc_lr = pd.DataFrame({
        'False Positive Rate': lr_fpr,
        'True Positive Rate': lr_tpr,
        'Model': f'Logistic Regression (Perturbed {i+1})'
    }, index=lr_thresholds)

    roc_ls = pd.DataFrame({
        'False Positive Rate': ls_fpr,
        'True Positive Rate': ls_tpr,
        'Model': f'LS (Perturbed {i+1})'
    }, index=ls_thresholds)

    roc_curves_ls_jc.append(pd.concat([roc_ls]))
    roc_curves_lr_jc.append(pd.concat([roc_lr]))

In [66]:
fig = go.Figure()

for i, roc_curve in enumerate(roc_curves_ls_jc):
    fig.add_trace(go.Scatter(
        x=roc_curve['False Positive Rate'],
        y=roc_curve['True Positive Rate'],
        mode='lines',
        name=f'ROC Curve {i+1}',
        showlegend=False  # Remove the legend for each trace
    ))

fig.update_layout(
    title='(a) LS',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate'
)

fig.show()


In [67]:
fig = go.Figure()

for i, roc_curve in enumerate(roc_curves_lr_jc):
    fig.add_trace(go.Scatter(
        x=roc_curve['False Positive Rate'],
        y=roc_curve['True Positive Rate'],
        mode='lines',
        name=f'ROC Curve {i+1}',
        showlegend=False  # Remove the legend for each trace
    ))

fig.update_layout(
    title='(b) Logistic regression',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate'
)

fig.show()



The ROC curves seem very stable.


Next, let's look at the distribution of the other performance measures using boxplots.


In [68]:
# LS 
ls_tp_rate_jc_perturbed = [metrics.recall_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_pred=ls_val_pred_binary_jc_perturbed[i]) for i in range(len(ls_val_pred_binary_jc_perturbed))]
ls_acc_jc_perturbed = [metrics.accuracy_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_pred=ls_val_pred_binary_jc_perturbed[i]) for i in range(len(ls_val_pred_binary_jc_perturbed))]
ls_tn_rate_jc_perturbed = [metrics.recall_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_pred=ls_val_pred_binary_jc_perturbed[i], pos_label=0) for i in range(len(ls_val_pred_binary_jc_perturbed))]

In [69]:
# Logistic regression
lr_tp_rate_jc_perturbed = [metrics.recall_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_pred=lr_val_pred_binary_jc_perturbed[i]) for i in range(len(lr_val_pred_binary_jc_perturbed))]
lr_acc_jc_perturbed = [metrics.accuracy_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_pred=lr_val_pred_binary_jc_perturbed[i]) for i in range(len(lr_val_pred_binary_jc_perturbed))]
lr_tn_rate_jc_perturbed = [metrics.recall_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_pred=lr_val_pred_binary_jc_perturbed[i], pos_label=0) for i in range(len(lr_val_pred_binary_jc_perturbed))]

In [70]:
ls_tp_df_jc = pd.DataFrame(dict(tp_rate = ls_tp_rate_jc_perturbed,
                             model = 'LS'))
lr_tp_df_jc = pd.DataFrame(dict(tp_rate = lr_tp_rate_jc_perturbed,
                             model = 'Logistic regression'))

px.box(pd.concat([ls_tp_df_jc, lr_tp_df_jc]), x='model', y='tp_rate', title='True positive rate')

In [71]:
ls_tn_df_jc = pd.DataFrame(dict(tn_rate = ls_tn_rate_jc_perturbed,
                             model = 'LS'))
lr_tn_df_jc = pd.DataFrame(dict(tn_rate = lr_tn_rate_jc_perturbed,
                             model = 'Logistic regression'))

px.box(pd.concat([ls_tn_df_jc, lr_tn_df_jc]), x='model', y='tn_rate', title='True negative rate')

In [72]:
ls_acc_df_jc = pd.DataFrame(dict(acc = ls_acc_jc_perturbed,
                                 model = 'LS'))
lr_acc_df_jc = pd.DataFrame(dict(acc = lr_acc_jc_perturbed,
                                 model = 'Logistic regression'))

px.box(pd.concat([ls_acc_df_jc, lr_acc_df_jc]), x='model', y='acc', title='Accuracy')


We can also compare these performance measures across different judgment call options:

In [73]:
# look at the distribution of AUC across different performance measures
ls_auc_jc_perturbed = [metrics.roc_auc_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_score=ls_val_pred_jc_perturbed[i]) for i in range(len(ls_val_pred_binary_jc_perturbed))]
lr_auc_jc_perturbed = [metrics.roc_auc_score(y_true=shopping_val_jc_perturb[i]['purchase'], y_score=lr_val_pred_jc_perturbed[i]) for i in range(len(lr_val_pred_binary_jc_perturbed))]



In [74]:
ls_jc_perturb_auc = pd.DataFrame(dict(auc = ls_auc_jc_perturbed,
                                      numeric_to_cat = perturb_options['numeric_to_cat'],
                                      month_numeric = perturb_options['month_numeric'],
                                      log_page = perturb_options['log_page'],
                                      remove_extreme = perturb_options['remove_extreme']))

px.box(ls_jc_perturb_auc.melt(id_vars=['auc'], 
                              value_vars=['numeric_to_cat', 'month_numeric', 'log_page', 'remove_extreme'], 
                              var_name='perturbation', 
                              value_name='value'),
       x='value', y='auc', facet_col='perturbation', facet_col_wrap=2,
       title='LS')

In [75]:
lr_jc_perturb_auc = pd.DataFrame(dict(auc = lr_auc_jc_perturbed,
                                      numeric_to_cat = perturb_options['numeric_to_cat'],
                                      month_numeric = perturb_options['month_numeric'],
                                      log_page = perturb_options['log_page'],
                                      remove_extreme = perturb_options['remove_extreme']))

px.box(lr_jc_perturb_auc.melt(id_vars=['auc'], 
                              value_vars=['numeric_to_cat', 'month_numeric', 'log_page', 'remove_extreme'], 
                              var_name='perturbation', 
                              value_name='value'),
       x='value', y='auc', facet_col='perturbation', facet_col_wrap=2,
       title='Logistic regression')


Overall, it seems like the `log_page` judgment call is the only one that seems to make much of a difference, in that when we *do* log-transform the page variables, we tend to have fits with *lower* AUCs, however, the decrease in AUC is not too extreme.


Lastly, let's look at the distributions of the coefficients themselves. To do that, we need to extract the coefficients from the perturbed LS fits. However, since the coefficients are not inherently on the same scale, we need to either standardize the coefficients themselves (e.g., using the bootstrap), or we need to standardize the features before we generate the LS fit. Since it is computationally easier to take the latter approach, the code below does exactly that:

In [76]:
results_jc_perturbed_standardized = Parallel(n_jobs=-1)(delayed(fit_models)(df, standardize=True) for df in shopping_jc_perturb)
ls_jc_perturbed_std, lr_jc_perturbed_std = zip(*results_jc_perturbed_standardized)

In [77]:
perturbed_jc_data_coefs_ls_std = pd.DataFrame([x.coef_ for x in ls_jc_perturbed_std])
perturbed_jc_data_coefs_ls_std.columns = shopping_train_preprocessed.drop(columns='purchase').columns
perturbed_jc_data_coefs_ls_std.head()

Unnamed: 0,administrative,administrative_duration,informational,informational_duration,product_related,product_related_duration,bounce_rates,exit_rates,page_values,special_day,visitor_type,weekend,month_dec,month_feb,month_jul,month_june,month_mar,month_may,month_nov,month_oct,month_sep,operating_systems_2,operating_systems_3,operating_systems_4,operating_systems_8,operating_systems_other,browser_10,browser_2,browser_3,browser_4,browser_5,browser_6,browser_8,browser_other,region_2,region_3,region_4,region_5,region_6,region_7,region_8,region_9,traffic_type_10,traffic_type_11,traffic_type_13,traffic_type_2,traffic_type_20,traffic_type_3,traffic_type_4,traffic_type_5,traffic_type_6,traffic_type_8,traffic_type_other
0,0.008933,0.001228,0.001638,0.012058,-0.001422,0.039691,-0.012388,-0.008738,0.161315,-0.006602,0.023457,-0.007701,0.003805,0.007227,-7.5e-05,-0.005079,-0.002725,-0.000298,0.000705,-0.007133,-0.006337,0.000768,-0.003814,-0.006343,-0.003876,-0.004174,0.003833,-0.001176,-0.004809,-0.005667,-0.002637,-0.001045,-0.007606,-0.007104,0.010394,0.009923,-0.005426,0.014624,0.011041,-0.003691,0.001214,0.00535,-0.001138,0.01508,0.001766,,,,,,,,
1,0.009699,0.000478,0.002194,0.011124,-0.001331,0.039618,-0.012283,-0.008901,0.161349,-0.006621,0.023375,-0.007668,0.003798,0.00701,-0.000126,-0.005182,-0.002749,-0.000304,0.000696,-0.007306,-0.006361,0.000744,-0.003854,-0.006343,-0.003932,-0.004206,0.003868,-0.001138,-0.004781,-0.005642,-0.002628,-0.001011,-0.007596,-0.007077,0.01039,0.009918,-0.005409,0.014611,0.01106,-0.003751,0.001174,0.005313,-0.001143,0.015071,0.001761,,,,,,,,
2,0.005267,0.001618,-0.001553,0.009265,0.004223,0.029751,0.026482,-0.04866,0.164363,-0.007656,0.021827,-0.008624,0.004425,0.008875,-0.000564,-0.003808,-0.003361,-0.000311,0.000457,-0.007343,-0.005965,0.001015,-0.003842,-0.006044,-0.003944,-0.003998,0.004347,-0.000217,-0.004161,-0.005701,-0.002296,-0.000637,-0.007017,-0.006483,0.010929,0.009844,-0.005764,0.016533,0.011438,-0.004138,0.002376,0.005499,-0.000703,0.015298,0.00175,,,,,,,,
3,0.007715,-0.001808,0.000438,0.006347,0.016344,0.015933,0.026779,-0.049526,0.164604,-0.007887,0.021731,-0.00814,0.004429,0.009215,-0.000435,-0.003978,-0.003185,-0.000126,0.000359,-0.007741,-0.005932,0.000904,-0.004073,-0.006011,-0.004028,-0.004127,0.004716,-0.000148,-0.004002,-0.005524,-0.002276,-0.000614,-0.006769,-0.00639,0.010915,0.009631,-0.005711,0.016178,0.011393,-0.004529,0.001941,0.005424,-0.000787,0.015447,0.001704,,,,,,,,
4,0.00916,0.001804,0.000813,0.012919,-0.005769,0.039911,-0.014794,-0.007276,0.161324,-0.00339,-0.009187,0.002034,-0.025612,-0.009403,0.003602,-0.00963,-0.022057,-0.022531,0.021552,-0.002574,-0.001808,0.007657,-0.000115,-0.004432,-0.00167,-0.00043,0.001002,-0.007324,-0.005852,0.000553,-0.003888,-0.006572,-0.003503,-0.002007,0.004506,-0.000185,-0.004366,-0.005796,-0.001704,0.000233,-0.006782,-0.007127,0.008776,0.006754,-0.008174,0.013228,0.010719,-0.006073,-0.000272,0.003104,-0.001702,0.013116,0.001665


In [78]:
perturbed_jc_data_coefs_lr_std = [pd.DataFrame(x.coef_, columns=shopping_jc_perturb[i].drop(columns='purchase').columns) for i, x in enumerate(lr_jc_perturbed_std)]
perturbed_jc_data_coefs_lr_std = pd.concat(perturbed_jc_data_coefs_lr_std)

In [79]:
perturbed_jc_data_coefs_ls_std = [pd.DataFrame([x.coef_], columns=shopping_jc_perturb[i].drop(columns='purchase').columns) for i, x in enumerate(ls_jc_perturbed_std)]
perturbed_jc_data_coefs_ls_std = pd.concat(perturbed_jc_data_coefs_ls_std)


Next, we can use boxplots to visualize the distributions of the largest (in absolute value) coefficients for each fit (note that in the book, we arrange the LS plot in the same order as the logistic regression plot, but we haven't done that here).

In [80]:
# Sort the columns based on the absolute values of the coefficients
sorted_columns = perturbed_jc_data_coefs_ls_std.abs() \
    .mean() \
    .sort_values(ascending=False) \
    .head(20) \
    .index

# Create the boxplot with the sorted columns
fig = px.box(perturbed_jc_data_coefs_ls_std[sorted_columns],
             title='LS')
fig.update_yaxes(title_text='Standardized coefficient')


In [81]:
# Sort the columns based on the absolute values of the coefficients
sorted_columns = perturbed_jc_data_coefs_lr_std.abs() \
    .mean() \
    .sort_values(ascending=False) \
    .head(20) \
    .index

# Create the boxplot with the sorted columns
fig = px.box(perturbed_jc_data_coefs_lr_std[sorted_columns], 
             title='Logistic regression')
fig.update_yaxes(title_text='Standardized coefficient')



When looking at the coefficients it seems as though the coefficient of exit rate is less stable to the judgment calls than the other variables.
