# Online Shoppers Purchasing Intention Prediction
Authors: Julian Daduica, Stephanie Ta, and Wai Ming Wong

In [16]:
from ucimlrepo import fetch_ucirepo # raw data is from this package
import pandas as pd
import altair as alt
import altair_ally as aly
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

## Summary

## Introduction

The growth of online shopping or e-commerce has completely changed how people shop. Online shopping provides the convenience of exploring many different online stores effortlessly from their homes. This gives people more freedom over their time and choices. With this, retail e-commerce sales are estimated to exceed 4.1 trillion U.S. dollars worldwide in 2024 from roughly 2.7 billion online shoppers (Taheer, 2024; Commerce, 2024). In an evergrowing consumerism society, it is important to understand consumers’ behaviours in addition to their intentions. This can allow businesses to optimize the online shopping experience and maximize revenue in such a massive industry. When shopping in person, a store employee may find it easy to determine a person’s purchasing intention through various social cues. However, while shopping online, companies and businesses find it much more difficult to decide on the intentions of their customers. Businesses need to find solutions from data on user interactions such as page clicks, time spent on pages, time of day or year, and much more. With the evergrowing increase in website traffic, businesses must distinguish between visitors with strong purchasing intentions and those who are simply browsing. 

Machine learning is a powerful tool we can utilize to analyze and predict online shoppers purchasing intentions based on behavioural and interaction data. Using machine learning techniques, we can use algorithms and computation to analyze various features such as bounce rates, visitor type, time of year, and many others to identify patterns which can help predict purchasing intention. In this study, we aim to use a machine learning algorithm to predict online shoppers purchasing intentions. This will allow us to extract meaningful insights from user data. In such a lucrative field, determining purchasing intentions is vital to these companies and businesses for increasing revenue. This can help companies and businesses find optimal sales and marketing techniques, or personalize each customer experience on their website. 

## Methods

### Data

The dataset used was sourced from the UCI Machine Learning Repository (Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic, 2018) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset). Each row in the dataset represents a web session on an e-commerce website, including details such as pages visited, time spent, and “Google Analytics” metrics for each page, such as “Bounce Rate”, “Exit Rate”, and “Page Value”. The “Special Day” variable highlights special events, while other web client attributes include OS, browser, region, traffic type, visitor type, and visit timing.

### Analysis

The logistic regression algorithm was build for a classification model to predict whether the customers would make purchasing online in ecommerce sites based on the website visiting behaviours. All variables included in the data set were used to fit the model. Data was split with 70% into the training set and 30% into the test set. The hyperparameter was chosen using 5-fold cross validation with the accuracy score as the classification metric. All variables were standardized prior to model fitting. The Python programming language (Van Rossum and Drake, 2009) and the following Python packages were used to perform the analysis: numpy (Harris et al., 2020), Pandas (McKinney, 2010), altair (VanderPlas, 2018), scikit-learn (Pedregosa et al., 2011). The package for data fetching from UCI Machine Learning Respoitory was ucimlrepo (Philip Truong et al.). The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/Online-Shoppers-Purchasing-Intention-Predictionttimbefrs/breast_cancer_predictor_py) .

## Results and Discussion

To investigate the features in our dataset, we first visualized the correlation between each pair of features using a heatmap. From this, we can see that feautres are not too correlated with each other, and strong correlations only appear when a feature is compared to itself.

We also plotted the distribution of each numeric feature using density plots and the distribution of each categorical feature using bar plots. These plots were coloured by the target (false: blue and true: orange). For the numeric features, we can see that the target class distributions overlap and are of similar shape, but we decided to keep these features in our model since they may be useful for prediction in ombination with other features. For the catagorical features, the target class dsitributions seem to be similar, but again we decided to keep these features in our model since they may be useful for prediction in ombination with other features. We also noticed that there is an imbalance in our dataset in which there are more observations with the target = false and less observations with the target = true. We did not account for this imbalance in our analysis since that would be out of the scope for this project, which relies on only DSCI 571 knowledge.

In [17]:
#Dataset importing script from UCI ML Repository
# fetch dataset 
online_shoppers_purchasing_intention_dataset = fetch_ucirepo(id=468) 

# data (as pandas dataframes) and save it as csv
X = online_shoppers_purchasing_intention_dataset.data.features 
y = online_shoppers_purchasing_intention_dataset.data.targets
df = pd.concat([X, y], axis=1)
df.to_csv("../data/raw/raw_df.csv")

# split the training set and testing set and save them as csv files
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
train_df.to_csv("../data/processed/train_df.csv")
test_df.to_csv("../data/processed/test_df.csv")

# split X, y in the training set and testing set
X_train = train_df.drop(columns=["Revenue"])
X_test = test_df.drop(columns=["Revenue"])
y_train = train_df["Revenue"]
y_test = test_df["Revenue"]


In [18]:
# begin exploratory data analysis
train_df.describe()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0,8631.0
mean,2.318851,80.035963,0.496582,33.735985,31.506546,1179.548652,0.022252,0.04318,5.765987,0.06333,2.129765,2.353261,3.150852,4.071371
std,3.326228,173.132521,1.244019,138.9954,44.119701,1895.590842,0.048634,0.048648,18.215382,0.202414,0.925164,1.727358,2.408261,4.011918
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,182.208333,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.0,0.0,0.0,18.0,593.70198,0.003077,0.025466,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.115833,0.0,0.0,37.0,1439.177083,0.017124,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,16.0,2549.375,584.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0


In [19]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8631 entries, 2476 to 3582
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           8631 non-null   int64  
 1   Administrative_Duration  8631 non-null   float64
 2   Informational            8631 non-null   int64  
 3   Informational_Duration   8631 non-null   float64
 4   ProductRelated           8631 non-null   int64  
 5   ProductRelated_Duration  8631 non-null   float64
 6   BounceRates              8631 non-null   float64
 7   ExitRates                8631 non-null   float64
 8   PageValues               8631 non-null   float64
 9   SpecialDay               8631 non-null   float64
 10  Month                    8631 non-null   object 
 11  OperatingSystems         8631 non-null   int64  
 12  Browser                  8631 non-null   int64  
 13  Region                   8631 non-null   int64  
 14  TrafficType              8

In [20]:
corr_df = (
    train_df
    .corr('spearman', numeric_only=True)
    .abs()                      
    .stack()                   
    .reset_index(name='corr')
    .query(('level_0 < level_1')))  
corr_df

plot1 = alt.Chart(corr_df).mark_rect().encode(
    x=alt.X('level_0:N', title='Feature 1'),
    y=alt.Y('level_1:N', title='Feature 2'),
    size=alt.Size('corr:Q', title='Correlation Strength'),
    color=alt.Color('corr:Q'),
    tooltip=['level_0', 'level_1', 'corr']
).properties(
    width=600,
    height=600,
    title="Correlation Heatmap Between all Features"
)

plot1

Figure 1. Correlation plot between all features in dataset

In [21]:
aly.alt.data_transformers.enable('vegafusion')

feature_density_plot = aly.dist(train_df, color='Revenue')

feature_density_plot

Figure 2.

In [22]:
feature_bar_plot = aly.dist(train_df.assign(churn=lambda df: df['Revenue'].astype(object)), dtype='object', color='Revenue')

feature_bar_plot

Figure 3. 

We chose to use a simple classification model using the k-nearest neighbours algorithm. To find the model that best predicted whether a tumour was benign or malignant, we performed 30-fold cross validation using F2 score (beta = 2) as our metric of model prediction performance to select K (number of nearest neighbours). We observed that the optimal K was 4.

In [23]:

# create baseline model to compare final model to
dummy_classifier = DummyClassifier()
dummy_classifier.fit(X_train, y_train)
dummy_cv_scores = pd.DataFrame(
    cross_validate(dummy_classifier, X_train, y_train, cv = 5, return_train_score = True))
mean_dummy_validation_accuracy = dummy_cv_scores['test_score'].mean()
mean_dummy_validation_accuracy

np.float64(0.8494960081213042)

In [24]:
# lists of each type of feature
numeric_cols = ['Administrative', 'Administrative_Duration',
                'Informational', 'Informational_Duration',
                'ProductRelated', 'ProductRelated_Duration',
                'BounceRates', 'ExitRates',
                'PageValues', 'SpecialDay']
categorical_cols = ['Weekend', 'OperatingSystems',
                    'Browser', 'Region',
                    'TrafficType', 'VisitorType']
ordinal_cols = ['Month']

In [25]:
# make preproccessor, note imputation is not needed since there are no null values in the data set
month_levels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_cols),
    (OneHotEncoder(drop='if_binary', sparse_output=False, handle_unknown='ignore'), categorical_cols),
    (OrdinalEncoder(categories=[month_levels]), ordinal_cols)
)

In [26]:
# make pipeline including preprocessor and logistic regression model
log_reg_pipe = make_pipeline(
    preprocessor, LogisticRegression(max_iter=2000, random_state=123)
)

In [27]:
# tune hyperparameter C of the logistic regression model
param_grid = {
    "logisticregression__C": loguniform(1e-3, 1e3),
}

random_search = RandomizedSearchCV(
    log_reg_pipe,
    param_grid,
    n_iter=100,
    verbose=1,
    n_jobs=-1,
    random_state=123,
    return_train_score=True,
)

random_search.fit(X_train, y_train)

print("Best hyperparameter value: ", random_search.best_params_)
print("Best score: %0.3f" % (random_search.best_score_))

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best hyperparameter value:  {'logisticregression__C': np.float64(0.916453820211066)}
Best score: 0.887


Cross Validation Results:
- the best C for logistic regression: 0.916
- validation accuracy score of best logistic regression model: 0.887
- validation accuracy score of dummy classifier: 0.849
- validation accuracy score of best logistic regression model is slightly (3.8%) better than the validation accuracy score of the dummy classifier


In [28]:
# score dummy and best logistic regression model on test set
dummy_test_score = dummy_classifier.score(X_test, y_test)
log_reg_test_score = random_search.score(X_test, y_test)


print("Dummy classifier test score: %0.3f" % dummy_test_score)
print("Best logistic regression model test score: %0.3f" % log_reg_test_score)

Dummy classifier test score: 0.835
Best logistic regression model test score: 0.876




Testing results:
- test accuracy score of best logistic regression model: 0.875
- test accuracy score of dummy classifier: 0.835
- test accuracy score of best logisitic regression model is just a bit (4%) better than the test accuracy score of the dmmy classfier

In [29]:
# find weights of each feature
best_estimator = random_search.best_estimator_
feature_names = best_estimator['columntransformer'].get_feature_names_out() # get feature names
weights = best_estimator["logisticregression"].coef_ # get feature coefficients

feat_weights = pd.DataFrame(weights, columns = feature_names)
feat_weights = feat_weights.T.reset_index()
feat_weights = feat_weights.rename(columns={'index': 'feature', 0: "weight"})
feat_weights['feature'] = feat_weights['feature'].str.split('__', expand = True)[1]
feat_weights.sort_values(by='weight', ascending=False)


Unnamed: 0,feature,weight
8,PageValues,1.433243
56,TrafficType_16,0.628456
30,Browser_12,0.620195
17,OperatingSystems_7,0.509110
51,TrafficType_11,0.491351
...,...,...
16,OperatingSystems_6,-0.475022
53,TrafficType_13,-0.606659
55,TrafficType_15,-0.625522
21,Browser_3,-0.805174


In [30]:
# average the weights of each feature

feat_weights['overall_feature'] = feat_weights['feature'].str.split('_', expand = True)[0]

# absolute value the weights so that positive and negative ones don't cancel eachother out
feat_weights['absolute_value_weight'] = abs(feat_weights['weight'])

pd.DataFrame(feat_weights.groupby('overall_feature'
    ).mean(numeric_only=True
    ).sort_values('absolute_value_weight', ascending = False
    )['absolute_value_weight'])

Unnamed: 0_level_0,absolute_value_weight
overall_feature,Unnamed: 1_level_1
PageValues,1.433243
ExitRates,0.833293
TrafficType,0.28751
Browser,0.238253
OperatingSystems,0.208122
VisitorType,0.16103
Weekend,0.159635
ProductRelated,0.145292
Region,0.080731
Month,0.077423


For determining the target, the `ExitRates` and `PagesValues` features seem to be the most important.

## References

Commerce, S. (2024, May 14). 43 ECommerce statistics in 2024 (global and U.s. data). SellersCommerce. https://www.sellerscommerce.com/blog/ecommerce-statistics/

Taheer, F. (2024, September 8). Online shopping statistics: How many people shop online in 2024? OptinMonster. https://optinmonster.com/online-shopping-statistics/

Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018). “UCI Machine Learning Repository.” https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset

Van Rossum, Guido, and Fred L. Drake. (2009). Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

Harris, C.R. et al., (2020). Array programming with NumPy. Nature, 585, pp.357–362.

McKinney, Wes. (2010). “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.

VanderPlas, J. et al., (2018). Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.

Pedregosa, F. et al., (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

Philip Truong et al. ucimlrepo package https://github.com/uci-ml-repo/ucimlrepo/tree/main

