# Feature Selection and Predictive Modeling of E-Commerce Purchase Intent

## Group 24

**Members:**  
Andrew Liu, Audra Cornick, Haoxi Jiang, Nazia Chowdhury

**Date:**  
December 1, 2025


# Introduction

In the e-commerce industry, being able to predict whether a customer is likely to make a purchase is essential for designing effective recommendation systems, targeted marketing strategies, and personalized user experiences (Ding et al., 2015; Rajamma et al., 2009). Unlike in physical retail settings where sales associates can rely on experience, intuition, and real-time interactions to guide customers, online platforms must rely entirely on algorithmic decision-making (Moe, 2003). This creates a strong need for data-driven models that can identify which aspects of a user’s browsing behaviour are most informative for predicting the likelihood of a purchase.

Because online shopping datasets often include many behavioural, technical, and session-related variables, traditional regression models can struggle due to multicollinearity and the presence of weak or redundant predictors. Regularization methods such as the Least Absolute Shrinkage and Selection Operator (LASSO) offer a robust solution by performing variable selection and reducing overfitting, making them well-suited for high-dimensional settings common in e-commerce analytics.

**The guiding research question of this project is:**
 - Which combination of session-level and behavioural features provides the strongest predictive accuracy for determining whether an online shopper completes a purchase, and how does predictive performance change as we fit multiple LASSO models with varying penalty values?
 
Because the data come from naturally occurring online browsing sessions rather than a controlled experiment, it is not possible to draw causal conclusions about how specific variables influence purchasing behaviour. For this reason, our research question is intentionally focused on *prediction*, aiming to identify which combination of features yields the most accurate model rather than estimating the effect of any individual variable.

This predictive approach mirrors existing work in e-commerce analytics, where retailers commonly use machine-learning–based recommendation systems, clickstream models, and regularized regression techniques to forecast customer actions and personalize online experiences (Chen, 2025; Narvekar & Banu, 2015; Satu & Islam, 2023). By evaluating how different levels of LASSO regularization improve predictive performance, our project aligns with these established modeling strategies used across the industry.


# Methods and Results

## Data

* read the data into R using reproducible code (i.e., from an open source and not a local directory in your server or computer)
include a citation of its source
* include any information you have about data collection (e.g., observational vs experimental)
* describe the variables as done in your Stage 1 Report.
* if (absolutely) needed, indicate which variables will be pre-selected (or dropped) and provide a clear justification of your selection.
* If your goal is prediction, you should keep all variables in the analysis and perform variable selection based on model performance.


## EDA

- Clean and wrangle your data into a tidy format (review Tidyverse's style guide Links to an external site.if needed)
- Include 2 effective and creative visualizations 
- explore the association of some potential explanatory variables with the response (use colours, point types, point size and/or faceting to include more variables)
- highlight potential problems (e.g., multicollinearity or outliers)
- You may utilize sub-plots as you did in Stage 1 Report.
- Use easily readable main/axis/legend titles, appropriately sized and without any underscores.
- Transform some variables if needed and include a clear explanation (e.g. log-transformation may be useful when outliers are present)
- Any summary tables that are relevant to your analysis (e.g., summarize number of observation in groups, indicate if NAs exist)
- Be sure not to print output that takes up a lot of screen space!
Your EDA must be comprehensive with high quality plots.

## Methods

- Describe in written English the methods/models you used to perform your analysis from beginning to end.
- Provide a detailed justification of the method(s) used. The analysis must be based on methods learned in class.
- Make sure that the analysis responded the question posed and that the proposed method is appropriate for the characteristics of the data.
- If a variable selection method is used, you need to describe and justify the method. Furthermore, explain what data will be used, and how final model will be chosen.
- Include a careful model assessment plan relevant to your goal (i.e. diagnostics and/or evaluation, however appropriate), with justifications.

## Results

- all the analysis code, from reading the data to visualizing results, must be based on clean, reproducible (e.g. read from an open source and not a local directory in your server or computer), and well-commented code.
- Include no more than 3 visualizations and/or tables to summarize and highlight your results.  Ensure your tables and/or figures are labelled with a figure/table number and readable fonts.
- You may utilize sub-plots as you did in Stage 1 Report.
- Use easily readable main/axis/legend titles, appropriately sized and without any underscores.
- Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
- If inference is the aim of your project, a detailed interpretation of your fitted models will be required, as well as a discussion of relevant quantities.
For example, which coefficient(s) is(are) statistically significant? What are some hypothesis tests of interest? Interpretation of coefficients, how does the model fit the data? among other points.
- Also explain briefly the key differences between your fitted models.
- If prediction is the aim, you must highlight the key outcomes from your model fitting/selection/prediction in written English.

# Discussion

In this section, you’ll interpret and reflect on the results you obtained in the previous section with respect to the main question/goal of your project.

- Summarize what you found and the implications/impact of your findings
- If relevant, discuss whether your results were what you expected to find
- Discuss how your model could be improved
- Discuss future questions/research this study could lead to


# References

Chen, X. (2025). Consumer online shopping behavior prediction based on machine learning algorithm. *Procedia Computer Science, 262*, 1395–1401. https://doi.org/10.1016/j.procs.2025.05.187

Ding, A. W., Li, S., & Chatterjee, P. (2015). Learning user real-time intent for optimal dynamic web page transformation. *Information Systems Research, 26*(2), 339–359. https://doi.org/10.1287/isre.2015.0568

Moe, W. W. (2003). Buying, searching, or browsing: Differentiating between online shoppers using in-store navigational clickstream. *Journal of Consumer Psychology, 13*(1–2), 29–39. https://doi.org/10.1207/S15327663JCP13-1&2_03

Narvekar, M., & Banu, S. S. (2015). Predicting user's web navigation behavior using hybrid approach. *Procedia Computer Science, 45*, 3–12. https://doi.org/10.1016/j.procs.2015.03.073

Rajamma, R. K., Paswan, A. K., & Hossain, M. M. (2009). Why do shoppers abandon shopping cart? Perceived waiting time, risk, and transaction inconvenience. *Journal of Product & Brand Management*, 18(3), 188-197. https://doi.org/10.1108/10610420910957816

Sakar, C. O., Polat, S., Katircioglu, M., & Kastro, Y. (2019). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. *Neural Computing & Applications*. https://doi.org/10.24432/C5F88Q

Satu, M. S., & Islam, S. F. (2023). Modeling online customer purchase intention behavior applying different feature engineering and classification techniques. *Discover Artificial Intelligence, 3*(36). https://doi.org/10.1007/s44163-023-00086-0


