# Classification Model of Purchase Intentions
### DSCI100 Project Proposal Group 13

## Introduction
With e-commerce becoming more and more prevalent in today’s environment, businesses within this sector need to understand what factors influence a visitor to transform into a purchaser. Hence, a lot of current research is focused on real-time revenue predictors for these businesses. In this project, we aim to build a classification model to predict the purchase intentions of prospective buyers, and determine whether they will make a purchase or not using the available website usage data.

This project uses a dataset from the UCI Machine Learning Repository called the Online Shoppers Purchasing Intention Dataset, which contains information about different users, such as the time spent in each web page and if that lead to a purchase or not.

The dataset consists of 12,330 rows of observations associated with 10 numerical and 8 categorical attributes. The dataset was formed so that each observation (session) would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.

The attributes in the dataset can be split into three categories: data related to the page that the user lands on, Google Analytics metrics, and user visit data. The 'Revenue' attribute determines whether a session ended in a purchase or not, and would be used as the class label for our project.

## Methods & Results:
- describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
- your report should include code which:
    - loads data from the original source on the web 
    - wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis 
    - note: all tables and figure should have a figure/table number and a legend

In [3]:
#importing libraries
library(tidyverse)
library(digest)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

In [4]:
# loading data from the original source on the web
set.seed(3456)
shoppers <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv")
head(shoppers)

Parsed with column specification:
cols(
  Administrative = [32mcol_double()[39m,
  Administrative_Duration = [32mcol_double()[39m,
  Informational = [32mcol_double()[39m,
  Informational_Duration = [32mcol_double()[39m,
  ProductRelated = [32mcol_double()[39m,
  ProductRelated_Duration = [32mcol_double()[39m,
  BounceRates = [32mcol_double()[39m,
  ExitRates = [32mcol_double()[39m,
  PageValues = [32mcol_double()[39m,
  SpecialDay = [32mcol_double()[39m,
  Month = [31mcol_character()[39m,
  OperatingSystems = [32mcol_double()[39m,
  Browser = [32mcol_double()[39m,
  Region = [32mcol_double()[39m,
  TrafficType = [32mcol_double()[39m,
  VisitorType = [31mcol_character()[39m,
  Weekend = [33mcol_logical()[39m,
  Revenue = [33mcol_logical()[39m
)



Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<lgl>,<lgl>
0,0,0,0,1,0.0,0.2,0.2,0,0,Feb,1,1,1,1,Returning_Visitor,False,False
0,0,0,0,2,64.0,0.0,0.1,0,0,Feb,2,2,1,2,Returning_Visitor,False,False
0,0,0,0,1,0.0,0.2,0.2,0,0,Feb,4,1,9,3,Returning_Visitor,False,False
0,0,0,0,2,2.666667,0.05,0.14,0,0,Feb,3,2,2,4,Returning_Visitor,False,False
0,0,0,0,10,627.5,0.02,0.05,0,0,Feb,3,3,1,4,Returning_Visitor,True,False
0,0,0,0,19,154.216667,0.01578947,0.0245614,0,0,Feb,2,2,1,3,Returning_Visitor,False,False


### Tidying the data
`shoppers` satisifies the three criteria of tidy data ([Wickham and others 2014](#references)): each row is a single observation, each column is a single variable, and each value is a single cell. Therefore, no further wrangling is needed to makei

## Discussion:
- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

## <a id='references'></a>References
At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
Make sure to cite the source of your data as well.

- Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.
- Cue consistency and page value perception: Implications for web-based catalog design.
- Analysis of Queues with Impatient Clients: An Application to Online Shopping