# Data Information: **[Online Shoppers Purchasing Intention Data Set](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset)**

## Source  
1. C. Okan Sakar  
Department of Computer Engineering  
Faculty of Engineering and Natural Sciences, Bahcesehir University  
34349 Besiktas, Istanbul, Turkey  

2. Yomi Kastro  
Inveon Information Technologies Consultancy and Trade  
34335 Istanbul, Turkey  

The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session
would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user
profile, or period.



The dataset consists of 10 numerical and 8 categorical attributes.
The 'Revenue' attribute can be used as the class label.

## Numericals

"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. 

### The **"Bounce Rate"**, **"Exit Rate"** and **"Page Value"** features represent the metrics measured by "Google Analytics" for each page in the e-commerce site.

The value of **"Bounce Rate"** feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session.  
The value of **"Exit Rate"** feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.  
The **"Page Value"** feature represents the average value for a web page that a user visited before completing an e-commerce transaction.  

The **"Special Day"** feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentine’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 

## Categorical
The dataset also includes **"operatingsystems"**, **"browser"**, **"region"**, **"traffictype"**, **"visitortype"** as returning or new visitor, a Boolean value indicating whether the date of the visit is **"weekend"**, and **"month"** of the year.





# Data Download and Extraction to **'data'** folder

In [36]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv
# !mkdir data
# !mv online_shoppers_intention.csv data/online_shoppers_intention.csv
os.listdir('data')

['online_shoppers_intention.csv']

## Libraries etc

In [10]:
# import pandas as np #just joking
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [49]:
df = pd.read_csv(os.path.join('data', 'online_shoppers_intention.csv'))

In [50]:
print('# of records:', len(df))

# of records: 12330


## Cleaning 🧹

In [52]:
# da duplicates
if df.duplicated().any():
    print(f'There are {df.duplicated().sum()} duplicated values', '\n', '='*50)
    print(df[df.duplicated()], '\n', '='*50)
    print('records before duplicate removal:', len(df))
    df.drop_duplicates(inplace=True)
    print('records before duplicate removal:', len(df))

There are 125 duplicated values 
       Administrative  Administrative_Duration  Informational  \
158                 0                      0.0              0   
159                 0                      0.0              0   
178                 0                      0.0              0   
418                 0                      0.0              0   
456                 0                      0.0              0   
...               ...                      ...            ...   
11934               0                      0.0              0   
11938               0                      0.0              0   
12159               0                      0.0              0   
12180               0                      0.0              0   
12185               0                      0.0              0   

       Informational_Duration  ProductRelated  ProductRelated_Duration  \
158                       0.0               1                      0.0   
159                       0.0         

In [87]:
# column name lowercase
df.columns = map(str.lower, df.columns)
for i, col in enumerate(df.columns):
    print(i, col)

0 administrative
1 administrative_duration
2 informational
3 informational_duration
4 productrelated
5 productrelated_duration
6 bouncerates
7 exitrates
8 pagevalues
9 specialday
10 month
11 operatingsystems
12 browser
13 region
14 traffictype
15 visitortype
16 weekend
17 revenue


In [57]:
# changing data types to category
cat_cols = ['month', 'operatingsystems', 'browser', 'region',
            'traffictype', 'visitortype', 'weekend', 'revenue']
for col in cat_cols:
    df[col] = df[col].astype('category')

In [72]:
if df.isna().sum().sum():
    print('Null Values')
else:
    print('No Nulls 👍')

No Nulls 👍


## Checking data quality and values

In [89]:
for i, col in enumerate(df.columns):
    print(i+1, ':', col,
          '\n',
          '='*40, f'Number of unique values: **{df[col].nunique()}**',
          '\n',
          '=' *
          40, f'Percentage of nulls: **%{np.round(df[col].isnull().sum()/len(df)*100,3)}**',
          '\n',
          'Sorted unique values: \n',
          df[col].sort_values().unique(), '\n\n\n')

1 : administrative 
 Sorted unique values: 
 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 26 27] 



2 : administrative_duration 
 Sorted unique values: 
 [0.00000000e+00 1.33333333e+00 2.00000000e+00 ... 2.65731806e+03
 2.72050000e+03 3.39875000e+03] 



3 : informational 
 Sorted unique values: 
 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 24] 



4 : informational_duration 
 Sorted unique values: 
 [0.00000000e+00 1.00000000e+00 1.50000000e+00 ... 2.25203333e+03
 2.25691667e+03 2.54937500e+03] 



5 : productrelated 
 Sorted unique values: 
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 10

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12205 entries, 0 to 12329
Data columns (total 18 columns):
administrative             12205 non-null int64
administrative_duration    12205 non-null float64
informational              12205 non-null int64
informational_duration     12205 non-null float64
productrelated             12205 non-null int64
productrelated_duration    12205 non-null float64
bouncerates                12205 non-null float64
exitrates                  12205 non-null float64
pagevalues                 12205 non-null float64
specialday                 12205 non-null float64
month                      12205 non-null category
operatingsystems           12205 non-null category
browser                    12205 non-null category
region                     12205 non-null category
traffictype                12205 non-null category
visitortype                12205 non-null category
weekend                    12205 non-null category
revenue                    12205 non-nul

In [92]:
df.to_pickle(os.path.join('data', 'cleaned_df.pkl'))