# Project 2: Clustering + SVM to Predict Online Purchases
# DAV 6150

- Name: Zhengnan Li
- Repository: [Project_2/Z_Li_Project2.ipynb](https://github.com/Zhengnan817/DAV-6150/blob/5b0700d14267b43c664aca6d699e92ac88a638bc/Project_2/Z_Li_Project2.ipynb)

In [2]:
# Import the library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# 1. Introduction
For this Project, I will be working with a data set comprised of a variety of such web site metrics. My objective for Project 2 is to use clustering algorithms to create groupings of similar data observations within the provided data set, apply labels to the data observations assigned to those groupings, and then, after completing the necessary EDA and data prep work, construct an SVM model to predict the most likely categorization of any previously unseen data items. And as the data science practitioner, I will determine which features to include in my SVM models.  
In short, we will use the dataset to help the online retailers to try to determine whether or not a given site visitor will actually make a 
purchase.

### 1.1 Approach:

- 1. [Introduction](#1-Introduction): Import the data set and introduce data variables.  
- 2. [Pre-Clustering EDA](#2-Pre-Clustering-EDA): Explore the raw dataset and do analysis based on domain knowledge.  
- 3. [Pre-Clustering Data Preparation](#3-Pre-Clustering-Data-Preparation): Perform data cleaning, imputation and transformation.  
- 4. [Cluster Modeling](#4-Cluster-Modeling): Explain and present hierarchical and K-means clustering work.
- 5. [Post-Clustering Exploratory Data Analysis ](#5-Post-Clustering-Exploratory-Data-Analysis):Explain and present the post-clustering EDA work.
- 6. [Clustering Output vs. Actual Labels](#6-Clustering-Output-vs.-Actual-Labels): Compare the content of V_Revenue to the content of the Revenue column generated by clustering algorithm.
- 7. [SVM Modeling](#7-SVM-Modeling): SVM modeling work including any feature selection methods used and the use of any kernel functions.  
- 8. [Select Models](#8-Select-Models): Explain how I selected your model selection criteria and why. Also use the test dataset to predict.
- 9. [Clustering + SVM Output vs. Actual Labels](#9-Clustering-+-SVM-Output-vs.-Actual-Labels): Compare the content of V_Revenue to the 
content of the Revenue column generated by SVM algorithm.
- 10. [Conclusion](#10-Conclusion)

### 1.2 Data Introduction

 The data is sourced from the UCI Machine Learning repository:https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset. It is comprised of a variety of such web site metrics. After importing the dataset into the file, we can see that it has 12330 rows and 17 columns.


In [3]:
online_shop = pd.read_csv("https://raw.githubusercontent.com/Zhengnan817/DAV-6150/main/Project_2/src/Project2_Data.csv")
online_shop.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend
0,0,0.0,0,0.0,8,222.0,0.0,0.028571,53.474571,0.0,May,1,1,1,2,New_Visitor,True
1,0,0.0,0,0.0,14,1037.5,0.014286,0.047619,0.0,0.0,Mar,2,2,4,2,Returning_Visitor,False
2,4,37.5,2,82.0,4,96.625,0.0,0.0175,0.0,0.0,Nov,2,2,9,2,New_Visitor,False
3,4,115.7,0,0.0,16,655.383333,0.0,0.012037,0.0,0.0,Nov,1,1,2,3,Returning_Visitor,False
4,1,60.0,1,15.0,26,670.166667,0.0,0.003846,0.0,0.0,May,2,2,3,4,Returning_Visitor,False


In [4]:
online_shop.shape

(12330, 17)

# 2. Pre Clustering EDA

### 2.1 Statistical Summary

First, let's overview the basic statistical summary of the data set.We can see that the data set has no null and missing values. And the datatype includes int64, float64, object and bool. We will analyze it in the visualization part to divide them into numerical and categorical variables.

In [5]:
online_shop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

In [7]:
# Check the unique value for each column
online_shop.nunique()

Administrative               27
Administrative_Duration    3335
Informational                17
Informational_Duration     1258
ProductRelated              311
ProductRelated_Duration    9551
BounceRates                1872
ExitRates                  4777
PageValues                 2704
SpecialDay                    6
Month                        10
OperatingSystems              8
Browser                      13
Region                        9
TrafficType                  20
VisitorType                   3
Weekend                       2
dtype: int64

In [11]:
duplicates = online_shop.duplicated(keep='first')
print({duplicates.sum()} )

{125}


In [9]:
# Basic statisical insight
online_shop.describe() 

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,2.124006,2.357097,3.147364,4.069586
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,0.911325,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0


### 2.2 Univariate distribution visualization

### 2.3 Multi-variable relationship visualization

# 3. Pre-Clustering Data Preparation

# 4. Cluster Modeling

# 5. Post-Clustering Exploratory Data Analysis

# 6. Clustering Output vs. Actual Labels

# 7. SVM Modeling

# 8. Select Models

# 9. Clustering + SVM Output vs. Actual Labels

# 10. Conclusion