# <b> Support Vector Machines (SVMs) </b>
___

<b> Table of Content: </b>
<br> [Pipeline_1](#8071)
<br> [Pipeline 2](#8072)
<br> [Pipeline 3](#8073)

Loading Modules and Datasets

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics

# Import `train_test_split` from `sklearn.model_selection`
from sklearn.model_selection import train_test_split

# Import the `svm` model from `sklearn`
from sklearn import svm

# import necessary packages  
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss



<a id = "8071"> <h2> Pipeline 1 </h2> </a>
___

In [2]:
# read csv file to a pandas dataframe
df_pipeline1 = pd.read_csv("pipeline_1.csv")

> Show all features and target in dataframe

In [3]:
# show all columns in dataset
list(df_pipeline1.columns)[:]

['Q4',
 'VisitorType_New_Visitor',
 'Q3',
 'TrafficType_2',
 'TrafficType_8',
 'TrafficType_3',
 'PageValues_iqr_yj_zscore',
 'Q1',
 'TrafficType_13',
 'ExitRates_iqr_yj_zscore',
 'OperatingSystems_3',
 'Administrative_Duration_iqr_yj_zscore',
 'TrafficType_1',
 'SpecialDay_0.8',
 'Month_Feb',
 'Browser_6',
 'SpecialDay_0.4',
 'TrafficType_20',
 'Informational_Duration_pp_iqr_yj_zscore',
 'Browser_12',
 'OperatingSystems_7',
 'TrafficType_16',
 'Revenue']

> Declare Features and Target

In [4]:
# Define Features and Target variables
X = df_pipeline1.iloc[:, :-1] # Features is all columns in the dataframe except the last column
Y = df_pipeline1.iloc[:, -1] # Target is the last column in the dataframe: 'Revenue'

In [5]:
# Split dataset into training set and test set 
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,random_state=2019) 

In [6]:
# Create the SVC model by calling `svm.SVC()`
# the SVC model has a series of hyper-parameters
# for now, we are going to use the basics of the hyperparameters as follows
# set `gamma` to 0.001 - `gamma` typically ranges from [0.001, 0.01]
# set `C` to 100. , and then set `kernel` to 'linear'
# name the model as `svc_model`
svc_model = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Fit the data to the SVC model
# since this is supervised learning, you need to `fit` both `X_train` and `y_train` to it
svc_model.fit(X_train, y_train)

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model.predict(X_test)

# Print the classification report using comparison of `y_test` and `y_pred`
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.89      0.92      3131
           1       0.56      0.75      0.65       568

    accuracy                           0.87      3699
   macro avg       0.76      0.82      0.78      3699
weighted avg       0.89      0.87      0.88      3699



<b> 1.1 Synthetic Minority Oversampling Technique (SMOTE)

We oversample the dataset, because the classes within our target variable 'Revenue' are imbalanced:
* class 0: 84.53%
* class 1: 15.47%

Fore more information, click on detailed information from Prof. Jie Tao [link](https://github.com/DrJieTao/ba545-docs/blob/master/competition2/handling_imbalanced_data_part2.ipynb)

In [7]:
sm = SMOTE(random_state = 2019) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train) 

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

After OverSampling, the shape of train_X: (14582, 22)
After OverSampling, the shape of train_y: (14582,) 

After OverSampling, counts of label '1': 7291
After OverSampling, counts of label '0': 7291




The SMOTE Algorithm has oversampled the instances in the minority class and made it equal to majority class:
* Both classes (0 & 1) now have 7291 instances, the dataset is balanced.
* Class 1 increased from 1340 instances to 7291 instances, an increase of 5951 instances of class 1.

In [8]:
svc_model1 = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Fit the data to the SVC model
# since this is supervised learning, you need to `fit` both `X_train_res` and `y_train_res` to it
svc_model1.fit(X_train_res, y_train_res)

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model1.predict(X_test)

# Print the classification report using comparison of `y_test` and `y_pred`
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.88      0.92      3131
           1       0.55      0.78      0.64       568

    accuracy                           0.87      3699
   macro avg       0.75      0.83      0.78      3699
weighted avg       0.89      0.87      0.88      3699



<b> 1.2 NearMiss Under-Sampling Technique

In [9]:
# apply near miss 
nr = NearMiss(random_state=123) 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))

After Undersampling, the shape of train_X: (2680, 22)
After Undersampling, the shape of train_y: (2680,) 

After Undersampling, counts of label '1': 1340
After Undersampling, counts of label '0': 1340




The NearMiss Algorithm has undersampled the instances in the majority class and made it equal to minority class:
* Both classes (0 & 1) now have 1340 instances, the dataset is balanced.
* Class 0 decreased from 7291 instances to 1340 instances, a decrease of 5951 instances of class 0.

In [10]:
# train the model on train set 
svc_model2 = svm.SVC(gamma=0.001, C=100, kernel='linear')
svc_model2.fit(X_train_miss, y_train_miss) 

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model2.predict(X_test)

# print classification report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.76      0.85      3131
           1       0.39      0.85      0.53       568

    accuracy                           0.77      3699
   macro avg       0.68      0.81      0.69      3699
weighted avg       0.88      0.77      0.80      3699



<a id = "8072"> <h2> Pipeline 2 </h2> </a>
___

In [11]:
# read csv file to a pandas dataframe
df_pipeline2 = pd.read_csv("pipeline_2.csv")

> Show all features and target in dataframe

In [12]:
# show all columns in dataset
list(df_pipeline2.columns)[:]

['TrafficType_15',
 'Month_Nov',
 'Administrative_Duration_mm_yj_stdev',
 'VisitorType_New_Visitor',
 'Informational_mm_yj_stdev',
 'TrafficType_2',
 'TrafficType_3',
 'ProductRelated_mm_yj_stdev',
 'PageValues_mm_yj_stdev',
 'Month_May',
 'TrafficType_13',
 'OperatingSystems_3',
 'TrafficType_1',
 'add_exit_bounce_rates_mm_yj_stdev',
 'Month_Mar',
 'TrafficType_18',
 'TrafficType_8',
 'SpecialDay_0.8',
 'Month_Feb',
 'TrafficType_12',
 'Browser_12',
 'Revenue']

> Declare Features and Target

In [13]:
# Define Features and Target variables
X = df_pipeline2.iloc[:, :-1] # Features is all columns in the dataframe except the last column
Y = df_pipeline2.iloc[:, -1] # Target is the last column in the dataframe: 'Revenue'

In [14]:
# Split dataset into training set and test set 
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,random_state=2019) 

In [15]:
# Create the SVC model by calling `svm.SVC()`
# the SVC model has a series of hyper-parameters
# for now, we are going to use the basics of the hyperparameters as follows
# set `gamma` to 0.001 - `gamma` typically ranges from [0.001, 0.01]
# set `C` to 100. , and then set `kernel` to 'linear'
# name the model as `svc_model`
svc_model = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Fit the data to the SVC model
# since this is supervised learning, you need to `fit` both `X_train` and `y_train` to it
svc_model.fit(X_train, y_train)

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model.predict(X_test)

# Print the classification report using comparison of `y_test` and `y_pred`
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.92      0.93      3131
           1       0.60      0.68      0.64       568

    accuracy                           0.88      3699
   macro avg       0.77      0.80      0.78      3699
weighted avg       0.89      0.88      0.88      3699



<b> 1.1 Synthetic Minority Oversampling Technique (SMOTE)

We oversample the dataset, because the classes within our target variable 'Revenue' are imbalanced:
* class 0: 84.53%
* class 1: 15.47%

Fore more information, click on detailed information from Prof. Jie Tao [link](https://github.com/DrJieTao/ba545-docs/blob/master/competition2/handling_imbalanced_data_part2.ipynb)

In [16]:
sm = SMOTE(random_state = 2019) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train) 

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

After OverSampling, the shape of train_X: (14582, 21)
After OverSampling, the shape of train_y: (14582,) 

After OverSampling, counts of label '1': 7291
After OverSampling, counts of label '0': 7291




The SMOTE Algorithm has oversampled the instances in the minority class and made it equal to majority class:
* Both classes (0 & 1) now have 7291 instances, the dataset is balanced.
* Class 1 increased from 1340 instances to 7291 instances, an increase of 5951 instances of class 1.

In [17]:
svc_model1 = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Fit the data to the SVC model
# since this is supervised learning, you need to `fit` both `X_train_res` and `y_train_res` to it
svc_model1.fit(X_train_res, y_train_res)

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model1.predict(X_test)

# Print the classification report using comparison of `y_test` and `y_pred`
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.90      0.92      3131
           1       0.57      0.75      0.65       568

    accuracy                           0.87      3699
   macro avg       0.76      0.82      0.79      3699
weighted avg       0.89      0.87      0.88      3699



<b> 1.2 NearMiss Under-Sampling Technique

In [18]:
# apply near miss 
nr = NearMiss(random_state=123) 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))

After Undersampling, the shape of train_X: (2680, 21)
After Undersampling, the shape of train_y: (2680,) 

After Undersampling, counts of label '1': 1340
After Undersampling, counts of label '0': 1340




The NearMiss Algorithm has undersampled the instances in the majority class and made it equal to minority class:
* Both classes (0 & 1) now have 1340 instances, the dataset is balanced.
* Class 0 decreased from 7291 instances to 1340 instances, a decrease of 5951 instances of class 0.

In [19]:
# train the model on train set 
svc_model2 = svm.SVC(gamma=0.001, C=100, kernel='linear')
svc_model2.fit(X_train_miss, y_train_miss) 

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model2.predict(X_test)

# print classification report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.65      0.77      3131
           1       0.29      0.78      0.42       568

    accuracy                           0.67      3699
   macro avg       0.62      0.72      0.60      3699
weighted avg       0.84      0.67      0.72      3699



<a id = "8073"> <h2> Pipeline 3 </h2> </a>
___

In [20]:
# read csv file to a pandas dataframe
df_pipeline3 = pd.read_csv("pipeline_3.csv")

> Show all features and target in dataframe

In [21]:
# show all columns in dataset
list(df_pipeline3.columns)[:]

['Administrative_yj_stdev_zscore',
 'Month_Nov',
 'VisitorType_New_Visitor',
 'TrafficType_2',
 'Month_May',
 'TrafficType_3',
 'add_exit_bounce_rates_yj_stdev_zscore',
 'TrafficType_13',
 'PageValues_yj_stdev_zscore',
 'OperatingSystems_3',
 'TrafficType_1',
 'Month_Mar',
 'TrafficType_8',
 'SpecialDay_0.8',
 'Month_Feb',
 'Month_Dec',
 'SpecialDay_0.4',
 'TrafficType_20',
 'Month_Oct',
 'Region_1',
 'Browser_12',
 'OperatingSystems_7',
 'TrafficType_16',
 'Revenue']

> Declare Features and Target

In [22]:
# Define Features and Target variables
X = df_pipeline3.iloc[:, :-1] # Features is all columns in the dataframe except the last column
Y = df_pipeline3.iloc[:, -1] # Target is the last column in the dataframe: 'Revenue'

In [23]:
# Split dataset into training set and test set 
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,random_state=2019) 

In [24]:
# Create the SVC model by calling `svm.SVC()`
# the SVC model has a series of hyper-parameters
# for now, we are going to use the basics of the hyperparameters as follows
# set `gamma` to 0.001 - `gamma` typically ranges from [0.001, 0.01]
# set `C` to 100. , and then set `kernel` to 'linear'
# name the model as `svc_model`
svc_model = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Fit the data to the SVC model
# since this is supervised learning, you need to `fit` both `X_train` and `y_train` to it
svc_model.fit(X_train, y_train)

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model.predict(X_test)

# Print the classification report using comparison of `y_test` and `y_pred`
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.90      0.92      3131
           1       0.57      0.75      0.64       568

    accuracy                           0.87      3699
   macro avg       0.76      0.82      0.78      3699
weighted avg       0.89      0.87      0.88      3699



<b> 1.1 Synthetic Minority Oversampling Technique (SMOTE)

We oversample the dataset, because the classes within our target variable 'Revenue' are imbalanced:
* class 0: 84.53%
* class 1: 15.47%

Fore more information, click on detailed information from Prof. Jie Tao [link](https://github.com/DrJieTao/ba545-docs/blob/master/competition2/handling_imbalanced_data_part2.ipynb)

In [25]:
sm = SMOTE(random_state = 2019) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train) 

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

After OverSampling, the shape of train_X: (14582, 23)
After OverSampling, the shape of train_y: (14582,) 

After OverSampling, counts of label '1': 7291
After OverSampling, counts of label '0': 7291




The SMOTE Algorithm has oversampled the instances in the minority class and made it equal to majority class:
* Both classes (0 & 1) now have 7291 instances, the dataset is balanced.
* Class 1 increased from 1340 instances to 7291 instances, an increase of 5951 instances of class 1.

In [26]:
svc_model1 = svm.SVC(gamma=0.001, C=100, kernel='linear')

# Fit the data to the SVC model
# since this is supervised learning, you need to `fit` both `X_train_res` and `y_train_res` to it
svc_model1.fit(X_train_res, y_train_res)

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model1.predict(X_test)

# Print the classification report using comparison of `y_test` and `y_pred`
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.88      0.92      3131
           1       0.55      0.78      0.64       568

    accuracy                           0.87      3699
   macro avg       0.75      0.83      0.78      3699
weighted avg       0.89      0.87      0.88      3699



<b> 1.2 NearMiss Under-Sampling Technique

In [27]:
# apply near miss 
nr = NearMiss(random_state=123) 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))

After Undersampling, the shape of train_X: (2680, 23)
After Undersampling, the shape of train_y: (2680,) 

After Undersampling, counts of label '1': 1340
After Undersampling, counts of label '0': 1340




The NearMiss Algorithm has undersampled the instances in the majority class and made it equal to minority class:
* Both classes (0 & 1) now have 1340 instances, the dataset is balanced.
* Class 0 decreased from 7291 instances to 1340 instances, a decrease of 5951 instances of class 0.

In [28]:
# train the model on train set 
svc_model2 = svm.SVC(gamma=0.001, C=100, kernel='linear')
svc_model2.fit(X_train_miss, y_train_miss) 

# Predict the label of `X_test` using `.predict()`
y_pred = svc_model2.predict(X_test)

# print classification report 
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.87      0.91      3131
           1       0.52      0.78      0.63       568

    accuracy                           0.86      3699
   macro avg       0.74      0.83      0.77      3699
weighted avg       0.89      0.86      0.87      3699

