# Support Vector Machines (SVM) - Pipeline 2

SVM offers very high accuracy compared to other classifiers such as logistic regression, and decision trees. It is known for its kernel trick to handle **nonlinear** input spaces. It is used in a variety of applications such as face detection, intrusion detection, classification of emails, news articles and web pages, classification of genes, and handwriting recognition.

Generally, Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.

In [126]:
#import pandas
import pandas as pd

# ignore warnings

import warnings
warnings.filterwarnings('ignore')

# load dataset
svmdata = pd.read_csv('./pipeline2.csv', header=0)
svmdata.head()

Unnamed: 0,AdministrativeSkew,Administrative_DurationSkew,InformationalSkew,Informational_DurationSkew,ProductRelatedSkew,ProductRelated_DurationSkew,BounceRatesSkew,ExitRatesSkew,PageValuesSkew,SpecialDay_0.0,...,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_False,Weekend_True,SeasonBins_1,SeasonBins_2,SeasonBins_3,SeasonBins_4,RevenueEnc
0,-0.0,-0.0,-0.0,-0.0,0.001399,-0.0,0.486759,0.19895,-0.0,1,...,0,0,1,1,0,1,0,0,0,0
1,-0.0,-0.0,-0.0,-0.0,0.002761,0.031306,-0.0,0.177272,-0.0,1,...,0,0,1,1,0,1,0,0,0,0
2,-0.0,-0.0,-0.0,-0.0,0.001399,-0.0,0.486759,0.19895,-0.0,1,...,0,0,1,1,0,1,0,0,0,0
3,-0.0,-0.0,-0.0,-0.0,0.002761,0.006454,0.462351,0.190387,-0.0,1,...,0,0,1,1,0,1,0,0,0,0
4,-0.0,-0.0,-0.0,-0.0,0.012429,0.089834,0.401902,0.136293,-0.0,1,...,0,0,1,0,1,1,0,0,0,0


## Select Features

Divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

In [127]:
svmdata.columns.values

array(['AdministrativeSkew', 'Administrative_DurationSkew',
       'InformationalSkew', 'Informational_DurationSkew',
       'ProductRelatedSkew', 'ProductRelated_DurationSkew',
       'BounceRatesSkew', 'ExitRatesSkew', 'PageValuesSkew',
       'SpecialDay_0.0', 'SpecialDay_0.2', 'SpecialDay_0.4',
       'SpecialDay_0.6', 'SpecialDay_0.8', 'SpecialDay_1.0',
       'OperatingSystems_1', 'OperatingSystems_2', 'OperatingSystems_3',
       'OperatingSystems_4', 'OperatingSystems_5', 'OperatingSystems_6',
       'OperatingSystems_7', 'OperatingSystems_8', 'Browser_1',
       'Browser_2', 'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6',
       'Browser_7', 'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11',
       'Browser_12', 'Browser_13', 'Region_1', 'Region_2', 'Region_3',
       'Region_4', 'Region_5', 'Region_6', 'Region_7', 'Region_8',
       'Region_9', 'TrafficType_1', 'TrafficType_2', 'TrafficType_3',
       'TrafficType_4', 'TrafficType_5', 'TrafficType_6', 'TrafficType_7',


In [128]:
feature_cols = ['AdministrativeSkew', 'Administrative_DurationSkew',
       'InformationalSkew', 'Informational_DurationSkew',
       'ProductRelatedSkew', 'ProductRelated_DurationSkew',
       'BounceRatesSkew', 'ExitRatesSkew', 'PageValuesSkew',
       'SpecialDay_0.0', 'SpecialDay_0.2', 'SpecialDay_0.4',
       'SpecialDay_0.6', 'SpecialDay_0.8', 'SpecialDay_1.0',
       'OperatingSystems_1', 'OperatingSystems_2', 'OperatingSystems_3',
       'OperatingSystems_4', 'OperatingSystems_5', 'OperatingSystems_6',
       'OperatingSystems_7', 'OperatingSystems_8', 'Browser_1',
       'Browser_2', 'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6',
       'Browser_7', 'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11',
       'Browser_12', 'Browser_13', 'Region_1', 'Region_2', 'Region_3',
       'Region_4', 'Region_5', 'Region_6', 'Region_7', 'Region_8',
       'Region_9', 'TrafficType_1', 'TrafficType_2', 'TrafficType_3',
       'TrafficType_4', 'TrafficType_5', 'TrafficType_6', 'TrafficType_7',
       'TrafficType_8', 'TrafficType_9', 'TrafficType_10',
       'TrafficType_11', 'TrafficType_12', 'TrafficType_13',
       'TrafficType_14', 'TrafficType_15', 'TrafficType_16',
       'TrafficType_17', 'TrafficType_18', 'TrafficType_19',
       'TrafficType_20', 'VisitorType_New_Visitor', 'VisitorType_Other',
       'VisitorType_Returning_Visitor', 'Weekend_False', 'Weekend_True',
       'SeasonBins_1', 'SeasonBins_2', 'SeasonBins_3', 'SeasonBins_4']
x = svmdata[feature_cols] # Features
y = svmdata.RevenueEnc # Target variable

In [135]:
feature_cols = ['PageValuesSkew','SeasonBins_4','VisitorType_New_Visitor','TrafficType_2','InformationalSkew','Informational_DurationSkew','SeasonBins_2',   
'TrafficType_3','TrafficType_13','OperatingSystems_3','Administrative_DurationSkew','TrafficType_1','BounceRatesSkew','SpecialDay_0.8','SeasonBins_1','ExitRatesSkew',
'OperatingSystems_2','VisitorType_Returning_Visitor','SpecialDay_0.4','AdministrativeSkew','TrafficType_20','SpecialDay_0.6','ProductRelated_DurationSkew', 'SpecialDay_1.0',  
'SpecialDay_0.0','Browser_3','Weekend_True','SpecialDay_0.2','TrafficType_5']
x = svmdata[feature_cols] # Features
y = svmdata.RevenueEnc # Target variable

## Split Data

Divide the dataset into a training set and a test set.

Use function train_test_split()

Pass 3 parameters features, target, and test_set size. (can use random_state to select records randomly)

In [136]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=2019)


### Generating Model
Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in `SVC()` function.

Then, fit your model on train set using `fit()` and perform prediction on the test set using `predict()`.

In [137]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

### Evaluating the Model
Let's estimate how accurately the classifier or model can predict the breast cancer of patients.

Accuracy can be computed by comparing actual test set values and predicted values.

In [138]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred))

F1_Score: 0.6153846153846154
AUC: 0.7657522908129069


## With SMOTE
Now we will examine the results when applying SMOTE.

In [139]:
from imblearn.over_sampling import SMOTE

#create  oversampled data to train on
oversampler = SMOTE(random_state = 2019)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

In [140]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train_oversampled,y_train_oversampled)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [141]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred))

F1_Score: 0.6282578875171467
AUC: 0.834181470168825


---

# Running SVM with Selected Features - Top 20 Highest Scores
Reducing amount of features to see if the highest scoring ones from [this feature selection process](2SVM_FeatSelect_Sean.ipynb) perform with better results,

In [38]:
# make copy of data

svmdata_FS = svmdata.copy()

In [157]:
# top 30
feature_cols_FS = ['PageValuesSkew','SeasonBins_4','VisitorType_New_Visitor','TrafficType_2','InformationalSkew','Informational_DurationSkew','SeasonBins_2',   
'TrafficType_3','TrafficType_13','OperatingSystems_3','Administrative_DurationSkew','TrafficType_1','BounceRatesSkew','SpecialDay_0.8','SeasonBins_1','ExitRatesSkew',
'OperatingSystems_2','VisitorType_Returning_Visitor','SpecialDay_0.4','AdministrativeSkew','TrafficType_20','SpecialDay_0.6','ProductRelated_DurationSkew', 'SpecialDay_1.0',  
'SpecialDay_0.0','Browser_3','Weekend_True','SpecialDay_0.2','TrafficType_5',
                  'TrafficType_10','TrafficType_15','TrafficType_7']

#top 15
#feature_cols_FS = ['PageValuesSkew','SeasonBins_4','VisitorType_New_Visitor','TrafficType_2','InformationalSkew','Informational_DurationSkew','SeasonBins_2',   
#'TrafficType_3','TrafficType_13','OperatingSystems_3','Administrative_DurationSkew','TrafficType_1','BounceRatesSkew','SpecialDay_0.8','SeasonBins_1']

# top 10
#feature_cols_FS = ['PageValuesSkew','SeasonBins_4','VisitorType_New_Visitor','TrafficType_2','InformationalSkew','Informational_DurationSkew','SeasonBins_2',   
#'TrafficType_3','TrafficType_13','OperatingSystems_3']


#top 20
#feature_cols_FS = ['PageValuesSkew','SeasonBins_4','VisitorType_New_Visitor','TrafficType_2','InformationalSkew','Informational_DurationSkew','SeasonBins_2',   
#'TrafficType_3','TrafficType_13','OperatingSystems_3','Administrative_DurationSkew','TrafficType_1','BounceRatesSkew','SpecialDay_0.8','SeasonBins_1','ExitRatesSkew',
#'OperatingSystems_2','VisitorType_Returning_Visitor','SpecialDay_0.4']



x2 = svmdata_FS[feature_cols_FS] # Features
y2 = svmdata_FS.RevenueEnc # Target variable

## Split Data

Divide the dataset into a training set and a test set.

Use function train_test_split()

Pass 3 parameters features, target, and test_set size. (can use random_state to select records randomly)

In [158]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(x2, y2, test_size=0.3,random_state=2019)


### Generating Model
Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in `SVC()` function.

Then, fit your model on train set using `fit()` and perform prediction on the test set using `predict()`.

In [159]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC (kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

### Evaluating the Model
Let's estimate how accurately the classifier or model can predict the breast cancer of patients.

Accuracy can be computed by comparing actual test set values and predicted values.

In [160]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred))

F1_Score: 0.6191780821917809
AUC: 0.7683931358833294


## With SMOTE - BEST PERFORMER!
Now we will examine the results when applying SMOTE.

In [161]:
from imblearn.over_sampling import SMOTE

#create  oversampled data to train on
oversampler = SMOTE(random_state = 2019)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

In [162]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train_oversampled,y_train_oversampled)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [163]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics


# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred))

F1_Score: 0.6274777853725222
AUC: 0.8344229783041912


In [149]:
F1_Score: 0.6308539944903582
AUC: 0.8351396305009874

In [None]:
F1_Score: 0.6322314049586776
AUC: 0.836179605579822