# Support Vector Machines (SVM) - Pipeline 1

SVM offers very high accuracy compared to other classifiers such as logistic regression, and decision trees. It is known for its kernel trick to handle **nonlinear** input spaces. It is used in a variety of applications such as face detection, intrusion detection, classification of emails, news articles and web pages, classification of genes, and handwriting recognition.

Generally, Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.

Via https://docs.google.com/presentation/d/1clcb0XypiyFs77psLixKSaQM8_b_piwv3grOa8MNplE/edit#slide=id.g4ca7083693_0_199
- "In practice, a low degree (<5) polynomial kernel or RBF kernel with a reasonable width is a good initial try. Note that SVM with RBF kernel is closely related to RBF neural networks, with the centers of the radial basis functions automatically chosen for SVM 
Usually done in trial-and-error." - we did an RBF kernel since our data isn't linear
- "SVM are originally designed for binary classification" - this could be why it works very well, since most of our featurs are binary.  "Create “dummies” for multi-class target, then train multiple binary classifiers – then combine them" - what we did when we did one-hot. 
- "Usually the best classification performance" - this is a classification problem, so makes sense.

In [1]:
#import pandas
import pandas as pd

# ignore warnings

import warnings
warnings.filterwarnings('ignore')

# load dataset
svmdata = pd.read_csv('./pipeline1.csv', header=0)
svmdata.head()

Unnamed: 0,AdministrativeSkew,Administrative_DurationSkew,InformationalSkew,Informational_DurationSkew,ProductRelatedSkew,ProductRelated_DurationSkew,BounceRatesSkew,ExitRatesSkew,PageValuesSkew,SpecialDay_0.0,...,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_False,Weekend_True,SeasonBins_1,SeasonBins_2,SeasonBins_3,SeasonBins_4,RevenueEnc
0,-0.990128,-0.996659,-0.520831,-0.492257,-1.922123,-2.096783,1.503281,1.982547,-0.533268,1,...,0,0,1,1,0,1,0,0,0,0
1,-0.990128,-0.996659,-0.520831,-0.492257,-1.574524,-1.074189,-1.036838,1.569866,-0.533268,1,...,0,0,1,1,0,1,0,0,0,0
2,-0.990128,-0.996659,-0.520831,-0.492257,-1.922123,-2.096783,1.503281,1.982547,-0.533268,1,...,0,0,1,1,0,1,0,0,0,0
3,-0.990128,-0.996659,-0.520831,-0.492257,-1.574524,-1.875436,1.354717,1.832073,-0.533268,1,...,0,0,1,1,0,1,0,0,0,0
4,-0.990128,-0.996659,-0.520831,-0.492257,-0.44261,0.057515,1.002737,0.72246,-0.533268,1,...,0,0,1,0,1,1,0,0,0,0


## Select Features

Divide the given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

In [2]:
svmdata.columns.values

array(['AdministrativeSkew', 'Administrative_DurationSkew',
       'InformationalSkew', 'Informational_DurationSkew',
       'ProductRelatedSkew', 'ProductRelated_DurationSkew',
       'BounceRatesSkew', 'ExitRatesSkew', 'PageValuesSkew',
       'SpecialDay_0.0', 'SpecialDay_0.2', 'SpecialDay_0.4',
       'SpecialDay_0.6', 'SpecialDay_0.8', 'SpecialDay_1.0',
       'OperatingSystems_1', 'OperatingSystems_2', 'OperatingSystems_3',
       'OperatingSystems_4', 'OperatingSystems_5', 'OperatingSystems_6',
       'OperatingSystems_7', 'OperatingSystems_8', 'Browser_1',
       'Browser_2', 'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6',
       'Browser_7', 'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11',
       'Browser_12', 'Browser_13', 'Region_1', 'Region_2', 'Region_3',
       'Region_4', 'Region_5', 'Region_6', 'Region_7', 'Region_8',
       'Region_9', 'TrafficType_1', 'TrafficType_2', 'TrafficType_3',
       'TrafficType_4', 'TrafficType_5', 'TrafficType_6', 'TrafficType_7',


In [3]:
feature_cols = ['AdministrativeSkew', 'Administrative_DurationSkew',
       'InformationalSkew', 'Informational_DurationSkew',
       'ProductRelatedSkew', 'ProductRelated_DurationSkew',
       'BounceRatesSkew', 'ExitRatesSkew', 'PageValuesSkew',
       'SpecialDay_0.0', 'SpecialDay_0.2', 'SpecialDay_0.4',
       'SpecialDay_0.6', 'SpecialDay_0.8', 'SpecialDay_1.0',
       'OperatingSystems_1', 'OperatingSystems_2', 'OperatingSystems_3',
       'OperatingSystems_4', 'OperatingSystems_5', 'OperatingSystems_6',
       'OperatingSystems_7', 'OperatingSystems_8', 'Browser_1',
       'Browser_2', 'Browser_3', 'Browser_4', 'Browser_5', 'Browser_6',
       'Browser_7', 'Browser_8', 'Browser_9', 'Browser_10', 'Browser_11',
       'Browser_12', 'Browser_13', 'Region_1', 'Region_2', 'Region_3',
       'Region_4', 'Region_5', 'Region_6', 'Region_7', 'Region_8',
       'Region_9', 'TrafficType_1', 'TrafficType_2', 'TrafficType_3',
       'TrafficType_4', 'TrafficType_5', 'TrafficType_6', 'TrafficType_7',
       'TrafficType_8', 'TrafficType_9', 'TrafficType_10',
       'TrafficType_11', 'TrafficType_12', 'TrafficType_13',
       'TrafficType_14', 'TrafficType_15', 'TrafficType_16',
       'TrafficType_17', 'TrafficType_18', 'TrafficType_19',
       'TrafficType_20', 'VisitorType_New_Visitor', 'VisitorType_Other',
       'VisitorType_Returning_Visitor', 'Weekend_False', 'Weekend_True',
       'SeasonBins_1', 'SeasonBins_2', 'SeasonBins_3', 'SeasonBins_4']
x = svmdata[feature_cols] # Features
y = svmdata.RevenueEnc # Target variable

## Split Data

Divide the dataset into a training set and a test set.

Use function train_test_split()

Pass 3 parameters features, target, and test_set size. (can use random_state to select records randomly)

In [4]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=2019)


### Generating Model
Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in `SVC()` function.

Then, fit your model on train set using `fit()` and perform prediction on the test set using `predict()`.

In [5]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC (kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

### Evaluating the Model
Let's estimate how accurately the classifier or model can predict the breast cancer of patients.

Accuracy can be computed by comparing actual test set values and predicted values.

In [6]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred))

F1_Score: 0.6258503401360545
AUC: 0.7612533231969267


## With SMOTE
Now we will examine the results when applying SMOTE.

In [9]:
from imblearn.over_sampling import SMOTE

#create  oversampled data to train on
oversampler = SMOTE(random_state = 2019)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

In [None]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train_oversampled,y_train_oversampled)

#Predict the response for test dataset
y_pred_smote = clf.predict(X_test)

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred_smote))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred_smote))

---

# Running SVM with Selected Features - Top 15 Highest Scores
Reducing amount of features to see if the highest scoring ones from [this feature selection process](1SVM_FeatSelect_Sean.ipynb) perform with better results,

In [None]:
# make copy of data

svmdata_FS = svmdata.copy()

In [None]:
# top 20
#feature_cols_FS =     ['PageValuesSkew','ExitRatesSkew','ProductRelated_DurationSkew','ProductRelatedSkew','AdministrativeSkew','Administrative_DurationSkew',
#'SeasonBins_4','BounceRatesSkew','SeasonBins_2','TrafficType_2','InformationalSkew','Informational_DurationSkew','VisitorType_New_Visitor','VisitorType_Returning_Visitor',
#'SpecialDay_0.0','TrafficType_3']


# top 30
#feature_cols_FS =     ['PageValuesSkew','ExitRatesSkew','ProductRelated_DurationSkew','ProductRelatedSkew','AdministrativeSkew','Administrative_DurationSkew',
#'SeasonBins_4','BounceRatesSkew','SeasonBins_2','TrafficType_2','InformationalSkew','Informational_DurationSkew','VisitorType_New_Visitor','VisitorType_Returning_Visitor',
#'SpecialDay_0.0','TrafficType_3','OperatingSystems_3','TrafficType_13','TrafficType_1','OperatingSystems_2','TrafficType_8','SpecialDay_0.8','SeasonBins_1',
#'SpecialDay_0.4','TrafficType_20','SpecialDay_0.6','Weekend_True','Weekend_False','SpecialDay_1.0','Browser_3'] 

# top 15 - best performer (with SMOTE)
feature_cols_FS =     ['PageValuesSkew','ExitRatesSkew','ProductRelated_DurationSkew','ProductRelatedSkew','AdministrativeSkew','Administrative_DurationSkew',
'SeasonBins_4','BounceRatesSkew','SeasonBins_2','TrafficType_2','InformationalSkew','Informational_DurationSkew','VisitorType_New_Visitor','VisitorType_Returning_Visitor',
'SpecialDay_0.0']

# top 10
#feature_cols_FS =     ['PageValuesSkew','ExitRatesSkew','ProductRelated_DurationSkew','ProductRelatedSkew','AdministrativeSkew','Administrative_DurationSkew',
#'SeasonBins_4','BounceRatesSkew','SeasonBins_2','TrafficType_2']


x2 = svmdata_FS[feature_cols_FS] # Features
y2 = svmdata_FS.RevenueEnc # Target variable

## Split Data

Divide the dataset into a training set and a test set.

Use function train_test_split()

Pass 3 parameters features, target, and test_set size. (can use random_state to select records randomly)

In [None]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(x2, y2, test_size=0.3,random_state=2019)


### Generating Model
Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in `SVC()` function.

Then, fit your model on train set using `fit()` and perform prediction on the test set using `predict()`.

In [13]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC (kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

### Evaluating the Model
Let's estimate how accurately the classifier or model can predict the breast cancer of patients.

Accuracy can be computed by comparing actual test set values and predicted values.

In [14]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred))

F1_Score: 0.6288461538461537
AUC: 0.7646965713154686


## With SMOTE - BEST PERFORMER!
Now we will examine the results when applying SMOTE.

In [15]:
from imblearn.over_sampling import SMOTE

#create  oversampled data to train on
oversampler = SMOTE(random_state = 2019)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

In [16]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='rbf') # non-linear

#Train the model using the training sets
clf.fit(X_train_oversampled,y_train_oversampled)

#Predict the response for test dataset
y_pred_smote = clf.predict(X_test)

In [17]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics


# F1 Score
print("F1_Score:",metrics.f1_score(y_test, y_pred_smote))

# AUC
print("AUC:",metrics.roc_auc_score(y_test, y_pred_smote))

F1_Score: 0.6445623342175065
AUC: 0.8553161029415073


In [19]:
from sklearn.metrics import classification_report, confusion_matrix
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_smote))
print("Precision:",metrics.precision_score(y_test, y_pred_smote))
print("Recall:",metrics.recall_score(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote))

Accuracy: 0.8550959718842931
Precision: 0.5170212765957447
Recall: 0.8556338028169014
              precision    recall  f1-score   support

           0       0.97      0.85      0.91      3131
           1       0.52      0.86      0.64       568

    accuracy                           0.86      3699
   macro avg       0.74      0.86      0.78      3699
weighted avg       0.90      0.86      0.87      3699

