# Support Vector Machines (SVM)

SVM is one of the powerful algorithm available but it requires careful data preprocessing. In this algorithm each data point is plotted in n-dimensional space where `n` is number of attributes in the dataset. 

For linear data, we can compare SVM with linear regression and for non-linear data, we can compare SVM with logistic regression.

## SVM (Classification Problem)

Data Source: [Banknote Authentication]("https://archive.ics.uci.edu/ml/datasets/banknote+authentication")

This is a text data, we need to convert this to .csv format and rename the columns as per data source

**Attributes**
- variance of Wavelet Transformed image (continuous)
- skewness of Wavelet Transformed image (continuous)
- curtosis of Wavelet Transformed image (continuous)
- entropy of image (continuous)
- class (integer) 

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading the dataset
banknote = pd.read_csv("./banknote/data_banknote_authentication.txt")
banknote

Unnamed: 0,3.6216,8.6661,-2.8073,-0.44699,0
0,4.54590,8.16740,-2.4586,-1.46210,0
1,3.86600,-2.63830,1.9242,0.10645,0
2,3.45660,9.52280,-4.0112,-3.59440,0
3,0.32924,-4.45520,4.5718,-0.98880,0
4,4.36840,9.67180,-3.9606,-3.16250,0
...,...,...,...,...,...
1366,0.40614,1.34920,-1.4501,-0.55949,1
1367,-1.38870,-4.87730,6.4774,0.34179,1
1368,-3.75030,-13.45860,17.5932,-2.77710,1
1369,-3.56370,-8.38270,12.3930,-1.28230,1


In [3]:
# Convert this dataset into .csv format
banknote.to_csv("banknote.csv", index = None)

In [4]:
# Load the dataset
banknote = pd.read_csv("./banknote/banknote.csv")
banknote.head()

Unnamed: 0,3.6216,8.6661,-2.8073,-0.44699,0
0,4.5459,8.1674,-2.4586,-1.4621,0
1,3.866,-2.6383,1.9242,0.10645,0
2,3.4566,9.5228,-4.0112,-3.5944,0
3,0.32924,-4.4552,4.5718,-0.9888,0
4,4.3684,9.6718,-3.9606,-3.1625,0


In [5]:
# Column rename
banknote.columns = ["variance", "skewness", "curtosis", "entropy", "class"]
banknote.head()

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,4.5459,8.1674,-2.4586,-1.4621,0
1,3.866,-2.6383,1.9242,0.10645,0
2,3.4566,9.5228,-4.0112,-3.5944,0
3,0.32924,-4.4552,4.5718,-0.9888,0
4,4.3684,9.6718,-3.9606,-3.1625,0


In [6]:
# Display the characteristics of the dataset
print("Dimension of dataset is: ", banknote.shape)
print("The names of the variables in the data are: \n", banknote.columns)

Dimension of dataset is:  (1371, 5)
The names of the variables in the data are: 
 Index(['variance', 'skewness', 'curtosis', 'entropy', 'class'], dtype='object')


In [7]:
# Using random seed function for generating the same dataset
np.random.seed(3000)

In [8]:
# Train-Test Split for both independent and dependent features
training, test = train_test_split(banknote, test_size = 0.3)

x_trg = training.drop("class", axis = 1)
y_trg = training["class"]

x_test = test.drop("class", axis = 1)
y_test = test["class"]

In [9]:
x_trg.shape

(959, 4)

In [10]:
y_trg.shape

(959,)

In [11]:
x_test.shape

(412, 4)

In [12]:
y_test.shape

(412,)

### Model Building - SVM

In [13]:
svm_banknote = LinearSVC(random_state = 0)

svm_banknote.fit(x_trg, y_trg)
print("Accuracy of SVM model on training set is: %0.3f" % svm_banknote.score(x_trg, y_trg))
print("Accuracy of SVM model on test set is: %0.3f" % svm_banknote.score(x_test, y_test))

svm_pred = svm_banknote.predict(x_test)

Accuracy of SVM model on training set is: 0.991
Accuracy of SVM model on test set is: 0.983


In [14]:
# Confusion Matrix - SVM model
svm_results = confusion_matrix(y_test, svm_pred)
print("The confusion matrix of SVM model: \n", svm_results)

The confusion matrix of SVM model: 
 [[218   6]
 [  1 187]]


### Compare with kNN model

In [15]:
# Model Building - kNN
knn_accuracylist = []

for K in range(21):
    K = K + 1
    # Model Building
    knn_banknote = KNeighborsClassifier(n_neighbors = K)
    
    # Fit the model
    knn_banknote.fit(x_trg, y_trg)
    
    # Predicing with model
    knn_pred = knn_banknote.predict(x_test)
    
    # Confusion matrix of kNN model
    knn_results = confusion_matrix(y_test, knn_pred)
    
    # Accuracy score of kNN model
    knn_acc_score = accuracy_score(y_test, knn_pred)
    
    print("The accuracy of kNN model for k = ", K," is: %0.3f" % knn_acc_score)
    print("The confusion matrix is: \n", knn_results)
    
    knn_accuracylist.append(knn_acc_score)
    
print("The maximum accuracy using kNN is: %0.3f" % max(knn_accuracylist))

The accuracy of kNN model for k =  1  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  2  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  3  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  4  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  5  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  6  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  7  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  8  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  9  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN model for k =  10  is: 1.000
The confusion matrix is: 
 [[224   0]
 [  0 188]]
The accuracy of kNN

### Compare with Naive Bayes model

In [16]:
# Model Building - Naive Bayes
naive_banknote = GaussianNB()

naive_banknote.fit(x_trg, y_trg)
naive_pred = naive_banknote.predict(x_test)
naive_results = confusion_matrix(y_test, naive_pred)
naive_acc_score = accuracy_score(y_test, naive_pred)
print("The accuracy score of Naive Bayes model is: ", naive_acc_score)
print("The confusion matrix of Naive Bayes model is: \n", naive_results)

The accuracy score of Naive Bayes model is:  0.8398058252427184
The confusion matrix of Naive Bayes model is: 
 [[195  29]
 [ 37 151]]


### Compare with Logistic Regression model

In [17]:
# Model Building - Logistic Regression
log_banknote = LogisticRegression()

log_banknote.fit(x_trg, y_trg)
print("The accuracy of Log Reg model on training dataset is: %0.3f" % log_banknote.score(x_trg, y_trg))
print("The accuracy of Log Reg model on test dataset is: %0.3f" % log_banknote.score(x_test, y_test))

log_pred = log_banknote.predict(x_test)
log_results = confusion_matrix(y_test, log_pred)
print("The confusion matrix of Log Reg model is: \n", log_results)

The accuracy of Log Reg model on training dataset is: 0.993
The accuracy of Log Reg model on test dataset is: 0.988
The confusion matrix of Log Reg model is: 
 [[219   5]
 [  0 188]]


We have not done feature scaling and encoding because SVM is very sensitive model. The SVM accuracy on training dataset is 0.991 and that on test dataset is 0.983. This shows the SVM model is slightly overfitting.

Maximum accuracy of kNN model is 1
Maximum accuracy of Naive Bayes model is 0.839
Maximum accuracy of Logistic Regression model on training dataset is 0.993 and test dataset is 0.988

Thus for this dataset, kNN is the best model.