### Introduction to Support Vector Machines 


Support Vector Machines (SVMs in short) are machine learning algorithms that are used for classification and regression purposes. SVMs are one of the powerful machine learning algorithms for classification, regression and outlier detection purposes. An SVM classifier builds a model that assigns new data points to one of the given categories. Thus, it can be viewed as a non-probabilistic binary linear classifier.

The original SVM algorithm was developed by Vladimir N Vapnik and Alexey Ya. Chervonenkis in 1963. At that time, the algorithm was in early stages. The only possibility is to draw hyperplanes for linear classifier. In 1992, Bernhard E. Boser, Isabelle M Guyon and Vladimir N Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The current standard was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.

SVMs can be used for linear classification purposes. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using the kernel trick. It enable us to implicitly map the inputs into high dimensional feature spaces.

### Support Vector Machines intuition 

### Hyperplane
A hyperplane is a decision boundary which separates between given set of data points having different class labels. The SVM classifier separates data points using a hyperplane with the maximum amount of margin. This hyperplane is known as the maximum margin hyperplane and the linear classifier it defines is known as the maximum margin classifier.



### Support Vectors
Support vectors are the sample data points, which are closest to the hyperplane. These data points will define the separating line or hyperplane better by calculating margins.



### Margin
A margin is a separation gap between the two lines on the closest data points. It is calculated as the perpendicular distance from the line to support vectors or closest data points. In SVMs, we try to maximize this separation gap so that we get maximum margin.



### . Kernel trick 

In practice, SVM algorithm is implemented using a kernel. It uses a technique called the kernel trick. In simple words, a kernel is just a function that maps the data to a higher dimension where data is separable. A kernel transforms a low-dimensional input data space into a higher dimensional space. So, it converts non-linear separable problems to linear separable problems by adding more dimensions to it. Thus, the kernel trick helps us to build a more accurate classifier. Hence, it is useful in non-linear separation problems.

### Linear kernel
In linear kernel, the kernel function takes the form of a linear function as follows-

linear kernel : K(xi , xj ) = xiT xj

Linear kernel is used when the data is linearly separable. It means that data can be separated using a single line. It is one of the most common kernels to be used. It is mostly used when there are large number of features in a dataset. Linear kernel is often used for text classification purposes.

Training with a linear kernel is usually faster, because we only need to optimize the C regularization parameter. When training with other kernels, we also need to optimize the γ parameter. So, performing a grid search will usually take more time

### Polynomial Kernel
Polynomial kernel represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables. The polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of the input samples.

For degree-d polynomials, the polynomial kernel is defined as follows –

Polynomial kernel : K(xi , xj ) = (γxiT xj + r)d , γ > 0

Polynomial kernel is very popular in Natural Language Processing. The most common degree is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems.

### Radial Basis Function Kernel
Radial basis function kernel is a general purpose kernel. It is used when we have no prior knowledge about the data.

### Sigmoid kernel
Sigmoid kernel has its origin in neural networks. We can use it as the proxy for neural networks. Sigmoid kernel is given by the following equation –

sigmoid kernel : k (x, y) = tanh(αxTy + c)

### SVM Scikit-Learn libraries ¶
Table of Contents

Scikit-Learn provides useful libraries to implement Support Vector Machine algorithm on a dataset. There are many libraries that can help us to implement SVM smoothly. We just need to call the library with parameters that suit to our needs. In this project, I am dealing with a classification task. So, I will mention the Scikit-Learn libraries for SVM classification purposes.

First, there is a LinearSVC() classifier. As the name suggests, this classifier uses only linear kernel. In LinearSVC() classifier, we don’t pass the value of kernel since it is used only for linear classification purposes.

Scikit-Learn provides two other classifiers - SVC() and NuSVC() which are used for classification purposes. These classifiers are mostly similar with some difference in parameters. NuSVC() is similar to SVC() but uses a parameter to control the number of support vectors. We pass the values of kernel, gamma and C along with other parameters. By default kernel parameter uses rbf as its value but we can pass values like poly, linear, sigmoid or callable function

 

### Attribute Information:
Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile. The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:

Mean of the integrated profile.

Standard deviation of the integrated profile.

Excess kurtosis of the integrated profile.

Skewness of the integrated profile.

Mean of the DM-SNR curve.

Standard deviation of the DM-SNR curve.

Excess kurtosis of the DM-SNR curve.

Skewness of the DM-SNR curve.

Class

### Import libraries

In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


## Support Vector Machines - Nonlinear Classification
### with GridSearch

### All needed imports for this notebook

In [5]:
import pandas as pd
pd.options.display.max_colwidth = 80

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from sklearn.svm import SVC # SVM model with kernels
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

#### Fetching Data

In [8]:
data = (r"C:\Users\PRATHAMESH\Desktop\(CSD4207)Data Science Practical - III (Machine Learning using R)\ASSIGNMENTS\car_evaluation.csv")

header_list = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']

cars = pd.read_csv(data, names=header_list, index_col=None)

### Exploring Data

In [9]:
cars.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class value
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [10]:
cars.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class value
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [11]:
cars.info(), cars.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   buying       1728 non-null   object
 1   maint        1728 non-null   object
 2   doors        1728 non-null   object
 3   persons      1728 non-null   object
 4   lug_boot     1728 non-null   object
 5   safety       1728 non-null   object
 6   class value  1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


(None, (1728, 7))

#### **Destribution frequency of values in each variable.** Judging by the output, *stratified sampling* is not needed since all data instances seem to be evenly good splitted

In [12]:
for column in cars.columns:
    print(cars[column].value_counts(), '\n') 

vhigh    432
high     432
med      432
low      432
Name: buying, dtype: int64 

vhigh    432
high     432
med      432
low      432
Name: maint, dtype: int64 

2        432
3        432
4        432
5more    432
Name: doors, dtype: int64 

2       576
4       576
more    576
Name: persons, dtype: int64 

small    576
med      576
big      576
Name: lug_boot, dtype: int64 

low     576
med     576
high    576
Name: safety, dtype: int64 

unacc    1210
acc       384
good       69
vgood      65
Name: class value, dtype: int64 



#### I had an idea that number of doors can somehow correlate with luggage capacity, but seems that *lug_boot* value does not depended on that

#### Feature and Target vectors

In [13]:
X = cars.drop(['class value'], axis=1)
y = cars['class value']

X, y

(     buying  maint  doors persons lug_boot safety
 0     vhigh  vhigh      2       2    small    low
 1     vhigh  vhigh      2       2    small    med
 2     vhigh  vhigh      2       2    small   high
 3     vhigh  vhigh      2       2      med    low
 4     vhigh  vhigh      2       2      med    med
 ...     ...    ...    ...     ...      ...    ...
 1723    low    low  5more    more      med    med
 1724    low    low  5more    more      med   high
 1725    low    low  5more    more      big    low
 1726    low    low  5more    more      big    med
 1727    low    low  5more    more      big   high
 
 [1728 rows x 6 columns],
 0       unacc
 1       unacc
 2       unacc
 3       unacc
 4       unacc
         ...  
 1723     good
 1724    vgood
 1725    unacc
 1726     good
 1727    vgood
 Name: class value, Length: 1728, dtype: object)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

In [15]:
X_train.shape, X_test.shape

((1036, 6), (692, 6))

In [16]:
y_train.shape, y_test.shape

((1036,), (692,))

### Encoding
#### There are a limited number of possible values, each of which represents a category, which means that all the variables in dataset are of ordinal categorical data type. Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s OrdinalEncoder class:

In [17]:
columns_encode = []
columns_encode.append(header_list)
columns_encode

[['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']]

In [18]:
ordinal_encoder = OrdinalEncoder()

X_train = ordinal_encoder.fit_transform(X_train, columns_encode)
X_test = ordinal_encoder.transform(X_test)

In [19]:
X_train, X_train.shape

(array([[0., 0., 2., 2., 1., 1.],
        [3., 2., 2., 2., 0., 1.],
        [0., 2., 2., 1., 2., 2.],
        ...,
        [0., 1., 3., 2., 1., 0.],
        [1., 0., 2., 0., 2., 2.],
        [2., 2., 1., 2., 2., 2.]]),
 (1036, 6))

In [20]:
y_train, y_train.shape

(615     unacc
 294     unacc
 712     unacc
 1720      acc
 88      unacc
         ...  
 1130    vgood
 1294     good
 860       acc
 1459    unacc
 1126      acc
 Name: class value, Length: 1036, dtype: object,
 (1036,))

#### Using GridSearch to find the best hyperparameters

In [21]:
param_grid = [{'kernel': ['poly'], 'C' : [3, 5, 7, 9, 10]},
             {'kernel' : ['rbf'], 'C' : [3, 5, 7, 9, 10], 'gamma' : [2, 4, 6, 8]}]

svm = SVC()

In [22]:
grid_search = GridSearchCV(svm, param_grid, return_train_score=True)

grid_search.fit(X_train, y_train)

GridSearchCV(estimator=SVC(),
             param_grid=[{'C': [3, 5, 7, 9, 10], 'kernel': ['poly']},
                         {'C': [3, 5, 7, 9, 10], 'gamma': [2, 4, 6, 8],
                          'kernel': ['rbf']}],
             return_train_score=True)

#### Estimated best hyperparameters for SVM

In [23]:
grid_search.best_params_

{'C': 9, 'kernel': 'poly'}

#### GridSearcg estimated the best model to be with polynomial kernel of ninth degree

In [24]:
grid_search.best_estimator_

SVC(C=9, kernel='poly')

In [25]:
svm_y_pred = grid_search.predict(X_test)

accuracy_score(y_test, svm_y_pred)

0.8815028901734104

In [26]:
svm_y_pred_train = grid_search.predict(X_train)

accuracy_score(y_train, svm_y_pred_train)

0.9054054054054054

### Accuracy of training test is a little bit higher, but it's clearly not overfit, so I guess tthe model did very good

#### Confusion Matrix

In [27]:
confusion_matrix(y_test, svm_y_pred)

array([[107,   3,  39,   7],
       [  2,  22,   5,   0],
       [ 14,   0, 465,   1],
       [  6,   0,   5,  16]], dtype=int64)

In [28]:
print(classification_report(y_test, svm_y_pred))

              precision    recall  f1-score   support

         acc       0.83      0.69      0.75       156
        good       0.88      0.76      0.81        29
       unacc       0.90      0.97      0.94       480
       vgood       0.67      0.59      0.63        27

    accuracy                           0.88       692
   macro avg       0.82      0.75      0.78       692
weighted avg       0.88      0.88      0.88       692

