In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## Support Vector Machines (SVM)

## SVM Parameters & Model Selection

## Params explained
Param	Description	linear	rbf	poly	sigmoid

C	Regularization Param	✓	✓	✓	✓

degree	Polynomial Degree			✓	

coef0	Independent Term			✓	✓

gamma	Kernel Coefficient		✓	✓	✓

epsilon	Acceptable error margin	✓	✓	✓	✓


### `C`
> Regularization parameter. Value inversely proporional to strength of regularization

### `degree`
> Used only for `poly` kernel. 2 → quadratic, 3 → cubic, ..

### `coef0`
> Independent Term in polynomial and sigmoid kernels

### `gamma`
> Kernel Coefficient. `"scale"`: $ \frac{1}{(nfeatures * X.var())} $, `"auto"`: $ \frac{1}{nfeatures} $

### `epsilon` (SVR-only)
> Acceptable error margin around which no penalty is levied

## Dataset: iris

In [3]:
iris = sns.load_dataset('iris')

In [4]:
X = iris.iloc[:,:-1]
X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [5]:
y = iris.iloc[:, -1]
y

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

## Exercise 1
Create your first SVC model with a C value of 100

Fit the training data on this model

Score this model with the test data. What accuracy does your first model have?

In [7]:
svc = SVC(C=100)
svc

In [8]:
# fit and score your model
svc.fit(X_train, y_train)
svc.score(X_test, y_test)

0.9777777777777777

## Model Selection with Grid Search

In [9]:
from sklearn.model_selection import GridSearchCV

## Exercise 2
GridSearch helps in finding the right set of params which will produce the best performing estimator.

In this exercise, we try finding the best rbf kernel.

Create a model with an rbf kernel

Create a GridSearchCV object with this model and the following grid of values to evaluate on:

Param	Values

C	0.1, 1,10,100

gamma	auto, scale

Fit the training data to the GridSearch object to get the best estimator

Capture the best estimator determined by the GridSearch object into svc_rbf and score it with the test data.

Is this model's performance better than the previous one's?

In [10]:
gs = GridSearchCV(SVC(kernel='rbf'), dict(C = [0.1, 1, 10, 100], gamma = ['auto', 'scale']))

In [11]:
# fit gs
gs.fit(X_train, y_train)

In [12]:
svc_rbf = gs.best_estimator_

# score svc_rbf
svc_rbf.score(X_test, y_test)

0.9777777777777777

## Exercise 3
Repeat the process above for a 2-degree polynomial kernel.

Create a model with a poly kernel and degree=2

Create a GridSearchCV object with this model and the following grid of values to evaluate on:

Param	Values

C	0.1, 1,10,100

gamma	auto, scale

coef0	1,2,..9

Fit the training data to the GridSearch object to get the best estimator

Capture the best estimator determined by the GridSearch object into svc_poly and score it with the test data.

Notice that there is no guarantee that a given kernel will perform better than the previous one.

In [13]:
gs = GridSearchCV(SVC(kernel='poly', degree=2), dict(C = [0.1, 1, 10, 100], gamma = ['auto', 'scale'], coef0 = np.arange(1, 10, 1)))

In [14]:
# fit gs
gs.fit(X_train, y_train)


In [15]:
svc_poly = gs.best_estimator_

# score svc_poly
svc_poly.score(X_test, y_test)

0.9555555555555556

## Exercise 4¶
We finally try this exercise with a linear kernel and see how it performs

Create a model with a linear kernel
Create a GridSearchCV object with this model and the following grid of values to evaluate on:
Param	Values
C	1,2..19
Fit the training data to the GridSearch object to get the best estimator
Capture the best estimator determined by the GridSearch object into svc_linear and score it with the test data.
Have we found the best model yet?

In [16]:
gs = GridSearchCV(SVC(kernel='linear'), dict(C = range(1, 20)))

In [17]:
# fit gs
gs.fit(X_train, y_train)

In [18]:
svc_linear = gs.best_estimator_

# score svc_linear
svc_linear.score(X_test, y_test)

0.9777777777777777

# Exercise 5

The linear kernel is the simplest of the three and gives best performance for this dataset. We would want to stick the `svc_linear` model for predictions

1. To analyze deeper on the classification performance of `svc_linear`, generate the classfication report of the model

Which of three species has a perfect classification score?

- Setosa

In [19]:
cls_report = classification_report(y_test, svc_linear.predict(X_test))
print(cls_report)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.91      1.00      0.95        10
   virginica       1.00      0.94      0.97        17

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.97        45
weighted avg       0.98      0.98      0.98        45

