# Support Vector Machine

Let's draw two plot of the same data and try answer this question.

- Which one is the line that separates the points better?

As a preview, this is how Support Vector Machine (SVM) works. It gives a line that separates the points with the greatest distance.

SVM extend the criteria of classificaton a bit further, from giving a line that separates the points, to create one which is far away from the points as possible. We do this by creating two more lines which is equidistant parallel lines to the main line, and we try to **maximise** the distance between these two or the **margin** between them.

![](../python-for-data-science/assets/img/svm-boundary.png)

Now, the error function now become a bit more complex, that is
$$
error = \text{classification error} + \text{margin error}
$$

## SVM in Scikit-Learn

In [50]:
from sklearn.svm import SVC, LinearSVC
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

In [6]:
data = pd.read_csv("data.csv", names=['x1', 'x2', 'label'])

In [14]:
X = data[['x1', 'x2']].values
y = data['label'].values

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=111)

In [35]:
linear_svm = SVC(kernel='linear')
poly_svm = SVC(kernel='poly', degree=4)
rbf_svm = SVC(kernel='rbf')
# sigmoid_svm = SVC(kernel='sigmoid')


In [39]:
for model in [linear_svm, poly_svm, rbf_svm]:
    model.fit(X_train, y_train)
    train_acc_score = metrics.accuracy_score(y_test, model.predict(X_test))
    test_acc_score = metrics.accuracy_score(y_train, model.predict(X_train))
    print ('Kernel:', model.kernel)
    print ('Training Acc Score', train_acc_score)
    print ('Test Acc Score', test_acc_score)

Kernel: linear
Training Acc Score 0.7083333333333334
Test Acc Score 0.6527777777777778
Kernel: poly
Training Acc Score 0.7083333333333334
Test Acc Score 0.6527777777777778
Kernel: rbf
Training Acc Score 0.7083333333333334
Test Acc Score 0.6527777777777778




In [43]:
model = SVC(kernel='poly', C=10**8, degree=4)
model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
print ('Train Acc Score', metrics.accuracy_score(y_train, model.predict(X_train)))
print ('Test Acc Score', metrics.accuracy_score(y_test, model.predict(X_test)))



Train Acc Score 0.7222222222222222
Test Acc Score 0.7083333333333334


In [46]:
iris = load_iris()
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [55]:
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1111)

In [56]:
for model in [linear_svm, poly_svm, rbf_svm]:
    model.fit(X_train, y_train)
    train_acc_score = metrics.accuracy_score(y_test, model.predict(X_test))
    test_acc_score = metrics.accuracy_score(y_train, model.predict(X_train))
    print ('Kernel:', model.kernel)
    print ('Training Acc Score', train_acc_score)
    print ('Test Acc Score', test_acc_score)

Kernel: linear
Training Acc Score 0.9736842105263158
Test Acc Score 0.9910714285714286
Kernel: poly
Training Acc Score 0.9473684210526315
Test Acc Score 0.9910714285714286
Kernel: rbf
Training Acc Score 0.9736842105263158
Test Acc Score 0.9821428571428571




In [57]:
linSVC = LinearSVC(random_state=1111)
linSVC.fit(X_train, y_train)
print ('Train', metrics.accuracy_score(y_test, model.predict(X_test)))
print ('Test', metrics.accuracy_score(y_train, model.predict(X_train)))

Train 0.9736842105263158
Test 0.9821428571428571


