A support vector machine (SVM) is a binary linear classifier whose decision boundary is explicitly constructed to minimize generalization error.

Recall:

Binary classifier – solves two-class problem
Linear classifier – creates linear decision boundary (in 2d)
The decision boundary is derived using geometric reasoning (as opposed to the algebraic reasoning we’ve used to derive other classifiers). The generalization error is equated with the geometric concept of margin, which is the region along the decision boundary that is free of data points.



The goal of an SVM is to create the linear decision boundary with the largest margin. This is commonly called the maximum margin hyperplane (MMH).

Nonlinear applications of SVM rely on an implicit (nonlinear) mapping that sends vectors from the original feature space K into a higher-dimensional feature space K’. Nonlinear classification in K is then obtained by creating a linear decision boundary in K’. In practice, this involves no computations in the higher dimensional space, thanks to what is called the kernel trick.

In [1]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data
y = iris.target

Scikit-learn implements support vector machine models in the svm package.

In [2]:
model = SVC(kernel='linear')
model.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Notice that the SVC class has several parameters. In particular we are concerned with two:

C: penalty parameter of the error term (regularization)
kernel: the type of kernel used (linear, poly, rbf, sigmoid, precomputed or a callable.)
Notes from the documentation:

Notes from the documentation:

In the current implementation the fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
The multi-class support is handled according to a one-vs-one scheme.
As usual we can calculate the cross validated score to judge the quality of the model.

As usual we can calculate the cross validated score to judge the quality of the model.

In [4]:
from sklearn.cross_validation import cross_val_score

cvscores = cross_val_score(model, X, y, cv = 5, n_jobs=-1)
print "CV score: {:.3} +/- {:.3}".format(cvscores.mean(), cvscores.std())

CV score: 0.98 +/- 0.0163


## Guided Practice: Tuning an SVM 

An SVM almost never works without tuning its parameter.

Check: Try performing a grid search over kernel type and regularization strength to find the optimal score for the above data.

In [7]:
from sklearn.grid_search import GridSearchCV
parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 3, 10]}
clf = GridSearchCV(model, parameters, n_jobs=-1)
clf.fit(X, y)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [0.1, 1, 3, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [8]:
clf.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [9]:
clf.best_score_

0.97999999999999998

Check: Can you think of pros and cons for Support Vector Machines

Pros:

Very powerful, good performance
Can be used for anomaly detection (one-class SVM)
Cons:

Can get very hard to train with lots of data
Prone to overfit (need regularization)
Black box

In this class we have learned about Support Vector Machines. We've seen how they are powerful in many situations and what can some of their limitations be.

Can you think of a way to apply them in business?