# Chapter 5: Support Vector Machines







## SVM Classification
SVMs are good tools for classification on small / medium sized datasets. The method focuses on creating a large margin between test samples from different groups. That is, form a decision boundary that is very far from the data while achieving accurate classification.

The method is very sensative to scaling. 

Hard margin classification - when the data is linearlly seperable - is also very, very sensative to outliers.

Soft margin classification balances keeping the margins large while tolerating a certain number of margin violations (i.e. letting data fall within the margins defined by the support vectors).

Hyper-parameter $C$ in scikit - learn controlls this tradeoff. High values of $C$ allow few margin violations while low values of $C$ allow many. $C$ is also good to avoid overfitting.

In [2]:
import numpy as np 
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [4]:
iris = datasets.load_iris()
X = iris['data'][:, (2,3)]
y = (iris['target'] == 2).astype(np.float64)

In [5]:
svm_clf = Pipeline([
    ('scaler', StandardScaler()), 
    ('linear_svc', LinearSVC(C=1, loss = 'hinge'))
])

svm_clf.fit(X,y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_svc', LinearSVC(C=1, loss='hinge'))])

In [7]:
svm_clf.predict([[5.5,1.7], [2.5,6.2]])

array([1., 1.])

We can also use higher dimensional data using feature maps. 


In [8]:
from sklearn.datasets import make_moons 
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import PolynomialFeatures

In [14]:
X, y = make_moons(n_samples = 100, noise = 0.15)

In [15]:
poly_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree = 3)), 
    ('scaler', StandardScaler()), 
    ('svm_clf', LinearSVC(C = 10, loss = 'hinge'))
])

poly_svm_clf.fit(X,y)


Pipeline(steps=[('poly_features', PolynomialFeatures(degree=3)),
                ('scaler', StandardScaler()),
                ('svm_clf', LinearSVC(C=10, loss='hinge'))])

We can use the kernel trick super effectively within the SVM framework. We just pass the kernel to the LinearSVC constructor. The degree is the dimension of the kernel and _coef0_ is the decay in influence between low and high order terms in the polynomial.

In [20]:
from sklearn.svm import SVC 
poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()), 
    ('svm_clf', SVC(kernel = 'poly', degree = 3, coef0 = 1, C = 5))
])

poly_kernel_svm_clf.fit(X,y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=5, coef0=1, kernel='poly'))])

You can also create similarity scores using a RBF Kernel and expand dimensions this way. _gamma_ is the scaling paramter in the Gaussian kernel. $\gamma$ controls how similar points can be. If set very small, all points are equally dissimilar, and the dimension expansion does very little. (Overfitting: reduce $\gamma$. Underfitting: incrase $\gamma$).

In [21]:
rbf_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()), 
    ('svm_clf', SVC(kernel = 'rbf', gamma = 5, coef0 = 1, C = 0.001))
])

rbf_kernel_svm_clf.fit(X,y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=0.001, coef0=1, gamma=5))])

## Computational Complexity


LinearSVC - scales like $O(nm)$, take longer if higher precision required, does not support kernels. 
SVC - scales like $[O(m^2n), O(m^3n)]$, supports kernels

This is why SVMs are not recommended with several data points - even if it scales well with the number of paramters. 




## SVM Regression

Reverse objective - fit as many points between the margins as possible, limiting the number of margin violations. 

The width of the margin is controlled by a hyper paramter $\epsilon$. Any fit that does not change when $\epsilon \leq \epsilon_0$ is said to be $\epsilon_0$-insensative.

In [23]:
from sklearn.svm import LinearSVR 
svm_reg = LinearSVR(epsilon = 1.5)
svm_reg.fit(X, y)

LinearSVR(epsilon=1.5)

In [25]:
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel ='poly', degree= 3, C = 5, epsilon = 0.1)

## Mathematical Details


Linear SMV : $\hat{y} = \mathbf{1}(w^Tx + b \geq 0)$

The decision boundary is when $\hat{y} = 0$ and we define the margins where $\hat{y} = \pm 1$. We look to change $(w,b)$ to maximize the distance between where $\hat{y} = 0$ and $\hat{y} = \pm 1$. 

Note that slope of the decision function is $\|w\|$. And notice by reducing $\|w\|$ we _increase_ the distance between the margins. We also want to make little to no errors in classification while maintaining this large margin. Let $t_i = -1$ if $y_i = 0$ and $t_i = 1$ if $y_i = 1$. 

Then we can write the **hard margin** linear svm objective as 

$$(\hat{w}, \hat{b}) = \min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{subject to}\quad t_i(w^Tx_i + b)\geq 1$$


To achieve soft margin, we need to allow for the constraint $t(w^Tx + b)$ to be allowed inside the margin. We do so by introducing a set of slack variables $\{\xi_i\}_{i=1}^m$ and write the **soft margin** linear svm objective as 

$$(\hat{w}, \hat{b}) = \min_{w,b} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^m\xi_i \quad \text{subject to}\quad t_i(w^Tx_i + b)\geq 1 - \xi,\quad \xi_i \geq 0$$
 

Both of these objectives can be solved using convex quadratic programming. For kernel problems you consider the dual of the QP. This dual relies only on inner products of the observations so we can use Mercer's theorem to avoid full dimension expansion. 

Mercers Theorem: Under suitable conditions, there exists a feature map $\phi(\cdot)$ such that $K(x,y) = \phi(x)^T\phi(y)$. 

Therefore, instead of applying the potentially infinite feature map $\phi$ its enough to pass the kernel values which is at most $O(m^2)$ operations to the QP. 



Online versions of this problem instead use GD with $$J(w,b) = \frac{1}{2}\|w\|^2 + C\sum_{i=1}^m \max(0, 1 - t_i (w^Tx_i + b))$$