## Chapter 5 - Support Vector Machines

### Nonlinear SVC

The support vector classifier is used when the boundary between the two classes are linear. However, in practice, we are sometimes faced with non-linear boundaries. In this case, we consider enlarging the feature space using the higher order features. E.g. rather than fitting a support vector classifier on $p$ features $\begin{pmatrix} X_1, \cdots, X_p\end{pmatrix}$, we add a polynomial (squared) feature and fit the support vector classifier on $2p$ features $\begin{pmatrix} X_1, X_1^2 \cdots, X_p, X_p^2\end{pmatrix}$. Now, the optimisation problem will be:

$$\underset{\beta_0, \beta_{11}, \beta_{12}, \cdots, \beta_{p1},\beta_{p2},\epsilon_1, \cdots, \epsilon_n}{\text{Maximise }}M \text{ s. t. }$$
$$\sum_{j=1}^p\sum_{k=1}^2 \beta_{jk}^2=1$$
$$y_i\begin{pmatrix}\beta_0 + \sum_{j=1}^p\beta_{j1}x_{ij} + \sum_{j=1}^p\beta_{j2}^2x^2_{ij} \end{pmatrix}\geq M(1-\epsilon_i)\,\,\forall i \in \{1,\cdots,n\}$$
$$\epsilon_i \geq 0\,\,\forall i \in \{1,\cdots,n\}\,\,, \sum_{i=1}^n \epsilon_i \leq C$$

In this enlarged feature space, the decision boundary is linear. However, in the original future space, the decision boundary is in the form $q(x)=0$ where $q$ is a quadratic polynomial, adn its solutions are generally non-linear. In extension, we can enlarge the feature space with higher polynomial terms or interaction terms.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import (make_moons, load_iris)
from sklearn.preprocessing import (PolynomialFeatures, StandardScaler)
from sklearn.svm import LinearSVC, SVC, LinearSVR, SVR
from sklearn.model_selection import train_test_split

To achieve this in SKLearn, use `PolynomialFeatures` to transform before training.

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
# Transform to 3rd degree polynomial features
polyfeatures1 = PolynomialFeatures(degree=3)
scaler1 = StandardScaler()
X_expt1 = polyfeatures1.fit_transform(X_train)
X_expt1 = scaler1.fit_transform(X_expt1)

# Train on polynomial features
clf_expt1 = LinearSVC(C=10, loss='hinge', max_iter=1000000)
clf_expt1.fit(X_expt1, y_train)

LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='hinge', max_iter=1000000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

It is not hard to see that there are endless ways to enlarge the feature space and can come up with many features. This computationally becomes unmanageable. The support vector machine allows us to enlarge the feature space used by the support vector classifier in a way that leads to efficient computations.

### Nonlinear SVM Classification - Using Kernels

The Support Vector Machine extends the support vector classifier that results from <u>enlarging the feature space using kernels</u>. This results in a method that is more efficient computationally.

The following from SKLearn implements this using a 3rd degree polynomial kernel.

In [4]:
# Use 3rd degree polynomial and then train SVC on it
clf_expt12= SVC(kernel='poly', degree=3, coef0=1, C=10)
clf_expt12.fit(X_train, y_train)

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=1,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

The solution to the SVC problem involves only the <u>inner products</u> of the observations (as opposed to the observations themselves). The inner product of two vectors $\vec{a}$ and $\vec{b}$ of length $r$, $\langle\,\vec{a},\vec{b}\rangle$ is $\sum_{i=1}^ra_ib_i$

In [5]:
a, b = np.array([3,4]), np.array([2,7])
print(np.dot(a,b))

34


The inner product of two observations $x_i,x_{i'}$, denoted by $\langle x_{i},x_{i'}\rangle$ is hence
$$\langle x_{i},x_{i'}\rangle = \sum_{j=1}^p x_{ij}x_{i'j}$$

The linear SVC solution can then be represented as:

$$f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x,x_{i}\rangle$$

where there are $n$ parameters, $\alpha_i \forall i \in \{1,\cdots,n\}$

To estimate the parameters $\alpha_i, i \in \{1,\cdots,n\}$ and $\beta_0$, we need the inner products $\langle x,x_{i}\rangle$ of all the pairs of training observations.

It turns out that $\alpha_i$ is nonzero for only the support vectors in the solution. So if the training observation is not the support vector then $\alpha_i=0$. So if $S$ is the collection of indices of these support points, we can rewrite the solution function in the form:

$$f(x) = \beta_0 + \sum_{i\in S}^n \alpha_i \langle x,x_{i}\rangle$$ which involve far fewer terms.

To expand the solution, instead of using the inner product $\langle x,x_{i}\rangle$ we use a generalisation of the inner product in the form: $$K(x_i,x_{i'})$$

where $K$ is some function we refer to as a <u>kernel</u>. A kernel is a function that quantifies the similarity of two observations. If $K_{\text{linear}}(x_i,x_{i'}) = \sum_{j=1}^p x_{ij}x_{i'j}$ then $K_{\text{linear}}$ is a linear kernel and the solution returns the support vector classifier. 

Instead, if the kernel is now $K_{\text{polynomial}}(x_i,x_{i'}) = (1 + \sum_{j=1}^p x_{ij}x_{i'j})^d$ where $d$ is a positive integer then we have $K_{\text{polynomial}}$ the polynomial kernel of degree $d$ 

Using a nonlinear kernel to perform classification results in the support vector machine. Now, the solution has the form:

$$\begin{align}f(x) &= \beta_0 + \sum_{i\in S}^n \alpha_i K_{\text{polynomial}}(x_i,x_{i'})\\&=\beta_0 + \sum_{i\in S}^n \alpha_i (1 + \sum_{j=1}^p x_{ij}x_{i'j})^d\end{align}$$


Another popular choice is the radial kernel, $K_{\text{radial}}(x_i,x_{i'}) = \exp(-\gamma \sum_{j=1}^p (x_{ij}-x_{i'j})^2)$ where $\gamma$ is a positive constant. Naturally, the solution to the radial kernel has the form:

$$\begin{align}f(x) &= \beta_0 + \sum_{i\in S}^n \alpha_i K_{\text{radial}}(x_i,x_{i'})\\&=\beta_0 + \exp(-\gamma \sum_{j=1}^p (x_{ij}-x_{i'j})^2)\end{align}$$

In [6]:
# Feature Scaling
scaler2 = StandardScaler()
X_expt3 = scaler1.fit_transform(X_train)

# Train on RBF Kernel
clf_expt13 = SVC(kernel='rbf', gamma=5, C=0.001)
clf_expt13.fit(X_expt3, y_train)

SVC(C=0.001, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=5, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

How does the radial kernel work? Given an unseen test observation $x^*$ is far from a training observation $x_i$ in terms of Euclidean distance, then $\sum_{j=1}^p (x^*_{j}-x_{ij})^2)$ is large and hence $K_{\text{radial}}(x^*,x_{i})$ is small. Then $x_i$ will play no role in $f(x^*)$. Training observations far from $x^*$ will play no role in the prediction for the class $x^*$

This means the radial kernel has very local behaviour, where only nearby training observations have an effect on the class label of the test observation.

The advantage of using kernels  is computational. When using kernels, we only compute $K$ for every pair of the observations. This can be done without working explicitly working on an enlarged feature space. In enlarged feature space solutions, the feature space could be so large the computations are intractable. In the case of the radial kernel, the feature space is implicit and infinite-dimensional, so we cannot perform the mathematical calculations anyway.

### SVM Regression (SVR)

The SVM algorithm is also versatile: It can support linear and nonlinear regression. The idea is to reverse the objective: instead of finding the largest possible street between two classes, we now find as many instances as possible on the street. The width of the street is controlled by a parameter $\epsilon$. 

In [7]:
# Linear SVR, no polynomial features
svr1 = LinearSVR(epsilon=1.5)
svr1.fit(X_train, y_train)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
          intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
          random_state=None, tol=0.0001, verbose=0)

Similarly, apply the polynomial kernel transformation if we want a SVM with a polynomial (not linear) solution.

In [8]:
# SVR, with polynomial features
svr2 = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)
svr2.fit(X_train, y_train)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='scale',
    kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)