# Support Vector Machine

What is a Hyperplane?
* in a k dimensional space, a Hyperplane is a flat affine subspace of dimension k-1
* in two dimensions, a Hyperplane is a flat one-dimensional subspaces i.e a line
* in 3-dimensions, a hyperplane is a flat two-dimensional subspace i.e a plane
* k > 3 dimensions, hard to visualize
* notion of a (k-1)-dimensional flat subspace still applies

math def of a hyperplane:

$$\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$$

* when saying the above equation "defines" the hyperplane, we're saying that any $X = (X_1,X_2)'$ for which the equation holds is a point on the hyperplane

the multidimensional case one has

$\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k = 0$

On the other hand, if 

$\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k < 0$

then X lies on the other side of the hyperplane

* think of the hyperplane has dividing k-dimensional space into two halves
* one can easily determine on which side of the hyperplane a point lies by calculating the sign of the lefthand side of the above equation

![](img\hyperplane_1.png)

* hyperplane $1 + 2X_1 + 3X_2 = 0$ is shown above. Blue region is the set of points for which $1 + 2X_1 + 3X_2 > 0$, and the purple region is the set of points for which $1 + 2X_1 + 3X_2 < 0$

## Classification using a Separating Hyperparameter

![](img\hyperplane_2.png)

* suppose we have observations $\{(y_1, x_{1,1}, x_{1,2}), (y_2, x_{2,1}, x_{2,2}), \dots, (y_n, x_{n,1}, x_{n,2})$
* we know these n obs fall into two classes: $\{y_1, y_2, \dots, y_n \} \in \{-1,1\}$, where $-1$ represents one class and $1$ represents the other class
* the left graph is showing three out of many possible separating hyperplanes
* the right graph is showing the decision rule made by a classifier based on this particular hyperplane (black line); if a test observation falls in the blue portion of the grid, it will be assigned to the case, and vise versa
* a separating hyperplane for any k has the property that $\beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_k x_{i,k}~>0~\text{if}~y_1 = 1$ and $\beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_k x_{i,k}~<0~\text{if}~y_1 = -1$
* equivalently, a separating hyperplane has the property that $y_i(\beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_k x_{i,k}) > 0$ for all $i = 1,\dots,n$ since $y_i \in \{-1,1\}$

EX. Imagine we're given a test observation $\textbf{x}^* = [x_1^*, \dots, x_k^*]'$, then we 'assign' it to a class based on the sign of $\textbf{x}^{*'} \beta = f(\textbf{x}^*)$
* if $f(\textbf{x}^*) > 0$, then we assign this test observation to class 1, and if $f(\textbf{x}^*)<0$, then we assign it to class -1

The magnitude of $f(\textbf{x}^*)$ is also useful
* $f(\textbf{x}^*)$ being far from zero makes us confident about its classification
* when $f(\textbf{x}^*)$ is close to zero, then $\textbf{x}^*$ is located near the hyperplane, therefore we're less confident about the class assignment for it

## The Maximal Margin Classifier

![](img\hyperplane_3.png)

* can compute the perpendicular distance from each training observation to a given separating hyperplane
* the smallest such distance is the minimal distance from the observations to the hyperplane, also known as the $\textit{margin}$
* the maximal margin hyperplane is the separating hyperplane for which the margin is largest; it's the hyperplane that has the farthest minimum distance to the training observations
* we can then classify a test obseration based on which side of the maximal margin hyperplane it lies, which is known as $\textit{maximal margin classifier}$

EX. we see that three training observations are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin
* these three observations are known as $\textit{support vectors}$ since they are vectors in k-dimensional space (here, k=2)
* they support the maximal margin hyperplane in the sense vector that if these points were moved slightly then the maximal margin hyperplane would move as well

following chunks of code
* simulated data set of 10,000 observations
* objective is to build a machine that can predict loan default (no or yes) based on the balance and income of the customers


In [7]:
import pandas as pd

df = pd.read_csv('/Users/beliciarodriguez/Documents/GitHub/ECON485-Material-Review/data/hyperplane_data.csv')

In [9]:
df.head()

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.138947
3,No,No,529.250605,35704.493935
4,No,No,785.655883,38463.495879


In [10]:
df.dtypes

default     object
student     object
balance    float64
income     float64
dtype: object

In [11]:
# map the 'no' and 'yes' strings into numerical values for numerical calculations
df['default'] = df['default'].map({'Yes': 1, 'No': 0})
df['student'] = df['student'].map({'Yes': 1, 'No': 0})
df.head()

Unnamed: 0,default,student,balance,income
0,0,0,729.526495,44361.625074
1,0,1,817.180407,12106.1347
2,0,0,1073.549164,31767.138947
3,0,0,529.250605,35704.493935
4,0,0,785.655883,38463.495879


In [12]:
df.dtypes

default      int64
student      int64
balance    float64
income     float64
dtype: object

the separating hyperplane might not exist, and so there may be nomaximal margin classifier

In [13]:
import patsy
y, X = patsy.dmatrices('default ~ -1 + balance + income', data=df, return_type='dataframe')

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Support Vector Classifier

* the SVC classifies a test observation depending on which side of the hyperplane it lies
* the hyperplane is chosen to correctly separate most of the training observations into two classes, but may misclassify a few observations
* it is the solution to the optimization problem

\begin{align*}
    \underset{\beta_{0}, \beta_{1}, \ldots, \beta_{k}, \epsilon_{1}, \ldots, \epsilon_{n}}{\operatorname{maximize}} M
    {\text { subject to } \sum_{j=1}^{k} \beta_{j}^{2}=1} \\
    {y_{i}\left(\beta_{0}+\beta_{1} x_{i,1}+\beta_{2} x_{i,2}+\ldots+\beta_{k} x_{i,k}\right) \geq M\left(1-\epsilon_{i}\right)} \\
    {\quad \epsilon_{i} \geq 0, \quad \sum_{i=1}^{n} \epsilon_{i} \leq C}
\end{align*}

* (cont) where $C$ is a nonnegative tuning parameter, $M$ is the width of the margin (which we want to make as large as possible)
* $\epsilon_1, \dots, \epsilon_n$ are slack variables that allow observations to be on the wrong side of the margin

(1) the slack variable $\epsilon_i$ tells us where the ith observation is located relative to the hyperplane and the margin i.e

1. $\epsilon_i = 0$: the ith observation is on the correct side of the margin
2. $\epsilon_i > 0$: the ith observation is on the wrong side of the margin
3. $\epsilon_i > 1$: the ith observation is on the wrong side of the hyperplane

(2) the tuning parameter C bounds the sum of the $\epsilon_i$'s and ca be considered a tolerance parameter

1. If $C=0$ then we are not allowing for violations; must be the case that $\epsilon_1 = \dots = \epsilon_n = 0$, in which case we have the maximal margin classifier (if it exists)
2. For $C > 0$, no more than $C$ observations can be on the wrong side of the hyperplane
* in this case $\epsilon_i > 1$
* we have that $\sum_{i=1}^n \epsilon_i \leq C$
* as $C$ increases, we become more tolerant of violations to the margin; the margin will widen
* as $C$ decreases, we become less tolerant of violations to the margin; the margin narrows

In [14]:
from sklearn import preprocessing
X_train_scaled = preprocessing.scale(X_train)
X_test_scaled = preprocessing.scale(X_test)

In [15]:
from sklearn.svm import SVC

# Build your classifier
clf = SVC(kernel='linear', C=1)

# Train it on the entire training data set
clf.fit(X_train_scaled, y_train.values.ravel())

# Get predictions on the test set
y_pred = clf.predict(X_test_scaled)

# Assessing the fit
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.9676


* now choose the hyperparameter $C$ by 3-fold CV over a grid of potential values for it

In [16]:
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score
svc = SVC(kernel='linear')

Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),n_jobs=-1)
clf.fit(X_train_scaled, y_train.values.ravel())

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='linear', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': array([1.00000000e-06, 3.59381366e-06, 1.29154967e-05, 4.64158883e-05,
       1.66810054e-04, 5.99484250e-04, 2.15443469e-03, 7.74263683e-03,
       2.78255940e-02, 1.00000000e-01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [17]:
clf.best_score_

0.9663999999999999

In [18]:
clf.best_estimator_.C

1e-06

In [19]:
# Prediction performance on test set is better that on train set
clf.score(X_test_scaled, y_test)

0.9676

## Support Vector Machine
* SVC is a natural approach for classification in the two class setting if the boundary between the two classes is linear
* in practice, however, we are sometimes faced with non-linear class boundaries

![](img\hyperplane_4.png)

* left graph: the observations fall into two classes with a non-linear boundary between them
* right graph: the support vector classifier seeks a linear boundary, and consequently performs very poorly
* turns out that the solution to the SVC problem inolves only the inner products between the point $\textbf{x}$ and the support vectors
* if $\mathcal{S}$ is the collection of indices of these support points, we can rewrite any solution function as $f(\textbf{x}) = \beta_0 + \sum_{i \in \mathcal{S}} \alpha_i (\textbf{x,x_i})$ where $<a,b> = \sum_{j=1}^r a_j b_j$
* therefore we can generalize this solution to $f(\textbf{x}) = \beta_0 + \sum_{i \in \mathcal{S}} \alpha_i K(\textbf{x,x_i})$ where $K$ is the kernel

In [20]:
svcPoly = SVC(kernel='poly',degree=3)

Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svcPoly, param_grid=dict(C=Cs),n_jobs=-1)
clf.fit(X_train_scaled, y_train.values.ravel())

print(clf.best_score_ )
print(clf.best_estimator_.C  )

# Prediction performance on test set is not better that on train set
clf.score(X_test_scaled, y_test)

0.9724
0.1


0.9712

In [21]:
svcRadial = SVC(kernel='rbf',gamma=.01)

Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svcRadial, param_grid=dict(C=Cs),n_jobs=-1)
clf.fit(X_train_scaled, y_train.values.ravel())

print(clf.best_score_ )
print(clf.best_estimator_.C  )

# Prediction performance on test set is not better that on train set
clf.score(X_test_scaled, y_test)

0.9663999999999999
1e-06


0.9676