# CS-5600/6600 Lecture 16 - Support Vector Machines (SVM)

**Instructor: Dylan Zwick**

*Weber State University*

References:
* [Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/) by Aurélien Géron - [Support Vector Machines](https://github.com/ageron/handson-ml3/blob/main/05_support_vector_machines.ipynb)
* [An Introduction to Statistical Learning](https://www.statlearning.com/) by James, Witten, Hastie, Tibshirani, and Taylor

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.datasets import make_moons

from sklearn.svm import SVC
from sklearn.svm import LinearSVC

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

Today we're going to talk about something that's very important in machine learning and, really, in life itself - knowing where to draw the line.

But first, what is a line?

<center>
  <img src="https://drive.google.com/uc?export=view&id=1oXXGBPcJjtJD7fn154t6czxA_HdqS2mB" alt="A Line Is A Curve">
</center>

###How To Draw Lines###

The basic formula for a line, which is the formula we used in linear regression, is:

<center>
  $y = mx + b$
</center>

This is called *slope-intercept* form, and it's the form you want to use if you're representing a function. You plug in the value of $x$, and it gives you the value of $y$.

Now, there's another way you can express a line algebraically, and it's called *standard form*. In this form you write the line as:

<center>
  $ax + by = c$
</center>

The line is defined as all the values of $x$ and $y$ that satisfy this relationship. It's usually straightforward to convert from standard form to slope-intercept form:

<center>
  $\displaystyle y = -\left(\frac{a}{b}\right)x + \frac{c}{b}$
</center>

However, this only works if $b \neq 0$. If $b = 0$ then the standard form equation is:

<center>
  $\displaystyle x = \left(\frac{c}{b}\right)$
</center>

In other words, it's a vertical line. (Note we need to assume that either $a$ or $b$ is non-zero). While this would be madness for a function, it's perfectly acceptable for a line, and it's an option we'll want. So, for this lecture we'll use the standard form.

###Picking Sides###

Let's bring back our old friend, the Iris dataset.

In [None]:
iris = datasets.load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = iris.target

setosa_or_versicolor = (y == 0) | (y == 1)
X = X[setosa_or_versicolor]
y = y[setosa_or_versicolor]

And take a look at its values in a chart:

In [None]:
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris versicolor")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris setosa")
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.legend(loc="upper left")

plt.show()

Suppose we wanted to build a model that predicts - based on petal width and petal length - whether a new observation is a versicolor or a setosa? One way to do this would be to just draw a line and say anything to one side is versicolor, and anything to the other side is setosa.

How would we formulate this mathematically? Well, our equation for a line is:

<center>
  $\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} = 0$
</center>

*Note* - We moved the constant to the same side as the variables, and switched $x$ and $y$ for $X_{1}$ and $X_{2}$, and $c$, $a$, and $b$ for $\beta_{0}$, $\beta_{1}$ and $\beta_{2}$ because that will make it easier to generalize to more dimensions later.

Suppose I get a new observation and its petal length and width values are $(x_{1},x_{2})$. If that observation happens to be directly on my line, then we'll have:

<center>
  $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} = 0$.
</center>

If it's not on the line, then we will either have:

<center>
  $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} > 0$,
</center>

or

<center>
  $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} < 0$.
</center>

These possibilities correspond to the *side* of the line where we find $(x_{1},x_{2})$. So, our model would take the values of our observation, namely $(x_{1},x_{2})$, plug them into the formula for our line, namely $\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}$, and check its sign.

<center>
  <img src="https://drive.google.com/uc?export=view&id=1mIlRJaL07QJI56tHn4R1o4hTc-xCJ-7C" alt="What's Your Sign?">
</center>

In fact, the value of $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}$ will be the distance from the line to the point, where a positive value is distance on one side, and a negative is distance on the other.

###Drawing The Line###

Now we know how we can create a predictive model from a simple line. But of course the big question now is - which line? There are many lines that we could draw on our iris chart. Which line is best?

Let's take a look at a few:

In [None]:
# Bad models
x0 = np.linspace(0, 5.5, 200)
pred_1 = 5 * x0 - 20
pred_2 = x0 - 1.8
pred_3 = 0.1 * x0 + 0.5

plt.plot(x0, pred_1, "g--", linewidth=2)
plt.plot(x0, pred_2, "m-", linewidth=2)
plt.plot(x0, pred_3, "r-", linewidth=2)
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris versicolor")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris setosa")
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.legend(loc="upper left")
plt.axis([0, 5.5, 0, 2])
plt.show()

I think we can probably toss out that dotted green line right away, as it fails to separate the data. But what about the other two? Both the blue and red line separate the data - but do they do a good job? If a new prediction were just to the upper-left of the blue line, would you agree it should be a setosa? What about the bottom-right of the red being versicolor?

I think we'd agree that neither the red nor blue line are good models. But what would be a good model?

This is where we introduce the idea of a **maximal margin classifier**. The idea is that the maximal margin classifier is the line that separates the data, and is farthest away from it. For the iris data this line would look like:

In [None]:
# Don't worry about understanding this code. For right now it's just for plotting the picture.

# SVM Classifier model
svm_clf = SVC(kernel="linear", C=1e100)
svm_clf.fit(X, y)

def plot_svc_decision_boundary(svm_clf, xmin, xmax):
    w = svm_clf.coef_[0]
    b = svm_clf.intercept_[0]

    # At the decision boundary, w0*x0 + w1*x1 + b = 0
    # => x1 = -w0/w1 * x0 - b/w1
    x0 = np.linspace(xmin, xmax, 200)
    decision_boundary = -w[0] / w[1] * x0 - b / w[1]

    margin = 1/w[1]
    gutter_up = decision_boundary + margin
    gutter_down = decision_boundary - margin
    svs = svm_clf.support_vectors_

    plt.plot(x0, decision_boundary, "k-", linewidth=2, zorder=-2)
    plt.plot(x0, gutter_up, "k--", linewidth=2, zorder=-2)
    plt.plot(x0, gutter_down, "k--", linewidth=2, zorder=-2)
    plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#AAA',
                zorder=-1)

plot_svc_decision_boundary(svm_clf, 0, 5.5)
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo")
plt.xlabel("Petal length")
plt.axis([0, 5.5, 0, 2])
plt.show()

The line in the center is called the *maximal margin classifier*. The distance from the center line to either dotted line is the *margin*, and the two observations on the boundary of the margin are called the *support vectors*. Thus the name!

###To Higher Dimensions###

Lines in planes are simple and (relatively) easy to visualize. But most of the time our data is higher dimensional. In higher dimensions, we can generalize our equation for a line into an equation for a *hyperplane*:

<center>
  $\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \cdots + \beta_{n}X_{n} = 0$
</center>

If $n = 3$ then this is just the equation for a two-dimensional flat surface in three-dimensional space. In other words, a plane.

The exact same ideas, mutatis mutandis, around margin, maximum margin classifiers, and support vectors extend to hyperplanes.

We note one thing here and that is that we can have more that one set of coefficients generating the same hyperplane. For example, if we multiplied all the coefficients by the same number, say $2$, then we'd have:

<center>
  $2\beta_{0} + 2\beta_{1}X_{1} + 2\beta_{2}X_{2} + \cdots + 2\beta_{n}X_{n} = 0$
</center>

This would be true if and only if

<center>
  $\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \cdots + \beta_{n}X_{n} = 0$
</center>

were true. So, we can scale our coefficients by a constant amount, and we generally choose to do so such that

<center>
  $\displaystyle \sum_{j = 1}^{M} \beta_{j}^{2} = 1$
</center>

You can view this as the coefficients $\beta_{1}, \beta_{2}, \ldots, \beta_{n}$ representing the components of a unit vector perpendicular to the plane. This sets the orientation of the plane. The constant term $\beta_{0}$ just translates the plane away from the origin.

###Constructing the Maximal Margin Classifier###

Suppose we have a set of $n$ training observations $x_{1}, x_{2}, \ldots, x_{n}$, and each training observation has $p$ measured features, so for example $x_{1} = (x_{11}, x_{12}, \ldots, x_{1p})$. Each observation also has a corresponding class label $y_{1}, y_{2}, \ldots, y_{n}$ which is either $-1$ or $1$. Finding the maximal margin classifier is solving the following optimization problem:

<center>
  $\displaystyle \underset{\beta_{0},\beta_{1},\ldots,\beta_{p},M}{maximize} M$

  subject to $\displaystyle \sum_{j = 1}^{p}\beta_{j}^{2}$

  $\displaystyle y_{i}(\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots + \beta_{p}x_{ip}) \geq M$
</center>

We've discussed the first constraint. The second constraint just says that every observation has to lie a distance greater than or equal to $M$ away from the line $\beta_{0} + \beta_{1}X_{1} + \cdots + \beta_{p}X_{p} = 0$.

This is a problem in [convex optimization](https://en.wikipedia.org/wiki/Convex_optimization). The idea is that you're trying to optimize a value subject to being constrained within a convex space. It's a very interesting area of mathematics, which we don't have time to study. So, we won't get into how we solve this problem, but just understand it's the problem that needs to be solved to find the maximal margin classifier.

##Support Vector Classifier##

OK. So now we know how to draw the line when we've got linearly separable data. Are we done? Well, unfortunately for us, this method has some problems.

First, it's very dependent on the scale of the data. Fortunately, this can be fixed by the appropriate scaling of our features.

In [None]:
Xs = np.array([[1, 50], [5, 20], [3, 80], [5, 60]]).astype(np.float64)
ys = np.array([0, 0, 1, 1])
svm_clf = SVC(kernel="linear", C=100).fit(Xs, ys)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(Xs)
svm_clf_scaled = SVC(kernel="linear", C=100).fit(X_scaled, ys)

plt.figure(figsize=(9, 2.7))
plt.subplot(121)
plt.plot(Xs[:, 0][ys==1], Xs[:, 1][ys==1], "bo")
plt.plot(Xs[:, 0][ys==0], Xs[:, 1][ys==0], "ms")
plot_svc_decision_boundary(svm_clf, 0, 6)
plt.xlabel("$x_0$")
plt.ylabel("$x_1$    ", rotation=0)
plt.title("Unscaled")
plt.axis([0, 6, 0, 90])
plt.grid()

plt.subplot(122)
plt.plot(X_scaled[:, 0][ys==1], X_scaled[:, 1][ys==1], "bo")
plt.plot(X_scaled[:, 0][ys==0], X_scaled[:, 1][ys==0], "ms")
plot_svc_decision_boundary(svm_clf_scaled, -2, 2)
plt.xlabel("$x'_0$")
plt.ylabel("$x'_1$  ", rotation=0)
plt.title("Scaled")
plt.axis([-2, 2, -2, 2])
plt.grid()

plt.show()

Second, it can be significantly influenced by just a single outlier datapoint. For example:

In [None]:
X_outlier = np.array([3.2, 0.8])
y_outlier = 0

Xo = np.concatenate([X, X_outlier.reshape(1, -1)], axis=0)
yo = np.concatenate([y, [y_outlier]], axis=0)

svm_clf2 = SVC(kernel="linear", C=10**9)
svm_clf2.fit(Xo, yo)

plt.plot(Xo[:, 0][yo==1], Xo[:, 1][yo==1], "bs")
plt.plot(Xo[:, 0][yo==0], Xo[:, 1][yo==0], "yo")
plot_svc_decision_boundary(svm_clf2, 0, 5.5)
plt.ylabel("Sepal length")
plt.xlabel("Petal length")
plt.annotate(
    "Outlier",
    xy=(X_outlier[0], X_outlier[1]),
    xytext=(3.2, 0.08),
    ha="center",
    arrowprops=dict(facecolor='black', shrink=0.1),
)
plt.axis([0, 5.5, 0, 2])
plt.show()

Note the one outlier datapoint complete transforms our maximal margin classifier line and its margin.

Third, and worst of all, I know it's hard to believe but sometimes it's not possible to separate classes with a line!

In [None]:
X_outlier = np.array([3.4, 1.3])
y_outlier = 0

Xo = np.concatenate([X, X_outlier.reshape(1, -1)], axis=0)
yo = np.concatenate([y, [y_outlier]], axis=0)

plt.plot(Xo[:, 0][yo==1], Xo[:, 1][yo==1], "bs")
plt.plot(Xo[:, 0][yo==0], Xo[:, 1][yo==0], "yo")
plt.text(0.3, 1.0, "Impossible!", color="red", fontsize=18)
plt.ylabel("Sepal length")
plt.xlabel("Petal length")
plt.annotate(
    "Outlier",
    xy=(X_outlier[0], X_outlier[1]),
    xytext=(2.5, 1.7),
    ha="center",
    arrowprops=dict(facecolor='black', shrink=0.1),
)
plt.axis([0, 5.5, 0, 2])
plt.show()

What do we do?!? Well, we learn how to accept some imperfection in our model. More specifically, in the interest of

* Greater robustness to individual observations, and
* Better classification of *most* of the training observations

We allow our model to have a few observations on the wrong side of the margin, or even to misclassify a few training observations. We call this a *soft margin classifier*.

###Constructing The Support Vector Classifier###

The support vector classifier classifies a test observation depending on which side of the hyperplane it's on. The hyperplane is chosen to correctly separate most of the training observations into two classes, but may misclassify a few. Precisely, it's the solution to the optimization problem:

<center>
  $\displaystyle \underset{\beta_{0},\beta_{1},\ldots,\beta_{p},\epsilon_{1},\ldots,\epsilon_{n},M}{maximize} M$

  subject to $\displaystyle \sum_{j = 1}^{p}\beta_{j}^{2}$

  $\displaystyle y_{i}(\beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots + \beta_{p}x_{ip}) \geq M(1-\epsilon_{i})$

  $\displaystyle \epsilon_{i} \geq 0$, $\displaystyle \sum_{i = 1}^{n} \epsilon_{i} \leq C$
</center>

Here $C$ is a non-negative hyperparameter that basically dictates how much slack we allow within our model. It's our budget for slack, and in the maximal margin classifier we have $C = 0$. The variables $\epsilon_{i}$ are called *slack variables*. They basically tell us how far behind the margin observation $x_{i}$ is. If the observation is not beyond the margin, then $\epsilon_{i} = 0$. If $\epsilon_{i} > 1$ then not only is $x_{i}$ on the wrong side of the margin, but it's classified incorrectly. Only at most $C$ observations can be classified incorrectly.

**Note** - This is how the term $C$ is defined in *An Introduction to Statistical Learning*. In sklearn, the term $C$ is actually the reciprocal of this $C$. So, a low value of $C$ means a high tolerance for error, while a high value of $C$ means a low tolerance of error. Sorry that it's confusing. I didn't determine the different terms different sources use.

We can take a look at the models we get for different values of $C$:

In [None]:
iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 2)  # Iris virginica

scaler = StandardScaler()
svm_clf1 = LinearSVC(C=1, max_iter=10_000, dual=True, random_state=42)
svm_clf2 = LinearSVC(C=100, max_iter=10_000, dual=True, random_state=42)

scaled_svm_clf1 = make_pipeline(scaler, svm_clf1)
scaled_svm_clf2 = make_pipeline(scaler, svm_clf2)

scaled_svm_clf1.fit(X, y)
scaled_svm_clf2.fit(X, y)

# Convert to unscaled parameters
b1 = svm_clf1.decision_function([-scaler.mean_ / scaler.scale_])
b2 = svm_clf2.decision_function([-scaler.mean_ / scaler.scale_])
w1 = svm_clf1.coef_[0] / scaler.scale_
w2 = svm_clf2.coef_[0] / scaler.scale_
svm_clf1.intercept_ = np.array([b1])
svm_clf2.intercept_ = np.array([b2])
svm_clf1.coef_ = np.array([w1])
svm_clf2.coef_ = np.array([w2])

# Find support vectors (LinearSVC does not do this automatically)
t = y * 2 - 1
support_vectors_idx1 = (t * (X.dot(w1) + b1) < 1).to_numpy()
support_vectors_idx2 = (t * (X.dot(w2) + b2) < 1).to_numpy()
svm_clf1.support_vectors_ = X[support_vectors_idx1]
svm_clf2.support_vectors_ = X[support_vectors_idx2]

fig, axes = plt.subplots(ncols=2, figsize=(10, 2.7), sharey=True)

plt.sca(axes[0])
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^", label="Iris virginica")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs", label="Iris versicolor")
plot_svc_decision_boundary(svm_clf1, 4, 5.9)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.legend(loc="upper left")
plt.title(f"$C = {svm_clf1.C}$")
plt.axis([4, 5.9, 0.8, 2.8])
plt.grid()

plt.sca(axes[1])
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
plot_svc_decision_boundary(svm_clf2, 4, 5.99)
plt.xlabel("Petal length")
plt.title(f"$C = {svm_clf2.C}$")
plt.axis([4, 5.9, 0.8, 2.8])
plt.grid()

plt.show()

We can see that the classifier on the left is more tolerant of errors than the one on the right.

The optimization problem above has a very interpesting property: it turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane, and hence the classifier. In other words, an observation that lies strictly on the correct side of the margin does not affect the support vector classifier! Those observations that do affect the classifier are known as the *support vectors*.

##Support Vector Machines##

I've got some more bad news. It turns out that for many types of classification problems, linear boundaries just don't make a lot of sense. Take for example our moons dataset:

In [None]:
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$", rotation=0)
plt.show()

If we try to build a support vector classifier for this, we're going to have a bad time (or, at least, not build a very good model):

In [None]:
svm_clf3 = SVC(kernel="linear", C=1)
svm_clf3.fit(X, y)

plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
plot_svc_decision_boundary(svm_clf3, -1.5, 2.5)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$", rotation=0)
plt.axis([-1.5, 2.5, -.75, 1.5])
plt.show()

How can we do better?

Well, in the same way that linear regression does poorly when there is a non-linear relationship between the predictors and the outcome, a support vector classifier does poorly when there is a non-linear boundary between our classes. With regression, we enlarged our feature space by using nonlinear terms. Well, we can do the same for support vector classifiers.

So, for example, instead of fitting a line, we could fit a quadratic defined by:

<center>
  $\displaystyle \beta_{0} + \sum_{j = 1}^{p}\beta_{j1}X_{ij} + \sum_{j = 1}^{p}\beta_{j2}x_{ij}^{2} = 0$
</center>

This would mean solving the optimization problem (assuming soft margins):

<center>
  $\displaystyle \underset{\beta_{0},\beta_{11},\beta_{12},\ldots,\beta_{p1},\beta_{p2},\epsilon_{1},\ldots,\epsilon_{n},M}{maximize} M$

  subject to
  
  $\displaystyle y_{i}(\beta_{0} + \sum_{j = 1}^{p}\beta_{j1}X_{ij} + \sum_{j = 1}^{p}\beta_{j2}x_{ij}^{2}) \geq M(1-\epsilon_{i})$

  $\displaystyle \epsilon_{i} \geq 0$, $\displaystyle \sum_{i = 1}^{n} \epsilon_{i} \leq C, \sum_{j = 1}^{p}\sum_{k = 1}^{2}\beta_{jk}^{2} = 1$
</center>

<center>
  <img src="https://drive.google.com/uc?export=view&id=1Fs_4m8TuYorGXS7VlUrs6HJtyHOs974s" alt="ALF">
</center>

When we go down this path, it can get complicated quickly. What a support vector machine does is it allows us to enlarge the feature space used by the support vector classifier in a way that keeps computations manageable.

This is honestly the part of the course that I wish the most we could dive into the math, because the math behind the support vector machine is actually pretty slick. It involves the use of something called a *kernel*, which is a way of measuring similarity, and a very clever insight called the "kernel trick". However, that would take us too far afield. So, just understand that a support vector machine is still trying to optimize a soft margin - just like we did above, but just for a nonlinear, curved boundary.

We can see how one of these would work on our moons dataset, with various values of our hyperparameter $C$.

In [None]:
poly_kernel_svm_clf = make_pipeline(StandardScaler(),
                                    SVC(kernel="poly", degree=3, coef0 = 1, C=5))
poly_kernel_svm_clf.fit(X, y)

In [None]:
def plot_dataset(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True)
    plt.xlabel("$x_1$")
    plt.ylabel("$x_2$", rotation=0)

def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    y_decision = clf.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)

plot_predictions(poly_kernel_svm_clf, [-1.5, 2.45, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.4, -1, 1.5])
plt.title("degree=3, C=5")
plt.show()

Not bad!

##References

* [Support Vector Machines](https://youtu.be/efR1C6CvhmE?si=ZBO8fNh3vBhfr0cz) from StatQuest