# Contents

1. [Maximal Margin Classifier](#Maximal-Margin-Classifier)
2. [Support Vector Classifiers](#Support-Vector-Classifiers)
3. [Support Vector Machines](#Support-Vector-Machines)
4. [SVMs with More than Two Classes](#SVMs-with-More-than-Two-Classes)
5. [Relationship to Logistic Regression](#Relationship-to-Logistic-Regression)

---

## Introduction

The support vector machine is a generalization of a simple and intuitive classifier called the *maximal margin classifier*, which we introduce in [Section 9.1](#Maximal-Margin-Classifier). Though it is elegant and simple, we will see that this classifier unfortunately cannot be applied to most data sets, since **it requires that the classes be separable by a linear boundary**.

In [Section 9.2](#Support-Vector-Classifiers), we introduce the *support vector classifier*, an extension of the maximal margin classifier that can be applied in a broader range of cases.

[Section 9.3](#Support-Vector-Machines) introduces the support vector machine, which is a further extension of the support vector classifier in order to accommodate non-linear class boundaries. Support vector machines are intended for the binary classification setting in which there are two classes; in [Section 9.4](#SVMs-with-More-than-Two-Classes) we discuss extensions of support vector machines to **the case of more than two classes**.

In [Section 9.5](#Relationship-to-Logistic-Regression) we discuss the close connections between support vector machines and other statistical methods such as logistic regression.

---

# Maximal Margin Classifier
In this section, we define a hyperplane and introduce the concept of an optimal separating hyperplane

## What Is a Hyperplane?
In a p-dimensional space, a hyperplane is a flat affine subspace of dimension $p − 1$. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane, etc.

In two dimensions, a hyperplane is defined by the equation
\begin{equation}\label{9.1}
    \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0
    \tag{9.1}
\end{equation}

for parameters $\beta_0, \beta_1$ and $\beta_2$. Note that (\ref{9.1}) is simply the equation of a line, since indeed in two dimensions a hyperplane is a line.

Equation 9.1 can be easily extended to the p-dimensional setting:

\begin{equation}\label{9.2}
    \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p = 0
    \tag{9.2}
\end{equation}

defines a p-dimensional hyperplane, again in the sense that if a point $X = (X_1, X_2, \ldots , X_p)^T$ in p-dimensional space (i.e. a vector of length $p$) satisfies (\ref{9.2}), then $X$ lies on the hyperplane.

Now, suppose that $X$ does not satisfy (\ref{9.2}); rather,

\begin{equation}\label{9.3}
    \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p > 0
    \tag{9.3}
\end{equation}

Then this tells us that $X$ lies to one side of the hyperplane. On the other hand, if

\begin{equation}\label{9.4}
    \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p < 0
    \tag{9.4}
\end{equation}

then $X$ lies on the other side of the hyperplane. So we can think of the hyperplane as dividing p-dimensional space into two halves. One can easily determine on which side of the hyperplane a point lies by *simply calculating the sign* of the left hand side of (\ref{9.2}).

## Classification Using a Separating Hyperplane
Now suppose that we have a $n \times p$ data matrix $X$ that consists of $n$ training observations in p-dimensional space,

\begin{equation}\label{9.5}
    x_1 =  \begin{pmatrix} x_{11} \\ \vdots \\ x_{1p} \end{pmatrix}, \; \ldots \;, \; x_n = \begin{pmatrix} x_{n1} \\ \vdots \\ x_{np} \end{pmatrix},
    \tag{9.5}
\end{equation}

and that these observations fall into two classes—that is, $y_1, \ldots, y_n \in \{−1, 1\}$ where $−1$ represents one class and $1$ the other class. We also have a test observation, a p-vector of observed features $x^* = (x^*_1 \ldots x^*_p)^T$.

Suppose that it is possible to construct a hyperplane that separates the training observations perfectly according to their class labels. Examples of three such separating hyperplanes are shown in the left-hand panel of Figure 9.2.

![Separating Hyperplanes](./figures/9.2.png)
>**Figure 9.2.** *Left*: There are two classes of observations, shown in blue and
in purple, each of which has measurements on two variables. Three separating
hyperplanes, out of many possible, are shown in black.  
*Right*: A separating hyperplane is shown in black. The blue and purple grid indicates the decision rule
made by a classifier based on this separating hyperplane: a test observation that
falls in the blue portion of the grid will be assigned to the blue class, and a test
observation that falls into the purple portion of the grid will be assigned to the
purple class.

We can label the observations from the blue class as $y_i = 1$ and those from the purple class as $y_i = −1$. Then a separating hyperplane has the property that

\begin{equation}\label{9.6}
    \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} > 0 \text{ if } y_i = 1,
    \tag{9.6}
\end{equation}

and

\begin{equation}\label{9.7}
    \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} < 0 \text{ if } y_i = -1,
    \tag{9.7}
\end{equation}

Equivalently, a separating hyperplane has the property that

\begin{equation}\label{9.8}
    y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}) > 0
    \tag{9.8}
\end{equation}

for all $i = 1, \ldots, n$.

If a separating hyperplane exists, we can use it to construct a very natural classifier: **a test observation is assigned a class depending on which side of the hyperplane it is located**. The right-hand panel of Figure 9.2 shows an example of such a classifier. That is, we classify the test observation $x^*$ **based on the sign** of $f(x^*) = \beta_0 + \beta_1 x^*_1 + \beta_2 x^*_2 + \ldots + \beta_p x^*_p$. If $f(x^∗)$ is *positive*, then we assign the test observation to class 1, and if $f(x^∗)$ is *negative*, then we assign it to class −1.

We can also make use of the magnitude of $f(x^∗)$. If $f(x^∗)$ is far from zero, then this means that $x^∗$ lies far from the hyperplane, and so we can be confident about our class assignment for $x$. Not surprisingly, and as we see in Figure 9.2, a classifier that is based on a separating hyperplane leads to a linear decision boundary.

## The Maximal Margin Classifier
In general, if our data can be *perfectly separated* using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up or down, or rotated, without coming into contact with any of the observations.

In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use. A natural choice is the *maximal margin hyperplane* (also known as the *optimal separating hyperplane*), which is the separating hyperplane that is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the *margin*. **The maximal margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations**.

We can then classify a test observation based on which side of the maximal margin hyperplane it lies. This is known as the maximal margin classifier. *We hope that a classifier that has a large margin on the training data will also have a large margin on the test data, and hence will classify the test observations correctly*. Although the maximal margin classifier is often successful, it can also lead to overfitting when $p$ is large.

If $\beta_0, \beta_1, \ldots , \beta_p$ are the coefficients of the maximal margin hyperplane, then the maximal margin classifier classifies the test observation $x^∗$ based on the sign of $f(x^∗) = \beta_0 + \beta_1 x^*_1 + \beta_2x^*_2 + \ldots + \beta_p x^*_p$.

Figure 9.3 shows the maximal margin hyperplane on the data set of Figure 9.2. Comparing the right-hand panel of Figure 9.2 to Figure 9.3, we see that the maximal margin hyperplane shown in Figure 9.3 does indeed result in a greater minimal distance between the observations and the separating hyperplane—that is, a larger margin. In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes.

![The maximal margin hyperplane](./figures/9.3.png)
>**Figure 9.3.** There are two classes of observations, shown in blue and in pur-
ple. The maximal margin hyperplane is shown as a solid line. The margin is the
distance from the solid line to either of the dashed lines. The two blue points and
the purple point that lie on the dashed lines are the support vectors, and the dis-
tance from those points to the hyperplane is indicated by arrows. The purple and
blue grid indicates the decision rule made by a classifier based on this separating
hyperplane.

Examining Figure 9.3, we see that three training observations are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin. These three observations are known as **support vectors**, since they are vectors in p-dimensional space (in Figure 9.3, $p = 2$) and they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well. Interestingly, **the maximal margin hyperplane depends directly on the support vectors, but not on the other observations**: a movement to any of the other observations would not affect the separating hyperplane, provided that the observation’s movement does not cause it to cross the boundary set by the margin.

## Construction of the Maximal Margin Classifier
We now consider the task of constructing the maximal margin hyperplane based on a set of n training observations $x_1, \ldots, x_n \in \mathbb{R}^p$ and associated class labels $y_1, \ldots, y_n \in \{−1, 1\}$. Briefly, the maximal margin hyperplane is the solution to the optimization problem

\begin{align*}\label{9.9}
    \text{maximize M } \atop \beta_0, \beta_1, \ldots, \beta_p, M \tag{9.9} \\
    \text{subject to } \sum^p_{j=1} \beta^2_j &= 1, \tag{9.10} \\
    y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}) &\ge M \; \forall \; i=1, \ldots, n. \tag{9.11}
\end{align*}

This optimization problem (\ref{9.9})–(9.11) is actually simpler than it looks. First of all, the constraint in (9.11) guarantees that each observation will be on the correct side of the hyperplane, provided that $M$ is positive. (Actually, for each observation to be on the correct side of the hyperplane we would simply need $y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}) > 0$, **so the constraint in (9.11) in fact requires that each observation be on the correct side of the hyperplane, with some cushion, provided that $M$ is positive**.)

Second, note that (9.10) is not really a constraint on the hyperplane, since if $\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} = 0$ defines a hyperplane, then so does $k(\beta_0 + \beta1 x_{i1} + \beta_2 x_{i2} + \ldots+ \beta_p x_{ip}) = 0$ for any $k \ne 0$. However, (9.10) adds meaning to (9.11); one can show that with this constraint the perpendicular distance from the ith observation to the hyperplane is given by

\begin{align*}
    y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}).
\end{align*}

**Therefore, the constraints (9.10) and (9.11) ensure that each observation is on the correct side of the hyperplane and at least a distance $M$ from the hyperplane.** Hence, $M$ represents the margin of our hyperplane, and the optimization problem chooses $\beta_0, \beta_1 , \ldots, \beta_p$ to maximize $M$.

## The Non-separable Case
The maximal margin classifier is a very natural way to perform classification, **if a separating hyperplane exists**. However, as we have hinted, in many cases no separating hyperplane exists, and so there is no maximal margin classifier. In this case, the optimization problem (\ref{9.9})–(9.11) has no solution with $M > 0$. An example is shown in Figure 9.4.

However, as we will see in the next section, we can extend the concept of a separating hyperplane in order to develop a hyperplane that almost separates the classes, using a so-called soft margin. The generalization of the maximal margin classifier to the non-separable case is known as the **support vector classifier**.

![Non-separable data](./figures/9.4.png)
>**Figure 9.4.** There are two classes of observations, shown in blue and in purple.
In this case, the two classes are not separable by a hyperplane, and so the
maximal margin classifier cannot be used.

---

# Support Vector Classifiers
## Overview of the Support Vector Classifier
In Figure 9.4, we see that observations that belong to two classes are not necessarily separable by a hyperplane. In fact, even if a separating hyper-plane does exist, then there are instances in which a classifier based on a separating hyperplane might not be desirable. **A classifier based on a separating hyperplane will necessarily perfectly classify all of the training observations; this can lead to sensitivity to individual observations**. An example is shown in Figure 9.5.

The addition of a single observation in the right-hand panel of Figure 9.5 leads to a dramatic change in the maximal margin hyperplane. The resulting maximal margin hyperplane is not satisfactory—for one thing, it has only a tiny margin. This is problematic because as discussed previously, the distance of an observation from the hyperplane can be seen as a measure of our confidence that the observation was correctly classified. Moreover, **the fact that the maximal margin hyperplane is extremely sensitive to a change in a single observation suggests that it may have overfit the training data**.

![One point, two hyperplanes](./figures/9.5.png)
>**Figure 9.5.** *Left*: Two classes of observations are shown in blue and in
purple, along with the maximal margin hyperplane.  
*Right*: An additional blue observation has been added, leading to a dramatic shift
in the maximal margin hyperplane shown as a solid line. The dashed line indicates
the maximal margin hyperplane that was obtained in the absence of this additional point.

In this case, we might be willing to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of
- Greater robustness to individual observations, and
- Better classification of most of the training observations.

That is, *it could be worthwhile to misclassify a few training observations in order to do a better job in classifying the remaining observations*.

The **support vector classifier**, sometimes called a *soft margin classifier*, does exactly this. Rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, **we instead allow some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane**.

An example is shown in the left-hand panel of Figure 9.6.

![Support vector classifier](./figures/9.6.png)
>**Figure 9.6.** *Left*: A support vector classifier was fit to a small data set. The
hyperplane is shown as a solid line and the margins are shown as dashed lines.
Purple observations: Observations 3, 4, 5, and 6 are on the correct side of the
margin, observation 2 is on the margin, and observation 1 is on the wrong side of
the margin. Blue observations: Observations 7 and 10 are on the correct side of
the margin, observation 9 is on the margin, and observation 8 is on the wrong side
of the margin. No observations are on the wrong side of the hyperplane.  
*Right*: Same as left panel with two additional points, 11 and 12. These two observations
are on the wrong side of the hyperplane and the wrong side of the margin.

## Details of the Support Vector Classifier
The support vector classifier classifies a test observation depending on which side of a hyperplane it lies. The hyperplane is chosen to correctly separate most of the training observations into the two classes, but may misclassify a few observations. It is the solution to the optimization problem

\begin{align*}\label{9.12}
    \text{maximize M } \atop \beta_0, \beta_1, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n, M \tag{9.12}\\
    \text{subject to } \sum^p_{j=1} \beta^2_j &= 1, \tag{9.13} \\
    y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}) &\ge M(1 - \epsilon_i) \tag{9.14} \\
    \epsilon_i \ge 0, \; \sum^n_{i=1} \epsilon_i &\le C \tag{9.15}
\end{align*}

where $C$ is a nonnegative tuning parameter. As in (9.11), $M$ is the width of the margin; we seek to make this quantity as large as possible. In (9.14), $\epsilon_1, \ldots , \epsilon_n$ are *slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane*. Once we have solved (9.12)–(9.15), we classify a test observation $x^∗$ as before, by simply determining on which side of the hyperplane it lies. That is, we classify the test observation based on the sign of $f(x^*) = \beta_0 + \beta_1 x_1^* + \ldots + \beta_p x_p^*$.

The problem (9.12)–(9.15) seems complex, but insight into its behavior can be made through a series of simple observations presented below.

First of all, the slack variable $\epsilon_i$ tells us where the $i$th observation is located, relative to the hyperplane and relative to the margin. If $\epsilon_i = 0$ then the $i$th observation is on the correct side of the margin. If $\epsilon_i > 0$ then the $i$th observation is on the wrong side of the margin, and we say that the $i$th observation has violated the margin. If $\epsilon_i > 1$ then it is on the wrong side of the hyperplane.

We now consider the role of the tuning parameter $C$. We can think of $C$ as a budget for the amount that the margin can be violated by the $n$ observations. If $C = 0$ then there is no budget for violations to the margin, and it must be the case that $\epsilon_1 = \ldots = \epsilon_n = 0$, in which case the equations above simply amount to the maximal margin hyperplane optimization problem from earlier. For $C > 0$ **no more than $C$ observations can be on the wrong side of the hyperplane**, because if an observation is on the wrong side of the hyperplane then $\epsilon_i > 1$, and (9.15) requires that $\sum_{i=1}^n \epsilon_i \le C$. As the budget C increases, we become more tolerant of violations to the margin, and so the margin will widen.

In practice, $C$ is treated as a tuning parameter that is generally *chosen via cross-validation*. As with the tuning parameters that we have seen throughout this book, $C$ *controls the bias-variance trade-off of the statistical learning technique. When $C$ is larger, the margin is wider and we allow more violations to it; this amounts to fitting the data less hard and obtaining a classifier that is potentially more biased but may have lower variance*.

The optimization problem (9.12)–(9.15) has a very interesting property: it turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane, and hence the classifier obtained. In other words, an observation that lies strictly on the correct side of the margin *does not* affect the support vector classifier! Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as **support vectors**. **Only these observations do affect the support vector classifier.**

The fact that only support vectors affect the classifier is in line with our previous assertion that $C$ controls the bias-variance trade-off of the support vector classifier. When the tuning parameter $C$ is large, then the margin is wide, many observations violate the margin, and so there are many support vectors. In this case, many observations are involved in determining the hyperplane. The top left panel in Figure 9.7 illustrates this setting: this classifier has low variance but potentially high bias.

![Support vector classifiers](./figures/9.7.png)
>**Figure 9.7.** A support vector classifier was fit using four different values of the
tuning parameter $C$ in (9.12)–(9.15). The largest value of $C$ was used in the top
left panel, and smaller values were used in the top right, bottom left, and bottom
right panels. When $C$ is large, then there is a high tolerance for observations being
on the wrong side of the margin, and so the margin will be large. As $C$ decreases,
the tolerance for observations being on the wrong side of the margin decreases,
and the margin narrows.

The fact that the support vector classifier’s decision rule is based only on a potentially small subset of the training observations (the support vectors) means that it is quite robust to the behavior of observations that are far away from the hyperplane. This property is distinct from some of the other classification methods that we have seen in preceding chapters, such as linear discriminant analysis.  Recall that the LDA classification rule depends on the mean of all of the observations within each class, as well as the within-class covariance matrix computed using all of the observations. In contrast, logistic regression, unlike LDA, has very low sensitivity to observations far from the decision boundary.

---

# Support Vector Machines

We first discuss a general mechanism for converting a linear classifier into one that produces non-linear decision boundaries. We then introduce the support vector machine, which does this in an automatic way.

## Classification with Non-linear Decision Boundaries
In practice we are sometimes faced with non-linear class boundaries. For instance, consider the data in the left-hand panel of Figure 9.8. It is clear that a support vector classifier or any linear classifier will perform poorly here. Indeed, the support vector classifier shown in the right-hand panel of Figure 9.8 is useless here.

![Non-linear boudaries](./figures/9.8.png)
>**Figure 9.8.** *Left*: The observations fall into two classes,
with a non-linear boundary between them.  
*Right*: The support vector classifier seeks a linear boundary,
and consequently performs very poorly.

In Chapter 7, we are faced with an analogous situation. In that case, we consider enlarging the feature space using functions of the predictors, such as quadratic and cubic terms, in order to address this non-linearity. In the case of the support vector classifier, **we could address the problem of possibly non-linear boundaries between classes in a similar way, by enlarging the feature space using quadratic, cubic, and even higher-order polynomial functions of the predictors**. For instance, rather than fitting a support vector classifier using p features

\begin{align*}
    X_1, X_2, \ldots, X_p
\end{align*}

we could instead fit a support vector classifier using $2p$ features

\begin{align*}
    X_1, X_1^2, X_2, X_2^2, \ldots, X_p, X_p^2
\end{align*}

Then (\ref{9.12})-(9.15) would become

\begin{align*}\label{9.16}
    \text{maximize M } \atop \beta_0, \beta_{11}, \beta_{12}, \ldots, \beta_{p1}, \beta_{p2}, \epsilon_1, \ldots, \epsilon_n, M  \tag{9.16} \\
    \text{subject to } y_i \left( \beta_0 + \sum^p_{j=1} \beta_{j1}x_{ij} + \sum^p_{j=1} \beta_{j2}x^2_{ij} \right) &\ge M(1- \epsilon_i), \\
    \sum^n_{i=1} \epsilon_i \le C, \; \epsilon_i \ge 0, \; \sum^p_{j=1} \sum^2_{k=1} \beta^2_{jk} &= 1.
\end{align*}

Why does this lead to a non-linear decision boundary? In the enlarged feature space, the decision boundary that results from (9.16) is in fact linear. But in the original feature space, the decision boundary is of the form $q(x) = 0$, where $q$ is a quadratic polynomial, and its solutions are generally non-linear. One might additionally want to enlarge the feature space with higher-order polynomial terms, or with interaction terms of the form $X_j X_{j^\prime}$ for $j \ne j^\prime$.

Alternatively, other functions of the predictors could be considered rather than polynomials. It is not hard to see that there are many possible ways to enlarge the feature space, and that unless we are careful, we could end up with a huge number of features. Then computations would become unmanageable. The support vector machine, which we present next, allows us to enlarge the feature space used by the support vector classifier in a way that leads to efficient computations.

## The Support Vector Machine
The *support vector machine* (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using **kernels**. We will now discuss this extension, the details of which are beyond the scope of this book. However, the main idea is described in [Section 9.3.1](#Classification-with-Non-linear-Decision-Boundaries): we may want to enlarge our feature space in order to accommodate a non-linear boundary between the classes. The kernel approach that we describe here is simply an efficient computational approach for enacting this idea.

It turns out that the solution to the support vector classifier problem (\ref{9.12})–(9.15) involves only the inner products of the observations (as opposed to the observations themselves). The inner product of two $r$-vectors $a$ and $b$ is defined as $\langle a, b\rangle = \sum^r_{i=1} a_i b_i$. Thus the inner product of two observations $x_i, x_{i^\prime}$ is given by

\begin{equation}\label{9.17}
    \langle x_i, x_{i^\prime} \rangle = \sum^p_{j=1} x_{ij}x_{i'j}
    \tag{9.17}
\end{equation}

It can be shown that
- The linear support vector classifier can be represented as
\begin{equation}\label{9.18}
    f(x) = \beta_0 + \sum^n_{i=1} \alpha_i \langle x, x_i \rangle,
    \tag{9.18}
\end{equation}
where there are $n$ parameters $\alpha_i$, $i=1, \ldots, n,$ one per training observation.
-  To estimate the parameters $\alpha_1, \ldots, \alpha_n$ and $\beta_0$, all we need are the $n \choose 2$ inner products $\langle x_i, x_{i'} \rangle$ between all pairs of training observations.

Notice that in (\ref{9.18}), in order to evaluate the function $f(x)$, we need to compute the inner product between the new point $x$ and each of the training points $x_i$. However, **it turns out that $\alpha_i$ is nonzero only for the support vectors in the solution—that is, if a training observation is not a support vector, then its $\alpha_i$ equals zero**. So if $S$ is the collection of indices of these support points, we can rewrite any solution function of the form (\ref{9.18}) as

\begin{equation}\label{9.19}
    f(x) = \beta_0 + \sum_{i \in S} \alpha_i \langle x, x_i \rangle,
    \tag{9.19}
\end{equation}

which typically involves far fewer terms than in (\ref{9.18}).

To summarize, **in representing the linear classifier $f(x)$, and in computing its coefficients, all we need are inner products**.

Now suppose that every time the inner product (\ref{9.17}) appears in the representation (\ref{9.18}), or in a calculation of the solution for the support vector classifier, we replace it with a generalization of the inner product of the form

\begin{equation}\label{9.20}
    K(x_i, x_{i'}) \tag{9.20}
\end{equation}

where $K$ is  some function that we will refer to as a *kernel*. **A kernel is a function that quantifies the *similarity* of two observations**. For instance, we could simply take the kernel to be the inner product

\begin{equation}\label{9.21}
    K(x_i, x_{i'}) = \sum^p_{j=1} x_{ij}x_{i'j}
    \tag{9.21}
\end{equation}

which would just give us back the support vector classifier. Equation \ref{9.21} is known as a linear kernel because the support vector classifier is linear in the features; the linear kernel essentially quantifies the similarity of a pair of observations using Pearson (standard) correlation. But one could instead choose another form for (\ref{9.20}). For instance, one could replace every instance of $\sum^p_{j=1} x_{ij} x_{i'j}$ with the  quantity

\begin{equation}\label{9.22}
    K(x_i, x_{i'}) = (1 + \sum^p_{j=1} x_{ij} x_{i'j})^d
    \tag{9.22}
\end{equation}

This is known as a *polynomial kernel* of degree $d$, where $d$ is a positive integer. **Using such a kernel with $d > 1$, instead of the standard linear kernel (\ref{9.21}), in the support vector classifier algorithm leads to a much more flexible decision boundary**. It essentially amounts to fitting a support vector classifier in a higher-dimensional space involving polynomials of degree $d$, rather than in the original feature space.

When the support vector classifier is combined with a non-linear kernel such as (\ref{9.22}), the resulting classifier is known as a **support vector machine**. Note that in this case the (non-linear) function has the form

\begin{equation}\label{9.23}
    f(x) = \beta_0 + \sum_{i \in S} \alpha_i K(x, x_i)
    \tag{9.23}
\end{equation}

The left-hand panel of Figure 9.9 shows an example of an SVM with a polynomial kernel applied to the non-linear data from Figure 9.8. When $d = 1$, then the SVM reduces to the support vector classifier seen earlier in this chapter.

The polynomial kernel shown in (\ref{9.22}) is one example of a possible non-linear kernel, **but alternatives abound**. Another popular choice is the radial kernel, which takes the form

\begin{equation}\label{9.24}
    K(x_i, x_{i'}) = \exp \left(-\gamma \sum^p_{j=1} (x_{ij} -x_{i'j})^2 \right)
    \tag{9.24}
\end{equation}

In (\ref{9.24}), $\gamma$ is a positive constant. The right-hand panel of Figure 9.9 shows an example of an SVM with a radial kernel on this non-linear data; it also does a good job in separating the two classes.

![Support Vector Machines](./figures/9.9.png)
>**Figure 9.9.** *Left*: An SVM with a polynomial kernel of degree 3 is applied to
the non-linear data from Figure 9.8, resulting in a far more appropriate decision
rule.  
*Right*: An SVM with a radial kernel is applied. In this example, either kernel
is capable of capturing the decision boundary.

How does the radial kernel (\ref{9.24}) actually work? If a given test observation $x^∗ = (x^∗_1, \ldots, x^∗_p)^T$ is far from a training observation $x_i$ in terms of Euclidean distance, then $\sum^p_{j=1} (x^∗_j − x_{ij})^2$ will be large, and so $K(x_i, x_{i'}) = \exp(−\gamma \sum^p_{j=1} (x^∗_j − x_{ij})^2)$ will be very tiny.

This means that in (\ref{9.23}), $x_i$ will play virtually no role in $f(x^∗)$. Recall that the predicted class label for the test observation $x^∗$ is based on the sign of $f(x^∗)$. In other words, training observations that are far from $x^∗$ will play essentially no role in the predicted class label for $x$. **This means that the radial kernel has very *local* behavior**, in the sense that only nearby training observations have an effect on the class label of a test observation.

What is the advantage of using a kernel rather than simply enlarging the feature space using functions of the original features, as in (\ref{9.16})? One advantage is computational, and it amounts to the fact that using kernels, one need only compute $K(x_i, x_{i'})$ for all $n \choose 2$ distinct pairs $i, i'$.

This can be done without explicitly working in the enlarged feature space. This is important because in many applications of SVMs, the enlarged feature space is so large that computations are intractable. For some kernels, such as the radial kernel (\ref{9.24}), the feature space is implicit and infinite-dimensional, so we could never do the computations there anyway!

## An Application to the Heart Disease Data
In Chapter 8 we apply decision trees and related methods to the Heart data. The aim is to use 13 predictors such as `Age`, `Sex`, and `Chol` in order to predict whether an individual has heart disease. We now investigate how an SVM compares to LDA on this data. After removing 6 missing observations, the data consist of 297 subjects, which we randomly split into 207 training and 90 test observations.

We first fit LDA and the support vector classifier to the training data. Note that the support vector classifier is equivalent to a SVM using a polynomial kernel of degree $d = 1$. The left-hand panel of Figure 9.10 displays
ROC curves for the training set predictions for both LDA and the support vector classifier. Both classifiers compute scores of the form $\hat{f}(X) = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \ldots + \hat{\beta}_p X_p$ for each observation.

For any given cutoff $t$, we classify observations into the heart disease or no heart disease categories depending on whether $\hat{f}(X) < t$ or $\hat{f}(X) \ge t$. The ROC curve is obtained by forming these predictions and computing the false positive and true positive rates for a range of values of $t$. In this instance LDA and the support vector classifier both perform well, though there is a suggestion that the support vector classifier may be slightly superior.

The right-hand panel of Figure 9.10 displays ROC curves for SVMs using a radial kernel, with various values of $\gamma$. As $\gamma$ increases and the fit becomes more non-linear, the ROC curves improve. Using $\gamma = 10^{−1}$ appears to give an almost perfect ROC curve. However, **these curves represent training error rates, which can be misleading in terms of performance on new test data**.

![](./figures/9.10.png)
>**Figure 9.10.** ROC curves for the Heart data **training set**.  
*Left*: The support vector classifier and LDA are compared.  
*Right*: The support vector classifier is
compared to an SVM using a radial basis kernel with $\gamma = 10^{−3} , 10^{−2}$, and $10^{−1}$.

Figure 9.11 displays ROC curves computed on the 90 test observations. We observe some differences from the training ROC curves. In the left-hand panel of Figure 9.11, the support vector classifier appears to have a small advantage over LDA (although these differences are not statistically significant). In the right-hand panel, the SVM using $\gamma = 10^{−1}$, which showed the best results on the training data, produces the worst estimates
on the test data.

![](./figures/9.11.png)
>**Figure 9.11.** ROC curves for the **test set** of the Heart data.  
*Left*: The support vector classifier and LDA are compared.  
*Right*: The support vector classifier is compared to an SVM
using a radial basis kernel with $\gamma = 10^{−3} , 10^{−2}$, and $10^{−1}$.

This is once again evidence that while a more flexible method will often produce lower training error rates, this does not necessarily lead to improved performance on test data. The SVMs with $\gamma = 10^{−2}$ and $\gamma = 10^{−3}$ perform comparably to the support vector classifier, and all three outperform the SVM with $\gamma = 10^{−1}$.

---

# SVMs with More than Two Classes
So far, our discussion has been limited to the case of binary classification: that is, classification in the two-class setting. How can we extend SVMs to the more general case where we have some arbitrary number of classes?

It turns out that the concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes. Though a number of proposals for extending SVMs to the $K$-class case have been made, the two most popular are the **one-versus-one** and **one-versus-all** approaches. We briefly discuss those two approaches here.

## One-Versus-One Classification
Suppose that we would like to perform classification using SVMs, and there are $K > 2$ classes. A one-versus-one or **all-pairs** approach constructs $K \choose 2$ SVMs, each of which compares a pair of classes. For example, one such SVM might compare the $k$th class, coded as $+1$, to the $k'$th class, coded as $−1$. **We classify a test observation using each of the $K \choose 2$ classifiers, and we tally the number of times that the test observation is assigned to each of the $K$ classes**. The final classification is performed by assigning the test
observation to the class to which it was most frequently assigned in these $K \choose 2$ pairwise classifications.

## One-Versus-All Classification
The one-versus-all approach is an alternative procedure for applying SVMs in the case of $K > 2$ classes. We fit $K$ SVMs, each time comparing one of the $K$ classes to the remaining $K − 1$ classes. Let $\beta_{0k}, \beta_{1k}, \ldots, \beta_{pk}$ denote the parameters that result from fitting an SVM comparing the $k$th class (coded as $+1$) to the others (coded as $−1$). Let $x^∗$ denote a test observation. **We assign the observation to the class for which $\beta_{0k} + \beta_{1k} x^∗_1 + \beta_{2k} x^∗_2 + \ldots + \beta_{pk} x^∗_p$ is largest**, as this amounts to a high level of confidence that the test observation belongs to the $k$th class rather than to any of the other classes.

---

# Relationship to Logistic Regression
It turns out that one can rewrite the criterion (\ref{9.12})–(9.15) for fitting the support vector classifier $f(X) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p$ as

\begin{equation}\label{9.25}
    {\text{minimize } \atop \beta_0, \beta_1, \ldots, \beta_p} \left\{ \sum^n_{i=1} \max[0, 1-y_i f(x_i)] + \lambda \sum^p_{j=1} \beta_j^2 \right\}
    \tag{9.25}
\end{equation}

where $\lambda$ is a nonnegative tuning parameter. When $\lambda$ is large then $\beta_1, \ldots, \beta_p$ are small, more violations to the margin are tolerated, and a low-variance but high-bias classifier will result. When $\lambda$ is small then few violations to the margin will occur; this amounts to a high-variance but low-bias
classifier. Thus, a small value of $\lambda$ in (\ref{9.25}) amounts to a small value of $C$ in (9.15). Note that the $\lambda \sum^p_{j=1} \beta^2_j$ term in (\ref{9.25}) is the ridge penalty term from Section 6.2.1 (Ridge Regression), and plays a similar role in controlling the bias-variance trade-off for the support vector classifier.

Now (\ref{9.25}) takes the “Loss + Penalty” form that we have seen repeatedly throughout this book:

\begin{equation}\label{9.26}
    {\text{minimize } \atop \beta_0, \beta_1, \ldots, \beta_p} \left\{ L(X,x,\beta) + \lambda P(\beta) \right\}
    \tag{9.26}
\end{equation}

In (\ref{9.26}), $L(X, y, \beta)$ is some loss function quantifying the extent to which the model, parametrized by $\beta$, fits the data $(X, y)$, and $P(\beta)$ is a penalty function on the parameter vector $\beta$ whose effect is controlled by a nonnegative tuning parameter $\lambda$. For instance, ridge regression and the lasso both
take this form with

\begin{align*}
    L(\boldsymbol{X},\boldsymbol{y},\beta) = \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} x_{ij} \beta_j \right)^2
\end{align*}

and with $P (β) = \sum^p_{j=1} \beta_j^2$ for ridge regression and $P(\beta) = \sum_{j=1}^p |\beta_j |$ for the lasso. In the case of (\ref{9.25}) the loss function instead takes the form

\begin{align*}
    L(\boldsymbol{X},\boldsymbol{y},\beta) = \sum^n_{i=1} \max [ 0, 1-y_i(\beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}) ]
\end{align*}

This is known as *hinge loss*, and is depicted in Figure 9.12. However, it turns out that the hinge loss function is closely related to the loss function used in logistic regression, also shown in Figure 9.12.

![SVM Loss and Logistic Regression Loss](./figures/9.12.png)
>**Figure 9.12.** The SVM and logistic regression loss functions are compared,
as a function of $y_i(\beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip})$.
When $y_i(\beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip})$ is greater than 1,
then the SVM loss is zero, since this corresponds to an observation
that is on the correct side of the margin. Overall, the two loss functions
have quite similar behavior.

An interesting characteristic of the support vector classifier is that only support vectors play a role in the classifier obtained; observations on the correct side of the margin do not affect it. **This is due to the fact that the loss function shown in Figure 9.12 is exactly zero for observations for which $y_i(\beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} ) \ge 1$; these correspond to observations that are on the correct side of the margin**.

In contrast, the loss function for logistic regression shown in Figure 9.12 is not exactly zero anywhere. But it is very small for observations that are far from the decision boundary. Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very similar results. **When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred**.

When the support vector classifier and SVM were first introduced, it was thought that the tuning parameter $C$ in (9.15) was an unimportant “nuisance” parameter that could be set to some default value, like 1. However, the “Loss + Penalty” formulation (\ref{9.25}) for the support vector classifier indicates that this is not the case. The choice of tuning parameter is very important and determines the extent to which the model underfits or over-fits the data, as illustrated, for example, in Figure 9.7.

We have established that the support vector classifier is closely related to logistic regression and other preexisting statistical methods. Is the SVM unique in its use of kernels to enlarge the feature space to accommodate non-linear class boundaries? The answer to this question is “no”. We could just as well perform logistic regression or many of the other classification methods seen in this book using non-linear kernels; this is closely related to some of the non-linear approaches seen in Chapter 7. However, for historical reasons, the use of non-linear kernels is much more widespread in the context of SVMs than in the context of logistic regression or other methods.

### Support Vector Regression
Though we have not addressed it here, there is in fact an extension of the SVM for regression (i.e. for a quantitative rather than a qualitative response), called **support vector regression**.

In Chapter 3, we saw that least squares regression seeks coefficients $\beta_0 , \beta_1, \ldots, \beta_p$ such that the sum of squared residuals is as small as possible. (Recall from Chapter 3 that residuals are defined as $y_i − \beta_0 − \beta_1 x_{i1} − \ldots − \beta_p x_{ip}$.)

**Support vector regression instead seeks coefficients that minimize a different type of loss, where only residuals larger in absolute value than some positive constant contribute to the loss function**. This is an extension of the margin used in support vector classifiers to the regression setting.

---
# End Chapter