## $\S$ 4.5.2. Optimal Separating Hyperplanes

The *optimal separating hyperplanes* separates the two classes and maximizes the distance to the closest point from either class (Vapnik, 1996). Not only does this provide a unique solution to the separating hyperplane problem, but by maximizing the margin between two classes on the training data, this leads to better classification performance on test data.

### Formulation

We need to generalize the perceptron criterion

\begin{equation}
D(\beta,\beta_0) = -\sum_{i\in\mathcal{M}} y_i(x_i^T\beta + \beta_0).
\end{equation}

Consider the optimization problem

\begin{equation}
\max_{\beta,\beta_0,\|\beta\|=1} M \\
\text{subject to } y_i(x_i^T\beta + \beta_0) \ge M, \text{ for } i = 1,\cdots,N.
\end{equation}

The set of conditions ensure that all the points are at least a signed distance $M$ from the decision boundary defined by $\beta$ and $\beta_0$, and we seek the largest such $M$ and associated parameters.

We can get rid of the $\|\beta\| = 1$ constraint by replacing the conditions with

\begin{equation}
\frac{1}{\|\beta\|} y_i(x_i^T\beta + \beta_0) \ge M, \\
\text{or equivalently} \\
y_i(x_i^T\beta + \beta_0) \ge M\|\beta\|,
\end{equation}

which redefines $\beta_0$.

Since for any $\beta$ and $\beta_0$ satisfying these inequalities, any positively scaled multiple satisfies them too, we can arbitrarily set

\begin{equation}
\|\beta\| = \frac{1}{M},
\end{equation}

which leads to the equivalent formulation as

\begin{equation}
\min_{\beta,\beta_0} \frac{1}{2}\|\beta\|^2 \\
\text{subject to } y_i(x_i^T\beta + \beta_0) \ge 1, \text{ for }i=1,\cdots,N.
\end{equation}

In light of $(4.40)$, the constraints define an empty slab or margin around the linear decision boundary of thickness $1/\|\beta\|$. Hence we choose $\beta$ and $\beta_0$ to maximize its thickness.

### Convex optimization

This is a convex optimization problem (quadratic criterion with linear inequality constraints). The Lagrange (primal) function, to be minimized w.r.t. $\beta$ and $\beta_0$, is

\begin{equation}
L_P = \frac{1}{2}\|\beta\|^2 - \sum_{i=1}^N \alpha_i \left[ y_i(x_i^T\beta + \beta_0) -1 \right].
\end{equation}

Setting the derivatives to zero, we obtain:

\begin{align}
\beta &= \sum_{i=1}^N \alpha_i y_i x_i, \\
0 &= \sum_{i=1}^N \alpha_i y_i,
\end{align}

and substitutig these in $L_P$ we obtain the so-called Wolfe dual

\begin{equation}
L_D = \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{k=1}^N \alpha_i \alpha_k y_i y_k x_i^T x_k \\
\text{subject to } \alpha_i \ge 0 \text{ and } \sum_{i=1}^N \alpha_i y_i = 0.
\end{equation}

The solution is obtained by maximizing $L_D$ in the positive orthant, a simpler convex optimization problem, for which standard software can be used. In addition the solution must satisfy the Karush-Kuhn-Tucker (KKT) conditions, which includes the above three conditions and

\begin{equation}
\alpha_i \left[ y_i (x_i^T\beta + \beta_0) \right] = 0, \forall i.
\end{equation}

### Implications of the algorithm

From these we can see that
* if $\alpha_i \gt 0$, then $y_i(x_i^T\beta + \beta_0) = 1$, or in other words, $x_i$ is on the boundary of the slab;
* if $y_i(x_i^T\beta + \beta_0) \gt 1$, $x_i$ is not on the boundary of the slab, and $\alpha_i = 0$.

From the above condition of the primal Lagrange function

\begin{equation}
\beta = \sum_{i=1}^N \alpha_i y_i x_i,
\end{equation}

we see that the solution vector $\beta$ is defined in terms of a linear combination of the *support points* $x_i$ -- those points defined to be on the boundary of slab via $\alpha_i \gt 0$.

FIGURE 4.16 shows the optimal separating hyperplane for our toy example; these are three support points. Likewise, $\beta_0$ is obtained by solving the last KKT condition

\begin{equation}
\alpha_i \left[ y_i (x_i^T\beta + \beta_o) \right] = 0,
\end{equation}

for any of the support points.

In [2]:
"""FIGURE 4.16. Optimal separating hyperplane

The same data aas in FIGURE 4.14. The shaded region delineates the maximum
margin separating the two classes. There are three support points
indicated, which lie on the boundary of the margin, and the optimal
separating hyperplane (blue line) bisects the slab. Included in the figure
is the boundary found using logistic regreesion (red line), which is very
close to the optimal separating hyperplane (see Section 12.3.3).

https://docs.scipy.org/doc/scipy/reference/optimize.html"""
print('Under construction (CVXOPT may be needed, priority low)...')

Under construction (CVXOPT may be needed, priority low)...


### Classification

The optimal separating hyperplane produces a function $\hat{f}(x) = x^T\hat\beta + \hat\beta_0$ for classifying new observations:

\begin{equation}
\hat{G}(x) = \text{sign } \hat{f}(x).
\end{equation}

Although none of the training observations fall in the margin (by construction), this will not necessarily be the case for test observations. The intuition is that a large margin on the training data will lead to good separation on the test data.

### Dependency on model assumption

The description of the solution in terms of support points seems to suggest that the optimal hyperplane focuses more on the points that count, and is more robust to model misspecification.

The LDA solution, on the other hand, depends on all of the data, even points far away from the decision boundary. Note, however, that the identification of these support points required the use of all the data. Of course, if the classes are really Gaussian, then LDA is optimal, and separating hyperplane will pay a price for focusing on the (noisier) data at the boundaries if the classes.

Included in FIGURE 4.16 is the logistic regression solution to this problem, fit by maximum likelihood. Both solutions are similar in this case. When a separating hyperplane exists, logistic regression will always find it, since the log-likelihood can be driven to $0$ in this case (Exercise 4.5).

*skipped*

### When the data are not separable

There will be no feasible solution to this problem, and an alternative formulation is needed.

Again one can enlarge the space using basis transformations, but this can lead to artificial separation through over-fitting. In Chapter 12 we discuss a more attractive alternative known as the *support vector machine*, which allows for overlap, but minimizes a measure of the extent of this overlap.