In [1]:
%load_ext lab_black

import altair as alt
import numpy as np
import pandas as pd

# 7. In Depth: Support Vector Machines

> Support vector machines (SVMs) are a particularly powerful and flexible class of supervised algorithms for both classification and regression.

## Motivating Support Vector Machines

> As part of our disussion of Bayesian classification, we learned a simple model describing the distribution of each underlying class, and used these generative models to probabilistically determine labels for new points. That was an example of *generative classification*; here we will consider instead *discriminative classification*: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.

> A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification.

## Support Vector Machines: Maximizing the *Margin*

> In support vector machines, the line that maximizes this margin is the one we will choose as the optimal model. Support vector machines are an example of such a *maximum margin* estimator.

### Fitting a support vector machine

In [2]:
from sklearn.datasets import make_blobs
from sklearn.svm import SVC

X, y = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=0.60)
model = SVC(kernel="linear", C=1e10).fit(X, y)

pd.concat(
    [
        pd.DataFrame(X).add_prefix("x").assign(y=y.astype(str), opacity=1, size=0.1),
        (
            pd.DataFrame(model.support_vectors_)
            .add_prefix("x")
            .assign(y="Support vector", opacity=0.25, size=0.5)
        ),
    ]
).pipe(
    lambda df: alt.Chart(df)
    .mark_point(filled=True)
    .encode(
        x="x0",
        y="x1",
        color="y:N",
        opacity=alt.Opacity("opacity", legend=None),
        size=alt.Size("size", legend=None),
    )
) + pd.DataFrame(
    (x0, x1)
    for x0 in np.linspace(X[:, 0].min(), X[:, 0].max(), 30)
    for x1 in np.linspace(X[:, 1].min(), X[:, 1].max(), 30)
).add_prefix(
    "x"
).assign(
    y=model.decision_function
).pipe(
    lambda df: (
        alt.Chart(df, height=300, width=400)
        .mark_point(size=1)
        .encode(
            x="x0",
            y="x1",
            color=alt.Color(
                "y",
                scale=alt.Scale(scheme="blueorange", type="symlog"),
                title="Decision surface",
            ),
        )
    )
)

> A key to this classifier's success is that for the fit, only the position of the support vectors matter; any points further from the margin which are on the correct side do not modify the fit! Technically, this is because these points do not contribute to the loss function used to fit the model, so their position and number do not matter so long as they do not cross the margin.

> This insensitivity to the exact behavior of distant points is one of the strengths of the SVM model.

### Beyond linear boundaries: Kernel SVM

> Where SVM becomes extremely powerful is when it is combined with kernels.

> This type of basis function transformation is known as a *kernel transformation*, as it is based on a similarity relationship (or kernel) between each pair of points. A potential problem with this strategy—projecting $N$ points into $N$ dimensions—is that it might become very computationally intensive as $N$ grows large. However, because of a neat little procedure known as the [*kernel trick*](https://en.wikipedia.org/wiki/Kernel_trick), a fit on kernel-transformed data can be done implicitly—that is, without ever building the full $N$-dimensional representation of the kernel projection! This kernel trick is built into the SVM, and is one of the reasons the method is so powerful.

> Using this kernelized support vector machine, we learn a suitable nonlinear decision boundary. This kernel transformation strategy is used often in machine learning to turn fast linear methods into fast nonlinear methods, especially for models in which the kernel trick can be used.

In [3]:
from sklearn.datasets import make_circles
from sklearn.svm import SVC

X, y = make_circles(100, factor=0.1, noise=0.1, random_state=0)

pd.concat(
    [
        pd.DataFrame(np.c_[X, y], columns=["x0", "x1", "y"]).assign(
            opacity=1, size=0.1
        ),
        pd.DataFrame(SVC(kernel="rbf").fit(X, y).support_vectors_)
        .add_prefix("x")
        .assign(y="Support vector", opacity=0.25, size=0.5),
    ]
).pipe(
    lambda df: alt.Chart(df, height=400, width=600)
    .mark_point(filled=True)
    .encode(
        x="x0",
        y="x1",
        color="y:N",
        opacity=alt.Opacity("opacity", legend=None),
        size=alt.Size("size", legend=None),
    )
)

### Tuning the SVM: Softening Margins

> To handle this case, the SVM implementation has a bit of a fudge-factor which "softens" the margin: that is, it allows some of the points to creep into the margin if that allows a better fit. 

In [4]:
from sklearn.datasets import make_blobs
from sklearn.svm import SVC

X, y = make_blobs(n_samples=100, centers=2, cluster_std=0.8, random_state=0)

pd.concat(
    df
    for C in np.logspace(-1, 1, 5)
    for df in [
        pd.DataFrame(np.c_[X, y], columns=["x0", "x1", "y"]).assign(
            C=C, opacity=1, size=0.1
        ),
        pd.DataFrame(SVC(kernel="linear", C=C).fit(X, y).support_vectors_)
        .add_prefix("x")
        .assign(y="Support vector", C=C, opacity=0.25, size=0.5),
    ]
).pipe(
    lambda df: alt.Chart(df, height=200, width=300)
    .mark_point(filled=True)
    .encode(
        x="x0",
        y="x1",
        color="y:N",
        opacity=alt.Opacity("opacity", legend=None),
        size=alt.Size("size", legend=None),
    )
    .facet(facet="C", columns=3)
)

## Example: Face Recognition

In [5]:
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.svm import SVC

lfw = fetch_lfw_people(min_faces_per_person=60)