# Chapter 5: Support Vector Machines

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

A Support Vector Machine (SVM) is a powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex small- or medium-sized datasets.

This chapter will explain the core concepts of SVMs, how to use them, and how they work.

## 2. Linear SVM Classification

The fundamental idea behind SVMs is best explained with some pictures. Imagine two classes that can clearly be separated easily with a straight line (they are *linearly separable*).

You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called **Large Margin Classification**.

Notice that adding more training instances "off the street" will not affect the decision boundary at all: it is fully determined (or "supported") by the instances located on the edge of the street. These instances are called **Support Vectors**.

**Feature Scaling is Crucial**
SVMs are sensitive to the feature scales. If features have very different scales, the street will be very narrow. Feature scaling (e.g., using Scikit-Learn’s `StandardScaler`) allows the street to be much wider, which generally leads to better performance.

### Soft Margin Classification

If we strictly impose that all instances must be off the street and on the right side, this is called **Hard Margin Classification**. There are two main issues with hard margin classification:
1.  It only works if the data is linearly separable.
2.  It is quite sensitive to outliers.

To avoid these issues, it is preferable to use a more flexible model. The objective is to find a good balance between keeping the street as wide as possible and limiting the *margin violations* (i.e., instances that end up in the middle of the street or even on the wrong side). This is called **Soft Margin Classification**.

In Scikit-Learn’s SVM classes, you can control this balance using the `C` hyperparameter:
* **Low C:** A wider street, but more margin violations (high bias, low variance).
* **High C:** A narrower street, but fewer margin violations (low bias, high variance).

Let's load the Iris dataset, scale the features, and then train a linear SVM model (using the `LinearSVC` class with `C=1` and the *hinge loss* function) to detect Iris virginica flowers.

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

# 1. Load the Iris dataset
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.float64)  # Iris virginica

# 2. Build the Pipeline
# StandardScaling is essential for SVMs to perform well.
# LinearSVC is optimized for linear SVMs and is faster than SVC(kernel="linear").
# loss="hinge": The standard SVM loss function.
svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge", random_state=42)),
])

# 3. Train the model
svm_clf.fit(X, y)

# 4. Make a prediction
print("Prediction for [5.5, 1.7]:", svm_clf.predict([[5.5, 1.7]]))

Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class.

## 3. Nonlinear SVM Classification

Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more features, such as polynomial features; in some cases this can result in a linearly separable dataset.

To implement this idea using Scikit-Learn, you can create a `Pipeline` containing a `PolynomialFeatures` transformer, followed by a `StandardScaler` and a `LinearSVC`. Let’s test this on the moons dataset: this is a toy dataset for binary classification in which the data points are shaped as two interleaving half circles.

In [None]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

# 1. Generate the Moons dataset
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

# 2. Build Pipeline with Polynomial Features
polynomial_svm_clf = Pipeline([
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge", random_state=42))
])

polynomial_svm_clf.fit(X, y)
print("Training complete on Moons dataset.")

### Polynomial Kernel

Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.

Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the **kernel trick** (explained later). It makes it possible to get the same result as if you had added many polynomial features, even with very high-degree polynomials, without actually having to add them. So there is no combinatorial explosion of the number of features since you don’t actually add any features. This trick is implemented by the `SVC` class.

In [None]:
from sklearn.svm import SVC

# Polynomial Kernel SVM
# kernel="poly": Use the polynomial kernel.
# degree=3: A 3rd-degree polynomial kernel.
# coef0=1: Controls how much the model is influenced by high-degree polynomials versus low-degree polynomials.
# C=5: Regularization parameter.
poly_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
poly_kernel_svm_clf.fit(X, y)

### Gaussian RBF Kernel

Another technique to tackle nonlinear problems is to add features computed using a *similarity function* that measures how much each instance resembles a particular *landmark*.

The similarity function used is the **Gaussian Radial Basis Function (RBF)**:
$$ \phi_\gamma(\mathbf{x}, \ell) = \exp(-\gamma \|\mathbf{x} - \ell\|^2) $$

Where:
* $\ell$ is the landmark.
* $\gamma$ is a hyperparameter that controls the width of the Gaussian curve.

Just like the polynomial features method, the similarity features method can be useful with any Machine Learning algorithm, but it may be computationally expensive to compute all the additional features (especially on large training sets). However, once again the kernel trick does its SVM magic: it makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them.

In [None]:
# Gaussian RBF Kernel SVM
# kernel="rbf": Use the Gaussian RBF kernel.
# gamma=5: Controls the width of the bell curve.
#   - High gamma: Narrow bell curve -> Instances range of influence is small -> Irregular decision boundary (Overfitting risk).
#   - Low gamma: Wide bell curve -> Instances range of influence is large -> Smoother decision boundary (Underfitting risk).
# C=0.001: Regularization parameter.
rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
rbf_kernel_svm_clf.fit(X, y)

**How to choose the kernel?**
The rule of thumb is to always try the linear kernel first (remember that `LinearSVC` is much faster than `SVC(kernel="linear")`), especially if the training set is very large or if it has plenty of features. If the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in most cases. Then if you have spare time and computing power, you can also experiment with a few other kernels using cross-validation and grid search.

## 4. SVM Regression

As we mentioned earlier, the SVM algorithm is quite versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression. The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible *on* the street while limiting margin violations (i.e., instances *off* the street).

The width of the street is controlled by a hyperparameter $\epsilon$ (epsilon).

In [None]:
from sklearn.svm import LinearSVR

# Generate simple linear data
np.random.seed(42)
m = 50
X = 2 * np.random.rand(m, 1)
y = (4 + 3 * X + np.random.randn(m, 1)).ravel()

# Linear SVM Regression
# epsilon=1.5: Defines the width of the street (margin).
# Adding more training instances within the margin does not affect the model’s predictions; thus, the model is said to be epsilon-insensitive.
svm_reg = LinearSVR(epsilon=1.5, random_state=42)
svm_reg.fit(X, y)

print("Prediction for X=1.0:", svm_reg.predict([[1.0]]))

To tackle nonlinear regression tasks, you can use a kernelized SVM model. Scikit-Learn’s `SVR` class (which supports the kernel trick) is the regression equivalent of the `SVC` class.

In [None]:
from sklearn.svm import SVR

# Generate quadratic data
m = 100
X = 2 * np.random.rand(m, 1) - 1
y = (0.2 + 0.1 * X + 0.5 * X**2 + np.random.randn(m, 1)/10).ravel()

# Nonlinear SVM Regression (Polynomial Kernel)
# kernel="poly": Use a polynomial kernel.
# degree=2: 2nd degree polynomial.
# C=100: Regularization (High C -> less regularization).
# epsilon=0.1: Margin width.
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1, gamma="scale")
svm_poly_reg.fit(X, y)

print("Regression complete.")

## 5. Under the Hood

This section explains how SVMs make predictions and how their training algorithms work, starting with linear SVM classifiers.

### Decision Function and Predictions
The linear SVM classifier predicts the class of a new instance $\mathbf{x}$ by simply computing the decision function $\mathbf{w}^T \mathbf{x} + b = w_1 x_1 + \dots + w_n x_n + b$. If the result is positive, the predicted class $\hat{y}$ is 1, otherwise it is 0.

**Equation 5-2: Linear SVM classifier prediction**
$$ \hat{y} = \begin{cases} 0 & \text{if } \mathbf{w}^T \mathbf{x} + b < 0 \\ 1 & \text{if } \mathbf{w}^T \mathbf{x} + b \ge 0 \end{cases} $$

### Training Objective
Consider the slope of the decision function: it is equal to the norm of the weight vector, $\|\mathbf{w}\|_2$. If we divide this slope by 2, the points where the decision function is equal to $\pm 1$ will be twice as far away from the decision boundary. In other words, dividing the slope by 2 will multiply the margin by 2. The smaller the weight vector $\mathbf{w}$, the larger the margin.

So we want to minimize $\|\mathbf{w}\|_2$ to get a large margin. However, if we also want to avoid any margin violation (hard margin), then we need the decision function to be greater than 1 for all positive training instances, and lower than -1 for all negative training instances. If we define $t^{(i)} = -1$ for negative instances (if $y^{(i)} = 0$) and $t^{(i)} = 1$ for positive instances (if $y^{(i)} = 1$), then we can express this constraint as $t^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \ge 1$ for all instances.

We can thus express the hard margin linear SVM classifier objective as the constrained optimization problem:

**Equation 5-3: Hard margin linear SVM classifier objective**
$$ \begin{aligned} & \underset{\mathbf{w}, b}{\text{minimize}} & & \frac{1}{2} \mathbf{w}^T \mathbf{w} \\ & \text{subject to} & & t^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \ge 1 \quad \text{for } i = 1, 2, \dots, m \end{aligned} $$

To get the soft margin objective, we introduce a *slack variable* $\zeta^{(i)} \ge 0$ for each instance: $\zeta^{(i)}$ measures how much the $i^{th}$ instance is allowed to violate the margin. We now have two conflicting objectives: make the slack variables as small as possible to reduce margin violations, and minimize $\frac{1}{2} \mathbf{w}^T \mathbf{w}$ to increase the margin. The hyperparameter $C$ allows us to define the trade-off between these two objectives.

**Equation 5-4: Soft margin linear SVM classifier objective**
$$ \begin{aligned} & \underset{\mathbf{w}, b, \mathbf{\zeta}}{\text{minimize}} & & \frac{1}{2} \mathbf{w}^T \mathbf{w} + C \sum_{i=1}^{m} \zeta^{(i)} \\ & \text{subject to} & & t^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \ge 1 - \zeta^{(i)} \quad \text{and} \quad \zeta^{(i)} \ge 0 \quad \text{for } i = 1, 2, \dots, m \end{aligned} $$

### Quadratic Programming
The hard margin and soft margin problems are both convex quadratic optimization problems with linear constraints. Such problems are known as *Quadratic Programming* (QP) problems. Many off-the-shelf solvers are available to solve QP problems.

### The Dual Problem
Given a constrained optimization problem, known as the *primal problem*, it is possible to express a different but closely related problem, called its *dual problem*. The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions (which apply here) it can have the same solution.

The SVM dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features. More importantly, it makes the **kernel trick** possible.

**Equation 5-6: Dual form of the linear SVM objective**
$$ \begin{aligned} & \underset{\mathbf{\alpha}}{\text{minimize}} & & \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} \mathbf{x}^{(i)T} \mathbf{x}^{(j)} - \sum_{i=1}^{m} \alpha^{(i)} \\ & \text{subject to} & & \alpha^{(i)} \ge 0 \quad \text{for } i = 1, 2, \dots, m \end{aligned} $$

### Kernelized SVM
Suppose you want to apply a 2nd-degree polynomial transformation to a two-dimensional training set. This involves calculating $(x_1^2, \sqrt{2}x_1 x_2, x_2^2)$. If you apply this mapping $\phi$, then the dot product of two transformed vectors is:
$$ \phi(\mathbf{a})^T \phi(\mathbf{b}) = (\mathbf{a}^T \mathbf{b})^2 $$

Notice that the dot product of the transformed vectors is equal to the square of the dot product of the original vectors. This is the key insight: if you apply the transformation $\phi$ to all training instances, then the dual problem will contain the dot product $\phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)})$. But if $\phi$ is the 2nd-degree polynomial transformation defined above, then you can replace this dot product of transformed vectors simply by $(\mathbf{x}^{(i)T} \mathbf{x}^{(j)})^2$. 

So you don’t actually need to transform the training instances at all: just replace the dot product by its square in the Equation 5-6. The result will be strictly the same as if you went through the trouble of transforming the training set, but much more efficient. This trick is called the **Kernel Trick**.

The function $K(\mathbf{a}, \mathbf{b}) = (\mathbf{a}^T \mathbf{b})^2$ is called a 2nd-degree polynomial kernel. In Machine Learning, a *kernel* is a function capable of computing the dot product $\phi(\mathbf{a})^T \phi(\mathbf{b})$ based only on the original vectors $\mathbf{a}$ and $\mathbf{b}$, without having to compute (or even know about) the transformation $\phi$.

**Common Kernels:**
* **Linear:** $K(\mathbf{a}, \mathbf{b}) = \mathbf{a}^T \mathbf{b}$
* **Polynomial:** $K(\mathbf{a}, \mathbf{b}) = (\gamma \mathbf{a}^T \mathbf{b} + r)^d$
* **Gaussian RBF:** $K(\mathbf{a}, \mathbf{b}) = \exp(-\gamma \|\mathbf{a} - \mathbf{b}\|^2)$
* **Sigmoid:** $K(\mathbf{a}, \mathbf{b}) = \tanh(\gamma \mathbf{a}^T \mathbf{b} + r)$