A maximum margin classifier is a linear classifier that chooses the separating line / hyperplane that is **as far as possible from all training points**, and it’s the core geometric idea behind SVMs. [bioinformatics-training.github](https://bioinformatics-training.github.io/intro-machine-learning-2017/svm.html)

### 1. Geometric picture: hyperplanes, margin, and support vectors

- In a binary classification problem, many different hyperplanes can separate the two classes perfectly if the data are linearly separable. [en.wikipedia](https://en.wikipedia.org/wiki/Margin_(machine_learning))

- Each separating hyperplane has a **margin**: the perpendicular distance from the hyperplane to the **closest** training points in either class. [courses.grainger.illinois](https://courses.grainger.illinois.edu/cs446/sp2015/Slides/Lecture10.pdf)

- The **maximum‑margin hyperplane** is the separating hyperplane whose margin is largest; the classifier it defines is the **maximum margin classifier**. [alan-turing-institute.github](https://alan-turing-institute.github.io/Intro-to-transparent-ML-course/08-glm-svm/support-vec-classifier.html)

Key terms:

- **Hyperplane**:  
  - Line in 2D, plane in 3D, or flat surface of dimension $p-1$ in $p$-dimensional space. [bioinformatics-training.github](https://bioinformatics-training.github.io/intro-machine-learning-2017/svm.html)

- **Support vectors**:  
  - The training points that lie exactly on the margin boundaries (the parallel hyperplanes closest to the decision boundary). [pages.hmc](https://pages.hmc.edu/ruye/MachineLearning/lectures/ch9/node6.html)
  - They alone “support” or determine the position of the maximum‑margin hyperplane.

- The **margin region** is the band between two parallel lines (or hyperplanes) that pass through the support vectors, one for each class. [courses.grainger.illinois](https://courses.grainger.illinois.edu/cs446/sp2015/Slides/Lecture10.pdf)

### 2. Why maximum margin is a good idea

Intuition:

- Test points are likely to appear **near** the training data.

- A decision boundary that sits right next to points is fragile: small noise or slight shifts can make new points fall on the wrong side.

- A boundary that is as **far away as possible** from all training points (large margin) tends to be **more stable** and generalize better to unseen data. [alan-turing-institute.github](https://alan-turing-institute.github.io/Intro-to-transparent-ML-course/08-glm-svm/support-vec-classifier.html)

This leads to the principle:

> Among all linear separators that correctly classify the training data, choose the one with the **largest margin**.

That is exactly what the maximum margin classifier does. [en.wikipedia](https://en.wikipedia.org/wiki/Margin_(machine_learning))

### 3. Linear model in feature space and normal vector β

We work in a (possibly nonlinear) feature space using a feature map $\phi(x)$:

- Features: $\phi_1(x), \dots, \phi_{M-1}(x)$.  
- Linear model:  
  $$
  y(x) = \beta_0 + \beta_1 \phi_1(x) + \cdots + \beta_{M-1} \phi_{M-1}(x).
  $$

For convenience:

- Put the coefficients (except bias) into a vector  
  $\beta = [\beta_1, \dots, \beta_{M-1}]^\top$.  
- Put the features into a vector  
  $\phi(x) = [\phi_1(x), \dots, \phi_{M-1}(x)]^\top$.

Then:

$$
y(x) = \beta_0 + \beta^\top \phi(x).
$$

The **decision boundary** is the set of points where $y(x) = 0$, i.e.,

$$
\beta_0 + \beta^\top \phi(x) = 0.
$$

Geometric fact:

- The vector $\beta$ is **perpendicular (normal)** to this decision boundary in feature space. [pages.hmc](https://pages.hmc.edu/ruye/MachineLearning/lectures/ch9/node6.html)
- This is why rewriting the model in terms of $\beta$ and $\phi(x)$ (with bias separated) is useful: the size and direction of $\beta$ directly control the orientation and margin.

A line intersecting the $x_1$ and $x_2$ axes (at 1 and 2) is just a geometric demonstration that $\beta$ points normal to the boundary.

### 4. 1D intuition: constraints and margin

To build intuition, consider a 1D feature $\phi_1(x)$ and two classes:

- Encode the positive (green) class as $+1$.  
- Encode the negative (orange) class as $-1$.  

The model is:

$$
y(x) = \beta_1 \phi_1(x) + \beta_0.
$$

- On a simple 1D plot, this is a straight line.
- The **decision boundary** is where $y(x) = 0$.  
- The **margin** is the horizontal distance from this decision point to the closest data point.

We want to find the line that **maximizes that distance**.

#### 4.1 Hard constraints for separable data

To **exclude bad solutions** that come too close to points, we introduce inequalities:

- For a green point $x_i$ with label $y_i = +1$:  
  $$
  \beta_1 \phi_1(x_i) + \beta_0 \ge 1.
  $$
- For an orange point $x_i$ with label $y_i = -1$:  
  $$
  \beta_1 \phi_1(x_i) + \beta_0 \le -1.
  $$

These say:

- Positives lie **above** the decision boundary by at least 1 in model output.
- Negatives lie **below** by at least 1.

You can combine both into one condition using the label $y_i$:

$$
y_i \big( \beta_1 \phi_1(x_i) + \beta_0 \big) \ge 1 \quad \forall i.
$$

- If $y_i = +1$ (green), this reduces to $\beta_1 \phi_1(x_i) + \beta_0 \ge 1$.
- If $y_i = -1$ (orange), multiplying reverses the inequality sign and gives $\beta_1 \phi_1(x_i) + \beta_0 \le -1$.

These are the **hard‑margin constraints** in 1D.

#### 4.2 Maximizing margin via slope

For all lines that satisfy these constraints, the **margin** in 1D is inversely related to $|\beta_1|$:

- A **steeper** slope (large $|\beta_1|$) means a smaller margin.
- A **flatter** slope (small $|\beta_1|$) means a larger margin, while still respecting the constraints.

So the optimization becomes:

$$
\text{Minimize } |\beta_1|
\quad \text{subject to } y_i (\beta_1 \phi_1(x_i) + \beta_0) \ge 1 \quad \forall i.
$$

If positives are on the other side (green left, orange right), the sign of $\beta_1$ flips, but minimizing $|\beta_1|$ still means “find the **shallowest** separating line,” regardless of whether it slopes up or down.

### 5. Full multidimensional formulation

In higher dimensions, $\beta$ is a vector and the margin is inversely proportional to $\|\beta\|$. [cs.cornell](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html)

The general **hard‑margin maximum margin classifier** problem is:

$$
\text{Minimize } \|\beta\|^2
\quad\text{subject to } y_i (\beta^\top \phi(x_i) + \beta_0) \ge 1 \quad \forall i,
$$

where:

- $\phi(x_i)$ is the feature vector for point $x_i$.
- $y_i \in \{+1, -1\}$ is the class label.
- $\beta$ is the normal vector to the decision hyperplane.
- $\beta_0$ is the bias.

We minimize $\|\beta\|^2$ instead of $\|\beta\|$ because:

- They have the same minimizer (square is monotonic on nonnegative numbers).
- $\|\beta\|^2$ makes the optimization problem smoother and easier to solve numerically.

This is a **convex optimization** problem (quadratic objective, linear constraints), which is important because convex problems can be solved efficiently and have a unique global optimum. [cs.cornell](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html)

### 6. Support vectors: only a few points matter

When we solve this optimization problem:

- Only some training points lie exactly on the margin boundaries (where $y_i (\beta^\top \phi(x_i) + \beta_0) = 1$).

- These points are the **support vectors**. [alan-turing-institute.github](https://alan-turing-institute.github.io/Intro-to-transparent-ML-course/08-glm-svm/support-vec-classifier.html)

- All other points satisfy the inequality strictly with some margin to spare.

Crucial property:

- The final decision boundary **depends only on the support vectors**. 

- Moving non‑support points slightly won’t change the boundary as long as they stay outside the margin region.

Contrast with:

- **Linear regression** and **logistic regression**: the solution typically depends on **all** training points simultaneously. [alan-turing-institute.github](https://alan-turing-institute.github.io/Intro-to-transparent-ML-course/08-glm-svm/support-vec-classifier.html)

- SVM / maximum margin classifier: more **sparse** and robust, which is why it works especially well with the **kernel trick**. [cs.cornell](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html)

### 7. Non‑separable case: soft margin, slack variables, and C

The above derivation assumed the classes are perfectly separable by some hyperplane (no overlap).  
In practice, data are often noisy or overlapping, so strict constraints

$$
y_i (\beta^\top \phi(x_i) + \beta_0) \ge 1 \quad \forall i
$$

might be impossible to satisfy. [pub.aimind](https://pub.aimind.so/soft-margin-svm-exploring-slack-variables-the-c-parameter-and-flexibility-1555f4834ecc)

#### 7.1 Slack variables

To handle overlapping or noisy data, we introduce **slack variables** $\delta_i \ge 0$:

- They allow some constraints to be “relaxed” by sliding the class boundaries inward.
- The constraint becomes:

$$
y_i (\beta^\top \phi(x_i) + \beta_0) \ge 1 - \delta_i, \quad \delta_i \ge 0 \quad \forall i.
$$

Interpretation:

- $\delta_i = 0$: point is on or outside the correct margin boundary.
- $0 < \delta_i < 1$: point is inside the margin but still correctly classified.
- $\delta_i > 1$: point is misclassified (on the wrong side of the decision boundary). [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/using-a-hard-margin-vs-soft-margin-in-svm/)

In the rod analogy from the transcript:

- You imagine rigid “rods” at margin level $+1$ for positive and $-1$ for negative.
- When data overlap, some rods must be shifted (by $\delta_i$) to allow **any** separating line to pass throug

#### 7.2 Soft‑margin objective with C

We now need to balance:

1. Having a **large margin** (small $\|\beta\|^2$), and  
2. Allowing **few and small violations** (small $\sum \delta_i$).

The **soft‑margin SVM** objective becomes: [people.eecs.berkeley](https://people.eecs.berkeley.edu/~jrs/189s20/lec/04.pdf)

$$
\text{Minimize } \|\beta\|^2 + C \sum_{i=1}^N \delta_i
$$

subject to:

$$
y_i (\beta^\top \phi(x_i) + \beta_0) \ge 1 - \delta_i, \quad \delta_i \ge 0 \quad \forall i.
$$

Here:

- $C > 0$ is a **hyperparameter** controlling the trade‑off between margin size and constraint violations. [people.eecs.berkeley](https://people.eecs.berkeley.edu/~jrs/189s20/lec/04.pdf)
- Large $C$:
  - Strongly penalizes violations.
  - Favors low training error but can reduce margin (more complex boundary; higher risk of overfitting). [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/using-a-hard-margin-vs-soft-margin-in-svm/)
- Small $C$:
  - Allows more violations.
  - Favors a larger margin and more regularization (simpler boundary; can underfit).  

This soft‑margin formulation is still **convex**, just with more variables ($\beta, \beta_0, \delta_i$), so it remains tractable. [people.eecs.berkeley](https://people.eecs.berkeley.edu/~jrs/189s20/lec/04.pdf)

### 8. Relation to SVMs and the kernel trick

The **maximum margin classifier** described here is the geometric and optimization foundation of **support vector machines (SVMs)**: [en.wikipedia](https://en.wikipedia.org/wiki/Margin_(machine_learning))

- Hard‑margin SVM = maximum margin classifier when data is perfectly separable.
- Soft‑margin SVM = maximum margin classifier with slack variables and $C$ when data overlaps.

Because:

- Only **support vectors** matter, and  
- The solution has a dual form that uses only inner products $\phi(x_i)^\top \phi(x_j)$,  

we can apply the **kernel trick**:

- Replace $\phi(x_i)^\top \phi(x_j)$ with a **kernel function** $k(x_i, x_j)$ (e.g., polynomial, RBF).  
- This yields nonlinear decision boundaries in the original input space while still solving a convex optimization problem in terms of α‑coefficients and kernels. [pages.hmc](https://pages.hmc.edu/ruye/MachineLearning/lectures/ch9/node6.html)