# Non-linear SVM

## Introduction

- For the final part of non-linear classifiers, we will see how we can turn SVMs into non-linear classifiers.
- Heavily depends on a field of research known as **kernel methods**.
- A field of its own with lots of use cases throughout machine learning.

## Remembering SVMs

- **Recall:**  From $L(\mathbf{w}, w_0)$: $$\mathbf{w} = \sum_{i \in SV} \lambda_i y_i \mathbf{x}_i$$

- **Dual:**  $$ \max_{\lambda \geq 0} \sum_{i=1}^N \lambda_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle$$

- Subject to: $$ \sum_i \lambda_i y_i = 0$$
- Note: $\langle \mathbf{x}_i, \mathbf{x} \rangle$ denotes the inner product

---

- **In testing:** $$ g(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + w_0 = \sum_{i \in SV} \lambda_i y_i \langle \mathbf{x}_i, \mathbf{x} \rangle + w_0 $$

### Example with explicit mapping

- Explicitly map $\mathbf{x}$, then use linear SVM.

- Let $$ \mathbf{z} = \begin{bmatrix} x_1^2 \\ \sqrt{2} x_1 x_2 \\ x_2^2 \end{bmatrix} $$ where $$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}$$

---

- **Need:** $$\mathbf{z}_i^T \mathbf{z}_j = x_{i1}^2 x_{j1}^2 + 2 x_{i1} x_{i2} x_{j1} x {j2} + x_{i2}^2 x_{j2}^2 = $$


### Example with explicit mapping - training and testing

- **Training:**: $$ \sum_{i=1}^N \lambda_i + \sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j \mathbf{z}_i^T \mathbf{z}_j $$

- $$ \sum_{i=1}^N \lambda_i + \sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j K(\mathbf{x}_i,\mathbf{x}_j) $$

- **Testing:**  $$ g(\mathbf{z}) = \sum_{\mathbf{z}_i \in SV} \lambda_i y_i \mathbf{z}_i^T \mathbf{z}_j$$
 
- $$ g(\mathbf{x}) = \sum_{\mathbf{x}_i \in SV} \lambda_i y_i K(\mathbf{x}_i, \mathbf{x})$$

- Can always find inner-product kernel $K(\mathbf{x}_i, \mathbf{x}_j)$!


## Mercer's theorem

- $$K(\mathbf{x}_i,\mathbf{x}_j) = $$

- [Nice open access article on kernel methods for those who want to learn more.](https://arxiv.org/pdf/math/0701907)

## Kernels

- **Polynomials:**  $$ K(\mathbf{x}_i, \mathbf{x}_j) = \left(\mathbf{x}_i^T,\mathbf{x}_j+1\right)^q \text{, where } q>0$$

- **RBF:**  $$ K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{1}{2\sigma^2} \|\mathbf{x}_i - \mathbf{x}_j\|^2\right)$$

- **Tanh:**  $$ K(\mathbf{x}_i, \mathbf{x}_j) = \tanh\left(\beta\, \mathbf{x}_i^T \mathbf{x}_j + \gamma\right)$$

## Non-linear SVM

- Choose $K(\mathbf{x}_i, \mathbf{x}_j)$

- **Training:**  $$\max_{\lambda \geq 0} \sum_{i=1}^N \lambda_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)$$

- **Test:**  $$g(\mathbf{x}) = \sum_{i \in SV} \lambda_i y_i K(\mathbf{x}_i, \mathbf{x}) +w_0$$


### Non-linear SVM as a network



### Non-separable classes

- Remember: $$ \mathbf{w}^T \mathbf{x} + w_0  \geq 1-\gamma$$
- Both classes: $$ y_i \left(\mathbf{w}^T \mathbf{x}_i + w_0\right)  \geq 1-\gamma_i$$


### Non-separable classes - in practice

- Don't want too many $\gamma_i > 0$
- Minimize: $$J(\mathbf{w}, w_0, \gamma_i) = \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^N \gamma_i$$
- Subject to: $$ y_i \left( \mathbf{w}^T \mathbf{x}_i + w_0 \right) \geq 1 - \gamma_i$$

$$
\gamma_i \geq 0
$$

---

- **Dual:** $$ \max_{\lambda \geq 0} \sum_{i=1}^N \lambda_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle $$

- Subject to: $$ \sum_i \lambda_i y_i = 0$$

$$
0 \leq \lambda_i \leq C
$$

## Practical considerations for non-linear SVMs

- Start simple -> linear kernel.
    - Only two hyperparameters to consider; slack variable and tolerance for stopping criterion.
- Then -> non-linear kernel. An RBF kernel is the standard choice.
    - Added complexity; kernel width.
- Use validation data to select hyperparameters.

## Programming exercises

Below are programming exercises assocaited with this lecture. These cell blocks are starting points that loads the data and prepares the problem such that you can get going with the implementation. There are also theoretical exercsies, but due to copyright we cannot shared them here. They will be made available in a private repository connected to the course.


### Non-linear classification with a SVM

The code below loads a classic synthetic machine learning dataset, the Two Moons dataset, that we have looked at before. Traing a SVM with a non-linear kernel to tackle this non-linearly separable classification task.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.15, random_state=42)

plt.figure(1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.show()

### Wine classification with SVM

This problem is revisting the Wine classification problem from the linear SVM notebook. Now, turn the SVM into a non-linear classifier through a Gaussian kernel. How does this affect performance?


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine

# fetch dataset
wine_data = load_wine()

# data (as pandas dataframes)
X = wine_data.data[:, :2]
feature_1_name = 'alcohol'
feature_2_name = 'malic acid'
y = wine_data.target
y_names = np.unique(y)  # class names
colors = ['black', 'blue', 'red']

plt.figure(1, figsize=(5, 5))
for class_value, color in zip(y_names, colors):
    plt.scatter(X[y == class_value, 0], X[y == class_value, 1], s=120, facecolors='none',
                edgecolors=color, linewidth=3.0, label=f'Class {class_value+1}')
plt.xlabel(feature_1_name)
plt.ylabel(feature_2_name)
plt.legend()
plt.show()
