Status: ✅ Done

## Exercise 15

---

First, let me start by saying that `SVM` is not certainly the easiest topic to grasp, especially within a short span of time. In this exercise, we focus on **hard margin** SVM, which assumes that given feature space is linearly separable. In the next session, the focus will be on **soft margin** SVM which simply drops the assumption about the perfect separability. Either way, I can promise you that if you get hard margin SVM, then it is fairly simple to understand to the soft margin version.

This exercise has therefore two **main learning objectives**:
- theory behind hard margin SVM: translation of an abstract idea (maximize the margin) to the actual mathematical model (primal and dual form)
- implementation of hard margin SVM: translation of the mathematical model to code

Thus, also this notebook is split according to these two goals into two sections.

### Theory behind hard margin SVM

---

In this section, we go over the core concepts of `hard-margin SVM`. With this being said, for detailed explanation, you can refer to my detailed [note on SVM](https://ludekcizinsky.notion.site/Support-Vector-Machines-2db1ab88770641bc9815a7a9db45f72c) where I explain all the nitty gritty details. Alternatively, you can also watch this great [mit lecture](https://www.youtube.com/watch?v=_PwhiWxHK8o&t=35s) on the subject.

> Recap: Hyperplanes

We first refresh the concept of hyperplanes, for this I prepared short [Geogebra note](https://www.geogebra.org/m/emrwvr3m).

> New concept: high level idea behind SVM

Now, we are ready to jump into the high level idea of hard margin SVM. I prepared short [Geogebra note](https://www.geogebra.org/m/xyxnhqpy) on the subject. This note should help you understand the big picture of SVM, in my [notes](https://ludekcizinsky.notion.site/Support-Vector-Machines-2db1ab88770641bc9815a7a9db45f72c), I explain how we translate this abstract concept into mathematical model.

> Model analysis: what are we optimising in hard margin SVM

As explained in the high level idea, our objective is to maximize **width** subject to the correct classification of training samples. Therefore mathematically speaking ($\text{width} = \frac{2}{||w||}$):


$
\underset{\mathrm w, b_0} {\arg\max} \frac{2}{||\mathrm w||}
\text{ subject to  } y(\mathrm{w \cdot x + b_0}) -1 \ge 0
$

However, the function $f(w) = \frac{2}{||w||}$ is not easily differentiable, therefore, we instead use equivalent function $\frac{1}{2}||w||^2$ which we, however, need to minimize:

$
\underset{\mathrm w, b_0} {\arg\min} \frac{1}{2} \lVert  \mathrm w \rVert^2
\text{ subject to  } y(\mathrm{w \cdot x + b_0}) -1 \ge 0
$

In words, we are trying to find model parameters $w,b_0$ such that the above function is minized subject to the given constraint. Just to be crystal clear, for instance in the case of the 2 features, $w = (b_1, b_2)$ and our model looks as follows:

$
h(x) = b_2x + b_1y + b_0
$

and then depending on the value of $h(x)$, we would classify the given input $x$. If $h(x) > 0$, then we classify for one class, else for the other. 


Very importantly, we must first normalize the features, otherwise our model might overestimate importance of given feature. For instance, if we have one feature with range 0 - 100, and the other with range 0 - 1, then change by 1 in the **context of the first feature** (tiny change) means something very different than in the context of second feature (huge change going from min to max). In this case, our model would overestimate the importance of making a tiny step in first feature.

> Section summary

This section tried to give you a high level idea about working of hard margin SVM. You should be able to explain:

- what is the core idea behind hard margin SVM as well as its assumption about linear separability
- what we are optimizing and why we need the constraint

If you also read my detailed note, then you should be then able to explain how we actually derive the formula which we are trying to optimize.

### Implementation of hard margin SVM

---

In this section, we will focus on implementing the hard margin SVM from scratch. I know, you have to understand math and programming at the same time, but if you manage go through this, I believe you will have a solid undrstanding of SVM. 

We will break the implementation into several steps. Let me first give you a small overview. In the course, your learned to rewrite the hard-margin SVM objective function as its **dual representation** (using Lagrangian multipliers), and maximize it:

$
\arg\max_{\alpha}  \tilde{L}(\alpha) = \sum_i^n \alpha_i - \frac{1}{2} \sum_i^n \sum_j^n \alpha_i \alpha_j y_i y_j (\mathbf{x_i} \cdot \mathbf{x_j})
$

subject to $\alpha \geq 0$ and $\sum_i^n \alpha_i y_i = 0$. We know that any machine learning problem can be broken down essentialy to two stages:

- **training**: we will find most optimal parameters $a$ using the above function $L$
- **prediction**: use the most optimal parameters $a$ to make prediction on given data

In training, we need to:
- vectorize the loss function
- use `scipy's` optimization API to train our model

In prediction, we need to:
- implement the predict function
- make prediction on the given data

I will follow these steps in the subsequent sections.

> training: loss function vectorisation

We can vectorize the original loss function $L$ as:

$
\tilde{L}(\alpha) = \sum_{i} \alpha_i - \frac{1}{2} \sum_{i,j} A_{i,j}, \quad
\text{where $A_{i,j}$ is the value at the $i$th row and $j$th column of $\mathbf{A}$.}
$

but what is $A$? We define $A$ as:

$
\mathbf{A}=\boldsymbol{a}^{\top} \boldsymbol{\alpha} \circ \mathbf{y} \mathbf{y}^{\top} \circ \boldsymbol{K}
$

where:
- $K$ is an $N \times N$ matrix resulted from the dot product between every two data points $\mathbf{x}$, that is $K_{i, j}=\mathbf{x}_{\mathbf{i}} \cdot \mathbf{x}_{\mathbf{j}}$
- $\boldsymbol{\alpha}$ is a $1 \times N$ matrix in this equation, so $\boldsymbol{\alpha}^{\top} \boldsymbol{\alpha}$ is an $N \times N$-matrix
- $\mathbf{y}$ is a $N \times 1$ matrix in this equation, so $\mathbf{y y}^{\top}$ is an $N \times N$-matrix
- The notation $\circ$ in the equation stands for the element-wise product (Hadamard product).

Let's get these done:
- First things first, we define the **interface of the model** and computation of the matrix $K$. You can see both of these in this [commit](https://github.com/ludekcizinsky/nano-learn/commit/f269d1ded6bc121230957c9d101942f4cdecbb9a?diff=unified).

- Next, we vectorize the function, using the above formula for $A$, see this [commit](https://github.com/ludekcizinsky/nano-learn/commit/0d22458b7d9df71a3f48636bc13cb76ec94a16c4?diff=unified#diff-b014f61961234340adc4a28a71457d9b4060bfb6263ef98deac297caac9ea503R22).

> training: use scipy to find parameters a

To use the QP solver [optimizer.minimize(..)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html) function, we need to pass several items to it. Some of them are:
* a loss function $L$
* the **Jacobian matrix of the loss function** (Jacobian matrix is the partial derivatives of the loss function with respect to each element of $\alpha$).
* **constraints**: the inequalities, their jacobian, the equalities, their jacobian

Once we have, these, we can train the model, see the [commit](https://github.com/ludekcizinsky/nano-learn/commit/9f88cbc9d25eb374fb84ff4686e228182bc95847?diff=split).

> predict: implement and test the predict function

Now that we trained the model and have the $\alpha$, we can use it to predict (classify) data points using:

$
h(x) = \sum_i^n \alpha_i y_i( x_i \cdot x )+ b 
$

where $b$:

$
b = \frac{1}{N_S}\sum_{i \in S}(y_i - \sum_{j \in S} \alpha_jy_j(x_i.x_j)), \quad \text{where $S$ is the set of support vectors}
$

Note that $x_i$ are our training samples, $x$ is the input for which we are predicting. Clearly, to make the prediction, we need to first compute $b$ which is essentialy a function of $a$ and our training samples. So technically, finding $b$ should be part of our training process, see this [commit](https://github.com/ludekcizinsky/nano-learn/commit/1cbd554614f8ba849ba5d02dda7cbac76e19bdfc). Finally, let's implement the prediction function and test whether our implentation is working, here is the corresponding [commit](https://github.com/ludekcizinsky/nano-learn/commit/5167926837b08cf2f5c19620d6651f653c7124fb). You should get following result:

![](figures/svmclf.png)

> Section summary

In this section, our focus was on implementing SVM from scratch using only numpy and scipy. This was certainly challenging for many reasons, but I believe if you manage to finish it, it should give you a really in depth understanding of hard margin svm. Finally, you can find the final implementation [here](https://github.com/ludekcizinsky/nano-learn/blob/main/nnlearn/svm/_svm.py).

---

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=fc95f382-8acc-4012-8b16-804e27725cfe' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>