# Classification
<hr>

## Support Vector Machines (SVM)
_Also known as Large Margin Classifiers_ 

**Theoretical Documentation**<br>

Find the maximum-margin hyperplane in an D-dimensional space that **divides** a group of binary points for which $y_i = 1$ and $y_i = -1$ such that the distance between the hyperplane and the nearest point $x_i$ from either group is **maximized**.

<img alt="Support Vector Machine" src="assets/math-e04fb447.png" width="400" >

### The simple math behind it
Find a decision boundary (hyperplane) that is parallel to two support vectors and lies halfway between them. More formally it is defined as the following equation:<br><br>
$w^T X - b = 0$

where
- $X$ is a matrix in $n$ x $d$ dimensions
- $w$ is a vector orthogonal to the hyperplane and projects it to the support vector
- $b$ is the bias that translates the hyperplane away from the origin (i.e. the intercept)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/600px-SVM_margin.png" width="300">


Since distance between two hyperplanes is equal to $\frac{2}{\lVert w \rVert}$, ∴ minimizing $\lVert w \rVert$ will **maximize** the margin between the classes


### Evaluating margin classification error
_If a given observation is on the wrong side of the plane, then what is the error associated with it?_ <br><br>
**Error of a given observation**, *j*:<br>

$error_j = max(0, 1 - y_j(w^T X_{j} - b))$<br>

**Total error**: $\sum_{j=1}^{n} error_j$

where $y_j$ is the response variable (1 or -1), such that $y_j(w^T X_{j} - b) >= 1$ for both responses

### Cost function of a `soft-margin` SVM
Contains both margin classification error and margin from hyperplane.

$\displaystyle\arg \min_{\substack{a_0 \dots a_d}} \sum_{j=1}^{n} error_j + \lambda \lVert w \rVert ^2$

where $\lambda$ determines the trade-off between increasing the margin size and ensuring that the observations lie on the correct side of the margin, i.e. for small values of $\lambda$, the second term goes away and the importance of a small error outweighs the importance of a large margin

### What exactly is a support vector?
> "If we connect the dots on the outside of the points (the convex hull), support vectors are the points parallel to the shape in a given direction"

<img alt="math-1436320c.png" src="assets/math-1436320c.png" width="500">

### Practical tips
- Possible to bias decision boundary towards one response (considering its importance) by choosing $b$ adequately, such that $b = w_1 (b_0 - 1) + w_2 (b_0 + 1)$ where $w_1+w_2=1$
- Add a penalty term to computing $error_j$ for a given response to emphasize importance on margin error for given response
    - $\displaystyle\arg \min_{\substack{a_0 \dots a_d}} \sum_{j=1}^{n} m_j \cdot error_j + \lambda \lVert w \rVert ^2$
    - where $m_j > 1$ for more-costly errors and $m_j < 1$ for less-costly errors
- Scaling / standardization is necessary prior to running SVM
- Near-zero $w_i$ coefficients suggest that the i-th dimension does not contribute to the classification
- Works the same in multi-variate problems
- Kernel methods allow non-linear classifiers (doesn't always have to be straight-lines or linear classifiers)

```
Want to go deeper? For further reading, try http://pyml.sourceforge.net/doc/howto.pdf
```

# Basic code
A `minimal, reproducible example`

In [3]:
# Some code here

import pandas as pd
print(pd.__version__)

0.25.1
