# Support Vector Machine

# [Math to visualize the decision boundary](https://www.youtube.com/watch?v=QKc3Tr7U4Xc&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=72)
Given the two hyperspace $w x - b \le -1$ and $w x - b \ge 1$, $w$ is the perpendicular vector of the two parallel planes. The vector inner product $wx = P_{xw}||w||$, where $P_{xw}$ is the length of projected vector of x on w, and $||w||$ is the length of w (from vector calculus $u^Tv = P_{vu}||u||$). Therefore $wx = P_{xw}||w||$, hence we can rewrite the hyperspace into $P_{xw}||w|| - b \le -1$ and $P_{xw}||w|| - b \ge 1$. This has the benefit of visualizing the distance of any given sample $x_i$ to the hyperplane by using the projection length $P_{x_iw}$!!! 

If we have a dicision boundary that's not has margin that's not maximized. Then the resulting $P_{x_iw}$ samples are very small, then we have a large ||w|| to satisfy the $\le -1$ and $\ge 1$ hyperspace conditions. In contrast, if the margin is maximized, i.e., decision boundary separates the two-feature samples well, then the ||w|| will be small. 


Hard margin:

For linearly separable training data, select parallel two hyperplanes that can maximize the distance between the two classes. From [Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine), the two hyperspaces that host the samples for the two classes $y=-1$ and $y=1$ are $w x - b \le -1$ and $w x - b \ge 1$, respectively. Therefore multiplying both sides of the inequalities by y results in $y(wx-b) \ge 1$. The distance between the two hyperplanes are just $\frac{2}{||w||}$, which we want to maximize, i.e., minimize $||w||$.  The optimization problem becomes:

> minimize $||w||$ subject to $y(wx_i-b) \ge 1$, for all i-samples.

The Loss is just the inverse of the hyperplane distance $||w||$, hence do SGD to update (w,b) w.r.t the slope on Loss.

Soft margin:

For linearly inseparable data, the optimization becomes:


> minimize $[\frac{1}{n}\sum_{i=1}^n max(0,1-y(wx_i-b)] + \lambda ||w||^2$.

The first term, "hinge loss", is obtained by moving the $y(wx_i-b)$ to the r.h.s of the inequality $y(wx_i-b) \ge 1$. When correctly predicted max$(0,1-y(wx_i-b) \le 0)==0$, and if incorrectly predicted max$(0,1-y(wx_i-b) \gt 0) \gt 0$. Thus the positive hinge loss is reduced by finding the (w,b) through SGD, and maximize the margin just like hard margin does. The optimized (w,b) then gives the best hyperplane when substituted into $wx-b=0$. 


# Where did that $\frac{2}{||w||}$ come from?
Ever wonder where that distance measure between the two hyperplanes, $\frac{2}{||w||}$, in [Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine) comes from? Here's a quick and dirty proof. 


Given an arbitrary point $\hat{x}$ on the hyperplane $w x - b = -1$. We can use the unit vector $\frac{w}{||w||}$ and a distance $d$, to get to the point $\hat{x} + d\frac{w}{||w||}$ on the other plane $w x - b = 1$. By substituting this point 

> $w(\hat{x} + d \frac{w}{||w||}) - b = 1$ 

> $(w\hat{x}-b) + d\frac{w*w}{||w||} = 1$

> $-1 + d\frac{||w||^2}{||w||} = 1$

> $d||w|| = 2$

> $ d = \frac{2}{||w||}$

# RBF = kernel trick = similarity measure
Radial basis function (RBF) is just the Gaussian function applied on feature points to see how close they are. If very close, i.e., the ||x-x'|| is near zero, then a standard normal Gaussian is at the peak density, i.e., two points are very "similar". The variance $\sigma^2$ in Gaussian controls the width of the peak. The RBF $\gamma = \frac{1}{\sigma^2}$, so smaller $\gamma$ allows further points to have more similarity. A larger $\gamma$ is stricter controlling the similarity with distance, i.e., Gaussian peak drops fast with distance. 

# Reference
* [SVM on Wiki](https://en.wikipedia.org/wiki/Support_vector_machine)
* [where did that $\frac{2}{||w||}$ come from?](https://math.stackexchange.com/questions/1305925/why-does-the-svm-margin-is-frac2-mathbfw)