# Non linear learning algorithms

## Support Vector Machines (SVM)

SVM are based kernel methods require only a user-specified kernel function $K(x_i, x_j)$, i.e., a **similarity function** over pairs of data points $(x_i, x_j)$ into kernel (dual) space on which learning algorithms operate linearly, i.e. every operation on points is a linear combination of $K(x_i, x_j)$.

Outline of the algorithm:

1. Map point $x$ into kernel space using a kernel function: $x \rightarrow K(x, .)$. This step is theoretical.

![Map point $x$ into kernel space using a kernel function](images/svm_kernel_mapping.png)

2. Learning algorithms operate linearly by dot product into high-kernel space $K(., x_i) \cdot K(., x_j)$ using the kernel trick (Mercer’s Theorem) this operation is replaced by a function easy to compute such that $K(., x_i) \cdot K(., x_j) = K(x_i, x_j)$.  This step is theoretical.

3. This similarity measure is computed for each pairs of point and store in a $N \times N$ Gram matrix. The **decision function** to be learned to classify a new point $x$ will be a linear combination of kernel functions evaluated on training samples:

$$
f(x) = \text{sign} \left(\sum_i^N \alpha_i~y_i~K(x_i, x)\right). 
$$

So the learning process consist of estimating the $\alpha_i that maximises the hinge loss with a penalty.

2. Predition of a new point $x$ is done by applying the decision function.


### Gaussian kernel (RBF, Radial Basis Function):

For a pair of points $x_i, x_j$ the RBF kernel is defined as:
$$
    K(x_i, x_j) = \exp\left(-\frac{||x_i - x_j||^2}{2\sigma^2}\right)
$$
Where $\sigma$ a the kernel width parameter. Basically, we consider a Gaussian function centered on each training sample $x_i$.  it has a ready interpretation as a similarity measure as it deacreses with squared Euclidean distance between the two feature vectors.

## Tree-based algorithms