# Introduction : Why Kernels?

* what if input,x is not a binary or continuous values?
    * Binary Tree, Text : How do use wx + b?
    * Adapting such models to still use Linear or similar models

* Regression using Kernels:
    * In Ridge Reg $ w = (\lambda I + X^T X)^{-1} X^T y $
    
    * Prediction for example $x^*$ can be written as $ y^{*} = (\lambda I + X^T X)^{-1} X^T y)^T x^{*} $

    * Rearranging using Matrix Inversion Lemma: 
    $ y^{*} = y^T (\lambda I + X X^T)^{-1} Xx^{*} $

    * Why is this Rearranging helpful and whats the adv of $X X^T$ and $X x^*$ terms?


* Assume X is n*d* matrix and $ X X^T $ is of the size n*n!

* Each element in $ X X^T $ in short is an inner product
$$
 X X^T = 
\begin{bmatrix} 
x11 & x12 & x13 \\
x21 & x22 & x23\\
x31 & x32 & x33 \\
\end{bmatrix} * 
\begin{bmatrix} 
x11 & x21 & x31 \\
x12 & x22 & x32\\
x13 & x23 & x33 \\
\end{bmatrix}
$$

$$
 X X^T = 
\begin{bmatrix} 
<x1, x1> & <x1, x2> & <x1, x3> \\
<x2, x1> & <x2, x2> & <x2, x3>\\
<x3, x1> & <x3, x2> & <x3, x3> \\
\end{bmatrix}
$$

* ** Kernel Trick**

* Replace the dot product with a function $k(xi, xj)$

* Replace $X X^T$ with K, where $K[i][j]$ = $k(xi, xj)$

* K is matrix full of k. K = Gram Matrix, k is kernel function

* This would change the kernel regression prediction to:
$ y^{*} = y^T (\lambda I + K)^{-1} k(X, x^{*}) $


* In kernels, you transform x to $\phi(x)$ and can still execute these dot projects as above through some kernel functions! 

* In transformed space,     $ y^{*} = y^T (\lambda I + \Phi \Phi*^T)^{-1} \Phi \phi (x^{*}) $

* where,
$$ \Phi = 
\begin{bmatrix} 
\phi_1(x1) & \phi_2(x1) & \phi_3(x1) \\
\phi_1(x2) & \phi_2(x2) & \phi_3(x2)\\
\phi_1(x3) & \phi_2(x3) & \phi_3(x3)\\
\end{bmatrix}
$$

* Replace the inner products with some kernels functions and u still get the desired results with better efficiency

* ** By using kernels, you can generalize linear models to be used for non-vector data **

* Simplest kernel function is the dot product itself. No change to original matrix formulation

* k(xi, xj) = xi . xj

* For non-vector data, we can apply other data transformation like rbf

* Certain rules for ensuring valid kernel functions

* Mercer Kernels : RBF or Gaussian Kernels
    * Projecting data into infinite dimensional space (1d inseparable data might be separable in 2d space)
    
    * k(xi, xj) = $exp [\frac{-1}{2\sigma^2}(||xi - xj||^2)] $

    * This acts more like weighted nearest neighbour. Data closer (i close j) in space will yield higher k's. Far off points get lower weights. More like similarity concept

    * $\sigma$ hyperparameter representing the width of the curve


## Kernels Examples:

* 1d Example : 
    * no linear separator
    * Map x -> (x, $x^2$): Separable in 2D space
    * the above mapping is $\phi$

* 2d example : 
    * no linear separator
    * Map x -> $(x1^2, \sqrt 2 x1 x2, x2^2)$: Separable in 3D space



Originally Answered: What are Kernels in Machine Learning and SVM?
Briefly speaking, a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Mathematical definition: K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.

Intuition: normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number. But after all the trouble of going to the high dimensional space, the result of the dot product is really a scalar: we come back to one-dimensional space again! Now, the question we have is: do we really need to go through all the trouble to get this one number? do we really have to go to the m-dimensional space? The answer is no, if you find a clever kernel.

Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.

Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.

Now let us use the kernel instead:
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024
Same result, but this calculation is so much easier.

Additional beauty of Kernel: kernels allow us to do stuff in infinite dimensions! Sometimes going to higher dimension is not just computationally expensive, but also impossible. f(x) can be a mapping from n dimension to infinite dimension which we may have little idea of how to deal with. Then kernel gives us a wonderful shortcut.

Relation to SVM: now how is related to SVM? The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias. if y> 0, then we classify datum to class 1, else to class 0. We want to find a set of weight and bias such that the margin is maximized. Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the data linearly separable. The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).

Why it can also be understood as a measure of similarity:
if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.