# Kernels

## Introduction

Let $x_i \in \mathbb{R}^D$, we have some way of measuring the similarity between objects, that does not require preprocessing them into feature vector format. Let $\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) \geq 0$ be some measure of similarity between objects $x, x' \in X$, we call $\kappa$ a kernel function.

Kernel function: $\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) \in \mathbb { R }$ this function is: symmetric $\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \kappa \left( \mathbf { x } ^ { \prime } , \mathbf { x } \right)$ and non-negative: $\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) \geq 0$

### RBF kernels (Gaussian Kernels)

$$\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \exp \left( - \frac { 1 } { 2 } \left( \mathbf { x } - \mathbf { x } ^ { \prime } \right) ^ { T } \mathbf { \Sigma } ^ { - 1 } \left( \mathbf { x } - \mathbf { x } ^ { \prime } \right) \right)$$

$$\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \exp \left( - \frac { 1 } { 2 } \sum _ { j = 1 } ^ { D } \frac { 1 } { \sigma _ { j } ^ { 2 } } \left( x _ { j } - x _ { j } ^ { \prime } \right) ^ { 2 } \right)$$

If $\Sigma$ is spherical: we have isotropic kernels:
$$\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \exp \left( - \frac { \left\| \mathbf { x } - \mathbf { x } ^ { \prime } \right\| ^ { 2 } } { 2 \sigma ^ { 2 } } \right)$$

$\sigma^2$ is bandwidth, Radial basis function or RBF kernel

### Kernel for comparing documents:
+ Cosine similarity:
$$\kappa \left( \mathbf { x } _ { i } , \mathbf { x } _ { i ^ { \prime } } \right) = \frac { \mathbf { x } _ { i } ^ { T } \mathbf { x } _ { i ^ { \prime } } } { \left\| \mathbf { x } _ { i } \right\| _ { 2 } \left\| \mathbf { x } _ { i ^ { \prime } } | | _ { 2 } \right. }$$

+ TF-IDF Kernel: TF-IDF = Term frequency inverse document frequency
$$\operatorname { tf } \left( x _ { i j } \right) \triangleq \log \left( 1 + x _ { i j } \right)$$

$$\operatorname { idf } ( j ) \triangleq \log \frac { N } { 1 + \sum _ { i = 1 } ^ { N } \mathbb { I } \left( x _ { i j } > 0 \right) }$$

where $N$ is the total number of documents, the denominator counts how many documents contain term $j$:

$$\mathrm { tf } - \mathrm { idf } \left( \mathbf { x } _ { i } \right) \triangleq \left[ \mathrm { tf } \left( x _ { i j } \right) \times \mathrm { idf } ( j ) \right] _ { j = 1 } ^ { V }$$

$$\kappa \left( \mathbf { x } _ { i } , \mathbf { x } _ { i ^ { \prime } } \right) = \frac { \phi \left( \mathbf { x } _ { i } \right) ^ { T } \phi \left( \mathbf { x } _ { i ^ { \prime } } \right) } { \left\| \phi \left( \mathbf { x } _ { i } \right) \right\| _ { 2 } \left\| \phi \left( \mathbf { x } _ { i ^ { \prime } } \right) \right\| _ { 2 } }$$

where $\phi ( \mathbf { x } ) = tf-idf ( \mathbf { x } )$

### Mercer (positive definite) kernels:
Gram matrix:
$$\mathbf { K } = \left( \begin{array} { c c c } { \kappa \left( \mathbf { x } _ { 1 } , \mathbf { x } _ { 1 } \right) } & { \cdots } & { \kappa \left( \mathbf { x } _ { 1 } , \mathbf { x } _ { N } \right) } \\ { } & { \vdots } & { } \\ { \kappa \left( \mathbf { x } _ { N } , \mathbf { x } _ { 1 } \right) } & { \cdots } & { \kappa \left( \mathbf { x } _ { N } , \mathbf { x } _ { N } \right) } \end{array} \right)$$

If the kernel is Mercer, then there exists a function $\phi$ mapping $x \in X$ to $\mathbb{R}^D$ such that: 
$$\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \phi ( \mathbf { x } ) ^ { T } \phi \left( \mathbf { x } ^ { \prime } \right)$$

where $\phi$ depends on the eigen functions of $\kappa$

Examples:
+ Polynomial kernel: $\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \left( \gamma \mathbf { x } ^ { T } \mathbf { x } ^ { \prime } + r \right) ^ { M }$ where $r > 0$, $M$ is degree.
+ Sigmoid kernel: $\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \tanh \left( \gamma \mathbf { x } ^ { T } \mathbf { x } ^ { \prime } + r \right)$

### Linear Kernels:
$$\kappa \left( \mathbf { x } , \mathbf { x } ^ { \prime } \right) = \mathbf { x } ^ { T } \mathbf { x } ^ { \prime }$$

It is useful if the original data is already high dimensiona and original features are individually informative.

## Using Kernels

### Kernel machines:
$$\phi ( \mathbf { x } ) = \left[ \kappa \left( \mathbf { x } , \boldsymbol { \mu } _ { 1 } \right) , \ldots , \kappa \left( \mathbf { x } , \boldsymbol { \mu } _ { K } \right) \right]$$

where $\boldsymbol { \mu } _ { k } \in \mathcal { X }$ are a set of $K$ centroids (learnable parameters): if $\kappa$ is an RBF kernel, this is called RBF network. ==> Kernelized feature vector

+ Logistic regression: $p ( y | \mathbf { x } , \boldsymbol { \theta } ) = \operatorname { Ber } \left( \mathbf { w } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) \right)$ we will have non-linear decision boundary.
+ Linear regression: $p ( y | \mathbf { x } , \boldsymbol { \theta } ) = \mathcal { N } \left( \mathbf { w } ^ { T } \boldsymbol { \phi } ( \mathbf { x } ) , \sigma ^ { 2 } \right)$


## Support vector machines (SVMs)

### SVMs for regression
+ epsilon (margin) insensitive loss function: 
$$L _ { \epsilon } ( y , \hat { y } ) \triangleq \left\{ \begin{array} { c c } { 0 } & { \text { if } | y - \hat { y } | < \epsilon } \\ { | y - \hat { y } | - \epsilon } & { \text { otherwise } } \end{array} \right.$$

+ Objective function:
$$J = C \sum _ { i = 1 } ^ { N } L _ { \epsilon } \left( y _ { i } , \hat { y } _ { i } \right) + \frac { 1 } { 2 } \| \mathbf { w } \| ^ { 2 }$$

+ Introduce slack variables $\xi$: 
$$\begin{aligned} y _ { i } & \leq f \left( \mathbf { x } _ { i } \right) + \epsilon + \xi _ { i } ^ { + } \\ y _ { i } & \geq f \left( \mathbf { x } _ { i } \right) - \epsilon - \xi _ { i } ^ { - } \end{aligned}$$

+ New objective function: 
$$J = C \sum _ { i = 1 } ^ { N } \left( \xi _ { i } ^ { + } + \xi _ { i } ^ { - } \right) + \frac { 1 } { 2 } \| \mathbf { w } \| ^ { 2 }$$

+ So we have $$\hat { \mathbf { w } } = \sum _ { i } \alpha _ { i } \mathbf { x } _ { i }$$

    where $\alpha_i \geq 0$ are Langrange multipliers and sparse. The $x_i$ for which $\alpha_i > 0$ are called support vectors, these points lie on or outside the $\epsilon$ tube

+ make prediction: 
$$\hat { y } ( \mathbf { x } ) = \hat { w } _ { 0 } + \hat { \mathbf { w } } ^ { T } \mathbf { x }$$

+ if we use kernel:
$$\hat { y } ( \mathbf { x } ) = \hat { w } _ { 0 } + \sum _ { i } \alpha _ { i } \kappa \left( \mathbf { x } _ { i } , \mathbf { x } \right)$$

### SVMs for classification
+ Hinge loss: $$L _ { \text { hinge } } ( y , \eta ) = \max ( 0,1 - y \eta ) = ( 1 - y \eta ) _ { + }$$
where $\eta = f ( \mathbf { x } ) = \mathbf { w } ^ { T } \mathbf { x } + w _ { 0 }$ 

+ Objective function:
$$\min _ { \mathbf { w } , w _ { 0 } } \frac { 1 } { 2 } \| \mathbf { w } \| ^ { 2 } + C \sum _ { i = 1 } ^ { N } \left( 1 - y _ { i } f \left( \mathbf { x } _ { i } \right) \right) _ { + }$$

+ Introducing slack variable $\xi$:
$$\min _ { \mathbf { w } , w _ { 0 } , \xi } \frac { 1 } { 2 } \| \mathbf { w } \| ^ { 2 } + C \sum _ { i = 1 } ^ { N } \xi _ { i } \quad \text { s.t. } \quad \xi _ { i } \geq 0 , y _ { i } \left( \mathbf { x } _ { i } ^ { T } \mathbf { w } + w _ { 0 } \right) \geq 1 - \xi _ { i } , i = 1 : N$$

+ So we have $$\hat { \mathbf { w } } = \sum _ { i } \alpha _ { i } \mathbf { x } _ { i }$$
where $\alpha_i = \lambda_i y_i$, $\lambda_i$ are Langrange multipliers. The $x_i$ for which $\alpha_i > 0$ are called support vectors, which are either incorrectly classified or are classified correctly but are on or inside the margin

+ make prediction:
$$\hat { y } ( \mathbf { x } ) = \operatorname { sgn } ( f ( \mathbf { x } ) ) = \operatorname { sgn } \left( \hat { w } _ { 0 } + \hat { \mathbf { w } } ^ { T } \mathbf { x } \right)$$

+ use kernel:
$$\hat { y } ( \mathbf { x } ) = \operatorname { sgn } \left( \hat { w } _ { 0 } + \sum _ { i = 1 } ^ { N } \alpha _ { i } \kappa \left( \mathbf { x } _ { i } , \mathbf { x } \right) \right)$$

### Summary:
We need three ingredients: 
+ The kernel trick: The kernel trick is necessary to prevent underﬁtting, i.e., to ensure that the feature vector is sufficiently rich that a linear classiﬁer can separate the data.
+ The sparsity and large margin principles are necessary to prevent overﬁtting, i.e., to ensure that we do not use all the basis functions.