# 1. Support Vector Machines

## 1.1. Maximum Margin Classifier

* **Decision Border** and **Linear Separability**

>* Given a dataset $\mathcal{D} = \{(\mathbf{x}_n,t_n)\}^N_{n=1}$ where $t_n \in \{-1,1\}$, 
>* The goal is to train a classifer such that

>$$\begin{matrix} y(\mathbf{x}) \geq 0 & \text{if  } t_n = +1 \\ y(\mathbf{x}) < 0 & \text{if  } t_n = -1 \end{matrix}$$

>* **Decision Border:** $y(\mathbf{x}) = 0$  
>  * If the data is **linearly separable**, many possible borders have zero training error
>  * Linear classifier: $y(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$

* **Maximum Margin Classifiers**

>* **Idea:** choose a plane whose distance (i.e. **marging**) to the closest point (i.e. **support vectors**) in each class is maximal

>* Choose the scale so that:

>\begin{align}
y(\mathbf{x}_+) &= \mathbf{w}^T \mathbf{x}_+ + b = +1 \\
y(\mathbf{x}_-) &= \mathbf{w}^T \mathbf{x}_- + b = -1 
\end{align}

>* **Magnitude of the margin:**

>$$\frac{\mathbf{w}^T (\mathbf{x}_+ - \mathbf{x}_-)}{2||\mathbf{w}||} = \frac{1}{||\mathbf{w}||}$$

>* Minimum $|y(\mathbf{x})|$ achieved by support vectors $\rightarrow$ $t_n y(\mathbf{x}_n) \geq 1 \;\forall\;n$

## 1.2. Optimization with Inequality Constraints

* **Problem:** Maximize $f(x,y)$ subject to $g(x,y) \geq 0$

* **Equality Constraint $\rightarrow$ Lagrange Multipliers**

>$$\text{Maximize}\;\;\;\mathcal{L}(x,y,\lambda) = f(x,y) + \lambda g(x,y)$$

>$$\rightarrow \nabla_{x,y} f(x,y) = -\lambda \nabla_{x,y} g(x,y) \;\;\;,\;\;\; g(x,y)=0$$

>* $\lambda \neq 0$: **Lagrange multiplier** (positive or negative)

* **Karush-Kuhn-Tuker (KKT) Conditions**

>* Constraint is **not active** at solution: $g(x_\star, y_\star) > 0$

>>\begin{align}
\nabla_{x,y} \; f(x,y) &= 0 \\
\nabla_{x,y} \; \mathcal{L} (x,y,\lambda) &= 0 \;\;\; \text{if} \;\;\; \lambda=0
\end{align}

>* Constraint is **active** at solution: $g(x_\star, y_\star) = 0$

>>$$\nabla_{x,y} \; f(x,y) = -\lambda \nabla_{x,y} \; g(x,y)$$

>>* $\lambda$ must be positive / otherwise the constraint wouldn't be tight
>>* **Solution:**

>>$$\max_{x,y} \min_\lambda \mathcal{L}(x,y,\lambda)$$

>>* This solution satisfies the **KKT conditions**

>>$$g(x,y) \geq 0 \;\;\;,\;\;\; \lambda \geq 0 \;\;\;,\;\;\; \lambda g(x,y) = 0$$
>>$$\nabla_{x,y} \; f(x,y) = -\lambda \nabla_{x,y} \; g(x,y)$$

## 1.3. Optimization in SVM

* **Optimization Problem**

>$$\text{Minimize} \;\;\; \frac{1}{2} ||\mathbf{w}||^2 \;\;\; \text{s.t.} \;\;\;
t_n(\mathbf{w}^T \mathbf{x}_n + b) \geq 1 \;\;\;,\;\;\; n=1,...,N$$

* **Objective fn.** (introduce Lagrange multipliers)

>$$\mathcal{L} (\mathbf{w},b,\mathbf{a}) = \frac{1}{2} ||\mathbf{w}||^2 
- \sum^N_{n=1} a_n \{ t_n (\mathbf{w}^T \mathbf{x}_n + b) - 1 \}$$

>* Negative sign in $a_n$: because we will maximize w.r.t. $a_n$
>* By setting gradients w.r.t. $\mathbf{w}$ and $b$ to zero:

>$$\mathbf{w}=\sum^N_{n=1} a_n t_n \mathbf{x}_n \;\;\;,\;\;\;
0 = \sum^N_{n=1} a_n t_n$$

>* Substitute these results to yield a **dual problem**:

>$$\max_\mathbf{a} \left[ \sum^N_{n=1} a_n - \frac{1}{2} \sum^N_{n=1} \sum^N_{m=1} a_n a_m t_n t_m \mathbf{x}_n^T \mathbf{x}_m \right]
\;\;\;\text{s.t.}\;\;\; \sum^N_{n=1} a_n t_n = 0 
\;\;\;,\;\;\; a_n \geq 0 \;\forall\; n$$

* **Solution** (no analytic solution, too expensive for $N>50,000$)

>* From KKT conditions, at convergence, it holds that:

>$$a_n(t_n y(\mathbf{x}_n) - 1)=0 \;\forall\;n$$

>$$a_n > 0 \;\;\;\rightarrow\;\;\; t_n y(\mathbf{x}_n) = 1 \;\;\;\text{(i.e. support vectors)}$$

>* **Prediction**

>$$y(\mathbf{x}) = \sum_{n \in \mathcal{S}} a_n t_n \mathbf{x}_n^T \mathbf{x} + b
\;\;\;,\;\;\; \mathcal{S} = \{n: a_n > 0\}$$

>* **Bias** (average to improve numerical stability)

>$$b = \frac{1}{|\mathcal{S}|} \sum_{n\in\mathcal{S}} \bigg\{
t_n - \sum_{m \in \mathcal{S}} a_m t_m \mathbf{x}_m^T \mathbf{x}_n
\bigg\}$$

## 1.4. Constraint Violation and Soft Margin

* **Slack Variables**

>$$t_n (\mathbf{w}^T \mathbf{x}_n + b) \geq 1 - \xi_n 
\;\;\;,\;\;\; \xi_n \geq 0 \;\;\;,\;\;\; n = 1,...,N$$

>* $\xi_n = 0$: correctly classified & outside the margin (or on the border)
>* $0 \leq \xi_n \leq 1$: correctly classified & inside the margin
>* $\xi_n > 1$: misclassified

* **Optimization for Soft Margin**

>$$\text{Minimize} \;\;\; \frac{1}{2} ||\mathbf{w}||^2 + C \sum^N_{n=1} \xi_n
\;\;\; \text{s.t.} \;\;\;
t_n(\mathbf{w}^T \mathbf{x}_n + b) \geq 1 - \xi_n 
\;\;\;,\;\;\; \xi_n \geq 0$$

>* $C>0$: controls the **trade-off**
>  * Small $C$: soft contraint / large margin
>  * Large $C$: hard constraint / narrow margin 

* **Objective fn.** ($a_n \geq 0$ and $\mu_n \geq 0$: Lagrange multipliers)

>$$\mathcal{L} (\mathbf{w},b,\mathbf{a}) = \frac{1}{2} ||\mathbf{w}||^2
+ C \sum^N_{n=1} \xi_n
- \sum^N_{n=1} a_n \{ t_n (\mathbf{w}^T \mathbf{x}_n + b) - 1 + \xi_n\}
- \sum^N_{n=1} \mu_n \xi_n$$

>* By setting gradients w.r.t. $\mathbf{w}$, $b$ and $\xi_n$ to zero:

>$$\mathbf{w}=\sum^N_{n=1} a_n t_n \mathbf{x}_n \;\;\;,\;\;\;
0 = \sum^N_{n=1} a_n t_n \;\;\;,\;\;\;
\{a_n = C - \mu_n \}^N_{n=1}$$

>* Substitute these results to yield a **dual problem**:

>$$\max_\mathbf{a} \left[ \sum^N_{n=1} a_n - \frac{1}{2} \sum^N_{n=1} \sum^N_{m=1} a_n a_m t_n t_m \mathbf{x}_n^T \mathbf{x}_m \right]
\;\;\;\text{s.t.}\;\;\; \sum^N_{n=1} a_n t_n = 0 
\;\;\;,\;\;\; \{ 0 \leq a_n \leq C\}^N_{n=1}$$

* **Solution**

>* From KKT conditions, at convergence, it holds that:

>$$a_n(t_n y(\mathbf{x}_n) - 1 + \xi_n)=0 \;\forall\;n$$

>$$a_n > 0 \;\;\;\rightarrow\;\;\; t_n y(\mathbf{x}_n) = 1 - \xi_n \;\;\;\text{(i.e. support vectors)}$$

>* $0 < a_n < C \rightarrow \mu_n > 0$ and $\xi_n = 0$: lies on the border
>* $a_n = C$: lie inside the margin
>  * Correctly classified if $\xi_n \leq 1$ / Misclassified if $\xi_n > 1$

>* **Prediction**

>$$y(\mathbf{x}) = \sum_{n \in \mathcal{S}} a_n t_n \mathbf{x}_n^T \mathbf{x} + b
\;\;\;,\;\;\; \mathcal{S} = \{n: a_n > 0\}$$

>* **Bias**

>$$b = \frac{1}{|\mathcal{M}|} \sum_{n\in\mathcal{M}} \bigg\{
t_n - \sum_{m \in \mathcal{S}} a_m t_m \mathbf{x}_m^T \mathbf{x}_n
\bigg\}
\;\;\;,\;\;\;
\mathcal{M} = \{ n: 0 < a_n < C \}$$

# 2. Advanced Topics

## 2.1. Non-linear Max-margin Classifiers

* **Non-linear Basis Function**

>$$y(\mathbf{x}) = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) + b$$

>$$\boldsymbol{\phi}(\mathbf{x}_n) = (\phi_1 (\mathbf{x}_n) ,..., \phi_M (\mathbf{x}_n))^T$$

>* Example: **Gaussian Basis**

>$$\phi_m (\mathbf{x}) = \exp \left\{ 
-\frac{1}{2s} (\mathbf{x} - \mathbf{c}_m)^T(\mathbf{x} - \mathbf{c}_m)
\right\}$$

* **Optimization and Predictions**

>$$\max_\mathbf{a} \left[ \sum^N_{n=1} a_n - \frac{1}{2} \sum^N_{n=1} \sum^N_{m=1} a_n a_m t_n t_m \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}_m) \right]$$

>$$y(\mathbf{x}) = \sum_{n \in \mathcal{S}}
a_n t_n \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}) + b$$

* **Gram Matrix,** $\mathbf{K}$

>$$k_{n,m} = \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}_m)$$

>* Mapping & dot product $\rightarrow$ expensive

## 2.2. Kernel Functions

* **Kernel Functions**

>$$k_{n,m} = k(\mathbf{x}_n,\mathbf{x}_m) = \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}_m)$$

>* Example: 

>\begin{align}
\boldsymbol{\phi}(\mathbf{x}) &= \left[ 1, \sqrt{2}\;x_1, \sqrt{2}\;x_2, \sqrt{2}\;x_1x_2, x^2_1, x^2_2 \right]^T \\
\\
k(\mathbf{x}_n,\mathbf{x}_m) &= \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}_m) \\
&= 1 + 2 x_{n,1} x_{m,1} + 2 x_{n,2} x_{m,2} + 2 x_{n,1} x_{n,2} x_{m,1} x_{m,2}
+ x^2_{n,1} x^2_{m,1} + x^2_{n,2} x^2_{m,2} \\
&= (1 + x_{n,1}x_{m,1} + x_{n,2}x_{m,2})^2 \\
&= (1 + \mathbf{x}^T_n \mathbf{x}_m)
\end{align}

>* $k(\cdot,\cdot)$ is valid iff there is $\phi(\cdot)$ s.t. $k(\mathbf{x}_n,\mathbf{x}_m) = \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}_m)$

* **Other Examples**

>\begin{align}
\text{Linear: } \;\;\; k(\mathbf{x}_n,\mathbf{x}_m) &= \mathbf{x}_n^T \mathbf{x}_m \\
\text{Polynomial: } \;\;\; k(\mathbf{x}_n,\mathbf{x}_m) &= (1 + \mathbf{x}_n^T \mathbf{x}_m)^d \\
\text{Gaussian: } \;\;\; k(\mathbf{x}_n,\mathbf{x}_m) &= \exp \left( - \frac{1}{2s} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right)
\end{align}

* **Mercer's Condition:** $k(\cdot,\cdot)$ is a valid kernel iff

>1. $k(\cdot,\cdot)$ is symmetric - i.e. $k(\mathbf{x}_n, \mathbf{x}_m) = k(\mathbf{x}_m, \mathbf{x}_n)$
>2. Any gram matrix $K$ is positive semi-definite

>$$\mathbf{g}^T \mathbf{Kg} = \sum_{n,m} g_n k_{n,m} g_m \geq 0
\;\;\;\forall\;\;\; \mathbf{g} \text{ and } \{\mathbf{x_n}\}^N_{n=1}$$

* **Example with Gaussian Kernel**

>$$k(\mathbf{x}_n,\mathbf{x}_m) = \exp \left( - \frac{1}{2s} ||\mathbf{x}_n - \mathbf{x}_m||^2 \right) \;\;\;\text{and}\;\;\; C = \infty$$

>* The rank of $\mathbf{K} = \boldsymbol{\Phi} \boldsymbol{\Phi}^T$ determines the **effective dimension** of the feature space
>  * $s \rightarrow 0$: becomes diagonal (i.e. rank=N) $\rightarrow$ wiggly
>  * $s \rightarrow \infty$: all entries in $\mathbf{K}$ are the same (i.e. rank=1) $\rightarrow$ smooth

## 2.3. Kernel Trick

* **Kernel Trick**

>* Any algorithm that operates on the inputs $\mathbf{x}_1,...,\mathbf{x}_N$ by using only their **dot products** can be implemented using kernels

* **Kernel Least Squares**

>\begin{align}
C &= \frac{1}{2} (\boldsymbol{\Phi} \mathbf{w} - \mathbf{t})^T (\boldsymbol{\Phi} \mathbf{w} - \mathbf{t}) + \frac{\lambda}{2} \mathbf{w}^T \mathbf{w} \\
&= \frac{1}{2} \mathbf{w}^T \boldsymbol{\Phi}^T \boldsymbol{\Phi} \mathbf{w}
+ \frac{1}{2} \mathbf{t}^T\mathbf{t}
- \mathbf{t}^T \boldsymbol{\Phi} \mathbf{w}
+ \frac{\lambda}{2} \mathbf{w}^T \mathbf{w}
\end{align}

>* Gradient is 0 if:

>$$\mathbf{w} = \boldsymbol{\Phi}^T \frac{(\mathbf{t}-\boldsymbol{\Phi} \mathbf{w})}{\lambda} \equiv \boldsymbol{\Phi}^T \mathbf{a}$$

>* Use this to replace $\mathbf{w}$:

>$$C = \frac{1}{2} \mathbf{a}^T \boldsymbol{\Phi} \boldsymbol{\Phi}^T \boldsymbol{\Phi} \boldsymbol{\Phi}^T \mathbf{a} + \frac{1}{2} \mathbf{t}^T \mathbf{t} - \mathbf{t}^T \boldsymbol{\Phi} \boldsymbol{\Phi}^T \mathbf{a} + \frac{\lambda}{2} \mathbf{a}^T \boldsymbol{\Phi} \boldsymbol{\Phi}^TA \mathbf{a}$$

>* Gradient is 0 if:

>$$\mathbf{a} = (\boldsymbol{\Phi} \boldsymbol{\Phi}^T + \lambda \mathbf{I})^{-1} \mathbf{t} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{t}$$

>* **Prediction**

>$$y(\mathbf{x}) = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x})
= \mathbf{a}^T \boldsymbol{\Phi} \;\boldsymbol{\phi}(\mathbf{x})
= \sum^N_{n=1} a_n k(\mathbf{x}_n,\mathbf{x})$$

## 2.4. Multi-class Max-margin Classifiers

* **One-vs-all Classification** (one classifier per class)

>$$\hat{t}_\star = \underset{k \in \{k_1,...,k_N\}}{\text{argmax}} \mathbf{w}_k^T \mathbf{x}_\star + b_k$$

>* **Problems:** Different output scales / Different # training examples

* **Simultaneous Learning of Classifiers** (very expensive)

>* Force the following condition

>$$\mathbf{w}^T_{t_n} \mathbf{x}_n + b_{t_n} > \mathbf{w}^T_{j} \mathbf{x}_n + b_{j} \;\;\;\forall\;\;\; j \neq t_n$$

>* New objective fn.:

>$$\frac{1}{2} \sum^K_{k=1} ||\mathbf{w}_k||^2 + C \sum^N_{n=1,j\neq t_n} \xi_{n,j}$$

>$$\text{s.t.} \;\;\;
\mathbf{w}^T_{t_n} \mathbf{x}_n + b_{t_n} \geq \mathbf{w}^T_{j} \mathbf{x}_n + b_{j} + 1 - \xi_{n.j} \;\;\;,\;\;\;
\xi_{n,j} \geq 0 \;\;\;
\forall \;\;\;
n, \; j\neq t_n
$$

>* Prediction implemented in the same way

# 3. Kernels for Structured Data

>1. **Extract features from input objects**
1. **Compute kernels (dot products) using those features**

## 3.1. String Kernels

* **String Kernel**

>1. Given a list of substrings $s_1,s_2,... \in \mathcal{A}^\star$, encode $\boldsymbol{\phi}(x) = (\phi_{s_1}(x), \phi_{s_2}(x), ...)^T$
>1. $\phi_{s}(x)$: indicate the occurrence of $s$ in $x$
>1. Kernel between two strings defined as dot product:

>$$k(x,x') = \boldsymbol{\phi}(x)^T \boldsymbol{\phi}(x') 
= \sum_{s\in\{s_1,s_2,...\}} \phi_s(x) \phi_s(x')$$

* **Example: Gap-weighted Kernel**

>* $\phi_{s}(x)$: # occurrences, $\lambda^n$ for $n$ gaps

* **Example: $k$-spectrum kernel**

>* $\phi_{s}(x)$: # **exact** occurrences ($s$: any possible length-$k$ sequence)
>* Selection of $k$
>  * Large $k$: co-occurrence is more informative
>  * Small $k$: # co-occurrence increases
>  * $k=1$: **Bag-of-words kernel**

## 3.2. Tree Kernels

* **Definitions**

>* **Subtree:** tree formed by selecting one node an all its descendants
>* **Subset Tree:** subtree including all children of a node or none of them

* **Tree Kernel**

>1. Given a list of all possible subset trees $t_1,t_2,...$, each tree $\mathcal{T}$ is encoded using the feature vector $\boldsymbol{\phi}(\mathcal{T}) = (\phi_{t_1}(\mathcal{T}), \phi_{t_2}(\mathcal{T}), ...)^T$
>2. Each $\phi_{t}(\mathcal{T})$ counts occurrences of $t$ in $\mathcal{T}$ ($\mathcal{V}(\mathcal{T})$: set of nodes in $\mathcal{T}$)
>$$$$
>$$\phi_t (\mathcal{T}) = \sum_{n \in \mathcal{V}(\mathcal{T})} I_t (n)
\;\;\;,\;\;\;
I_t(n) = \bigg\{ \begin{matrix} 1 & t \text{ found in } \mathcal{T} \text{ with root at node } n \\ 0 & \text{otherwise} \end{matrix} $$
>$$$$
>3. Kernel between two trees defined as dot product:

>$$k(\mathcal{T}_a,\mathcal{T}_b) = \boldsymbol{\phi}(\mathcal{T}_a)^T \boldsymbol{\phi}(\mathcal{T}_b) = \sum_{t \in \{t_1,t_2,...\}} \phi_t(\mathcal{T}_a) \phi_t(\mathcal{T}_b)$$

* **Efficient Computation of Tree Kernels and Example**

>\begin{align}
k(\mathcal{T}_a,\mathcal{T}_b) &= \sum_{n_a \in \mathcal{V}(\mathcal{T}_a)}
\sum_{n_b \in \mathcal{V}(\mathcal{T}_b)} f(n_a, n_b) \\
\\
f(n_a,n_b) &= \sum_{t \in \{t_1,t_2,...\}} I_t(n_a) I_t(n_b)
\end{align}

>* $f(n_a,n_b)$: # common subset trees at $n_a$ and $n_b$

>|$\hspace{40mm}$Condition|                         Then|
|-|-|
|$n_a \neq n_b$ **or** $\text{ch}(n_a) \neq \text{ch}(n_b)$|$f(n_a,n_b)=0$|
|else if $n_a$ and $n_b$ are leaf nodes|$f(n_a,n_b)=1$|
|otherwise|$f(n_a,n_b)=\prod^{|\text{ch}(n_a)|}_{i=1} g(\text{ch}(n_a)_i,\text{ch}(n_b)_i)$<br/><br/>$g(n_1,n_2) = \bigg\{ \begin{matrix} 1 & \text{if } n_1 \text{ or } n_2 \text{ leaf} \\ 1+f(n_1,n_2) & \text{otherwise} \end{matrix}$|

## 3.3. Graph Kernels

* **Graphs and Graph Walks**

>* **Graph:** $\mathcal{G}=\{ \mathcal{V},\mathcal{E} \}$ where $\mathcal{E} = \{ (i,j);i,j\in\mathcal{V} \}$
>* **Graph Walks:** $k$-length walk defined as $w=\{v_1,...,v_{k+1}\}$ where $(v_i,v_{i+1}) \in \mathcal{E}$
>* Number of $k$-length walks between nodes $i$ and $j$:

>$$[\mathbf{A}^k]_{i,j} = \sum^{|\mathcal{V}|}_{s_1 = 1} \cdots \sum^{|\mathcal{V}|}_{s_{k-1} = 1} a_{i,s_1} a_{s_1,s_2} \cdot a_{s_{k-1},j}$$

* **Random-walk Graph Kernel**

>* $k(\mathcal{G},\mathcal{G}')$: # common walks in the two graphs
>* Computed using **direct product graph**

>\begin{align}
\mathcal{G}_\times &= (\mathcal{V}_\times, \mathcal{E}_\times) \\
\mathcal{V}_\times &= \{ (a,a') ; a\in\mathcal{V} \text{ and } a'\in\mathcal{V}' \} \\
\mathcal{E}_\times &= \{ ((a,a'),(b,b')) ; (a,b)\in\mathcal{E} \text{ and } (a',b')\in\mathcal{E}' \}
\end{align}

>* **Kernel Definition** ($\mathbf{A}_\times = \mathbf{A} \otimes \mathbf{A}'$ where $\otimes$ is the **Kronecker product**)

>$$k(\mathcal{G},\mathcal{G}') = \sum^{|\mathcal{V}_\times|}_{i,j = 1}
\left[ \sum^\infty_{n=0} \lambda^n \mathbf{A}^n_\times \right]_{i,j}
= \mathbf{1}^T [\mathbf{I} - \lambda \mathbf{A} \otimes \mathbf{A}']^{-1} \mathbf{1}
\Leftrightarrow \mathbf{x} = \mathbf{1} + \lambda (\mathbf{A} \otimes \mathbf{A}')\mathbf{x}$$

>* **Problems:** 
>  * A walk can visit the same cycle multiple times $\rightarrow$ small structural similarities can produce huge kernel values
>  * High cost, $\mathcal{O}(n^3)$

* **Weisfeiler-Lehman Graph Kernel**

>1. Create a set with labels of adjacent vertices & sort
>1. Add vertex label as a prefix
>1. Compress resulting label sequence into a **unique value**
>1. Assign the unique value as new **vertex label**
>1. Apply the **bag-of-words** kernel to the vertex labels (include the initial labels)

## 3.4. Fisher Kernel

* **Idea**

>* Use a **probabilistic generative model** to obtain a **fixed-length vector representation** of complex structured data

* **Steps**

>1. Train $p(\mathbf{x}|\boldsymbol{\theta})$ on $\{\mathbf{x}_n\}^N_{n=1}$ (e.g. using MLE)
>1. Define the **Fisher score vector** as $\boldsymbol{\phi}(\mathbf{x}_n) = \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}_n|\boldsymbol{\theta}) |_{\boldsymbol{\theta}_{\text{MLE}}}$
>1. Define **naive Fisher kernel** as $k(\mathbf{x}_n,\mathbf{x}_m) = \boldsymbol{\phi}(\mathbf{x}_n)^T \boldsymbol{\phi}(\mathbf{x}_m)$

* **Example: Mixture of Gaussians**

>\begin{align}
p(\mathbf{x}|\boldsymbol{\theta}) &= \sum^K_{k=1} p(\mathbf{x}|\theta_k)\pi_k \\
[\boldsymbol{\phi}(\mathbf{x}_n)]_{pi_k} &= \frac{\partial \log p(\mathbf{x}|\boldsymbol{\theta})}{\partial \pi_k} \bigg|_{\boldsymbol{\theta}=\boldsymbol{\theta}_{\text{MLE}}}
= \frac{p(\mathbf{x}|\theta^{\text{MLE}}_k)}{\sum_k p(\mathbf{x}|\theta^{\text{MLE}}_k) \pi^{\text{MLE}}_k}
\end{align}

>* $[\boldsymbol{\phi}(\mathbf{x}_n)]_{pi_k}$: amount by which the $k$-th component contributes to generate $\mathbf{x}_n$