# 1. Variational Auto Encoder

## 1.1. Latent Variable Models

* **Discrete LV**

>$$p(\mathbf{x}) = \sum^M_{m=1} P(c_m) p(\mathbf{x}|c_m)$$

* **Continuous LV**

>$$p(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}$$

>* $p(\mathbf{z}) = \mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{I})$
>* $p(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\mathbf{x};\mathbf{f}(\mathbf{z}),\sigma^2 \mathbf{I})$

* **Factor Analysis**, $\mathbf{f}(\mathbf{z}) = \mathbf{Az}$

>* **Auxiliary function:**

>$$\mathcal{Q}(\boldsymbol{\lambda}, \tilde{\boldsymbol{\lambda}}) = \int p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}) \log (p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda})) d\mathbf{z}$$

>* **General mapping using NN**

>$$p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda}) = \mathcal{N}(\mathbf{x};\mathbf{f}(\mathbf{z};\boldsymbol{\lambda}),\sigma^2 \mathbf{I})$$

>* No simple closed-form solution $\rightarrow$ adopt **variational** approaches

## 1.2. KL Divergence

* **Definition**

>$$\mathcal{KL}(p(\mathbf{x})||q(\mathbf{x})) = \int p(\mathbf{x}) \log \left( \frac{p(\mathbf{x})}{q(\mathbf{x})} \right) d\mathbf{x} = - \int p(\mathbf{x}) \log \left( \frac{q(\mathbf{x})}{p(\mathbf{x})} \right) d\mathbf{x}$$

* **Inequality** (use $\log y \leq y-1$)

>$$\int p(\mathbf{x}) \left( \frac{q(\mathbf{x})}{p(\mathbf{x})} \right) d\mathbf{x} \leq
\int p(\mathbf{x}) \left( \frac{q(\mathbf{x})}{p(\mathbf{x})} - 1 \right) d\mathbf{x} = 0$$

* **KL Divergence for Gaussians**

>$$\mathcal{KL}(p(\mathbf{x})||q(\mathbf{x})) = \frac{1}{2} \left( \text{tr} (\boldsymbol{\Sigma}^{-1}_2 \boldsymbol{\Sigma}_1 - \mathbf{I}) + (\mu_1 - \mu_2)^T \boldsymbol{\Sigma}^{-1}_2 (\mu_1 - \mu_2) + \log \left( \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} \right) \right)$$

## 1.3. Variational EM

* **EM: Auxiliary Function Maximization**

>$$\boldsymbol{\lambda}^{(k+1)} = \text{argmax}_{\boldsymbol{\lambda}} \left( \mathcal{Q}(\boldsymbol{\lambda}^{(k)},\boldsymbol{\lambda}) \right) = \text{argmax}_{\boldsymbol{\lambda}} \left( \int p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}^{(k)}) \log (p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})) d\mathbf{z} \right)$$

>$=$ **expected value of the log joint distribution**

>\begin{align}
\log (p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})) &= \log (p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda})) + \log (p(\mathbf{z})) \\
&= \log (\mathcal{N} (\mathbf{x}; \mathbf{Az}, \boldsymbol{\Sigma}_{\text{diag}})) + \log (\mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{I})) \\
&= -\frac{1}{2} (\mathbf{z}^T \mathbf{A}^T \boldsymbol{\Sigma}^{-1}_{\text{diag}} \mathbf{Az} - 2\mathbf{z}^T \mathbf{A}^T \boldsymbol{\Sigma}^{-1}_{\text{diag}} \mathbf{x} + \mathbf{x}^T \boldsymbol{\Sigma}^{-1}_{\text{diag}} \mathbf{x}) - \frac{1}{2} \log (|\boldsymbol{\Sigma}_{\text{diag}}|) + C
\end{align}


* **Estimate $\boldsymbol{\lambda}$** by maximizing the **Log-likelihood** $\mathcal{L}(\boldsymbol{\lambda})$

>\begin{align}
\mathcal{L}(\boldsymbol{\lambda}) &= \log (p(\mathbf{x};\boldsymbol{\lambda})) \\
&= \int p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}) \log (p(\mathbf{x};\boldsymbol{\lambda})) d\mathbf{z} \\
&= \int p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}) \log \left( \frac{p(\mathbf{x};\boldsymbol{\lambda})p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})}{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})} \right) d\mathbf{z} \\
&= \bigg\langle \log \left( \frac{p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})}{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})} \right) \bigg\rangle_{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})}
\end{align}

>* **Need to know** $p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})$ $\rightarrow$ use any valid distribution $q(\mathbf{z};\tilde{\boldsymbol{\lambda}})$

>\begin{align}
\mathcal{L}(\boldsymbol{\lambda}) &= \bigg\langle \log \left( \frac{p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})}{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})} \right) \bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} \\
&= \bigg\langle \log \left( \frac{p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})}{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} \right) \bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}
+ \bigg\langle \log \left( \frac{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})} \right) \bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} \\
&\geq \bigg\langle \log \left( \frac{p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})}{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} \right) \bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}
=\mathcal{F} (q(\mathbf{z};\boldsymbol{\lambda}),\boldsymbol{\lambda})
\end{align}

* **EM Revisited**

>* **Set** $q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}) = p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}^{(k)})$

>$$\mathcal{L}(\boldsymbol{\lambda}^{(k)}) = \mathcal{F} (q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}),\boldsymbol{\lambda}^{(k)})$$

>* **Maximize $\mathcal{F} (q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}),\boldsymbol{\lambda})$ to find $\boldsymbol{\lambda}^{(k+1)}$**

>$$\boldsymbol{\lambda}^{(k+1)} = \text{argmax}_{\boldsymbol{\lambda}} \Bigg\{ \bigg\langle \log \left( \frac{p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})}{q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)})} \right) \bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)})} \Bigg\}$$

>* **Minimize KL divergence for variational approximation**

>$$q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k+1)}) = \text{argmin}_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} 
\Bigg\{ \bigg\langle
\log \left(
\frac{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}^{(k+1)})}
\right)
\bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}\Bigg\}$$

>* **Which occurs at $q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k+1)}) = p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}^{(k+1)})$**

* **EM Guarantees:**

>$$\mathcal{L}(\boldsymbol{\lambda}^{(k)}) = \mathcal{F} \left( q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}), \boldsymbol{\lambda}^{(k)} \right)
\leq \mathcal{F} \left( q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}), \boldsymbol{\lambda}^{(k+1)} \right)
\leq \mathcal{L}(\boldsymbol{\lambda}^{(k+1)})$$

>* provided $q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}) = p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda}^{(k)})$


* **Variational EM** (no guarantees)

>$$\mathcal{L}(\boldsymbol{\lambda}^{(k)}) \geq \mathcal{F} \left( q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}), \boldsymbol{\lambda}^{(k)} \right)
\leq \mathcal{F} \left( q(\mathbf{z};\tilde{\boldsymbol{\lambda}}^{(k)}), \boldsymbol{\lambda}^{(k+1)} \right)
\leq \mathcal{L}(\boldsymbol{\lambda}^{(k+1)})$$

>* Allows any form of $q(\mathbf{z};\tilde{\boldsymbol{\lambda}})$ / e.g. mean-field approximation, $q(\mathbf{z};\tilde{\boldsymbol{\lambda}}) = \prod^n_{i=1} q_i (z_i ; \tilde{\boldsymbol{\lambda}})$

## 1.4. Variational Auto Encoder

* **Variational Auto Encoder**

>$$p(\mathbf{x};\boldsymbol{\lambda}) = \int p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda})p(\mathbf{z})d\mathbf{z}
= \int \mathcal{N} (\mathbf{x};\mathbf{f}(\mathbf{z};\boldsymbol{\lambda}), \sigma^2 \mathbf{I}) p(\mathbf{z}) 
d\mathbf{z}$$

>* $\mathbf{f}(\mathbf{z};\boldsymbol{\lambda})$: cannot compute integral
>* VEM (normally) not possible for DNN

* **(Stochastic) Gradient Descent** - approximate gradient by lower-bound gradient

>\begin{align}
\nabla \mathcal{L}(\boldsymbol{\lambda}) &\approx \nabla \left( \bigg\langle
\log \left( \frac{p(\mathbf{x},\mathbf{z};\boldsymbol{\lambda})}{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}
\right)
\bigg \rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} \right) \\
&= \nabla \left(
\langle \log (p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda})) \rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}
+ \bigg\langle \log \left( \frac{p(\mathbf{z})}{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})} \right)
\bigg\rangle_{q(\mathbf{z};\tilde{\boldsymbol{\lambda}})}
\right)
\end{align}

>* $p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda}) = \mathcal{N}(\mathbf{x};\mathbf{f}(\mathbf{z};\boldsymbol{\lambda}),\sigma^2 \mathbf{I})$
>* $p(\mathbf{z})$: also Gaussian
>* **Iterative Optimization**
>  * Optimize $\boldsymbol{\lambda}$ (model parameters)
>  * Optimize $\tilde{\boldsymbol{\lambda}}$ (variational approximation)

* **Variational Form**

>* Introduce dependence on $\mathbf{x}$

>$$q(\mathbf{z};\tilde{\boldsymbol{\lambda}}) \rightarrow q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}}) = \mathcal{N}(\mathbf{z};\mathbf{f}_\boldsymbol{\mu} (\mathbf{x};\tilde{\boldsymbol{\lambda}}), \mathbf{f}_\boldsymbol{\Sigma} (\mathbf{x};\tilde{\boldsymbol{\lambda}}))$$

>* Everything is Gaussian $\rightarrow$ rewrite $\mathcal{L}(\boldsymbol{\lambda})$

>$$\mathcal{L}(\boldsymbol{\lambda}) = 
\Bigg\langle \log \left( 
\frac{q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}})}
{p(\mathbf{z}|\mathbf{x};\boldsymbol{\lambda})} \right)
+ \log (p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda})) 
+ \log \left( \frac{p(\mathbf{z})}
{q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}})} 
\right) \Bigg\rangle_{q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}})}$$

>* **1st Term: error**
>  * Often neglected to yield the gradient lower-bound
>* **2nd Term: decoding** (compute probability given $\mathbf{z}$)
>  * Difficult if $\mathbf{f}(\mathbf{z})$ is non-linear
>* **3rd Term: encoding** (encode information about $\mathbf{x}$ into $\mathbf{z}$)
>  * (-) **KL Divergence between Gaussians** $\rightarrow$ closed-form solution

* **Monte-Carlo Approximation** and **Reparameterization Trick**

>\begin{align}
\langle \log (p(\mathbf{x}|\mathbf{z};\boldsymbol{\lambda})) \rangle_{q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}})}
&\approx \frac{1}{K} \sum^K_{i=1} \log (p(\mathbf{x}|\mathbf{z}^{(i)};\boldsymbol{\lambda})) \\
q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}}) &= 
\mathcal{N}(\mathbf{z};\mathbf{f}_\mathbf{\mu} (\mathbf{x};\tilde{\boldsymbol{\lambda}}), 
\mathbf{f}_\mathbf{\Sigma} (\mathbf{x};\tilde{\boldsymbol{\lambda}})) \\
\mathbf{z}^{(i)} &= \mathbf{f}_{\boldsymbol{\mu}} (\mathbf{x};\tilde{\boldsymbol{\lambda}})
+ \mathbf{f}_{\boldsymbol{\Sigma}} (\mathbf{x};\tilde{\boldsymbol{\lambda}})^{1/2} \boldsymbol{\epsilon}^{(i)}
\;\;\;,\;\;\; \boldsymbol{\epsilon}^{(i)} \sim \mathcal{N}(\mathbf{0},\mathbf{I})
\end{align}

>* **As a result,**

>$$\mathcal{L}(\boldsymbol{\lambda}) \geq
\langle \log (p(\mathbf{x}|
\mathbf{f}_{\boldsymbol{\mu}} (\mathbf{x};\tilde{\boldsymbol{\lambda}})
+ \mathbf{f}_{\boldsymbol{\Sigma}} (\mathbf{x};\tilde{\boldsymbol{\lambda}})^{1/2} \boldsymbol{\epsilon}
;\boldsymbol{\lambda})) \rangle_{\mathcal{N}(\mathbf{0},\mathbf{I})}
+ \Bigg\langle \log \left( \frac{p(\mathbf{z})}
{q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}})} 
\right) \Bigg\rangle_{q(\mathbf{z}|\mathbf{x};\tilde{\boldsymbol{\lambda}})}
$$

# 2. Ensemble Methods

## 2.1. Introduction

* **Majority Voting**

>$$P(\text{error}) = \sum^N_{i=\frac{N}{2}} \left( \begin{matrix} N \\ i \end{matrix} \right)
p^i_e (1-p_e)^{N-i}$$

* **Bayesian Approaches** and **Monte-Carlo Approximation**

>* Consider an ensemble of **discriminative classifiers**

>\begin{align}
\hat{\omega} &= \text{argmax}_{{\omega}} \left\{ 
\sum_{\mathcal{M}} \int P({\omega}|\mathbf{x}^\star;\boldsymbol{\theta},\mathcal{M})
p(\boldsymbol{\theta}|\mathcal{M},\mathcal{D})
P(\mathcal{M}|\mathcal{D})d\boldsymbol{\theta} \right\} \\
&\approx \text{argmax}_{{\omega}} \left\{ 
\frac{1}{N} \sum^N_{j=1} P({\omega}|\mathbf{x}^\star;\mathcal{M}^{(j)})
\right\} \;\;\;,\;\;\;
\mathcal{M}^{(i)} \sim p(\boldsymbol{\theta},\mathcal{M}|\mathcal{D})
\end{align}

## 2.2. Ensemble Generation

* **Random Network Initialization** (each $\mathcal{M}{(i)}$ at local optimum)

>\begin{align}
\tilde{\mathcal{M}}^{(i)} &\sim p(\boldsymbol{\theta}|\mathcal{M}) 
\;\;\;\leftarrow\;\;\; \text{prior over parameters}\\
\mathcal{M}^{(i)} &= \text{argmax}_{\boldsymbol{\theta}} \left\{ 
\sum^n_{j=1} \log \left( P \left( y_j | \mathbf{x}_j ; \boldsymbol{\theta}, \tilde{\mathcal{M}}^{(i)} \right)
\right)
\right\}
\end{align}

* **Bagging** - train each model on a random subset of data $\tilde{\mathcal{D}}$
* **Monte Carlo Dropout** - de-activate random nodes
* **Adhoc Models** - number/size of layers, activation fn., cost function, ...

## 2.3. Model Compression: Teacher-Student Training

* **Cross Entropy Training Criterion** 

>$$\mathcal{D}=\{\mathbf{x}_{1:n}, y_{1:n}\} \;\;\;,\;\;\; y_i \in \{\omega_1,...,\omega_K\}$$

>$$\mathcal{F}_{ce} = - \sum^n_{i=1} \log P(y_i|\mathbf{x}_i;\mathcal{M}_S)
= - \sum^n_{i=1} \sum_{\omega} \delta(y_i,\omega) \log P(\omega|\mathbf{x}_i;\mathcal{M}_S)$$

>* $\delta(y_i,\omega)$: Kronecker delta function, sum over all classes

* **Modify Targets based on a Teacher Network** (soft targets)

>$$\mathcal{F}_{ts} = - \sum^n_{i=1} \sum_{\omega} 
P(\omega|\mathbf{x}_i;\mathcal{M}_T)
\log P(\omega|\mathbf{x}_i;\mathcal{M}_S)$$

* **Replace the Teacher by an Ensemble**

>$$\mathcal{F}_{ts} = - \sum^n_{i=1} \sum_{\omega} 
\left[
\frac{1}{N} \sum^N_{j=1} P \left( \omega|\mathbf{x}_i;\mathcal{M}^{(j)} \right)
\right]
\log P(\omega|\mathbf{x}_i;\mathcal{M}_S)$$