## The variational auto-encoder

## Intractability
Variational techniques are typically used to form an approximation for:

${\displaystyle P(\mathbf {Z} \mid \mathbf {X} )={\frac {P(\mathbf {X} \mid \mathbf {Z} )P(\mathbf {Z} )}{P(\mathbf {X} )}}={\frac {P(\mathbf {X} \mid \mathbf {Z} )P(\mathbf {Z} )}{\int _{\mathbf {Z} }P(\mathbf {X} ,\mathbf {Z} )\,d\mathbf {Z} }}}$


The marginalization over ${\mathbf  Z}$ to calculate ${\displaystyle P(\mathbf {X} )}$ in the denominator is typically intractable, because, for example, the search space of ${\mathbf  Z}$ is combinatorially large. Therefore, we seek an approximation, using ${\displaystyle Q(\mathbf {Z} )\approx P(\mathbf {Z} \mid \mathbf {X} )}.$

## Intractable probability distribution

Refs: [1](https://stats.stackexchange.com/questions/4417/what-are-the-factors-that-cause-the-posterior-distributions-to-be-intractable), [2](https://arxiv.org/pdf/1601.00670.pdf), [3](https://stats.stackexchange.com/questions/208176/why-is-the-posterior-distribution-in-bayesian-inference-often-intractable)

## Variational Lower Bound

Assume that $X$ are observations (data) and $Z$ are hidden variables. The hidden variables might include the "parameters". The relationship of these two variables can be represented using the following graphical model

<img src='images/hidden_observed.jpg'>

Moreover, uppercase $P(X)$ denotes the probability distribution over that variable, and
lowercase $p(X)$ is the density function of the distribution of $X$.

The posterior distribution of the hidden variables can then be written as follows:
 

$p(Z|X)=\frac{p(X|Z)p(Z)}{p(x)}=\frac{p(X|Z)p(Z)}{\int_{Z} p(X,Z)}$




### First derivation: The Jensen’s inequality

$p(X)=\int_{Z}p(X,Z)$

$log(p(X))=log\int_{Z}p(X,Z)$

$=log\int_{Z}p(X,Z)\frac{q(Z)}{q(Z)} $

Remember, expected value of a function:

$\mathbb{E}[h(X)]=\int_x h(x) \cdot p(x) \ dx$


$=log E_{q}[\frac{p(X,Z)}{q(z)}]$

We also know that:

$f(E(X))\leq E(f(X))$

Therefore we have:


$log p(x) \geq E_{q}[log\frac{p(X,Z)}{q(Z)}]=E_{q}[log(p(X,Z))]-E_{q}[log(q(z))]$


$L= E_{q}[log(p(X,Z))]-E_{q}[log(q(z))]$

Then it is obvious that $L$ is a lower bound of the log probability of the observations.
As a result, if in some cases we want to maximize the marginal probability, we can instead
maximize its variational lower bound $L$.




### Second derivation: KL divergence

The main idea behind variational methods is: to find some approximation distributions $q(Z)$ that are as closed as possible to the true posterior distribution $p(Z|X)$. These
approximation distribution can have their own variational parameters: $q(Z|θ)$, and we
try to find the setting of the parameters that make $q$ close to the posterior of interest.
Obviously the distribution $q(Z)$ should be relatively easy and more tractable for inference.


To measure the closeness of the two distribution $q(Z)$ and $p(Z|X)$, a common metric
is the Kullback-Leibler (KL) divergence. 

$KL[q(Z) \parallel p(Z|X)]= \int_{Z} q(Z)log \frac{q(Z)}{p(Z|X)} $

$= -\int_{Z} q(Z)\log \frac{p(Z|X)}{q(Z)} $

$= -\int_{Z} q(Z)\log \frac{p(Z,X)}{p(x)q(Z)} $

$= -\int_{Z} q(Z)( \log \frac{p(Z,X)}{q(Z)} -\log(p(x)))$

$= -\int_{Z} q(Z) \log \frac{p(Z,X)}{q(Z)} +\int_{Z} q(Z)\log(p(x))$


since $q(𝑍)$ is a pdf function:

$= -\int_{Z} q(Z) \log \frac{p(Z,X)}{q(Z)} + \log(p(x)$

$= -L + \log(p(x)$

$L$ is the variational lower bound.

Rearranging will give us the following:

$L = \log p(X) − KL [q(Z)kp(Z|X)]$


since $KL$ divergence is always $\geq 0$, once again we get $L \leq log p(X)$. therefore ur goal is to maximize $L $

### Example
We want to maximize the log likelihood of the class label: $\log p(y|I,W)$. Here $I$ is the image, $W$ is the model parameters and $y$ is the class label. Then, the objective function above can be rewritten by
marginalizing over the locations l (hidden variables):

$\log p(y|I,W)=\log$

Refs: [1](http://legacydirs.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf)

 Refs: [1](https://www.youtube.com/watch?v=Tc-XfiDPLf4&ab_channel=MLExplained-AggregateIntellect-AI.SCIENCE)

## Marginal likelihood

A marginal likelihood function (integrated likelihood), is a likelihood function in which some parameter variables have been marginalized. 

### In the context of Bayesian statistics
Given a set of independent identically distributed data points ${\displaystyle \mathbf {X} =(x_{1},\ldots ,x_{n}),}$, where $x_{i}\sim p(x_{i}|\theta )$ according to some probability distribution parameterized by $\theta$ , where $\theta$  itself is a random variable described by a distribution, i.e. ${\displaystyle \theta \sim p(\theta \mid \alpha ),}$ the marginal likelihood in general asks what the probability ${\displaystyle p(\mathbf {X} \mid \alpha )}$ is, where $\theta$  has been marginalized out (integrated out): 


${\displaystyle p(\mathbf {X} \mid \alpha )=\int _{\theta }p(\mathbf {X} \mid \theta )\,p(\theta \mid \alpha )\ \operatorname {d} \!\theta }$

###  In classical statistics
In In classical statistics, the concept of marginal likelihood occurs instead in the context of a joint parameter ${\displaystyle \theta =(\psi ,\lambda )}$, where $\psi$  is the actual parameter of interest, and $\lambda$  is a non-interesting nuisance parameter.


We know that:

$P(B|C)=\sum_{i} P(B|A_i,C)P(A_i|C) $

And we also know 

${\mathcal {L}}(\theta|X)=p(X|\theta)=p_{\theta }(X)$







by marginalizing out $\lambda$ :

${\displaystyle {\mathcal {L}}(\psi ;\mathbf {X} )=p(\mathbf {X} \mid \psi )=\int _{\lambda }p(\mathbf {X} \mid \lambda ,\psi )\,p(\lambda \mid \psi )\ \operatorname {d} \!\lambda }$






## Bayesian model comparison

${\displaystyle p(\mathbf {X} \mid M)=\int p(\mathbf {X} \mid \theta ,M)\,p(\theta \mid M)\,\operatorname {d} \!\theta }$



## VAE

$KL(q_\phi(z|x) || P_\theta(z|x))=\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z|x)}=$

$=\int q_\phi(z|x) \log \frac{q_\phi(z|x)p_\theta(x)}{P_\theta(z,x)}$

$=\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)} +\int q_\phi(z|x) \log p_\theta(x)$


$=\underbrace{ \int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)}}_{-\mathcal {L}}  +\log p_\theta(x)$


$-\mathcal {L}$, is variational lower bound.

$\mathcal {L}= -\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)}$

$\log p_\theta(x)=\mathcal {L}+KL(q_\phi || P_\theta)$

$\log p_\theta(x) > \mathcal {L}$

The goal is minimize the $KL(q_\phi || P_\theta)$ w.r.t $\phi$ ($p_{\theta}$ is fixed w.r.t to $\phi$) which means we have to maximize $\mathcal {L}$


$\mathcal {L}= -\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z,x)}$

$P_\theta(z,x)=p_\theta(x|z)p_\theta(z)$

$\mathcal {L}= E_q[\log p_\theta(x|z)]   -\int q_\phi(z|x) \log \frac{q_\phi(z|x)}{P_\theta(z)}$


$1) \mathcal {L}= E_q[\log p_\theta(x,z) -  \log q_\phi(z|x)]$


$2) \mathcal {L}= E_q[\log p_\theta(x|z)] - KL( q_\phi(z|x)||p_\phi(z)) $



(2) can be written as:


$\mathcal{L}(\theta, \phi;x^{(i)}) = -KL(q_{\phi}(z|x^{(i)}) || p_{\theta}(z)) + \mathbb{E}_{z{\tilde{}}q}[logp_{\theta}(x|z)]$




### The Optimization Procedure

- And we need to maximize the expectation of the reconstruction of data points from the latent vector, $E_q[\log p_\theta(x|z)]$. Maximizing this means that the decoder is getting better at reconstruction, This means that we need to minimize reconstruction loss, which is $\mathcal{L}_R$

- We need to minimize the divergence between the estimated latent vector and the true latent vector, $KL( q_\phi(z|x)||p_\phi(z))$,  Let’s call this loss as $\mathcal{L}_{KL}$


$KL(q_{\phi}(z|x^{(i)}) || p_{\theta}(z)) = \frac{1}{2}\sum_{j=1}^{J}{(1+log(\sigma_j)^2-(\mu_j)^2-(\sigma_j)^2)}$


Here, $\sigma_j$ is the standard deviation and $\mu_j$ is the mean. We need $𝜎𝑗→1$ and $𝜇𝑗→1$


To sum it up:

1) $\mathcal{L}_R = E_q[\log p_\theta(x|z)]$

This is the reconstruction (decoder), i.e. pixel differences $|| x-f(x) ||^2$

2) $\mathcal{L}_{KL} = KL( q_\phi(z|x)||p_\phi(z)) = \frac{1}{2}\sum_{j=1}^{J}{(1+log(\sigma_j)^2-(\mu_j)^2-(\sigma_j)^2)}$


This is 

So, the final VAE loss that we need to optimize is:
$\mathcal{L}_{VAE} = \mathcal{L}_R + \mathcal{L}_{KL}$





### Reparameterization trick

$\phi^{*},  \theta^{*}=\text{argmax} \mathcal {L}(\phi, \theta;x) $

Finally, we need to sample from the input space using the following formula (reparameterization trick ).

$Sample = \mu + \epsilon\sigma$


<img src='images/vae.jpg'>

Refs [1](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/), [2](https://debuggercafe.com/getting-started-with-variational-autoencoder-using-pytorch/)
