# 1. Autoregressive Models
a. Write out the chain rule for a distribution over $x_1, x_2,...,x_n$.

$$ p(x_1) * p(x_2 | x_1) * ... * p(x_n | x_{n-1}, x_{n-2},...,x_1) $$

b. Draw an example of a Masked Autoregressive Distribution Estimation (MADE) model and illustrate the key characteristic that qualify it as proper probability model.

# 2. KL-Divergence
Let $\mu:\chi\rightarrow\mathbb{R}^k, \sigma, \chi\rightarrow\mathbb{R}^k$ and $q(z|x)=N(z;\mu(x),\textnormal{diag}(\sigma^2(x)))$. Suppose that $p(z)=N(z;0,I)$, show that

$$D(q(z|x) || p(z)) = \frac{1}{2}\sum_i \sigma_i^2(x)+\mu(x)_i^2-\log\sigma_i^2(x)-1$$

---

Kullback-Leibler Divergence is Cross Entropy minus Entropy:

$$D(q(z|x) || p(z)) = H(q,p) - H(q)$$

---

Find Cross Entropy:

$$ H(q,p)=-\int q(z)\log p(z)dz $$

$$ p(z)=(2\pi)^{-\frac{1}{2}}\exp(-\frac{1}{2}z^2) $$

$$ H(q,p)=-\int q(z)\log [(2\pi)^{-\frac{1}{2}}\exp(-\frac{1}{2}z^2)]dz $$

$$ H(q,p)=\frac{1}{2}\log2\pi\int q(z)dz + \frac{1}{2}\int z^2q(z)dz$$

$$ H(q,p)=\frac{1}{2}[\log2\pi+\mu^2+\sigma^2] $$

where $ \int q(z)dz=1$ and $ \int z^2q(z)dz=\mu^2+\sigma^2 $ 

---

Find Entropy:

$$ H(q)=-\int q(z)\log q(z)dz $$

$$ H(q)=-\int q(z)\log [(2\pi\sigma^2)^{-\frac{1}{2}}\exp(-\frac{1}{2\sigma^2}(z-\mu)^2)]dz $$

$$ H(q)=\frac{1}{2}\log2\pi\int q(z)dz+\frac{1}{2}\int(\frac{z-\mu}{\sigma^2})^2q(z)dz $$

$$ H(q)=\frac{1}{2}[\log 2\pi+\log \sigma^2 + 1]$$

where $ \int(\frac{z-\mu}{\sigma^2})^2q(z)dz=1 $

---

Solve for KL Divergence for each dimension:

$$D(q(z|x) || p(z))=\frac{1}{2}[\log2\pi+\mu^2+\sigma^2]-\frac{1}{2}[\log 2\pi+\log \sigma^2 + 1] $$

$$ D(q(z|x) || p(z))=\frac{1}{2}[\sigma^2+\mu^2-\log\sigma^2-1] $$

Generalize for multiple dimensions:

$$ D(q(z|x) || p(z))=\sum_{i=1}^{k}\frac{1}{2}[\sigma_i^2+\mu_i^2-\log\sigma_i^2-1] $$



https://leenashekhar.github.io/2019-01-30-KL-Divergence/

# 3. Normalizing Flows
Let $q_0$ be a probability distribution on $z$, and define $\mathbf{z}_s=g_s(\mathbf{z}_{t-1})$ where $g_s:z\rightarrow z$ are invertible functions. Prove that the pushforward distribution on $\mathbf{z}_t=g_t \circ ... \circ g_1(\mathbf{z}_0)$ is given by $q_t$, where
$$ \log q_t(\mathbf{z}_t)=\log q_0(z_0)-\sum_{s=1}^t\log\det\frac{\partial g_s(\mathbf{z}_{s-1})}{\partial\mathbf{z}_{s-1}} $$
Hint: consider using the inverse function theorem.

# 4. Variational Autoencoders
In this question, you will train a VAE model on the MNIST dataset. This dataset conssits of 28x28 grayscale images. Please implement a standard VAE with the following characteristics:

a. 16-dim latent variables $z$ with standard normal prior $p(z)=N(0,I)$

b. An approximate posterior $q_\theta(z|x)=N(z;\mu_\theta(x),I)$, where $\mu_\theta(x)$ is the mean vector, and $\Sigma_\theta(x)$ is a diagonal covariance matrix.

c. A decoder $p(x|z)=N(x;\mu_\phi(z),I)$, where $\mu_\phi(z)$ is the mean vector (We are not learning the covariance of the decoder).

## Request deliverables
a. Record the average full negative ELBO, reconstruction loss, and KL term of the training data (per minibatch) and test data (for your entire test set). Code is provided that automatically plots the training curves.

b. Report the final test set performance of your final model.

c. 100 samples from your trained VAE (put all sample in one figure).

d. 50 real-image / reconstruction pairs (put all sample in one figure).

e. Interpolations of length 10 between 10 pairs of test images from your VAE (100 images total)

## Helpful Tips:
* When computing reconstruction loss and KL loss, average over the batch dimension and sum over the feature
dimension
* When computing reconstruction loss, it suffices to just compute MSE between the reconstructed and true
images. (you can compute the extra constants if you want)
* Use batch size $128$, learning rate $10^{-3}$, and an Adam optimizer
* You can play around with different architectures and try for better results, but the following encoder / decoder
architecture below suffices.


# 

In [3]:
import numpy as np