# Hands-on session 1- Variational Auto-Encoders
## Generative Modeling Summer School 2023

#### Instructions on how to use this notebook:

This notebook is hosted on ``Google Colab``. To be able to work on it, you have to create your own copy. Go to *File* and select *Save a copy in Drive*.

You can also avoid using ``Colab`` entirely, and download the notebook to run it on your own machine. If you choose this, go to *File* and select *Download .ipynb*.

The advantage of using **Colab** is that you can use a GPU. You can complete this assignment with a CPU, but it will take a bit longer. Furthermore, we encourage you to train using the GPU not only for faster training, but also to get experience with this setting. This includes moving models and tensors to the GPU and back. This experience is very valuable because for various models and large datasets (like large CNNs for ImageNet, or Transformer models trained on Wikipedia), training on GPU is the only feasible way.

The default ``Colab`` runtime does not have a GPU. To change this, go to *Runtime - Change runtime type*, and select *GPU* as the hardware accelerator. The GPU that you get changes according to what resources are available at the time, and its memory can go from a 5GB, to around 18GB if you are lucky. If you are curious, you can run the following in a code cell to check:

```sh
!nvidia-smi
```

Note that despite the name, ``Google Colab`` does  not support collaborative work without issues. When two or more people edit the notebook concurrently, only one version will be saved. You can choose to do group programming with one person sharing the screen with the others, or make multiple copies of the notebook to work concurrently.

**Submission:** Please bring your (partial) solution to the hands-on session. Then you can discuss it with intructors and your colleagues.

In [89]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Introduction

In this assignment, we are going to implement a Variational Auto-Encoder (VAE). A VAE is a likelihood-based deep generative model that consists of a stochastic encoder (a variational posterior over latent variables), a stochastic decoder, and a marginal distribution over latent variables (a.k.a. a prior). The model was originally proposed in two concurrent papers:
- [Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.](https://arxiv.org/abs/1312.6114)
- [Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models." International conference on machine learning. PMLR, 2014.](https://proceedings.mlr.press/v32/rezende14.html)

You can read more about VAEs in Chapter 4 of the following book:
- [Tomczak, J.M., "Deep Generative Modeling", Springer, 2022](https://link.springer.com/book/10.1007/978-3-030-93158-2)

In particular, the goals of this assignment are the following:

- Understand how VAEs are formulated
- Implement components of VAEs using PyTorch
- Train and evaluate a model for image data

### Theory behind VAEs

VAEs are latent variable models trained with variational inference. In general, the latent variable models define the following generative process:
\begin{align}
1.\ & \mathbf{z} \sim p_{\lambda}(\mathbf{z}) \\
2.\ & \mathbf{x} \sim p_{\theta}(\mathbf{x}|\mathbf{z})
\end{align}

In plain words, we assume that for observable data $\mathbf{x}$, there are some latent (hidden) factors $\mathbf{z}$. Then, the training objective is the log-likelihood function of the following form:
$$
\log p_{\vartheta}(\mathbf{x})=\log \int p_\theta(\mathbf{x} \mid \mathbf{z}) p_\lambda(\mathbf{z}) \mathrm{d} \mathbf{z} .
$$

The problem here is the intractability of the integral if the dependencies between random variables $\mathbf{x}$ and $\mathbf{z}$ are non-linear and/or the distributions are non-Gaussian.

By introducing variational posteriors $q_{\phi}(\mathbf{z}|\mathbf{x})$, we get the following lower bound (the Evidence Lower Bound, ELBO):
$$
\log p_{\vartheta}(\mathbf{x}) \geq \mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p_\theta(\mathbf{x} \mid \mathbf{z})\right]-\mathrm{KL}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\lambda(\mathbf{z})\right) .
$$

## IMPORTS

In [90]:
# DO NOT REMOVE!
import os

import numpy as np
import matplotlib.pyplot as plt

import torch

from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F

import torchvision
from torchvision.datasets import MNIST

!pip install pytorch_model_summary
from pytorch_model_summary import summary

[0m

In [91]:
# Check if GPU is available and determine the device
if torch.cuda.is_available():
  device = 'cuda'
else:
  device = 'cpu'

device = 'cpu'
print(f'The available device is {device}')

The available device is cpu


In [92]:
# mount drive: WE NEED IT FOR SAVING IMAGES!
#from google.colab import drive
#drive.mount('/content/gdrive')

In [93]:
# PLEASE CHANGE IT TO YOUR OWN GOOGLE DRIVE!
images_dir = '/user/cringwal/home/Desktop/THESE_YEAR1/GEMSS2023/Assignements/'

## Auxiliary functions

Let us define some useful log-distributions:

In [94]:
# DO NOT REMOVE
PI = torch.from_numpy(np.asarray(np.pi))
EPS = 1.e-5


def log_categorical(x, p, num_classes=256, reduction=None, dim=None):
    x_one_hot = F.one_hot(x.long(), num_classes=num_classes)
    log_p = x_one_hot * torch.log(torch.clamp(p, EPS, 1. - EPS))
    if reduction == 'mean':
        return torch.mean(log_p, dim)
    elif reduction == 'sum':
        return torch.sum(log_p, dim)
    else:
        return log_p


def log_bernoulli(x, p, reduction=None, dim=None):
    pp = torch.clamp(p, EPS, 1. - EPS)
    log_p = x * torch.log(pp) + (1. - x) * torch.log(1. - pp)
    if reduction == 'mean':
        return torch.mean(log_p, dim)
    elif reduction == 'sum':
        return torch.sum(log_p, dim)
    else:
        return log_p


def log_normal_diag(x, mu, log_var, reduction=None, dim=None):
    D = x.shape[1]
    log_p = -0.5 * D * torch.log(2. * PI) - 0.5 * log_var - 0.5 * torch.exp(-log_var) * (x - mu)**2.
    if reduction == 'mean':
        return torch.mean(log_p, dim)
    elif reduction == 'sum':
        return torch.sum(log_p, dim)
    else:
        return log_p


def log_standard_normal(x, reduction=None, dim=None):
    D = x.shape[1]
    log_p = -0.5 * D * torch.log(2. * PI) - 0.5 * x**2.
    if reduction == 'mean':
        return torch.mean(log_p, dim)
    elif reduction == 'sum':
        return torch.sum(log_p, dim)
    else:
        return log_p

## Implementing VAEs

The goal of this assignment is to implement four classes:
- `Encoder`: this class implements the encoder (variational posterior), $q_{\phi}(\mathbf{z}|\mathbf{x})$.
- `Decoder`: this class implements the decoded (the conditional likelihood), $p_{\theta}(\mathbf{x}|\mathbf{z})$.
- `Prior`: this class implements the marginal over latents (the prior), $p_{\lambda}(\mathbf{z})$.
- `VAE`: this class combines all components.

### Encoder
We start with `Encoder`. Please remember that we assume the Gaussian variational posterior with a diagonal covariance matrix.

In [95]:
# YOUR CODE GOES HERE
# NOTE: The class must containt the following functions:
# (i) reparameterization
# (ii) sample
# (iii) log_prob
# Moreover, forward must return the log-probability of the variational posterior for given x, i.e., log q(z|x)

class Encoder(nn.Module):
    def __init__(self, encoder_net):  # ADD APPROPRIATE ATTRIBUTES
        super(Encoder, self).__init__()
        # Init of encoder network
        self.encoder = encoder_net
        
    @staticmethod
    # Reparametrization trick for Gaussians
    def reparameterization(mu, log_var):
        # std for log-variance
        std = torch.exp(0.5*log_var)
        #sample of epsilon from normal(0,1)
        eps = torch.randn_like(std)

        return mu+std*eps

    # Output of encoder network (parameter of gausssian)
    def encode(self, x):
        # encoder network of size 2M
        #print("ENCODE :)")
        h_e =  self.encoder(x)
        #print("now get mu and logvare")
        # divide layer into mean / std vectors
        mu_e, log_var_e= torch.chunk(h_e, 2,dim=1)

        return mu_e, log_var_e

    # Sampling process
    def sample(self, x=None, mu_e=None, log_var_e=None):
        if (mu_e is None) and (log_var_e is None):
            mu_e, log_var_e = self.encode(x)
        else:
            if (mu_e is None) or (log_var_e is None):
                raise ValueError('mu and log-scale can`t be None!')
            z = self.reparameterization(mu_e, log_var_e)
        return z

    # Log-proba used for ELBO
    def log_prob(self, x=None, mu_e=None, log_var_e=None, z=None):
        if x is not None:
            mu_e, log_var_e = self.encode(x)
            z = self.sample(mu_e=mu_e, log_var_e=log_var_e)
        else:
            if (mu_e is None) or (log_var_e is None) or (z is None):
                raise ValueError('mu, log-scale and z can`t be None!')

        return log_normal_diag(z, mu_e, log_var_e)

    def forward(self, x, type='log_prob'):
        assert type in ['encode', 'log_prob'], 'Type could be either encode or log_prob'
        if type == 'log_prob':
            return self.log_prob(x)
        else:
            return self.sample(x)

Please answer the following questions:

#### Question 1

Please explain the reparameterization trick and provide a mathematical formula.

**ANSWER**:
We use a sample operation from the variationnal posterior (MC approximation) for estimate ELBO. This operation make impossible the backprogation of the gradient.
For overpassing that problem we define a new independant and random variable ε ~ N(0,1). Then we use by decomposing the latent variable z, by considering it as a Gaussian variable, depending of  ε, a mean (μ) and a variance (σ²) as follow :
$$
\mathbf{z}=\mathbf{μ}+\mathbf{σ}.ε
$$

#### Question 2

Please write down mathematically the log-probability of the encoder (variational posterior).

**REMINDERS**
* As p(X) is intractable and the approximation by the Monte Carlo sampling expensive we prefer to modelise it via a **variationnal inference** :
\begin{align}
\ln p(\mathbf{x}) & = \ln \int p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})\ \mathrm{d} \mathbf{z} \\
& = \ln \int \frac{q_{\phi}(\mathbf{z})}{q_{\phi}(\mathbf{z})} p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})\ \mathrm{d} \mathbf{z} \\
& = \ln \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z})} \left[ \frac{p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{q_{\phi}(\mathbf{z}) } \right] \\
&\geq \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z})} \ln \left[ \frac{p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{q_{\phi}(\mathbf{z}) } \right] \\
&\geq \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z})} \left[ \ln p(\mathbf{x} | \mathbf{z}) + \ln p(\mathbf{z}) - \ln q_{\phi}(\mathbf{z}) \right] \\
&\geq\mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z})} \left[ \ln p(\mathbf{x} | \mathbf{z}) \right] - \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z})} \left[ \ln q_{\phi}(\mathbf{z}) -  \ln p(\mathbf{z}) \right]
\end{align}

*  The **amortized variational posterior** process allows us to learn a function $q_{\phi}(z|x)$ via a neural network on given inputs to returns an approximation of a chosen input distribution parameters:
\begin{align}
\ln p(\mathbf{x}) \geq \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z|x})} \left[ \ln p(\mathbf{x} | \mathbf{z}) \right] - \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z|x})} \left[ \ln q_{\phi}(\mathbf{z|x}) -  \ln p(\mathbf{z}) \right]
\end{align}
* This lower-bound is the famous **ELBO**, we already told about. The first part is the **reconstruction error** that we want to **maximize**, and the second is the **Kullback-Leibler divergence** (a kind of regulizer) that must be **minimized** :
\begin{align}
\ln p(\mathbf{x}) \geq \mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z|x})} \left[ \ln p(\mathbf{x} | \mathbf{z}) \right] - \mathbb{KL}[q_{\phi}({x|z}) || p(z)]
\end{align}
* Here we want optimize the last equation by finding the best $q_{\phi}({x|z})$

**ANSWER :**

The distribution of variationnal posterior is defined by $qϕ(x|z)$ as following a Gaussian distrib.

$qϕ(x|z) \sim N(z|\mu_ϕ(x),diag[σ²_ϕ(x)])$

where $ϕ(x), σ²_ϕ$ are two parameters learnt by a NN

* the log-proba computed as :

$ \ln qϕ(x|z) = -\frac{n}{2}\ln(2π) - \frac{n}{2}\ln(σ²)-\frac{1}{2σ²}\sum(x_i-\mu)²$

### Decoder

The decoder is the conditional likelihood, i.e., $p(x|z)$. Please remember that we must decide on the form of the distribution (e.g., Bernoulli, Gaussian, Categorical).

In [96]:
# YOUR CODE GOES HERE
# NOTE: The class must containt the following functions:
# (i) sample
# (ii) log_prob
# Moreover, forward must return the log-probability of the conditional likelihood function for given z, i.e., log p(x|z)
# Additionally, please specify the distribution class you want to use for the decode (i.e. the `distribution` attribute)

class Decoder(nn.Module):
    def __init__(self, decoder_net, distribution='categorical', num_vals=None): # ADD APPROPRIATE ATTRIBUTES
        super(Decoder, self).__init__()
        # Decoder init
        self.decoder = decoder_net
        # Distribution used for encoder
        self.distribution = distribution
        # nb of possibles values
        self.num_vals=num_vals


    # Get param of likelihood function p(x|y)
    def decode(self, z):
        # decoder network
        h_d = self.decoder(z)
        # Depends of distrib
        if( self.distribution=='categorical'):
            # batch size
            b = h_d.shape[0]
            # dim of x
            d = h_d.shape[1]//self.num_vals
            #reshape to batch size / dim . nb values
            h_d = h_d.view(b, d, self.num_vals)
            # get proba from softmax
            mu_d = torch.softmax(h_d, 2)
            return [mu_d]
          
        elif self.distribution == 'bernoulli':
            # return a single proba in Bernouilli case
            mu_d = torch.sigmoid(h_d)
            return [mu_d]
        else :
            raise ValueError("Only : 'categorial' and 'bernouilli' implemented distrib")

    # SAMPLING FROM DECODER
    def sample(self, z):
        outs = self.decode(z)
        if( self.distribution=='categorical'):
          # output of decoder
            mu_d = outs[0]
          # batch size of decoder
            b = mu_d.shape[0]
          # dim of of decoder
            m = mu_d.shape[1]
          #reshape to batch size / dim . nb values
            mu_d = mu_d.view(mu_d.shape[0], -1, self.num_vals)
          # sample the categorials
            p = mu_d.view(-1, self.num_vals)
            
            x_new = torch.multinomial(p, num_samples=1).view(b, m)
        elif self.distribution == 'bernoulli':
          # no reshaping
          mu_d = outs[0]
          # pytorch bernoulli sampler :)
          x_new = torch.bernoulli(mu_d)
        else :
              raise ValueError("Only : 'categorial' and 'bernouilli' implemented distrib")
        return x_new

    # Compute conditional log-likelihood function
    def log_prob(self, x, z):
        outs = self.decode(z)

        if self.distribution == 'categorical':
            mu_d = outs[0]
            log_p = log_categorical(x, mu_d, num_classes=self.num_vals, reduction='sum', dim=-1).sum(-1)

        elif self.distribution == 'bernoulli':
            mu_d = outs[0]
            log_p = log_bernoulli(x, mu_d, reduction='sum', dim=-1)

        else:
            raise ValueError('Either `categorical` or `bernoulli`')

        return log_p

    def forward(self, z, x=None, type='log_prob'):
        assert type in ['decoder', 'log_prob'], 'Type could be either decode or log_prob'
        if type == 'log_prob':
            return self.log_prob(x, z)
        else:
            return self.sample(x)

Please answer the following questions:

#### Question 3

Please explain your choice of distribution for image data used in this assignment. Additionally, please write it down mathematically (if you think that presenting it as the log-probability, then please do it).

ANSWER:
We are trying with the decoder to estimate the probability of a given image $x$ by knowing $z$, with have different option concerning the choise of the distribution of the images :

* The Bernoulli distribution :
$$p_θ(\mathbf{x}|z) \sim Bernouilli(x|θ(z)) $$ with
\begin{align}
\mathbf{x} \in \{0, 1\}\\
Θ \in \{0, 1\} \\
  Bernouilli(x,Θ) = Θ^x(1-Θ)^{1-x} \\
  \ln(Bernouilli(x,Θ)) = \ln(Θ^x(1-Θ)^{1-x})
\end{align}

* The  categorial distribution :
$$p_θ(\mathbf{x}|z) \sim Categorical(x|θ(z)) $$ with
\begin{align}   
  \mathbf{x} \in \{0, 1\} \\
  \mathbf{θ}^d \in \{0, 1\} \\
  d \in [0, 255] \\
  Categorical(x,Θ) = ∑^{255}_{d=0}Θ^{xd}_{d} \\
  \ln(Categorical(x,Θ)) = \ln (∑^{255}_{d=0}Θ^{xd}_{d})
\end{align}

> **As we are working with black and white MNIST images we are using the Bernoulli distribution**

#### Question 4

Please explain how one can sample from the distribution chosen by you. Please be specific and formal (i.e., provide mathematical formulae).

**ANSWER :**

For sampling a given $p_θ(\mathbf{x}|z)$ we need first to have on hand a latent vector $z$. ( from this one we will be able from the decoder to get $θ(z)$ that is a vector representing the probabilities to obtain a white pixel for each position of the image (in the case of ther Bernoulli distribution) or a 

### Prior

The prior is the marginal distribution over latent variables, i.e., $p(z)$. It plays a crucial role in the generative process and also in synthesizing images of better quality.

In this assignment, you are asked to implement a prior that is learnable (e.g., parameterized by neural networks). If you decide to implement the standard Gaussian prior only, then please be aware that you will not get any points.

For the learnable prior you can choose the **Mixture of Gaussians**.

In [97]:
#### STANDARD GAUSSIAN
class Prior(nn.Module):
    def __init__(self, L=2, num_components=1):
        super(Prior, self).__init__()

        self.L = L

        # params weights
        self.means = torch.zeros(1, L)
        self.logvars = torch.zeros(1, L)

    def get_params(self):
        return self.means, self.logvars
#
    def sample(self, batch_size):
        return torch.randn(batch_size, self.L)
#
    def log_prob(self, z):
        return log_standard_normal(z)

In [98]:
# YOUR CODE GOES HERE
# NOTES:
# (i) The function "sample" must be implemented.
# (ii) The function "forward" must return the log-probability, i.e., log p(z)

################ MIXTURE OF GAUSSIANS
class Prior(nn.Module):
    def __init__(self, L, num_components=2):
        super(Prior, self).__init__()

        self.L = L
        self.num_components = num_components

      # params
      # What is multiplier var ?
        multiplier=1
        self.means = nn.Parameter(torch.randn(num_components, self.L)*multiplier)
        self.logvars = nn.Parameter(torch.randn(num_components, self.L))

      # mixing weights
        self.w = nn.Parameter(torch.zeros(num_components, 1, 1))

    def get_params(self):
        return self.means, self.logvars

    def sample(self, batch_size):
        # mu, lof_var
        means, logvars = self.get_params()

        # mixing probabilities
        w = F.softmax(self.w, dim=0)
        w = w.squeeze()

        # pick components
        indexes = torch.multinomial(w, batch_size, replacement=True)

        # means and logvars
        eps = torch.randn(batch_size, self.L)
        for i in range(batch_size):
            indx = indexes[i]
            if i == 0:
                z = means[[indx]] + eps[[i]] * torch.exp(logvars[[indx]])
            else:
                z = torch.cat((z, means[[indx]] + eps[[i]] * torch.exp(logvars[[indx]])), 0)
        return z


    # Log-proba used for ELBO
    def log_prob(self, z):
        
        #print("HEY ICI")
        # mu, lof_var
        means, logvars = self.get_params()

        # mixing probabilities
        w = F.softmax(self.w, dim=0)

        # log-mixture-of-Gaussians
        z = z.unsqueeze(0) # 1 x B x L
        means = means.unsqueeze(1) # K x 1 x L
        logvars = logvars.unsqueeze(1) # K x 1 x L

        log_p = log_normal_diag(z, means, logvars) + torch.log(w) # K x B x L
        log_prob = torch.logsumexp(log_p, dim=0, keepdim=False) # B x L
        #print("supposed size")
        #print(log_prob.shape)
        return log_prob




#### Question 5 > ADD SAMPLING PROCESS AND LOG PROBA

**Option 1:  Standard Gaussian**

- Please explain the choice of your prior and write it down mathematically.

**Option 2: Mixture of Gaussians**

Please do the following:
- Please explain the choice of your prior and write it down mathematically.
- Please write down its sampling procedure (if necessary, please add a code snippet).
- Please write down its log-probability (a mathematical formula).

**ANSWER OPTION 1 :**

If i had to chose this option i would design it as follow :
$$p(z) \sim \mathcal{N}(z|0,I) $$
But suing a fixed distribution could led to :  
* posterior collapse : the regularisation term is minimized when $∀_xq_ϕ(z|x)=p(z)$ and sometimes the decoder could treets z as a noise.
* the hole problem : when a mismatch append between the aggregated posterior (could be for example the mean of var posteriors over all training data) and the prior. For example a high probability prior with a low postior proba and vise-versa. Sampling in theses kind of hole will produce unrealistic latent and impact the qualityof the outputs.

A more general problem, also more globally related to deep generative  models : the out-of-distribution. Because VAE poorly detect out of distrib examples that generally relate to other distribution.


**ANSWER OPTION 2 :**
A second option is to design it via a mixture of $\mathbf{K}$ Gaussians, patching the holes and fitting better to the aggregated posterior :

$$ p_λ(z) = ∑^K_{k=1}W_k\mathcal{N}(z|μ_k,\sigma²_k)$$
with the following trainable parameters : $$λ = \{\{w_k\},\{μ_k\},\{σ²_k\}\} $$



### Complete VAE

The last class is `VAE` tha combines all components. Please remember that this class must implement the **Negative ELBO** in `forward`, as well as `sample` (*hint*: it is a composition of `sample` functions from the prior and the decoder).

In [99]:
# YOUR CODE GOES HERE
# This class combines Encoder, Decoder and Prior.
# NOTES:
# (i) The function "sample" must be implemented.
# (ii) The function "forward" must return the negative ELBO. Please remember to add an argument "reduction", which is either "mean" or "sum".
class VAE(nn.Module):
    def __init__(self, encoder, decoder, prior, num_vals=256, L=16, likelihood_type='categorical'): # prior > nb of Gaussians
        # num_compo > nb dimension of latent space
        # num_vals > nb of pixel categories

        super(VAE, self).__init__()
        print('VAE by CR.')

        self.encoder = encoder
        self.decoder = decoder
        self.prior = prior
        self.num_vals = num_vals
        self.likelihood_type = likelihood_type

    def sample(self, batch_size=64):
        z = self.prior.sample(batch_size=batch_size)
        return self.decoder.sample(z)

    def forward(self, x, reduction='mean'):
        # output: the negative ELBO (NELBO) that is either averaged or summed (VERY IMPORTANT!)

        #print("in foward")

        mu_e, log_var_e = self.encoder.encode(x)

        #print("mu, var : ",mu_e,"-",log_var_e)
        z = self.encoder.sample(mu_e=mu_e, log_var_e=log_var_e)

        # ELBO
        RE = self.decoder.log_prob(x, z) # RECONSTRUCTION ERROR
        logprobz=self.prior.log_prob(z)
        #print("logprobz")
        #print(logprobz.shape)
        encologprob=self.encoder.log_prob(mu_e=mu_e, log_var_e=log_var_e, z=z)
        #print("encologprob")
        #print(encologprob.shape)
        
        KL = (logprobz - encologprob).sum(-1)# REGULIZER

        error = 0
        if np.isnan(RE.detach().numpy()).any():
            print('RE {}'.format(RE))
            error = 1
        if np.isnan(KL.detach().numpy()).any():
            print('RE {}'.format(KL))
            error = 1

        if error == 1:
            raise ValueError()

        if reduction == 'sum':
            return -(RE + KL).sum()
        else:
            return -(RE + KL).mean()


#### Question 6 > TO COMPLETE NEED MORE EXPLANATIONS

Please explain your choice of the distribution for the conditional likelihood function, and write down mathematically the log-probability of the decoder.

ANSWER: I'm not sure to understand the question but ...

The conditionnal likelihood is $p(z|x)$ also noted as $p_θ(x)$ :
$$ p_θ(\mathbf{x}) = \int p_θ(\mathbf{x} | \mathbf{z}) p(\mathbf{z})\ \mathrm{d} \mathbf{z} $$
this probability is lower

#### Question 7

Please write down mathematically the **Negative ELBO**.

**ANSWER:**

As a lot of package we generally minimize a training objective, we have to use the negative ELBO
$$ −ELBO(D; θ, ϕ) = \sum^N_{n=1}-\{\ \ln Categorial(x_n|θ(z_{ϕ,n})) + [\ln N(z_{ϕ,n}|μ_ϕ(X_n),σ²_ϕ(x_n)+\ln N(z_{ϕ,n|0,I}]\}$$

### Evaluation and training functions

**Please DO NOT remove or modify them.**

In [100]:
# ==========DO NOT REMOVE==========

def evaluation(test_loader, name=None, model_best=None, epoch=None):
    # EVALUATION
    if model_best is None:
        # load best performing model
        model_best = torch.load(name + '.model')

    model_best.eval()
    loss = 0.
    N = 0.
    for indx_batch, (test_batch, _) in enumerate(test_loader):
        test_batch = test_batch.to(device)
        loss_t = model_best.forward(test_batch, reduction='sum')
        loss = loss + loss_t.item()
        N = N + test_batch.shape[0]
    loss = loss / N

    if epoch is None:
        print(f'FINAL LOSS: nll={loss}')
    else:
        print(f'Epoch: {epoch}, val nll={loss}')

    return loss


def samples_real(name, test_loader, shape=(28,28)):
    # real images-------
    num_x = 4
    num_y = 4
    x, _ = next(iter(test_loader))
    x = x.to('cpu').detach().numpy()

    fig, ax = plt.subplots(num_x, num_y)
    for i, ax in enumerate(ax.flatten()):
        plottable_image = np.reshape(x[i], shape)
        ax.imshow(plottable_image, cmap='gray')
        ax.axis('off')

    plt.savefig(name+'_real_images.pdf', bbox_inches='tight')
    plt.close()


def samples_generated(name, data_loader, shape=(28,28), extra_name=''):
    x, _ = next(iter(data_loader))
    x = x.to('cpu').detach().numpy()

    # generations-------
    model_best = torch.load(name + '.model')
    model_best.eval()

    num_x = 4
    num_y = 4
    x = model_best.sample(num_x * num_y)
    x = x.to('cpu').detach().numpy()

    fig, ax = plt.subplots(num_x, num_y)
    for i, ax in enumerate(ax.flatten()):
        plottable_image = np.reshape(x[i], shape)
        ax.imshow(plottable_image, cmap='gray')
        ax.axis('off')

    plt.savefig(name + '_generated_images' + extra_name + '.pdf', bbox_inches='tight')
    plt.close()


def plot_curve(name, nll_val):
    plt.plot(np.arange(len(nll_val)), nll_val, linewidth='3')
    plt.xlabel('epochs')
    plt.ylabel('nll')
    plt.savefig(name + '_nll_val_curve.pdf', bbox_inches='tight')
    plt.close()

In [101]:
# ==========DO NOT REMOVE==========

def training(name, max_patience, num_epochs, model, optimizer, training_loader, val_loader, shape=(28,28)):
    nll_val = []
    best_nll = 1000.
    patience = 0
    # Main loop
    for e in range(num_epochs):
        
        print(">>>>>>>>>>>>> epoch ",e,"/",num_epochs)
        # TRAINING
        model.train()
        #print("aftertrain >ep ",num_epochs)
        for indx_batch, (batch, _) in enumerate(training_loader):
            model = model.to(device)
            batch = batch.to(device)
            #print("batch/")
            loss = model.forward(batch, reduction='mean')
            #print("/loss")

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            #print("end loop")

        # Validation
        loss_val = evaluation(val_loader, model_best=model, epoch=e)
        nll_val.append(loss_val)  # save for plotting

        if e == 0:
            print('saved!')
            torch.save(model, name + '.model')
            best_nll = loss_val
        else:
            if loss_val < best_nll:
                print('saved!')
                torch.save(model, name + '.model')
                best_nll = loss_val
                patience = 0
                
                samples_generated(name, val_loader, shape=shape, extra_name="_epoch_" + str(e))
            else:
                patience = patience + 1

        if patience > max_patience:
            break

    nll_val = np.asarray(nll_val)

    return nll_val

### Setup

**NOTE: *Please comment your code! Especially if you introduce any new variables (e.g., hyperparameters).***

In the following cells, we define `transforms` for the dataset. Next, we initialize the data, a directory for results and some fixed hyperparameters.

In [102]:
# PLEASE DEFINE APPROPRIATE TRANFORMS FOR THE DATASET
# (If you don't see any need to do that, then you can skip this cell)
# HINT: Please prepare your data accordingly to your chosen distribution in the decoder
from torchvision import transforms
transforms_train = torchvision.transforms.Compose( [
                                                    #transforms.Resize((28,28)),
                                                    transforms.ToTensor(),  transforms.Lambda(lambda x: torch.flatten(x))
] )
transforms_test = torchvision.transforms.Compose( [
                                                    #transforms.Resize((28,28)),
                                                    transforms.ToTensor(), transforms.Lambda(lambda x: torch.flatten(x))
] )

#transforms_train = None
#transforms_test = None

Please do not modify the code in the next cell.

In [103]:
# ==========DO NOT REMOVE==========
#-dataset
dataset = MNIST('/files/', train=True, download=True,
                      transform=transforms_train
                )

train_dataset, val_dataset = torch.utils.data.random_split(dataset, [50000, 10000], generator=torch.Generator().manual_seed(14))

test_dataset = MNIST('/files/', train=False, download=True,
                      transform=transforms_test
                     )
#-dataloaders
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

#-creating a dir for saving results
name = 'vae_bern_mg'
result_dir = images_dir + 'results/' + name + '/'
if not(os.path.exists(result_dir)):
    os.mkdir(result_dir)


In the next cell, please initialize the model. Please remember about commenting your code!

In [104]:
# BASIC HYPERPARAMETERS
#D = 784   # input dimension # 28X28
D = 784
L = 2  # number of latents

M = 256  # the number of neurons in scale (s) and translation (t) nets
num_epochs = 1000 # max. number of epochs
max_patience = 20 # an early stopping is used, if training doesn't improve for longer than 20 epochs, it is stopped

# For gaussian mixture
num_components = 4**2

# model definition
# FILL IN
likelihood_type = 'bernoulli'

if likelihood_type == 'categorical':
    num_vals = 17
elif likelihood_type == 'bernoulli':
    num_vals = 1
    
# YOUR CODE COMES HERE:
# FILL IN ANY OTHER HYPERPARAMS YOU WANT TO USE

In [105]:
# INIT YOUR VAE (PLEASE CALL IT model)
# AN EXAMPLE: model = VAE(encoder, decoder, likelihood_type=likelihood_type, ...)

print("input dim : ",D)
print("nb neurones : ",M)
print("nb latents: ",L)
print("2 * L : ",2 * L)
print("num_vals >",num_vals)
print("num_vals * D : ",num_vals * D)

encoder_net = nn.Sequential(nn.Linear(D, M), nn.LeakyReLU(),
                        nn.Linear(M, M), nn.LeakyReLU(),
                        nn.Linear(M, 2 * L))

encoder = Encoder(encoder_net=encoder_net)
decoder_net = nn.Sequential(nn.Linear(L, M), nn.LeakyReLU(),
                        nn.Linear(M, M), nn.LeakyReLU(),
                        nn.Linear(M, num_vals * D))

decoder = Decoder(distribution=likelihood_type, decoder_net=decoder_net, num_vals=num_vals)

#prior = torch.distributions.MultivariateNormal(torch.zeros(L), torch.eye(L))
prior =  Prior(L=L, num_components=num_components)
#model = VAE(encoder=encoder, decoder=decoder, num_vals=num_vals, prior=L,num_compo=num_components, likelihood_type=likelihood_type)
model = VAE(encoder, decoder, prior, num_vals=num_vals, L=L, likelihood_type=likelihood_type)

# Print the summary (like in Keras)
#print("ENCODER:\n", summary(encoder, torch.zeros(1, D), show_input=True, show_hierarchical=True))
#print("\nDECODER:\n",  summary(decoder, torch.zeros(1, L), show_input=True, show_hierarchical=True))

model.to(device)

input dim :  784
nb neurones :  256
nb latents:  2
2 * L :  4
num_vals > 1
num_vals * D :  784
VAE by CR.


VAE(
  (encoder): Encoder(
    (encoder): Sequential(
      (0): Linear(in_features=784, out_features=256, bias=True)
      (1): LeakyReLU(negative_slope=0.01)
      (2): Linear(in_features=256, out_features=256, bias=True)
      (3): LeakyReLU(negative_slope=0.01)
      (4): Linear(in_features=256, out_features=4, bias=True)
    )
  )
  (decoder): Decoder(
    (decoder): Sequential(
      (0): Linear(in_features=2, out_features=256, bias=True)
      (1): LeakyReLU(negative_slope=0.01)
      (2): Linear(in_features=256, out_features=256, bias=True)
      (3): LeakyReLU(negative_slope=0.01)
      (4): Linear(in_features=256, out_features=784, bias=True)
    )
  )
  (prior): Prior()
)

Please initialize the optimizer

In [106]:
# PLEASE DEFINE YOUR OPTIMIZER
lr = 1e-5 # learning rate (PLEASE CHANGE IT AS YOU WISH!)
#optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
optimizer = torch.optim.Adamax([p for p in model.parameters() if p.requires_grad == True], lr=lr)

#### Question 8

Please explain the choice of the optimizer, and comment on the choice of the hyperparameters (e.g., the learing reate value).

**ANSWER:**
I considered Adamax as in the Jacub example, that is not very adventurous but it led me to understand why this is a good optimizer

Adamax is a extension desifn for accelerating Adam, a gradient descent algorithm that update the objective function after a given batch size.

This one is able to deal with non stationnary objectives and sparse gradients as explained in the original paper : https://arxiv.org/pdf/1412.6980.pdf


### Training and final evaluation

In the following two cells, we run the training and the final evaluation.

In [None]:
# ==========DO NOT REMOVE OR MODIFY==========
# Training procedure
nll_val = training(name=result_dir + name, max_patience=max_patience,
                   num_epochs=num_epochs, model=model, optimizer=optimizer,
                   training_loader=train_loader, val_loader=val_loader,
                   shape=(28,28))

>>>>>>>>>>>>> epoch  0 / 1000
Epoch: 0, val nll=314.96735698242185
saved!
>>>>>>>>>>>>> epoch  1 / 1000
Epoch: 1, val nll=275.5360330810547
saved!
>>>>>>>>>>>>> epoch  2 / 1000
Epoch: 2, val nll=265.0988737548828
saved!
>>>>>>>>>>>>> epoch  3 / 1000
Epoch: 3, val nll=260.1923396484375
saved!
>>>>>>>>>>>>> epoch  4 / 1000
Epoch: 4, val nll=257.27139794921874
saved!
>>>>>>>>>>>>> epoch  5 / 1000
Epoch: 5, val nll=254.76557641601562
saved!
>>>>>>>>>>>>> epoch  6 / 1000
Epoch: 6, val nll=252.35672780761718
saved!
>>>>>>>>>>>>> epoch  7 / 1000
Epoch: 7, val nll=250.24391318359375
saved!
>>>>>>>>>>>>> epoch  8 / 1000
Epoch: 8, val nll=248.5674984375
saved!
>>>>>>>>>>>>> epoch  9 / 1000
Epoch: 9, val nll=247.3224782470703
saved!
>>>>>>>>>>>>> epoch  10 / 1000
Epoch: 10, val nll=246.27089418945312
saved!
>>>>>>>>>>>>> epoch  11 / 1000
Epoch: 11, val nll=245.45234008789063
saved!
>>>>>>>>>>>>> epoch  12 / 1000
Epoch: 12, val nll=244.86294567871093
saved!
>>>>>>>>>>>>> epoch  13 / 1000
Epoch: 13

>>>>>>>>>>>>> epoch  109 / 1000
Epoch: 109, val nll=217.8172957763672
saved!
>>>>>>>>>>>>> epoch  110 / 1000
Epoch: 110, val nll=217.72037841796876
saved!
>>>>>>>>>>>>> epoch  111 / 1000
Epoch: 111, val nll=217.63186086425782
saved!
>>>>>>>>>>>>> epoch  112 / 1000
Epoch: 112, val nll=217.59454716796876
saved!
>>>>>>>>>>>>> epoch  113 / 1000
Epoch: 113, val nll=217.46168818359374
saved!
>>>>>>>>>>>>> epoch  114 / 1000
Epoch: 114, val nll=217.38583745117188
saved!
>>>>>>>>>>>>> epoch  115 / 1000
Epoch: 115, val nll=217.29655539550782
saved!
>>>>>>>>>>>>> epoch  116 / 1000
Epoch: 116, val nll=217.25532238769532
saved!
>>>>>>>>>>>>> epoch  117 / 1000
Epoch: 117, val nll=217.17374897460937
saved!
>>>>>>>>>>>>> epoch  118 / 1000
Epoch: 118, val nll=217.0603259765625
saved!
>>>>>>>>>>>>> epoch  119 / 1000
Epoch: 119, val nll=217.0105115234375
saved!
>>>>>>>>>>>>> epoch  120 / 1000
Epoch: 120, val nll=216.92694841308594
saved!
>>>>>>>>>>>>> epoch  121 / 1000
Epoch: 121, val nll=216.89547097167

>>>>>>>>>>>>> epoch  215 / 1000
Epoch: 215, val nll=211.65697661132813
>>>>>>>>>>>>> epoch  216 / 1000
Epoch: 216, val nll=211.625004296875
saved!
>>>>>>>>>>>>> epoch  217 / 1000
Epoch: 217, val nll=211.56822380371094
saved!
>>>>>>>>>>>>> epoch  218 / 1000
Epoch: 218, val nll=211.57252846679688
>>>>>>>>>>>>> epoch  219 / 1000
Epoch: 219, val nll=211.53095236816407
saved!
>>>>>>>>>>>>> epoch  220 / 1000
Epoch: 220, val nll=211.45523315429688
saved!
>>>>>>>>>>>>> epoch  221 / 1000
Epoch: 221, val nll=211.4794091796875
>>>>>>>>>>>>> epoch  222 / 1000
Epoch: 222, val nll=211.36594926757812
saved!
>>>>>>>>>>>>> epoch  223 / 1000
Epoch: 223, val nll=211.3451244140625
saved!
>>>>>>>>>>>>> epoch  224 / 1000
Epoch: 224, val nll=211.33345251464843
saved!
>>>>>>>>>>>>> epoch  225 / 1000
Epoch: 225, val nll=211.30485505371092
saved!
>>>>>>>>>>>>> epoch  226 / 1000
Epoch: 226, val nll=211.22915139160156
saved!
>>>>>>>>>>>>> epoch  227 / 1000
Epoch: 227, val nll=211.23013896484375
>>>>>>>>>>>>> epoc

In [None]:
# ==========DO NOT REMOVE OR MODIFY==========
# Final evaluation
test_loss = evaluation(name=result_dir + name, test_loader=test_loader)
f = open(result_dir + name + '_test_loss.txt', "w")
f.write(str(test_loss))
f.close()
### JUST ADDED FINAL FOR BETTER COMPARISON
samples_real(result_dir + name+'_FINAL', test_loader)
samples_generated(result_dir + name, test_loader, extra_name='_FINAL')

plot_curve(result_dir + name, nll_val)

### Results and discussion

After a successful training of your model, we would like to ask you to present your data and analyze it. Please answer the following questions.

#### Question 9 TODO

Please select the real data, and the final generated data and include them in this report. Please comment on the following:
- Do you think the model was trained properly by looking at the generations? Please motivate your answer well.
- What are the potential problems with evaluating a generative model by looking at generated data? How can we evaluate generative models (NOTE: ELBO or NLL do not count as answers)?

ANSWER: [Please fill in]

#### Question 10

Please include the plot of the negative ELBO. Please comment on the following:
- Is the training of your VAE stable or unstable? Why?
- What is the influence of the optimizer on your model? Do the hyperparameter values of the optimizer important and how do they influence the training? Motivate well your answer (e.g., run the script with more than one learning rate and present two plots here).

ANSWER: [Please fill in]