# Lab 7a - Variational Auto-Encoder in 2D

The goal of a variational auto-encoder (VAE) is to learn a **generative model**, that is 
1. to learn a latent representation of the input data in a lower-dimensional space that is well structured, 
2. to be able to sample from this latent space to generate new data points that resemble the training data.



---
## In a nutshell : VAE vs AE
### Architecture
A VAE is a type of neural network that consists of two main components: an **encoder** and a **decoder** : 
- The encoder $E : x \in \mathbb R^{d} \to z = E(x) \in \mathbb R^{D} $ 
maps the input data to a latent space, 
- while the decoder $D : z \in \mathbb R^{D} \to y = D(z) \in \mathbb R^{d} $
 maps points in the latent space back to the original data space. 
The key idea is to impose a probabilistic structure on the latent space, typically by assuming that the latent variables follow a Gaussian distribution : $z \sim \mathcal N(\mu, \Sigma)$ as it is easy to sample from (as seen earlier with GMM).

In some sense, a VAE can be seen as a probabilistic extension of a standard auto-encoder (AE) combined with GMM, where the encoder outputs parameters of a probability distribution (mean and variance) instead of a single point in the latent space.

### Training
A VAE is trained by maximizing the **variational lower bound** on the log-likelihood of the data, which consists of two terms: the reconstruction loss (how well the decoder can reconstruct the input data from the latent representation) and a regularization term that encourages the latent space to follow a Gaussian distribution. This regularization is achieved by introducing a **Kullback-Leibler divergence** term that measures how much the learned latent distribution deviates from a prior distribution (usually a standard normal distribution).


Here we implement a simple variational auto-encoder based on MLP using PyTorch.

---

Reference: "Auto-encoding variational Bayes.", Kingma, Diederik P., and Max Welling., Int. Conf. on Learning Representations (ICLR), 2014.

julien rabin @ greyc.ensicaen.fr 2025

### useful imports, definitions and data setup

In [None]:
import torch 
from torch import nn, Tensor

import numpy as np

import matplotlib.pyplot as plt

from tqdm import tqdm

In [None]:
def sample_data(n: int, model = 'radial_gmm') -> Tensor:
    data_dim = 2  # Dimension of the data
    if model == 'moons':
        from sklearn.datasets import make_moons
        return Tensor(make_moons(n_samples=n, noise=0.05)[0])
    elif model == 'circles':
        from sklearn.datasets import make_circles
        return Tensor(make_circles(n_samples=n, noise=0.05, factor=0.5)[0])
    elif model == '2gmm':
        n_samples = n
        n = n//3
        X1 = torch.randn(n, data_dim) @ torch.tensor([[.05, -0.02],[-0.02, .4]]) + torch.tensor([[.0, 1.0]]).view(1, data_dim)
        X2 = torch.randn(n_samples - n, data_dim) @ torch.tensor([[.3, 0.05],[0.05, .05]]) + torch.tensor([[-1.0, 0.]]).view(1, data_dim)
        return torch.cat((X1,X2), dim=0)  # Concatenate the two sets of samples
    elif model == 'radial_gmm':
        K = 8
        n_samples = n
        samples_per_component = n_samples // K
        remainder = n_samples % K
        all_samples = []

        for k in range(K):
            radius = 3.
            # Angle for the mean on a circle
            theta = 2 * np.pi * k / K
            cs = np.cos(theta)
            sn = np.sin(theta)

            # Radial direction unit vector
            radial = torch.tensor([cs, sn], dtype=float)#.view(data_dim, 1)
            tangential = torch.tensor([-sn, cs], dtype=float)#.view(data_dim, 1)

            mean = radius * radial
            
            # Covariance matrix: elongated along radial direction
            cov = 0.3 * torch.outer(radial, radial).to(float) + 0.05 * torch.outer(tangential, tangential).to(float)

            # Generate samples
            n = samples_per_component + (remainder if k == K else 0)
            samples = torch.randn(n, data_dim, dtype=float) @ torch.linalg.cholesky(cov).to(float) + mean.to(float)
            all_samples.append(samples)

        return torch.cat(all_samples, dim=0).to(torch.float32)
    else:
        raise ValueError(f"Unknown model: {model}.")


In [None]:
def plot_data_comparison(x_data : np.ndarray, x_model : np.ndarray, colors = None):
    if colors is None:
        colors = 'C0' #np.random.rand(x_data.shape[0])
    fig, axes = plt.subplots(1, 2, figsize=(20, 10), sharex=True, sharey=True)

    # Original data
    axes[0].scatter(x_data[:, 0], x_data[:, 1], s=40, c=colors)
    axes[0].set_title('Data Samples')
    axes[0].set_xlim(-3.0, 3.0)
    axes[0].set_ylim(-3.0, 3.0)

    # VAE reconstruction
    axes[1].scatter(x_model[:, 0], x_model[:, 1], s=40, c=colors)
    axes[1].set_title('VAE Output Samples')
    axes[1].set_xlim(-3.0, 3.0)
    axes[1].set_ylim(-3.0, 3.0)

    # Global title
    fig.suptitle("Original vs Model samples", fontsize=24)
    plt.tight_layout()
    plt.subplots_adjust(top=0.88)

    return fig, axes


In [None]:
def define_colors(x_data: np.ndarray ) -> np.ndarray:
    cmap = plt.get_cmap("rainbow")
    values = x_data[:, 0]  # Use the first dimension for coloring
    normalized = (values - values.min()) / (values.max() - values.min())
    colors = cmap(normalized)
    return colors[:, :3]

### Exercice 1: Implementation of a pytorch VAE class 

Complete the following class implementing a VAE with MLP encoder and decoder.
You can refer to the AE class implemented in the previous labs.

Recall that 
1. the encoder $D$ should output both the **mean** and the **standard deviation** of the latent Gaussian distribution, that is 
$$D(x) = (\mu(x), \sigma(x)).$$
Here, both $\mu(x)$ and $\sigma(x)$ are both vectors of size equal to the latent dimension (`latent_dim=1` by default).

Note that in practice, for numerical stability (and to deal with positivity constraints), it is common to output the log-variance $\log \sigma^2(x)$ instead of the standard deviation $\sigma(x)$ directly.

2. the reparameterization trick should be used to sample from the latent distribution during training, that is
$$z = \mu(x) + \sigma(x) * \epsilon,$$
where $\epsilon \sim \mathcal N(0, I)$ is a standard normal random variable.
3. the loss function should include both the reconstruction loss and the KL divergence term (see details later).


In [None]:
class VAE(nn.Module):
    def __init__(self, dim: int = 2, h: int = 64, latent_dim: int = 1):
        super().__init__()
        
        self.lat_dim = latent_dim
        self.encoder = nn.Sequential( # output dimensions are batch_size x (2*latent_dim) for mean and logvar
            nn.Linear(dim, 2*latent_dim)) # Output mean and logvar
        
        self.decoder = nn.Sequential( # input dimensions are batch_size x latent_dim
            nn.Linear(latent_dim, dim))
    
    def forward(self, x: Tensor) -> Tensor: # AE mode : encode then decode, using the reparameterization trick
        # Note : x is of shape (batch_size, dim)
        
        # Extract mean and std from the output, both of shape (batch_size, latent_dim)
        mu, std = ... # use 'encode_mu_std' method to get mu and std
        
        # Decode the latent space representation to data space: Output is of shape (batch_size, dim)
        y = ... # use method 'decode_mu_std' 
        
        return y # Note: this is the *mean* of the generated distribution in the data space
    
    def encode_mu_std(self, x: Tensor) -> tuple[Tensor, Tensor]:
        # Encode input x to latent space
        out = self.encoder(x)
        
        mu  = out[:,:self.lat_dim]  # First half is the mean of the gaussian distribution
        if True : # try predicting the standard deviation directly
            std = nn.Softplus()(out[:,self.lat_dim:]) # Second half is the sandard deviation  > 0
        else : # try predicting the log variance instead
            std = ...
        return mu, std
    
    def decode_mu_std(self, mu: Tensor, std : Tensor) -> Tensor:
        # Decode latent space representation z to data space
        
        # Reparameterization trick
        z = ... # sample from the gaussian latent space using predicted parameters mu and std
        
        return self.decoder(z)
    
    def sample(self, n: int, std : float = 0.) -> Tensor:
        # Sample from the latent space distribution N(0, I)
        z = ...
        if std is None : # prediction is the mean of the distribution
            return self.decoder(z)
        else : # prediction is a gaussian distribution with std
            y = self.decoder(z)
            return y + ... # add noise according to std
    

### Test the (untrained) VAE model : on data reconstruction

Check that (like untrained AE) the untrained VAE model does not reconstruct well the input data.

In [None]:
dataset_name = 'radial_gmm' # moons, circles, 2gmm, radial_gmm
x = ...  # Generate synthetic data

# reconstruction of the training data
model = ... # use VAE class to create model with different parameters
y = model(x).detach()

# define colors for the data points
colors = define_colors(x.detach())

fig, axes = plot_data_comparison(x,y, colors=colors)

fig.suptitle("*Untrained* VAE reconstruction of data samples", fontsize=24)

### Test the (untrained) VAE model : generating samples by sampling the latent space

Recall that an ideal VAE has a well-structured latent space, so that sampling from the prior distribution in the latent space should produce meaningful samples in the data space after decoding.

In this lab we consider the usual prior distribution for the latent space, that is a standard normal distribution $\mathcal N(0, I)$. 

In [None]:
torch.manual_seed(42)
x = ...

# Generate samples from the untrained VAE
y = ... # use 'sample' method of the VAE model to generate samples

fig, axes = plot_data_comparison(x,y)

fig.suptitle("*UNTRAINED* VAE generation of random samples", fontsize=24)

# Principle of likelihood maximization for generative modeling with a decoder

As seen with GMM and AE, the generative model itself is composed of two main components:
- a latent space $z \in \mathcal Z$ that is (easily !) sampled from a prior distribution $p$ (usually a standard normal distribution in $\mathbb R^h$) : from now on we assume $z \sim p(z) = \mathcal N(z|\mu, \Sigma)$
- a parametric (here a Neural Network) decoder $D : z \in \mathcal Z \to y = D(z) \in \mathcal X$ that maps points in the latent space to the data space $\mathcal X$.
This decoder can also be used to have a parametric model of the generated data $x'$ given $z$: again, a Gaussian distribution is often used, typically centered on the output of the decoder: $x'|z \sim p_\theta(x'|z) = \mathcal N(x' | D(z), \Sigma')$.

Like GMM, the goal is to maximize the log-likelihood of the data $x \sim p_{\text{data}} \in \mathcal X$ with the generative model parametrized by $\theta$.
This can be expressed as (for a single data point $x$):
$$
    \ell(\theta, x) = \log p_{\theta}(x) = \log \int p_{\theta}(x|z) p(z) dz = \log \int p_{\theta}(x|z) p(z) dz
    = \log \mathbb E_{z \sim p(z)}[p_\theta(x|z)]
$$
where $\theta$ are the parameters of the decoder $D$ and the prior distribution $p(z)$.
To train te model we have to maximize the log-likelihood over training samples from the data distribution $p_{\text{data}}$:
$$
    \mathcal L(\theta) = \mathbb E_{x\sim p_{\text{data}}} \ell(\theta, x) = \mathbb E_{x\sim p_{\text{data}}} \log \mathbb E_{z \sim p(z)}[p_\theta(x|z)]
$$

However, this is **intractable**:
- the integral cannot be computed analytically because the conditional distribution $p_{\theta}(x|z)$, parametrized by the decoder, is typically complex and high-dimensional ! 

*Note: Some models we consider later adresse this specific problem by considering particular classes of neural networks (e.g. Normalizing FLow, discrete autoregressive model) that allow compute the likelihood of the data point $x$ given the latent variable $z$ in a tractable way, but this is not the case in general*

- the latent variables $z$ corresponding to data points $x$ are not observed, so we cannot directly sample from them to compute the expectation,

*Note that the posterior distribution $p_{\theta}(z|x) = \frac{p_{\theta}(x|z)p(z)}{p_{data}(x)}$ (by Bayes's rule) is also complex and high-dimensional ...*

**Idea:** to make the problem tractable, try to **approximate** the posterior distribution $p_{\theta}(z|x)$ with a simpler distribution $q_\phi(z|x)$, where $\phi$ are the parameters of an encoder $E$ which is trained to infer the unobserved latent variable $z$ from a data point $x$.

# Principle of VAE for generative modeling

- **Key idea #1:** To address this issue, we use a **variational approach**: we introduce a variational distribution $q_\phi(z|x)$ (parametrized by an encoder) that approximates the true posterior distribution $p(z|x)$ of the latent variables given the data. The goal is to find the parameters $\phi$ of this variational distribution that minimize the Kullback-Leibler divergence between $q_\phi(z|x)$ and $p(z|x)$.

- **Key idea #2:** As stated before, even knowing the posterior distribution $p(z|x)$, the integral $\int p_{\theta}(x|z) p(z) dz$ is still intractable, so we cannot compute the log-likelihood of the data directly. Instead, we can use the **variational lower bound** on the log-likelihood of the data, which can be expressed as (*see the course for details*) for any distribution $q(z)$:
$$
    \log p_\theta(x) 
    = \mathbb E_{z \sim q(z)} \left[ \log p_{\theta}(x) \right] 
    \ge \mathbb E_{z \sim q(z)} \left[ \frac{\log p_{\theta}(x,z)}{q(z)} \right] 
$$

Using $q(z) = q_\phi(z|x)$ yields (*see the course for details*) the following loss function to maximize:
$$\begin{align*}
    \mathcal L(\theta, \phi) &= \mathbb E_{x\sim p_{\text{data}}} \left[ \mathbb E_{z \sim q_\phi(z|x)} \left[ \log p_{\theta}(x|z) - \text{KL}(q_\phi(z|x) || p(z)) \right] \right] \\
\end{align*}$$

- **Key idea #3 (reparametrization trick):** to make the computation of each term tractable, we use a simple gaussian parametrization of the involved conditional dentities: as stated before we can impose for instance $p_{\theta}(x|z) = \mathcal N(x' | D(z), \Sigma)$ with $\Sigma= \sigma^2 I_d$ a constant covariance matrix.
Same for the posterior distribution $q_\phi(z|x)$ (that is arbitrary), we can use a Gaussian distribution with mean $\mu_\phi(x)$ and **diagonal covariance** matrix $\Sigma_\phi(x) = \sigma_\phi^2(x) I_D$, where this time both $\mu_\phi$ and $\sigma_\phi$ are neural networks (encoders).

## Training and Optimization

**Sampling**: During training, one can use batch optimization techniques such as stochastic gradient descent (SGD) or Adam to optimize the parameters $\theta$ and $\phi$ of the decoder and encoder, respectively. 
To do so, we need to sample a batch $\{x_i\}$ from the data distribution $p_{\text{data}}$ but also sample $z_i$ from the variational distribution $q_\phi(z|x_i)$ for each data point $x_i$.
TO so so

After sampling $x_i$, instead of sampling $z_i$ directly from $q_\phi(z|x)$, we sample random variables $\epsilon_i$ from a standard distribution $\mathcal N(0,I_D)$ and then apply a deterministic transformation to obtain $z$ that depends on the data point $x$ and the parameters $\phi$:
$$
    z_i = \mu_\phi(x_i) + \sigma_\phi(x_i) \odot \epsilon_i
$$

These two samplings allows us to backpropagate through the sampling process and optimize the parameters of the encoder and decoder jointly.

**Loss evaluation**
Using samples $(z_i,x_i)$, the loss function used during training is a combination of two terms:

- **reconstruction loss** (MSE): measures how well the probabilistic model $p_{\theta}(x|z)$ can reconstruct the input data $x$ from the latent variable $z$ (predicted using the surrogate posterior $q_\phi(z|x)$), that is how close $y_i=D_{\theta}(z_i)$ is from the input data $x_i$, using the *negative* log-likelihood of a $d$-dimensional gaussian distribution
$$
\begin{align*}
	\mathcal L_{\text{rec}} (\theta,\phi) 
	&= - \mathbb E_{x \sim p_{data}} \mathbb E_{z \sim q_{\phi}(z|x)} \log p_\theta(x|z) 
	\\
	&= \mathbb E_{x \sim p_{data}} \mathbb E_{z \sim q_{\phi}(z|x)} \tfrac{1}{2} \left(\tfrac{1}{\sigma^2}  \| D_{\theta}(z_i) - x \|^2 + d\log(2\pi) + d\log(\sigma^2) \right)
	\\
	&\approx \tfrac{1}{2N\sigma^2}   \sum_{i=1}^N \| D_{\theta}(z_i) - x_i \|^2 
	\;+\; \text{Cst}
\end{align*}
$$
where $\theta$ and $\sigma^2$ are the parameters of the gaussian distribution that models the reconstruction of $x$ with some variance $\sigma^2$.
Remember that the encoder $E$ outputs the mean $\mu_\phi(x_i)$ and the standard deviation $\sigma_\phi(x_i)$ of the latent variable $z_i = \mu_\phi(x_i) + \sigma_\phi(x_i) \epsilon_i$, so the reconstruction loss both depends on the parameters of the decoder $D$ and the encoder $E$ !

In practice, for a batch of N samples $(x_i, \epsilon_i) \sim p_{\text{data}} \times \mathcal N(0,I_d)$ (ignoring the constant terms), we can write the reconstruction loss as a MSE loss:
$$
	\mathcal L_{\text{rec}} (\theta,\phi) 
	= \tfrac{\lambda}{2} MSE(x,D_\theta(\mu_\phi(x_i) + \sigma_\phi(x_i) \epsilon_i))
$$
and the corresponding weight for the reconstruction loss is given by the inverse of the variance:
$$
	\lambda = \frac{1}{\sigma^2}
$$

- **Kullback-Leibler (KL) divergence:** measures how well the latent space distribution matches a prior distribution (here the standard normal distribution in $h$ dimensions)

$$\text{KL} \left( {\mathcal N}(\mu,\sigma^2) \,||\, {\mathcal N}(0,1) \right) 
	= \tfrac{1}{2} \left(\|\mu\|^2 + \|\sigma\|^2  - \sum_i (\log (\sigma_i^2) +1) \right)
$$
so that
$$
\begin{align*}
	\mathcal L_{\text{KL}} (\theta,\phi) 
	&= \mathbb E_{x \sim p_{data}} \text{KL} \left( q_\phi(z|x) \,||\, p(z) \right)
	\\
	&\approx \tfrac{1}{2N} \sum_{i=1}^N \left( \|\mu_\phi(x_i)\|^2 + \|\sigma_\phi(x_i)\|^2 - 2 \langle  \sigma_{\phi}(x_i), 1_h\rangle - h) \right)
\end{align*}
$$


Note that to ensure that the KL divergence is well defined, we need to ensure that the standard-deviation $\sigma_\phi(x_i)>0$ is positive, which can be simply achieved by different means. In the following we use an exponential activation function $\sigma_\phi(x_i)=\exp(\text{MLP}(x_i))$

### Exercice 2 : Implement the VAE loss function and train a VAE model

like previous labs, experiment with different values of the reconstruction weight $\lambda$ (`lambda_rec`) (related to the variance of the gaussian reconstruction) and latent dimension `latent_dim`, starting from `latent_dim=1`.

In [None]:
# Initialize the VAE model and training parameters

torch.manual_seed(0) # For reproducibility

latent_dim = 1
model = ...

lambda_rec = ...  # Reconstruction loss weight (Note: related to the variance of the gaussian reconstruction)

learn_rate = ...
optimizer = ...

In [None]:
n_iter = 100
batch_size = 256

rec_losses = []
KL_losses = []

pbar = tqdm(range(n_iter), desc="Training VAE")
for _ in pbar :
    x = ...
    
    # Note : we use only once the encoder and the decoder for the same batch
    mu, std = ...
    y = ...
    var = std**2 
    
    optimizer.zero_grad()
    rec_loss = ... # Reconstruction loss : sum of squared errors (not mean !)
    KL_loss = ...  # KL divergence term between gaussians in latent space (see formula above !)
    loss = lambda_rec * rec_loss + KL_loss  # Total loss
    loss.backward()
    optimizer.step()
    
    rec_losses.append(rec_loss.item())
    KL_losses.append(KL_loss.item())
    
    pbar.set_postfix({
        'total loss': f"{loss.item():.4f}",
        'SSE loss'  : f"{rec_losses[-1]:.4f}",
        'KL loss'  : f"{KL_losses[-1]:.4f}"
    })

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 5))
ax.semilogy(rec_losses, label='Reconstruction Loss', color='blue')
ax.semilogy(KL_losses, label='KL Divergence Loss', color='orange')
ax.semilogy([lambda_rec * x + y for x, y in zip(rec_losses, KL_losses)], label='Total Loss (weighted)', color='green')
ax.set_title('Losses during VAE Training')
ax.set_xlabel('Iteration')
ax.set_ylabel('Loss Value')
ax.legend()
plt.show()

### Reconstruction with a trained VAE model

Complete the following code to reconstruct training data with the trained VAE model.
Check that the reconstruction is better than with an untrained model.
Is it better than an AE model and why ?

In [None]:
# Generate samples from the training data
x = ...

# reconstruction of the training data : using only the mean of the learned distribution
y = ... # output of the model on input x is the mean of the generated distribution

colors = define_colors(x.detach())
fig, axes = plot_data_comparison(x,y, colors=colors)
fig.suptitle("VAE reconstruction of data samples", fontsize=24)

# reconstruction of the training data with noise
std = 1/np.sqrt(lambda_rec)
y_noised = ... # add noise with standard deviation 'std' to 'y'
fig, axes = plot_data_comparison(x,y_noised, colors=colors)
fig.suptitle("stochastic VAE reconstruction with gaussian noise", fontsize=24)

### Latent distribution of a trained VAE model

The purpose of the VAE is to learn a well-structured latent space, so that sampling from the prior distribution in the latent space should produce meaningful samples in the data space after decoding.

1. In the following cells, we visualize the learned **latent parameters** by plotting the mean and standard deviation of the latent variables for each data point in the training set and check their distribution.

What do we expect for these distributions ? what do you observe ?

2. We want to visualize the latent variables $z$ encoded by the VAE model.

What do we expect for this distribution ? what do you observe ?

In [None]:
# Generate samples from the training data
n = ...
x = ...

# reconstruction of the training data
mu, std = ...
eps = ... # random gaussian noise
z = ...  # corresponding random latent representation of x

eps = eps.detach().numpy()
z = z.detach().numpy()

In [None]:
# 1. Visualize the latent space parameters
fig, axes = plt.subplots(1, 2, figsize=(20, 10), sharex=False, sharey=False)

if mu.shape[1] == 1 :  # only if latent dimension is 1
    axes[0].hist(mu.detach()[:, 0], bins=30, color='C0')
    axes[0].set_title(f'VAE latent mean')
    axes[0].legend()
    axes[1].hist(std.detach()[:, 0], bins=30, color='C1')
    axes[1].set_title(f'VAE latent std')
    axes[1].legend()
else :
    axes[0].scatter(0, 0, s=100, color='red', label='mean target')
    axes[0].scatter(mu.detach()[:, 0], mu.detach()[:, 1], s=40, color='C0')
    axes[0].set_title(f'VAE latent mean (first 2 dimensions)')
    axes[0].legend()
    axes[1].scatter(1, 1, s=100, color='red', label='std target')
    axes[1].scatter(std.detach()[:, 0], std.detach()[:, 1], s=40, color='C1')
    axes[1].set_title(f'VAE latent std (first 2 dimensions)')
    axes[1].legend()

plt.suptitle('VAE latent mean and std variables (1st 2 dimensions)')
plt.tight_layout()
#plt.show()

In [None]:
# 2. Visualize the latent space distribution vs expected standard gaussian
fig, axes = plt.subplots(1, 2, figsize=(20, 10), sharex=True, sharey=True)

if mu.shape[1] == 1 :  # only if latent dimension is 1
    axes[0].hist(eps[:, 0], bins=30, color='red')
    axes[0].set_title(f'standard gaussian latent space')
    axes[0].set_xlim(-3.0, 3.0)

    axes[1].hist(z[:, 0], bins=30, color='blue')
    axes[1].set_title(f'VAE latent')
    axes[1].set_xlim(-3.0, 3.0)
    
else : # latent dimension is 2 or more
    axes[0].scatter(eps[:, 0], eps[:, 1], s=40, color='red')
    axes[0].set_title(f'standard gaussian latent space (first 2 dimensions)')
    axes[0].set_xlim(-3.0, 3.0)
    axes[0].set_ylim(-3.0, 3.0)

    axes[1].scatter(z[:, 0], z[:, 1], s=40)
    axes[1].set_title(f'VAE latent (first 2 dimensions)')
    axes[1].set_xlim(-3.0, 3.0)
    axes[1].set_ylim(-3.0, 3.0)

### Sampling random samples with a trained VAE model

In [None]:
torch.manual_seed(42)
n = 1_000

x = sample_data(n=n, model = dataset_name)

# Generate samples from the trained VAE
y = model.sample(n).detach()

fig, axes = plot_data_comparison(x,y)

fig.suptitle("VAE generation of random samples with std=0", fontsize=24)


In [None]:
# sample with variance using var = 1/lambda_rec
torch.manual_seed(42)
n = 1_000

x = ...

# Generate samples from the trained VAE
std = 1/np.sqrt(lambda_rec)
print(f"Sampling with std = {std:.2f}")
y = ...

fig, axes = plot_data_comparison(x,y)

fig.suptitle("VAE generation of random samples with var=1/\lambda_rec", fontsize=24)

## Conclusion and Discussion

In this notebook, we have implemented a simple Variational Auto-Encoder (VAE) using PyTorch. 
The generative model is composed of a decoder that maps a small latent space with a simple prior distribution to the (high-dimensional) data space.
To train such model the VAE requires an encoder that maps implicitely the training data to the latent space: the decoder yields paramters of a Gaussian distribution (mean and variance) that are then used to sample from the latent space ('reparametrization trick'). 
The auto-encoder is learnt using a combination of reconstruction loss and KL divergence to ensure that the latent space follows the desired distribution (here a standard Gaussian).

List the Advantages and Drawbacks of the VAE compared to a simple auto-encoder (AE) and GMM :
- ✅ ...
- ❌ ...


In the next notebooks, we will explore more advanced generative models that address some of these limitations, such as Generative Adversarial Networks (GANs) and Diffusion Models.