# Variational Autoencoders (VAEs)

## Overview

**Variational Autoencoders (VAEs)** are a type of generative model that combine the ideas from autoencoders and variational inference. They learn a **probabilistic latent space** that can be sampled to generate new data similar to the input data. VAEs are commonly used for generating images, learning latent representations, and unsupervised tasks.

VAEs provide a principled way of performing both dimensionality reduction and data generation by learning a distribution of the latent variables, unlike standard autoencoders, which directly map inputs to latent codes.

---

## Architecture of VAEs

The basic architecture of a VAE consists of:

1. **Encoder**: Maps input data $ x $ to a probability distribution over the latent space $ z $. This distribution is typically Gaussian.
2. **Latent Space**: Encodes the data in a low-dimensional, continuous latent variable $ z $, sampled from the distribution learned by the encoder.
3. **Decoder**: Maps latent variable $ z $ back to a distribution over the original data space $ x $.

In contrast to standard autoencoders, VAEs encode the input as a distribution, rather than a single point. They aim to minimize the reconstruction error **and** regularize the latent space using a term based on **Kullback-Leibler (KL) divergence**.

---

## Mathematical Foundations of VAEs

### 1. The Encoder Network

The encoder maps the input $ x $ to a distribution over the latent variable $ z $. The distribution is often assumed to be Gaussian, so the encoder outputs the mean $ \mu(x) $ and the standard deviation $ \sigma(x) $ of the Gaussian distribution for each input:

$$
q(z|x) = \mathcal{N}(z; \mu(x), \sigma(x)^2)
$$

Here:

- $ \mu(x) $ is the mean of the Gaussian.
- $ \sigma(x) $ is the standard deviation (often represented as $ \log(\sigma(x)) $ to avoid negative values).

### 2. Sampling Latent Variable $ z $

To make backpropagation work, we use the **reparameterization trick**. Instead of directly sampling $ z $ from the Gaussian, we reparameterize the sampling process as:

$$
z = \mu(x) + \sigma(x) \odot \epsilon
$$

where $ \epsilon $ is sampled from a standard normal distribution $ \mathcal{N}(0, 1) $. This allows the model to learn $ \mu(x) $ and $ \sigma(x) $ through gradient descent.

### 3. The Decoder Network

The decoder reconstructs the data from the latent variable $ z $. It aims to maximize the likelihood of the data $ p(x|z) $, which can also be modeled as a Gaussian distribution:

$$
p(x|z) = \mathcal{N}(x; \hat{x}(z), \sigma^2)
$$

where $ \hat{x}(z) $ is the reconstructed output from the decoder network.

### 4. The Loss Function

The loss function for a VAE consists of two terms:

1. **Reconstruction Loss**: Measures how well the decoder reconstructs the input data $ x $. This can be computed using binary cross-entropy or mean squared error depending on the data type.
   
   $$
   \mathcal{L}_{\text{reconstruction}} = - \mathbb{E}_{q(z|x)}[\log p(x|z)]
   $$

2. **KL Divergence (Regularization Term)**: Encourages the distribution $ q(z|x) $ to be close to the prior distribution $ p(z) $, which is typically a standard Gaussian $ \mathcal{N}(0, 1) $. This regularizes the latent space to ensure smoothness and allows for meaningful sampling from it.

   $$
   \mathcal{L}_{\text{KL}} = D_{\text{KL}}(q(z|x) \parallel p(z))
   $$

   The KL divergence between the approximate posterior $ q(z|x) = \mathcal{N}(z; \mu(x), \sigma(x)^2) $ and the prior $ p(z) = \mathcal{N}(0, 1) $ is given by:

   $$
   D_{\text{KL}}(q(z|x) \parallel p(z)) = \frac{1}{2} \sum_{i=1}^{d} \left( \mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1 \right)
   $$

Thus, the total loss for the VAE is:

$$
\mathcal{L}_{\text{VAE}} = \mathcal{L}_{\text{reconstruction}} + \mathcal{L}_{\text{KL}}
$$

---

## How VAEs Work in Practice

1. **Training**: The VAE is trained to optimize the total loss function that balances the reconstruction accuracy and the regularization (KL divergence). The encoder learns to map the input data to a Gaussian distribution in the latent space, while the decoder learns to generate realistic outputs from sampled latent variables.

2. **Generation**: After training, we can generate new data by sampling from the prior distribution $ p(z) = \mathcal{N}(0, 1) $ in the latent space and feeding these samples into the decoder.

---

## Common Use Cases of VAEs

1. **Data Generation**: VAEs can generate new data that resembles the training data by sampling from the latent space. For example, they are used for generating images, speech, and other forms of data.

2. **Dimensionality Reduction**: VAEs can be used for unsupervised learning tasks where the goal is to learn a low-dimensional representation of the data, like PCA but with a probabilistic interpretation.

3. **Anomaly Detection**: Since VAEs model the distribution of the data, they can detect anomalies by observing how well the model reconstructs new data points. Poor reconstruction suggests that the data point is unusual or anomalous.

4. **Semi-supervised Learning**: VAEs can be used in semi-supervised learning settings, where only a small portion of the data is labeled. The VAE can help by learning meaningful latent representations from the unlabeled data.

---

## Summary

Variational Autoencoders (VAEs) are a powerful class of generative models that learn to map data into a probabilistic latent space. They are trained by optimizing a loss function that balances the reconstruction error and the KL divergence between the learned latent distribution and a prior. VAEs are used for tasks like data generation, dimensionality reduction, and anomaly detection.
