# Variational Autoencoder
- A Variational Autoencoder (VAE) is a type of generative model—that is, it learns to generate new data that looks like the training data.

### Encoder
- Instead of mapping the input to a single point (like a regular autoencoder), the encoder maps an input to a distribution over a latent space.
- For each input, it outputs:
  - A mean vector μ and
  - A log-variance vector log(σ²)


### Latent space sampling
- From the distribution(μ,σ²) we sample a latent vector z.
- Reparameterization trick
  - to allow backpropagation , we sample like this
    - z=μ+σ * ε
    - where ε ~ N(0, I)


### Decoder
- The decoder takes z and tries to reconstruct the original input.

- It maps z back to the data space (image, text, etc.).

### Objective Function (Loss)
The VAE’s loss has two parts:
- Reconstruction Loss:
  - Measures how well the decoded output matches the input.
  - Typically Mean Squared Error (MSE) or Binary Cross Entropy.

- KL Divergence Loss:
  - A regularization term that pushes the learned distribution (from the encoder) closer to a standard normal distribution N(0, I).
  - This makes sure the latent space is smooth and allows meaningful interpolation.
  - L=Reconstruction Loss+KL Divergence


### KL Divergence

- Kullback–Leibler Divergence
- KL Divergence is a measure of how one probability distribution Q(x) differs from a second, **reference distribution** P(x).
- How much information is lost when we approximate P(x) using Q(x)?”
- **Q(z|x)** is the learned (approximate) posterior from the encoder.



-  the posterior refers to the probability of some unknown variable (here, the latent variable z) after observing data x.
- Given this data x, what is the likely distribution of the hidden variable z?
  p(z∣x)=posterior
  

- x = your data (like an image)
- z = latent variable (compressed, hidden representation)
- p(z) = prior distribution (often chosen to be a standard normal)
- p(x|z) = likelihood (decoder: how likely x is given z)
- p(z|x) = posterior (how likely z is given the observed x)

## ❌ Why can't we use the real posterior?


- The true posterior p(z∣x) is usually intractable because it involves calculating:
   $p(z|x)=p(x|z)p(z) /  p(x)$
   


and $ p(x)=∫p(x∣z)p(z)dz $ is hard to compute (because the integral is often impossible to solve exactly).

- So what do we do?We approximate the posterior
- We introduce a learned function called the encoder, which gives
 $Q(z∣x)$
- This is a simpler, approximate distribution (like a Gaussian) that tries to be close to the true posterior
p(z∣x)
- The training goal is to make Q(z∣x) as close as possible to p(z∣x), typically by minimizing their KL divergence.

### Bayesian regression

____

$p(z∣x)= p(x∣z)⋅p(z) / p(x)$
​



- p(x∣z): likelihood (decoder)
- p(z): prior
- p(x)=∫p(x∣z)p(z)dz: marginal likelihood (evidence)
- The problem is this denominator,
p(x), which requires integrating over all possible values of z.
- **In Bayesian regression, sometimes we avoid this integration or simplify** it because:
   - We assume conjugate priors → the posterior has a closed form.
   - We use sampling methods like MCMC to approximate it.
   - Or we just compute MAP (maximum a posteriori) instead of full p(z∣x).
   - **So even in Bayesian regression, we often don’t explicitly compute the full posterior — we approximate or optimize over it.**



___

## 🧠 KL Divergence in Variational Autoencoders (VAEs)

**KL Divergence** (Kullback–Leibler Divergence) measures how one probability distribution diverges from another expected probability distribution. In VAEs, it regularizes the latent space by comparing the learned latent distribution $Q(z|x)$ with a prior $P(z)$.

---

### 📐 Definition

The KL divergence from distribution $Q(z)$ (approximate posterior) to $P(z)$ (prior) is defined as:

$$
\text{KL}(Q || P) = \int Q(z) \log \left( \frac{Q(z)}{P(z)} \right) \, dz
$$

This measures how much information is lost when using $Q$ to approximate $P$. KL divergence is always non-negative and equals zero only when $Q = P$.

---

### 🧮 KL Divergence Between Two Gaussians

In VAEs:

- The encoder outputs a Gaussian: $Q(z|x) = \mathcal{N}(\mu, \sigma^2)$
- The prior is a standard Gaussian: $P(z) = \mathcal{N}(0, 1)$

The KL divergence between these two distributions has a closed-form solution:

$$
\text{KL}\left( \mathcal{N}(\mu, \sigma^2) \, || \, \mathcal{N}(0, 1) \right) = \frac{1}{2} \sum_{i=1}^{d} \left( \mu_i^2 + \sigma_i^2 - \log \sigma_i^2 - 1 \right)
$$

Here:
- $\mu_i$ is the mean of the $i^{\text{th}}$ latent dimension  
- $\sigma_i^2$ is the variance of the $i^{\text{th}}$ latent dimension  
- $d$ is the dimensionality of the latent space

---

### 🎯 Why This Matters in VAEs

- Encourages the encoder to produce a **latent distribution** close to a **standard normal**: $z \sim \mathcal{N}(0, I)$  
- Ensures the latent space is **continuous and smooth**  
- Allows us to **sample** and **interpolate** meaningfully in the latent space  
- Prevents overfitting by regularizing the encoder

---

✅ In practice, this KL term is added to the **VAE loss function**, along with a reconstruction loss, to form the total objective:

$$
\mathcal{L}_{\text{VAE}} = \mathbb{E}_{Q(z|x)} \left[ \log P(x|z) \right] - \text{KL}(Q(z|x) || P(z))
$$

This balances reconstruction accuracy and latent space regularization.


___

✅ 1. Overview of VAE Architecture

> A VAE consists of:
- **Encoder**: Maps input $x$ to latent distribution parameters $\mu$ and $\log\sigma^2$
- **Reparameterization trick**: Samples latent variable $z$ from the latent distribution using $z = \mu + \sigma \cdot \epsilon$
- **Decoder**: Reconstructs $x$ from $z$
- **Loss**: Combines reconstruction loss + KL divergence



- **Sampling latent variable** $z$ just means:
  -  Randomly generating a point from that Gaussian distribution, using:
  - z=μ+σ⋅ϵ
  - where $\epsilon \sim \mathcal{N}(0, I)$ — random noise from a standard normal distribution (mean 0, std 1).


## 🤯 Why Reparameterize?

*   We need to backpropagate gradients during training.
*   But sampling from a distribution (e.g., torch.randn()) is not differentiable.
* To fix that, we use the reparameterization trick:
   - Instead of directly sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ (not differentiable),

   - We sample $\epsilon \sim \mathcal{N}(0, 1)$ (which is fixed and known),

   - And compute: $z = \mu + \sigma \cdot \epsilon$
✅ This formula is differentiable w.r.t. $\mu$ and $\sigma$





🔍 Problem: Why Not Just Sample Normally?
> - In a standard autoencoder, you encode input $x$ to a latent vector $z$ (just a point).
  - In a VAE, you want to model uncertainty, so instead of mapping $x \rightarrow z$,
you map $x \rightarrow \mu, \log\sigma^2$ and assume:
   - $z∼N(μ,σ_{2})$
- That is, $z$ is drawn from a Gaussian distribution for each input.


- ❗ But here's the issue: sampling a random variable breaks backpropagation because it's not differentiable.


- $
\nabla_\theta \, \mathbb{E}_{z \sim q_\theta(z|x)} \left[ \log p_\phi(x|z) \right]
$
- But we can’t differentiate through the stochastic node where $z$ is sampled.
- 💡 Solution: Reparameterization Trick

  - We make $z$ deterministic by moving the randomness outside the network.

Instead of:$$
z∼N(μ,σ^2)$$
we rewrite it as:


$$ z=μ+σ⋅ϵ $$ where $$ϵ∼N(0,1)$$
This works because:

- $\mu$ and $\sigma$ are outputs from the network (dependent on input $x$)

- $\epsilon$ is just noise (independent of the network)

- Now, $z$ is a deterministic function of $x$ and $\epsilon$

✅ This form is differentiable with respect to $\mu$ and $\sigma$, so we can use backpropagation!
