# Variational Inference (VI)

Variational Inference (VI) is a family of optimization-based methods for **approximating complex probability distributions**, most commonly used in **Bayesian inference** when exact posterior computation is intractable.

Instead of sampling from the posterior (as in MCMC), VI **turns inference into an optimization problem**.

---

## 1. The Core Problem in Bayesian Inference

In Bayesian modeling, we want the posterior distribution:

\[
p(\theta \mid x) = \frac{p(x \mid \theta)\, p(\theta)}{p(x)}
\]

where:
- \( \theta \) = latent variables / parameters  
- \( x \) = observed data  
- \( p(x) = \int p(x \mid \theta)p(\theta)\, d\theta \) (marginal likelihood)

### The bottleneck
- The denominator \( p(x) \) is often **intractable**
- High-dimensional integrals make exact inference impossible

---

## 2. Key Idea of Variational Inference

Instead of computing the true posterior \( p(\theta \mid x) \), VI:

1. Chooses a **simpler family of distributions** \( q(\theta) \)
2. Finds the member of this family **closest** to the true posterior

Closeness is measured using **KL divergence**:

\[
\text{KL}(q(\theta) \parallel p(\theta \mid x))
\]

---

## 3. Optimization Objective: ELBO

Directly minimizing KL divergence to the posterior is impossible because it depends on \( p(x) \).  
VI reformulates the problem using the **Evidence Lower Bound (ELBO)**.

### ELBO definition

\[
\log p(x) \ge
\mathbb{E}_{q(\theta)}[\log p(x,\theta)]
-
\mathbb{E}_{q(\theta)}[\log q(\theta)]
\]

This is equivalent to:

\[
\text{ELBO} =
\mathbb{E}_{q(\theta)}[\log p(x \mid \theta)]
-
\text{KL}(q(\theta)\,\|\,p(\theta))
\]

### Interpretation
- **First term**: data fit (expected log-likelihood)
- **Second term**: regularization toward the prior

Maximizing ELBO ⇔ minimizing KL divergence.

---

## 4. Variational Family Choices

The approximation quality depends heavily on the choice of \( q(\theta) \).

### Mean-Field Variational Inference
Assumes full factorization:

\[
q(\theta) = \prod_{i} q_i(\theta_i)
\]

**Pros**
- Simple
- Fast
- Scales well

**Cons**
- Ignores posterior correlations
- Underestimates uncertainty

---

## 5. Coordinate Ascent Variational Inference (CAVI)

When conjugacy exists, VI can be solved analytically using coordinate updates.

For each latent variable \( \theta_i \):

\[
\log q_i^*(\theta_i)
=
\mathbb{E}_{-i}[\log p(x,\theta)]
+ \text{const}
\]

Iteratively update each factor until convergence.

---

## 6. Stochastic Variational Inference (SVI)

For large datasets, full-batch VI is infeasible.

### SVI approach
- Use **mini-batches**
- Optimize ELBO using **stochastic gradients**
- Enables VI on massive datasets

Key ingredients:
- Reparameterization trick
- Natural gradients
- Learning rate schedules

---

## 7. Reparameterization Trick

Used to reduce gradient variance.

Instead of sampling:
\[
\theta \sim q_\phi(\theta)
\]

Rewrite as:
\[
\theta = g(\epsilon, \phi), \quad \epsilon \sim p(\epsilon)
\]

Example (Gaussian):
\[
\theta = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)
\]

This allows backpropagation through stochastic nodes.

---

## 8. Variational Inference vs MCMC

| Aspect | Variational Inference | MCMC |
|-----|----------------------|------|
| Speed | Very fast | Slow |
| Scalability | Excellent | Poor |
| Bias | Biased | Asymptotically exact |
| Uncertainty | Underestimated | Accurate |
| Parallelization | Easy | Difficult |

**Rule of thumb**
- Use **VI** when speed and scale matter
- Use **MCMC** when accuracy and uncertainty are critical

---

## 9. Common Failure Modes of VI

1. **Posterior collapse** (over-regularization)
2. **Variance underestimation**
3. **Bad variational family choice**
4. **Local optima**
5. **Poor initialization**

VI is not “wrong” — it is **biased by design**.

---

## 10. Practical Applications

- Topic models (LDA)
- Bayesian neural networks
- Variational Autoencoders (VAE)
- Probabilistic graphical models
- Online Bayesian learning

---

## 11. When Should You Use VI?

Use Variational Inference when:
- Data is large
- Latent space is high-dimensional
- Real-time or near-real-time inference is required
- Approximate uncertainty is acceptable

Avoid VI when:
- Posterior correlations are critical
- Exact uncertainty quantification is required
- Dataset is small (MCMC is feasible)

---

## 12. Summary

- Variational Inference reframes Bayesian inference as optimization
- ELBO is the central objective
- Speed and scalability come at the cost of bias
- VI is a **tool**, not a replacement for exact inference

Understanding its assumptions is more important than memorizing formulas.