# Probability and Statistics for Generative AI

## Table of Contents
1. [Foundations of Probability Theory](#1-foundations-of-probability-theory)
2. [Random Variables and Distributions](#2-random-variables-and-distributions)
3. [Key Probability Distributions in Gen-AI](#3-key-probability-distributions-in-gen-ai)
4. [Expectation, Moments, and Covariance](#4-expectation-moments-and-covariance)
5. [Information Theory Foundations](#5-information-theory-foundations)
6. [Parameter Estimation Methods](#6-parameter-estimation-methods)
7. [Latent Variable Models](#7-latent-variable-models)
8. [Sampling Methods for Generative Models](#8-sampling-methods-for-generative-models)
9. [Probabilistic Framework of Gen-AI Architectures](#9-probabilistic-framework-of-gen-ai-architectures)
10. [Statistical Learning Theory](#10-statistical-learning-theory)

---

## 1. Foundations of Probability Theory

### 1.1 Definition of Probability

**Definition:** Probability is a mathematical framework for quantifying uncertainty, assigning a numerical value between 0 and 1 to events, representing the likelihood of their occurrence.

In Generative AI, probability provides the theoretical foundation for:
- Modeling data distributions
- Generating new samples
- Quantifying uncertainty in predictions
- Training models via likelihood maximization

### 1.2 Probability Axioms (Kolmogorov Axioms)

For a sample space $\Omega$ and event $A$:

$$P: \mathcal{F} \rightarrow [0,1]$$

**Axiom 1 (Non-negativity):**
$$P(A) \geq 0 \quad \forall A \in \mathcal{F}$$

**Axiom 2 (Normalization):**
$$P(\Omega) = 1$$

**Axiom 3 (Countable Additivity):**
$$P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) \quad \text{for mutually exclusive } A_i$$

### 1.3 Conditional Probability

**Definition:** The probability of event $A$ occurring given that event $B$ has occurred.

$$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$

**Relevance to Gen-AI:**
- Autoregressive models (GPT, LLaMA) model $P(x_t | x_1, x_2, ..., x_{t-1})$
- Conditional generation: $P(\text{output}|\text{prompt})$

### 1.4 Chain Rule of Probability

For sequence of random variables $X_1, X_2, ..., X_n$:

$$P(X_1, X_2, ..., X_n) = P(X_1) \prod_{i=2}^{n} P(X_i | X_1, ..., X_{i-1})$$

**Application in Autoregressive Language Models:**
$$P(w_1, w_2, ..., w_T) = \prod_{t=1}^{T} P(w_t | w_1, ..., w_{t-1})$$

### 1.5 Bayes' Theorem

**Definition:** Relates conditional probabilities between events.

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

**General Form:**
$$P(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta) \cdot P(\theta)}{P(\mathcal{D})}$$

| Component | Term | Role in Gen-AI |
|-----------|------|----------------|
| $P(\theta \| \mathcal{D})$ | Posterior | Updated belief after seeing data |
| $P(\mathcal{D} \| \theta)$ | Likelihood | How well model explains data |
| $P(\theta)$ | Prior | Initial belief about parameters |
| $P(\mathcal{D})$ | Evidence | Normalization constant |

### 1.6 Independence and Conditional Independence

**Statistical Independence:**
$$P(A \cap B) = P(A) \cdot P(B)$$

**Conditional Independence:**
$$P(A, B | C) = P(A|C) \cdot P(B|C)$$

**Notation:** $A \perp B | C$

**Application:** Markov assumption in diffusion models:
$$P(x_{t-1} | x_t, x_0) = P(x_{t-1} | x_t)$$

---

## 2. Random Variables and Distributions

### 2.1 Random Variable Definition

**Definition:** A random variable $X$ is a measurable function from sample space $\Omega$ to real numbers $\mathbb{R}$.

$$X: \Omega \rightarrow \mathbb{R}$$

### 2.2 Discrete Random Variables

**Probability Mass Function (PMF):**
$$P_X(x) = P(X = x)$$

**Properties:**
- $P_X(x) \geq 0 \quad \forall x$
- $\sum_x P_X(x) = 1$

**Application in Gen-AI:**
- Token probability distribution in language models
- Categorical output selection

### 2.3 Continuous Random Variables

**Probability Density Function (PDF):**
$$f_X(x) \geq 0, \quad \int_{-\infty}^{\infty} f_X(x) dx = 1$$

**Probability of interval:**
$$P(a \leq X \leq b) = \int_a^b f_X(x) dx$$

**Application:**
- Latent space representations in VAEs
- Noise modeling in diffusion models

### 2.4 Cumulative Distribution Function (CDF)

**Definition:**
$$F_X(x) = P(X \leq x) = \int_{-\infty}^{x} f_X(t) dt$$

**Properties:**
- $F_X(-\infty) = 0$
- $F_X(\infty) = 1$
- Non-decreasing function

### 2.5 Joint, Marginal, and Conditional Distributions

**Joint Distribution:**
$$P(X, Y) \text{ or } f_{X,Y}(x, y)$$

**Marginal Distribution:**
$$P(X) = \sum_y P(X, Y=y) \quad \text{(discrete)}$$
$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) dy \quad \text{(continuous)}$$

**Conditional Distribution:**
$$f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}$$

---

## 3. Key Probability Distributions in Gen-AI

### 3.1 Bernoulli Distribution

**Definition:** Distribution over binary outcomes.

$$P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}$$

**Parameters:** $p \in [0, 1]$

**Moments:**
$$\mathbb{E}[X] = p, \quad \text{Var}(X) = p(1-p)$$

### 3.2 Categorical Distribution

**Definition:** Generalization of Bernoulli to $K$ categories.

$$P(X = k) = \pi_k, \quad \sum_{k=1}^{K} \pi_k = 1$$

**One-hot encoding representation:**
$$P(\mathbf{x}) = \prod_{k=1}^{K} \pi_k^{x_k}$$

**Application:** Token prediction in language models (softmax output)

### 3.3 Multinomial Distribution

**Definition:** Distribution over counts from $n$ trials with $K$ categories.

$$P(X_1=x_1, ..., X_K=x_K) = \frac{n!}{\prod_{k=1}^{K} x_k!} \prod_{k=1}^{K} \pi_k^{x_k}$$

**Constraint:** $\sum_{k=1}^{K} x_k = n$

### 3.4 Gaussian (Normal) Distribution

**Definition:** Continuous distribution with bell-shaped curve.

**Univariate:**
$$\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**Multivariate:**
$$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

**Critical Importance in Gen-AI:**
| Application | Role |
|-------------|------|
| VAE latent space | Prior distribution $P(\mathbf{z}) = \mathcal{N}(0, I)$ |
| Diffusion models | Noise distribution at each step |
| Weight initialization | Xavier/He initialization |
| Dropout noise | Gaussian dropout variant |

### 3.5 Exponential Family

**Definition:** Unified framework for many distributions.

$$P(x | \boldsymbol{\eta}) = h(x) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(x) - A(\boldsymbol{\eta})\right)$$

| Component | Description |
|-----------|-------------|
| $\boldsymbol{\eta}$ | Natural parameters |
| $\mathbf{T}(x)$ | Sufficient statistics |
| $A(\boldsymbol{\eta})$ | Log-partition function |
| $h(x)$ | Base measure |

**Members:** Gaussian, Bernoulli, Categorical, Poisson, Gamma, Beta

### 3.6 Mixture Models

**Definition:** Weighted combination of component distributions.

$$P(x) = \sum_{k=1}^{K} \pi_k \cdot P_k(x | \theta_k)$$

**Gaussian Mixture Model (GMM):**
$$P(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

**Latent variable formulation:**
$$P(x) = \sum_{z} P(z) \cdot P(x|z)$$

---

## 4. Expectation, Moments, and Covariance

### 4.1 Expected Value

**Definition:** The mean or average value of a random variable.

**Discrete:**
$$\mathbb{E}[X] = \sum_x x \cdot P(X=x)$$

**Continuous:**
$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx$$

**Properties:**
- Linearity: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$
- For function $g(X)$: $\mathbb{E}[g(X)] = \int g(x) f_X(x) dx$

### 4.2 Variance and Standard Deviation

**Variance:**
$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

**Standard Deviation:**
$$\sigma_X = \sqrt{\text{Var}(X)}$$

**Properties:**
- $\text{Var}(aX + b) = a^2 \text{Var}(X)$
- For independent $X, Y$: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

### 4.3 Covariance and Correlation

**Covariance:**
$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

**Covariance Matrix:**
$$\boldsymbol{\Sigma} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]$$

$$\boldsymbol{\Sigma}_{ij} = \text{Cov}(X_i, X_j)$$

**Correlation Coefficient:**
$$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}, \quad -1 \leq \rho \leq 1$$

### 4.4 Higher-Order Moments

**$n$-th Moment:**
$$\mu_n = \mathbb{E}[X^n]$$

**$n$-th Central Moment:**
$$\mu'_n = \mathbb{E}[(X - \mathbb{E}[X])^n]$$

**Skewness (3rd standardized moment):**
$$\gamma_1 = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3}$$

**Kurtosis (4th standardized moment):**
$$\gamma_2 = \frac{\mathbb{E}[(X-\mu)^4]}{\sigma^4}$$

### 4.5 Moment Generating Function

**Definition:**
$$M_X(t) = \mathbb{E}[e^{tX}]$$

**Property:**
$$\mathbb{E}[X^n] = M_X^{(n)}(0) = \left.\frac{d^n M_X(t)}{dt^n}\right|_{t=0}$$

---

## 5. Information Theory Foundations

### 5.1 Entropy

**Definition:** Measure of uncertainty or information content in a distribution.

**Discrete Entropy:**
$$H(X) = -\sum_x P(x) \log P(x) = \mathbb{E}[-\log P(X)]$$

**Differential Entropy (Continuous):**
$$H(X) = -\int f(x) \log f(x) dx$$

**Properties:**
- $H(X) \geq 0$ (discrete case)
- Maximum entropy for uniform distribution
- $H(X, Y) \leq H(X) + H(Y)$ (equality iff independent)

**Entropy of Common Distributions:**
| Distribution | Entropy |
|--------------|---------|
| Bernoulli($p$) | $-p\log p - (1-p)\log(1-p)$ |
| Categorical($K$) | $\leq \log K$ (max when uniform) |
| Gaussian($\mu, \sigma^2$) | $\frac{1}{2}\log(2\pi e \sigma^2)$ |

### 5.2 Cross-Entropy

**Definition:** Measures the average number of bits needed to encode data from distribution $P$ using a code optimized for distribution $Q$.

$$H(P, Q) = -\sum_x P(x) \log Q(x) = \mathbb{E}_{x \sim P}[-\log Q(x)]$$

**Continuous:**
$$H(P, Q) = -\int p(x) \log q(x) dx$$

**Application in Gen-AI - Cross-Entropy Loss:**
$$\mathcal{L}_{CE} = -\sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log \hat{y}_{ik}$$

Where:
- $y_{ik}$: true one-hot label
- $\hat{y}_{ik}$: predicted probability

### 5.3 Kullback-Leibler (KL) Divergence

**Definition:** Measures the difference between two probability distributions.

$$D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right]$$

**Continuous:**
$$D_{KL}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

**Properties:**
- $D_{KL}(P \| Q) \geq 0$ (Gibbs' inequality)
- $D_{KL}(P \| Q) = 0 \iff P = Q$
- **Asymmetric:** $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$

**Relationship:**
$$D_{KL}(P \| Q) = H(P, Q) - H(P)$$

**KL Divergence for Gaussians:**
$$D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2)) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

**Multivariate Gaussian:**
$$D_{KL}(\mathcal{N}_1 \| \mathcal{N}_2) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T\Sigma_2^{-1}(\mu_2-\mu_1) - k + \log\frac{|\Sigma_2|}{|\Sigma_1|}\right]$$

### 5.4 Jensen-Shannon Divergence

**Definition:** Symmetric measure of divergence.

$$D_{JS}(P \| Q) = \frac{1}{2}D_{KL}(P \| M) + \frac{1}{2}D_{KL}(Q \| M)$$

Where $M = \frac{1}{2}(P + Q)$

**Properties:**
- Symmetric: $D_{JS}(P \| Q) = D_{JS}(Q \| P)$
- Bounded: $0 \leq D_{JS} \leq \log 2$
- $\sqrt{D_{JS}}$ is a metric

**Application:** Original GAN training objective

### 5.5 Mutual Information

**Definition:** Measures shared information between two random variables.

$$I(X; Y) = D_{KL}(P(X,Y) \| P(X)P(Y))$$

$$I(X; Y) = \sum_x \sum_y P(x, y) \log \frac{P(x, y)}{P(x)P(y)}$$

**Alternative Forms:**
$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X, Y)$$

**Properties:**
- $I(X; Y) \geq 0$
- $I(X; Y) = 0 \iff X \perp Y$
- $I(X; X) = H(X)$

**Application in Gen-AI:**
- InfoGAN: Maximizing mutual information for disentanglement
- Contrastive learning objectives

---

## 6. Parameter Estimation Methods

### 6.1 Maximum Likelihood Estimation (MLE)

**Definition:** Find parameters that maximize the probability of observed data.

$$\hat{\theta}_{MLE} = \arg\max_\theta P(\mathcal{D} | \theta) = \arg\max_\theta \mathcal{L}(\theta; \mathcal{D})$$

**For i.i.d. data:**
$$\mathcal{L}(\theta; \mathcal{D}) = \prod_{i=1}^{N} P(x_i | \theta)$$

**Log-Likelihood (computationally preferred):**
$$\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{N} \log P(x_i | \theta)$$

**MLE Equation:**
$$\nabla_\theta \ell(\theta) = 0$$

### 6.2 MLE for Common Distributions

**Bernoulli:**
$$\hat{p}_{MLE} = \frac{1}{N}\sum_{i=1}^{N} x_i$$

**Gaussian:**
$$\hat{\mu}_{MLE} = \frac{1}{N}\sum_{i=1}^{N} x_i$$
$$\hat{\sigma}^2_{MLE} = \frac{1}{N}\sum_{i=1}^{N} (x_i - \hat{\mu})^2$$

**Categorical:**
$$\hat{\pi}_k = \frac{N_k}{N}$$

Where $N_k$ is count of category $k$

### 6.3 MLE in Neural Networks

**Negative Log-Likelihood (NLL) Loss:**
$$\mathcal{L}_{NLL} = -\frac{1}{N}\sum_{i=1}^{N} \log P_\theta(x_i)$$

**For classification (Cross-Entropy):**
$$\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N} \log P_\theta(y_i | x_i)$$

**For language models:**
$$\mathcal{L}_{LM} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(w_t | w_{<t})$$

### 6.4 Maximum A Posteriori (MAP) Estimation

**Definition:** Incorporates prior knowledge through Bayesian framework.

$$\hat{\theta}_{MAP} = \arg\max_\theta P(\theta | \mathcal{D}) = \arg\max_\theta P(\mathcal{D} | \theta) P(\theta)$$

**Log form:**
$$\hat{\theta}_{MAP} = \arg\max_\theta \left[\log P(\mathcal{D} | \theta) + \log P(\theta)\right]$$

### 6.5 Connection to Regularization

**Gaussian Prior → L2 Regularization:**
$$P(\theta) = \mathcal{N}(0, \sigma^2 I) \implies \log P(\theta) = -\frac{\|\theta\|^2}{2\sigma^2} + \text{const}$$

$$\hat{\theta}_{MAP} = \arg\max_\theta \left[\log P(\mathcal{D}|\theta) - \frac{\lambda}{2}\|\theta\|^2\right]$$

**Laplace Prior → L1 Regularization:**
$$P(\theta) \propto \exp(-\lambda|\theta|) \implies \hat{\theta}_{MAP} = \arg\max_\theta \left[\log P(\mathcal{D}|\theta) - \lambda\|\theta\|_1\right]$$

### 6.6 Full Bayesian Inference

**Posterior Distribution:**
$$P(\theta | \mathcal{D}) = \frac{P(\mathcal{D} | \theta) P(\theta)}{P(\mathcal{D})}$$

**Predictive Distribution:**
$$P(x_{new} | \mathcal{D}) = \int P(x_{new} | \theta) P(\theta | \mathcal{D}) d\theta$$

**Challenge:** Integral often intractable → requires approximate inference

---

## 7. Latent Variable Models

### 7.1 Definition and Motivation

**Latent Variable Model:** A probabilistic model with observed variables $\mathbf{x}$ and unobserved (hidden) variables $\mathbf{z}$.

$$P_\theta(\mathbf{x}) = \int P_\theta(\mathbf{x}, \mathbf{z}) d\mathbf{z} = \int P_\theta(\mathbf{x} | \mathbf{z}) P(\mathbf{z}) d\mathbf{z}$$

**Purpose in Gen-AI:**
- Learn compressed representations
- Enable controllable generation
- Capture data structure

### 7.2 Marginal Likelihood (Evidence)

$$P_\theta(\mathbf{x}) = \int P_\theta(\mathbf{x} | \mathbf{z}) P(\mathbf{z}) d\mathbf{z}$$

**Problem:** This integral is often intractable because:
- High-dimensional integration
- Complex likelihood functions
- No closed-form solution

### 7.3 Evidence Lower Bound (ELBO)

**Derivation:**
Starting from log marginal likelihood:
$$\log P_\theta(\mathbf{x}) = \log \int P_\theta(\mathbf{x}, \mathbf{z}) d\mathbf{z}$$

Introducing variational distribution $Q_\phi(\mathbf{z}|\mathbf{x})$:
$$\log P_\theta(\mathbf{x}) = \log \int Q_\phi(\mathbf{z}|\mathbf{x}) \frac{P_\theta(\mathbf{x}, \mathbf{z})}{Q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z}$$

Applying Jensen's inequality:
$$\log P_\theta(\mathbf{x}) \geq \int Q_\phi(\mathbf{z}|\mathbf{x}) \log \frac{P_\theta(\mathbf{x}, \mathbf{z})}{Q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z}$$

**ELBO Definition:**
$$\mathcal{L}_{ELBO}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{Q_\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{P_\theta(\mathbf{x}, \mathbf{z})}{Q_\phi(\mathbf{z}|\mathbf{x})}\right]$$

**Decomposition:**
$$\mathcal{L}_{ELBO} = \underbrace{\mathbb{E}_{Q_\phi(\mathbf{z}|\mathbf{x})}[\log P_\theta(\mathbf{x}|\mathbf{z})]}_{\text{Reconstruction Term}} - \underbrace{D_{KL}(Q_\phi(\mathbf{z}|\mathbf{x}) \| P(\mathbf{z}))}_{\text{Regularization Term}}$$

**Relationship to Log-Likelihood:**
$$\log P_\theta(\mathbf{x}) = \mathcal{L}_{ELBO} + D_{KL}(Q_\phi(\mathbf{z}|\mathbf{x}) \| P_\theta(\mathbf{z}|\mathbf{x}))$$

### 7.4 Expectation-Maximization (EM) Algorithm

**Objective:** Maximum likelihood for latent variable models.

**E-Step (Expectation):**
$$Q(\theta | \theta^{(t)}) = \mathbb{E}_{\mathbf{z} \sim P(\mathbf{z}|\mathbf{x}, \theta^{(t)})}[\log P(\mathbf{x}, \mathbf{z} | \theta)]$$

**M-Step (Maximization):**
$$\theta^{(t+1)} = \arg\max_\theta Q(\theta | \theta^{(t)})$$

**Convergence Guarantee:**
$$\log P(\mathbf{x} | \theta^{(t+1)}) \geq \log P(\mathbf{x} | \theta^{(t)})$$

### 7.5 Variational Inference

**Goal:** Approximate intractable posterior $P(\mathbf{z}|\mathbf{x})$ with tractable $Q_\phi(\mathbf{z}|\mathbf{x})$.

**Optimization:**
$$\phi^* = \arg\min_\phi D_{KL}(Q_\phi(\mathbf{z}|\mathbf{x}) \| P(\mathbf{z}|\mathbf{x}))$$

Equivalently:
$$\phi^* = \arg\max_\phi \mathcal{L}_{ELBO}(\phi)$$

**Mean-Field Approximation:**
$$Q(\mathbf{z}) = \prod_{i=1}^{M} Q_i(z_i)$$

---

## 8. Sampling Methods for Generative Models

### 8.1 Importance of Sampling in Gen-AI

Sampling is fundamental for:
- Generating new data points
- Computing expectations
- Training stochastic models
- Monte Carlo estimation

### 8.2 Ancestral Sampling

**Definition:** Sample from joint distribution using factorization.

For graphical model:
$$P(x_1, ..., x_n) = \prod_{i=1}^{n} P(x_i | \text{parents}(x_i))$$

**Procedure:**
1. Sample $x_1 \sim P(x_1)$
2. Sample $x_2 \sim P(x_2 | x_1)$
3. Continue: $x_i \sim P(x_i | x_1, ..., x_{i-1})$

**Application:** Autoregressive generation in GPT

### 8.3 Inverse Transform Sampling

**Definition:** Generate samples from any distribution using uniform random variable.

For CDF $F_X$:
1. Sample $u \sim \text{Uniform}(0, 1)$
2. Compute $x = F_X^{-1}(u)$

$$X = F_X^{-1}(U) \sim F_X$$

### 8.4 Rejection Sampling

**Setup:** Target distribution $p(x)$, proposal distribution $q(x)$, constant $M$ where $Mq(x) \geq p(x)$ for all $x$.

**Algorithm:**
1. Sample $x \sim q(x)$
2. Sample $u \sim \text{Uniform}(0, 1)$
3. Accept $x$ if $u < \frac{p(x)}{Mq(x)}$; else reject and repeat

### 8.5 Markov Chain Monte Carlo (MCMC)

**Definition:** Construct Markov chain with stationary distribution equal to target distribution.

**Metropolis-Hastings Algorithm:**
1. Initialize $x^{(0)}$
2. For $t = 1, 2, ...$:
   - Propose $x' \sim q(x' | x^{(t-1)})$
   - Compute acceptance ratio:
   $$\alpha = \min\left(1, \frac{p(x') q(x^{(t-1)} | x')}{p(x^{(t-1)}) q(x' | x^{(t-1)})}\right)$$
   - Accept with probability $\alpha$:
   $$x^{(t)} = \begin{cases} x' & \text{with prob } \alpha \\ x^{(t-1)} & \text{otherwise} \end{cases}$$

**Gibbs Sampling:**
For multivariate distribution, sample each variable conditioned on others:
$$x_i^{(t+1)} \sim P(x_i | x_1^{(t+1)}, ..., x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, ..., x_n^{(t)})$$

### 8.6 Reparameterization Trick

**Problem:** Cannot backpropagate through stochastic sampling.

**Solution:** Express random variable as deterministic function of parameters and noise.

**For Gaussian:**
$$\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$$
$$\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)$$

**Gradient Flow:**
$$\nabla_{\boldsymbol{\mu}, \boldsymbol{\sigma}} \mathbb{E}_{z \sim \mathcal{N}(\mu, \sigma^2)}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[\nabla_{\boldsymbol{\mu}, \boldsymbol{\sigma}} f(\mu + \sigma \epsilon)]$$

**Application:** Training VAEs with backpropagation

### 8.7 Temperature Sampling

**Definition:** Control randomness in sampling from categorical distribution.

**Softmax with temperature:**
$$P(x_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$$

| Temperature $\tau$ | Effect |
|-------------------|--------|
| $\tau \rightarrow 0$ | Deterministic (argmax) |
| $\tau = 1$ | Standard softmax |
| $\tau \rightarrow \infty$ | Uniform distribution |

### 8.8 Top-k and Top-p (Nucleus) Sampling

**Top-k Sampling:**
1. Keep only top $k$ highest probability tokens
2. Renormalize probabilities
3. Sample from truncated distribution

**Top-p (Nucleus) Sampling:**
1. Sort tokens by probability
2. Keep smallest set where cumulative probability $\geq p$
3. Renormalize and sample

$$V^{(p)} = \min\{V' \subseteq V : \sum_{x \in V'} P(x) \geq p\}$$

---

## 9. Probabilistic Framework of Gen-AI Architectures

### 9.1 Taxonomy of Generative Models

```
Generative Models
├── Explicit Density
│   ├── Tractable Density
│   │   ├── Autoregressive Models (GPT)
│   │   └── Flow-based Models (Normalizing Flows)
│   └── Approximate Density
│       ├── Variational Autoencoders (VAE)
│       └── Diffusion Models
└── Implicit Density
    └── Generative Adversarial Networks (GAN)
```

### 9.2 Autoregressive Models

**Probabilistic Foundation:**
$$P_\theta(\mathbf{x}) = \prod_{t=1}^{T} P_\theta(x_t | x_1, ..., x_{t-1})$$

**Training Objective (NLL):**
$$\mathcal{L}_{AR} = -\mathbb{E}_{\mathbf{x} \sim p_{data}}\left[\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})\right]$$

**GPT-style Transformer:**
$$P_\theta(x_t | x_{<t}) = \text{softmax}(\mathbf{W}_v \cdot \mathbf{h}_t)$$

Where $\mathbf{h}_t$ is the hidden state from transformer.

**Perplexity (Evaluation Metric):**
$$\text{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})\right)$$

### 9.3 Variational Autoencoders (VAE)

**Generative Model:**
$$P_\theta(\mathbf{x}) = \int P_\theta(\mathbf{x} | \mathbf{z}) P(\mathbf{z}) d\mathbf{z}$$

**Components:**
- Prior: $P(\mathbf{z}) = \mathcal{N}(0, I)$
- Decoder: $P_\theta(\mathbf{x} | \mathbf{z})$ (neural network)
- Encoder (approximate posterior): $Q_\phi(\mathbf{z} | \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x})))$

**VAE Loss (Negative ELBO):**
$$\mathcal{L}_{VAE}(\theta, \phi; \mathbf{x}) = -\mathbb{E}_{Q_\phi(\mathbf{z}|\mathbf{x})}[\log P_\theta(\mathbf{x}|\mathbf{z})] + D_{KL}(Q_\phi(\mathbf{z}|\mathbf{x}) \| P(\mathbf{z}))$$

**KL Term (Closed Form for Gaussians):**
$$D_{KL}(Q_\phi \| P) = -\frac{1}{2}\sum_{j=1}^{J}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

**Reconstruction Term:**
$$\mathbb{E}_{Q_\phi}[\log P_\theta(\mathbf{x}|\mathbf{z})] \approx \frac{1}{L}\sum_{l=1}^{L} \log P_\theta(\mathbf{x} | \mathbf{z}^{(l)})$$

Where $\mathbf{z}^{(l)} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}^{(l)}$, $\boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(0, I)$

### 9.4 Diffusion Models

**Forward Process (Fixed):**
$$q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t | \mathbf{x}_{t-1})$$

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

**Closed-form for arbitrary timestep:**
$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$$

Where:
- $\alpha_t = 1 - \beta_t$
- $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$

**Reverse Process (Learned):**
$$P_\theta(\mathbf{x}_{0:T}) = P(\mathbf{x}_T) \prod_{t=1}^{T} P_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$$

$$P_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

**Training Objective (Simplified):**
$$\mathcal{L}_{simple} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2\right]$$

Where $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$

**Score Matching Perspective:**
$$\nabla_{\mathbf{x}} \log P(\mathbf{x}) \approx -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

### 9.5 Normalizing Flows

**Definition:** Transform simple distribution to complex one through invertible functions.

$$\mathbf{x} = f_\theta(\mathbf{z}), \quad \mathbf{z} \sim P_Z(\mathbf{z})$$

**Change of Variables Formula:**
$$P_X(\mathbf{x}) = P_Z(f_\theta^{-1}(\mathbf{x})) \left|\det\left(\frac{\partial f_\theta^{-1}(\mathbf{x})}{\partial \mathbf{x}}\right)\right|$$

$$\log P_X(\mathbf{x}) = \log P_Z(\mathbf{z}) - \log\left|\det\left(\frac{\partial f_\theta(\mathbf{z})}{\partial \mathbf{z}}\right)\right|$$

**Composition of Flows:**
$$\mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})$$

$$\log P_X(\mathbf{x}) = \log P_Z(\mathbf{z}) - \sum_{k=1}^{K} \log\left|\det\left(\frac{\partial f_k}{\partial \mathbf{h}_{k-1}}\right)\right|$$

**Training:**
$$\mathcal{L}_{NF} = -\mathbb{E}_{\mathbf{x} \sim p_{data}}[\log P_X(\mathbf{x})]$$

### 9.6 Generative Adversarial Networks (GANs)

**Implicit Density:** No explicit density computation; generates samples directly.

**Min-Max Objective:**
$$\min_G \max_D V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{data}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]$$

**Optimal Discriminator:**
$$D^*(\mathbf{x}) = \frac{p_{data}(\mathbf{x})}{p_{data}(\mathbf{x}) + p_g(\mathbf{x})}$$

**Generator Objective (at optimal D):**
$$\min_G V(D^*, G) = 2 \cdot D_{JS}(p_{data} \| p_g) - \log 4$$

**Wasserstein GAN:**
$$\min_G \max_{D \in \mathcal{D}} \mathbb{E}_{\mathbf{x} \sim p_{data}}[D(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p_z}[D(G(\mathbf{z}))]$$

Where $\mathcal{D}$ is set of 1-Lipschitz functions.

---

## 10. Statistical Learning Theory

### 10.1 Bias-Variance Tradeoff

**Definition:** Decomposition of expected prediction error.

For estimator $\hat{f}(\mathbf{x})$ predicting $y$:
$$\mathbb{E}[(y - \hat{f}(\mathbf{x}))^2] = \underbrace{\text{Bias}^2[\hat{f}(\mathbf{x})]}_{\text{Systematic Error}} + \underbrace{\text{Var}[\hat{f}(\mathbf{x})]}_{\text{Sensitivity to Data}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}$$

Where:
$$\text{Bias}[\hat{f}(\mathbf{x})] = \mathbb{E}[\hat{f}(\mathbf{x})] - f(\mathbf{x})$$
$$\text{Var}[\hat{f}(\mathbf{x})] = \mathbb{E}[(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2]$$

**In Gen-AI Context:**

| Regime | Characteristic | Gen-AI Example |
|--------|---------------|----------------|
| High Bias | Underfitting | Too small model capacity |
| High Variance | Overfitting | Memorizing training data |
| Optimal | Balance | Proper regularization |

### 10.2 Generalization Theory

**Empirical Risk:**
$$\hat{R}(\theta) = \frac{1}{N}\sum_{i=1}^{N} \mathcal{L}(f_\theta(\mathbf{x}_i), y_i)$$

**True Risk:**
$$R(\theta) = \mathbb{E}_{(\mathbf{x}, y) \sim P_{data}}[\mathcal{L}(f_\theta(\mathbf{x}), y)]$$

**Generalization Gap:**
$$\epsilon_{gen} = R(\theta) - \hat{R}(\theta)$$

**PAC Bound:**
With probability at least $1 - \delta$:
$$R(\theta) \leq \hat{R}(\theta) + \sqrt{\frac{2\log(2|\mathcal{H}|/\delta)}{N}}$$

### 10.3 Regularization Techniques

**L2 Regularization (Weight Decay):**
$$\mathcal{L}_{reg} = \mathcal{L}_{data} + \lambda \|\theta\|_2^2$$

**L1 Regularization (Sparsity):**
$$\mathcal{L}_{reg} = \mathcal{L}_{data} + \lambda \|\theta\|_1$$

**Dropout:**
$$\tilde{\mathbf{h}} = \mathbf{m} \odot \mathbf{h}, \quad m_i \sim \text{Bernoulli}(p)$$

**Variational Dropout (Bayesian Interpretation):**
$$P(\theta | \mathcal{D}) \approx \prod_i \mathcal{N}(\theta_i | \mu_i, \sigma_i^2)$$

### 10.4 Law of Large Numbers

**Weak LLN:**
$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{P} \mathbb{E}[X] \text{ as } n \rightarrow \infty$$

**Application:** Justifies Monte Carlo estimation:
$$\mathbb{E}[f(X)] \approx \frac{1}{N}\sum_{i=1}^{N} f(x_i)$$

### 10.5 Central Limit Theorem

**Statement:** For i.i.d. random variables with mean $\mu$ and variance $\sigma^2$:
$$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)$$

**Implication:** Sum of many independent random effects → Gaussian distribution

**Application in Gen-AI:**
- Justifies Gaussian assumptions in many models
- Explains convergence of gradient averaging

### 10.6 Concentration Inequalities

**Markov's Inequality:**
$$P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}, \quad \text{for } X \geq 0, a > 0$$

**Chebyshev's Inequality:**
$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$$

**Hoeffding's Inequality:**
For bounded i.i.d. $X_i \in [a_i, b_i]$:
$$P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mathbb{E}[X]\right| \geq \epsilon\right) \leq 2\exp\left(-\frac{2n^2\epsilon^2}{\sum_{i=1}^{n}(b_i - a_i)^2}\right)$$

**Application:** Bounds on generalization error, confidence intervals

---

## Summary Table: Probability Concepts in Gen-AI Architectures

| Gen-AI Model | Key Probabilistic Concepts |
|--------------|---------------------------|
| **GPT/LLaMA** | Chain rule, Conditional probability, Cross-entropy, Autoregressive factorization |
| **VAE** | Latent variables, ELBO, KL divergence, Reparameterization, Variational inference |
| **Diffusion** | Markov chains, Gaussian noise, Score matching, Reverse process |
| **GAN** | Implicit density, JS divergence, Adversarial training, Wasserstein distance |
| **Flow** | Change of variables, Jacobian determinant, Invertible transformations |
| **RLHF** | Reward modeling, KL constraints, Policy optimization |

---

## Key Mathematical Relationships

```
Cross-Entropy = Entropy + KL Divergence
       ↓              ↓           ↓
    H(P,Q)    =     H(P)   +   D_KL(P||Q)

Log-Likelihood = ELBO + KL(Posterior)
       ↓           ↓          ↓
    log P(x)   =  L_ELBO  +  D_KL(Q(z|x)||P(z|x))
```

# Foundations of Probability Theory

---

## 1. Introduction and Motivation

**Definition:** Probability theory is the mathematical framework for quantifying uncertainty, randomness, and stochastic phenomena. In Generative AI, probability theory provides the foundational language for modeling data distributions, learning latent representations, and generating new samples from learned distributions.

**Relevance to Generative AI:**
- Generative models learn probability distributions $p(x)$ over data $x$
- Sampling mechanisms rely on probabilistic foundations
- Loss functions derived from probabilistic principles (likelihood, KL divergence)
- Latent variable models (VAEs, diffusion models) require marginalization and conditioning

---

## 2. Axiomatic Foundations of Probability

### 2.1 Sample Space and Events

**Definition (Sample Space):** The sample space $\Omega$ is the set of all possible outcomes of a random experiment.

**Definition (Event):** An event $A$ is a subset of the sample space, $A \subseteq \Omega$.

**Definition (Sigma-Algebra):** A $\sigma$-algebra $\mathcal{F}$ on $\Omega$ is a collection of subsets satisfying:
1. $\Omega \in \mathcal{F}$
2. If $A \in \mathcal{F}$, then $A^c \in \mathcal{F}$ (closure under complementation)
3. If $A_1, A_2, \ldots \in \mathcal{F}$, then $\bigcup_{i=1}^{\infty} A_i \in \mathcal{F}$ (closure under countable unions)

### 2.2 Kolmogorov Axioms

**Definition (Probability Measure):** A probability measure $P$ is a function $P: \mathcal{F} \rightarrow [0, 1]$ satisfying:

**Axiom 1 (Non-negativity):**
$$P(A) \geq 0 \quad \forall A \in \mathcal{F}$$

**Axiom 2 (Normalization):**
$$P(\Omega) = 1$$

**Axiom 3 (Countable Additivity):** For mutually disjoint events $A_1, A_2, \ldots$ (i.e., $A_i \cap A_j = \emptyset$ for $i \neq j$):
$$P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$$

**Probability Space:** The triplet $(\Omega, \mathcal{F}, P)$ constitutes a probability space.

### 2.3 Fundamental Properties Derived from Axioms

**Complement Rule:**
$$P(A^c) = 1 - P(A)$$

**Addition Rule:**
$$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

**Monotonicity:**
$$A \subseteq B \Rightarrow P(A) \leq P(B)$$

**Boole's Inequality (Union Bound):**
$$P\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} P(A_i)$$

---

## 3. Conditional Probability and Independence

### 3.1 Conditional Probability

**Definition:** The conditional probability of event $A$ given event $B$ (where $P(B) > 0$) is:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

**Chain Rule (Product Rule):** For events $A_1, A_2, \ldots, A_n$:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2|A_1) \cdot P(A_3|A_1 \cap A_2) \cdots P(A_n|A_1 \cap \cdots \cap A_{n-1})$$

**Compact Form:**
$$P\left(\bigcap_{i=1}^{n} A_i\right) = \prod_{i=1}^{n} P\left(A_i \Big| \bigcap_{j=1}^{i-1} A_j\right)$$

**Gen-AI Application:** Autoregressive models (GPT, language models) factorize joint distributions using the chain rule:
$$p(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} p(x_t | x_1, \ldots, x_{t-1})$$

### 3.2 Law of Total Probability

**Definition:** If $\{B_1, B_2, \ldots, B_n\}$ is a partition of $\Omega$ (mutually exclusive and exhaustive), then:

$$P(A) = \sum_{i=1}^{n} P(A|B_i) P(B_i)$$

**Continuous Version:**
$$p(x) = \int p(x|z) p(z) \, dz$$

**Gen-AI Application:** This is the marginalization principle fundamental to latent variable models (VAEs):
$$p_\theta(x) = \int p_\theta(x|z) p(z) \, dz$$

### 3.3 Bayes' Theorem

**Definition:** Bayes' theorem relates conditional probabilities:

$$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$

**Extended Form with Partition:**
$$P(B_i|A) = \frac{P(A|B_i) P(B_i)}{\sum_{j=1}^{n} P(A|B_j) P(B_j)}$$

**Continuous Form:**
$$p(\theta|x) = \frac{p(x|\theta) p(\theta)}{p(x)} = \frac{p(x|\theta) p(\theta)}{\int p(x|\theta') p(\theta') \, d\theta'}$$

**Terminology:**
| Term | Symbol | Description |
|------|--------|-------------|
| Posterior | $p(\theta\|x)$ | Updated belief after observing data |
| Likelihood | $p(x\|\theta)$ | Probability of data given parameters |
| Prior | $p(\theta)$ | Initial belief before data |
| Evidence/Marginal | $p(x)$ | Normalizing constant |

**Gen-AI Application:**
- VAE inference: $p(z|x) = \frac{p(x|z)p(z)}{p(x)}$
- Diffusion models: posterior $q(x_{t-1}|x_t, x_0)$

### 3.4 Independence

**Definition (Independence of Two Events):** Events $A$ and $B$ are independent if:
$$P(A \cap B) = P(A) P(B)$$

Equivalently:
$$P(A|B) = P(A) \quad \text{and} \quad P(B|A) = P(B)$$

**Definition (Mutual Independence):** Events $A_1, \ldots, A_n$ are mutually independent if for every subset $S \subseteq \{1, \ldots, n\}$:
$$P\left(\bigcap_{i \in S} A_i\right) = \prod_{i \in S} P(A_i)$$

**Definition (Conditional Independence):** $A$ and $B$ are conditionally independent given $C$:
$$P(A \cap B | C) = P(A|C) P(B|C)$$

**Notation:** $A \perp\!\!\!\perp B \mid C$

**Gen-AI Application:** Conditional independence assumptions are crucial in:
- Naive Bayes models
- Graphical models
- VAE assumption: $p(x_1, \ldots, x_n | z) = \prod_i p(x_i | z)$

---

## 4. Random Variables and Distributions

### 4.1 Random Variables

**Definition:** A random variable $X$ is a measurable function $X: \Omega \rightarrow \mathbb{R}$ mapping outcomes to real numbers.

**Types:**
- **Discrete Random Variable:** Takes countable values $\{x_1, x_2, \ldots\}$
- **Continuous Random Variable:** Takes uncountably infinite values in $\mathbb{R}$

### 4.2 Probability Mass Function (PMF)

**Definition:** For discrete random variable $X$, the PMF is:
$$p_X(x) = P(X = x)$$

**Properties:**
1. $p_X(x) \geq 0$ for all $x$
2. $\sum_{x} p_X(x) = 1$

### 4.3 Probability Density Function (PDF)

**Definition:** For continuous random variable $X$, the PDF $f_X(x)$ satisfies:
$$P(a \leq X \leq b) = \int_a^b f_X(x) \, dx$$

**Properties:**
1. $f_X(x) \geq 0$ for all $x$
2. $\int_{-\infty}^{\infty} f_X(x) \, dx = 1$

**Note:** $f_X(x)$ is NOT a probability; it can exceed 1.

### 4.4 Cumulative Distribution Function (CDF)

**Definition:** The CDF of random variable $X$ is:
$$F_X(x) = P(X \leq x)$$

**Properties:**
1. $\lim_{x \to -\infty} F_X(x) = 0$
2. $\lim_{x \to +\infty} F_X(x) = 1$
3. $F_X$ is non-decreasing
4. $F_X$ is right-continuous

**Relationship to PDF:**
$$f_X(x) = \frac{dF_X(x)}{dx}$$

$$F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$$

---

## 5. Joint, Marginal, and Conditional Distributions

### 5.1 Joint Distribution

**Definition (Joint PMF):** For discrete random variables $X$ and $Y$:
$$p_{X,Y}(x, y) = P(X = x, Y = y)$$

**Definition (Joint PDF):** For continuous random variables:
$$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) \, dx \, dy$$

### 5.2 Marginal Distribution

**Definition:** Obtained by summing/integrating out other variables.

**Discrete:**
$$p_X(x) = \sum_{y} p_{X,Y}(x, y)$$

**Continuous:**
$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy$$

**Gen-AI Application:** Computing marginal likelihood in VAEs:
$$p_\theta(x) = \int p_\theta(x, z) \, dz = \int p_\theta(x|z) p(z) \, dz$$

### 5.3 Conditional Distribution

**Definition:** For continuous variables with $f_Y(y) > 0$:
$$f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}$$

**Relationship:**
$$f_{X,Y}(x, y) = f_{X|Y}(x|y) \cdot f_Y(y) = f_{Y|X}(y|x) \cdot f_X(x)$$

### 5.4 Independence of Random Variables

**Definition:** $X$ and $Y$ are independent ($X \perp\!\!\!\perp Y$) if:
$$f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \forall x, y$$

Equivalently:
$$f_{X|Y}(x|y) = f_X(x) \quad \forall x, y$$

---

## 6. Expectation, Variance, and Moments

### 6.1 Expectation (Mean)

**Definition (Discrete):**
$$\mathbb{E}[X] = \sum_{x} x \cdot p_X(x)$$

**Definition (Continuous):**
$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx$$

**Expectation of Function:**
$$\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f_X(x) \, dx$$

**Properties:**
1. **Linearity:** $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$
2. **Monotonicity:** $X \leq Y \Rightarrow \mathbb{E}[X] \leq \mathbb{E}[Y]$
3. **Independence:** $X \perp\!\!\!\perp Y \Rightarrow \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$

### 6.2 Variance

**Definition:**
$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

**Properties:**
1. $\text{Var}(X) \geq 0$
2. $\text{Var}(aX + b) = a^2 \text{Var}(X)$
3. If $X \perp\!\!\!\perp Y$: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

**Standard Deviation:**
$$\sigma_X = \sqrt{\text{Var}(X)}$$

### 6.3 Covariance and Correlation

**Covariance:**
$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

**Correlation Coefficient:**
$$\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \in [-1, 1]$$

**Properties:**
- $\text{Cov}(X, X) = \text{Var}(X)$
- $X \perp\!\!\!\perp Y \Rightarrow \text{Cov}(X, Y) = 0$ (converse not always true)

### 6.4 Moments and Moment Generating Functions

**Definition (k-th Moment):**
$$\mu_k = \mathbb{E}[X^k]$$

**Definition (k-th Central Moment):**
$$\mu'_k = \mathbb{E}[(X - \mathbb{E}[X])^k]$$

**Moment Generating Function (MGF):**
$$M_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x) \, dx$$

**Property:** The k-th moment is:
$$\mathbb{E}[X^k] = \left. \frac{d^k M_X(t)}{dt^k} \right|_{t=0}$$

### 6.5 Conditional Expectation

**Definition:**
$$\mathbb{E}[X|Y=y] = \int_{-\infty}^{\infty} x \cdot f_{X|Y}(x|y) \, dx$$

**Law of Total Expectation (Tower Property):**
$$\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]]$$

**Law of Total Variance:**
$$\text{Var}(X) = \mathbb{E}[\text{Var}(X|Y)] + \text{Var}(\mathbb{E}[X|Y])$$

**Gen-AI Application:** In VAEs, the reconstruction term involves:
$$\mathbb{E}_{z \sim q(z|x)}[\log p(x|z)]$$

---

## 7. Important Probability Distributions for Generative AI

### 7.1 Discrete Distributions

#### 7.1.1 Bernoulli Distribution

**Definition:** Models binary outcomes.
$$X \sim \text{Bernoulli}(p)$$

$$P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}$$

**Parameters:** $p \in [0, 1]$

**Moments:**
- $\mathbb{E}[X] = p$
- $\text{Var}(X) = p(1-p)$

#### 7.1.2 Categorical Distribution

**Definition:** Generalization of Bernoulli to $K$ categories.
$$X \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$$

$$P(X = k) = \pi_k, \quad \sum_{k=1}^{K} \pi_k = 1$$

**Gen-AI Application:** Output distribution for classification, token prediction in language models.

#### 7.1.3 Multinomial Distribution

**Definition:** Multiple independent categorical trials.
$$(X_1, \ldots, X_K) \sim \text{Multinomial}(n, \pi_1, \ldots, \pi_K)$$

$$P(X_1 = x_1, \ldots, X_K = x_k) = \frac{n!}{x_1! \cdots x_K!} \prod_{k=1}^{K} \pi_k^{x_k}$$

where $\sum_{k=1}^{K} x_k = n$

### 7.2 Continuous Distributions

#### 7.2.1 Uniform Distribution

**Definition:**
$$X \sim \text{Uniform}(a, b)$$

$$f_X(x) = \frac{1}{b-a}, \quad x \in [a, b]$$

**Moments:**
- $\mathbb{E}[X] = \frac{a+b}{2}$
- $\text{Var}(X) = \frac{(b-a)^2}{12}$

#### 7.2.2 Gaussian (Normal) Distribution

**Definition (Univariate):**
$$X \sim \mathcal{N}(\mu, \sigma^2)$$

$$f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**Standard Normal:** $Z \sim \mathcal{N}(0, 1)$

$$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$

**Properties:**
- $\mathbb{E}[X] = \mu$
- $\text{Var}(X) = \sigma^2$
- Linear combination of Gaussians is Gaussian

**Definition (Multivariate):**
$$\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$

$$f_{\mathbf{X}}(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

where:
- $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector
- $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ is the positive semi-definite covariance matrix
- $|\boldsymbol{\Sigma}|$ is the determinant

**Gen-AI Application:**
- VAE latent prior: $p(z) = \mathcal{N}(0, I)$
- Diffusion noise: $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$
- Reparameterization trick

#### 7.2.3 Exponential Distribution

**Definition:**
$$X \sim \text{Exponential}(\lambda)$$

$$f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$

**Moments:**
- $\mathbb{E}[X] = 1/\lambda$
- $\text{Var}(X) = 1/\lambda^2$

#### 7.2.4 Beta Distribution

**Definition:**
$$X \sim \text{Beta}(\alpha, \beta)$$

$$f_X(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}, \quad x \in [0, 1]$$

**Gen-AI Application:** Prior for probabilities, noise scheduling in diffusion.

#### 7.2.5 Dirichlet Distribution

**Definition:** Multivariate generalization of Beta.
$$\boldsymbol{\pi} \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)$$

$$f(\boldsymbol{\pi}) = \frac{\Gamma(\sum_{k=1}^{K} \alpha_k)}{\prod_{k=1}^{K} \Gamma(\alpha_k)} \prod_{k=1}^{K} \pi_k^{\alpha_k - 1}$$

where $\sum_{k=1}^{K} \pi_k = 1$ and $\pi_k \geq 0$

**Gen-AI Application:** Prior for topic distributions in LDA.

### 7.3 Distributions Crucial for Generative Models

#### 7.3.1 Mixture Models

**Gaussian Mixture Model (GMM):**
$$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

where $\sum_{k=1}^{K} \pi_k = 1$

**Latent Variable Formulation:**
$$z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$$
$$\mathbf{x}|z=k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

#### 7.3.2 Exponential Family

**Definition:** A distribution belongs to the exponential family if:
$$p(x|\boldsymbol{\eta}) = h(x) \exp\left(\boldsymbol{\eta}^\top \mathbf{T}(x) - A(\boldsymbol{\eta})\right)$$

where:
- $\boldsymbol{\eta}$: natural parameters
- $\mathbf{T}(x)$: sufficient statistics
- $A(\boldsymbol{\eta})$: log-partition function (normalizer)
- $h(x)$: base measure

**Key Property:**
$$\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}[\mathbf{T}(x)]$$
$$\nabla^2_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(x)]$$

---

## 8. Information Theory Foundations

### 8.1 Entropy

**Definition (Discrete):** Shannon entropy measures uncertainty:
$$H(X) = -\sum_{x} p(x) \log p(x) = -\mathbb{E}_{p}[\log p(X)]$$

**Definition (Continuous):** Differential entropy:
$$h(X) = -\int f(x) \log f(x) \, dx$$

**Properties:**
- $H(X) \geq 0$ for discrete distributions
- $H(X)$ is maximized by uniform distribution (discrete)
- For continuous: $h(X)$ can be negative

**Gaussian Entropy:**
$$h(X) = \frac{1}{2}\log(2\pi e \sigma^2) \quad \text{for } X \sim \mathcal{N}(\mu, \sigma^2)$$

**Multivariate Gaussian:**
$$h(\mathbf{X}) = \frac{d}{2}\log(2\pi e) + \frac{1}{2}\log|\boldsymbol{\Sigma}|$$

### 8.2 Cross-Entropy

**Definition:**
$$H(p, q) = -\sum_{x} p(x) \log q(x) = -\mathbb{E}_{p}[\log q(X)]$$

**Relationship:**
$$H(p, q) = H(p) + D_{KL}(p \| q)$$

**Gen-AI Application:** Cross-entropy loss for classification:
$$\mathcal{L}_{CE} = -\sum_{c=1}^{C} y_c \log \hat{y}_c$$

### 8.3 Kullback-Leibler (KL) Divergence

**Definition:**
$$D_{KL}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{p}\left[\log \frac{p(X)}{q(X)}\right]$$

**Continuous Version:**
$$D_{KL}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx$$

**Properties:**
1. $D_{KL}(p \| q) \geq 0$ (Gibbs' inequality)
2. $D_{KL}(p \| q) = 0 \Leftrightarrow p = q$ (almost everywhere)
3. **Asymmetric:** $D_{KL}(p \| q) \neq D_{KL}(q \| p)$
4. Not a metric (doesn't satisfy triangle inequality)

**KL Divergence Between Gaussians:**
$$D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2)) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

**Multivariate Gaussians:**
$$D_{KL}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \| \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2}\left[\log\frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \text{tr}(\boldsymbol{\Sigma}_2^{-1}\boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^\top\boldsymbol{\Sigma}_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)\right]$$

**Special Case (VAE KL term):** When $q = \mathcal{N}(\mu, \sigma^2)$ and $p = \mathcal{N}(0, 1)$:
$$D_{KL}(q \| p) = -\frac{1}{2}\sum_{j=1}^{d}\left(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

### 8.4 Mutual Information

**Definition:**
$$I(X; Y) = D_{KL}(p(x, y) \| p(x)p(y)) = \mathbb{E}_{p(x,y)}\left[\log \frac{p(x, y)}{p(x)p(y)}\right]$$

**Equivalent Forms:**
$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X, Y)$$

**Properties:**
- $I(X; Y) \geq 0$
- $I(X; Y) = 0 \Leftrightarrow X \perp\!\!\!\perp Y$
- $I(X; Y) = I(Y; X)$ (symmetric)

**Gen-AI Application:** InfoGAN, contrastive learning objectives.

### 8.5 Jensen-Shannon Divergence

**Definition:**
$$D_{JS}(p \| q) = \frac{1}{2}D_{KL}(p \| m) + \frac{1}{2}D_{KL}(q \| m)$$

where $m = \frac{1}{2}(p + q)$

**Properties:**
- Symmetric: $D_{JS}(p \| q) = D_{JS}(q \| p)$
- Bounded: $0 \leq D_{JS}(p \| q) \leq \log 2$
- $\sqrt{D_{JS}}$ is a metric

**Gen-AI Application:** Original GAN objective is related to JS divergence:
$$\mathcal{L}_{GAN} = 2 \cdot D_{JS}(p_{data} \| p_g) - \log 4$$

---

## 9. Maximum Likelihood Estimation (MLE)

### 9.1 Definition and Formulation

**Definition:** MLE finds parameters $\theta$ that maximize the likelihood of observed data:

$$\hat{\theta}_{MLE} = \arg\max_{\theta} p(x_1, \ldots, x_n | \theta)$$

**For i.i.d. data:**
$$\hat{\theta}_{MLE} = \arg\max_{\theta} \prod_{i=1}^{n} p(x_i | \theta)$$

### 9.2 Log-Likelihood

**Definition:** Taking logarithm for numerical stability and mathematical convenience:

$$\mathcal{L}(\theta) = \log p(\mathcal{D}|\theta) = \sum_{i=1}^{n} \log p(x_i | \theta)$$

**MLE Objective:**
$$\hat{\theta}_{MLE} = \arg\max_{\theta} \mathcal{L}(\theta) = \arg\max_{\theta} \sum_{i=1}^{n} \log p(x_i | \theta)$$

### 9.3 MLE as KL Minimization

**Theorem:** MLE is equivalent to minimizing KL divergence between empirical distribution and model:

$$\hat{\theta}_{MLE} = \arg\min_{\theta} D_{KL}(\hat{p}_{data} \| p_\theta)$$

where $\hat{p}_{data}(x) = \frac{1}{n}\sum_{i=1}^{n} \delta(x - x_i)$

**Proof:**
$$D_{KL}(\hat{p}_{data} \| p_\theta) = -H(\hat{p}_{data}) - \mathbb{E}_{\hat{p}_{data}}[\log p_\theta(x)]$$

Since $H(\hat{p}_{data})$ is constant:
$$\arg\min_\theta D_{KL}(\hat{p}_{data} \| p_\theta) = \arg\max_\theta \mathbb{E}_{\hat{p}_{data}}[\log p_\theta(x)] = \arg\max_\theta \frac{1}{n}\sum_{i=1}^{n} \log p_\theta(x_i)$$

### 9.4 Properties of MLE

**Consistency:** As $n \rightarrow \infty$, $\hat{\theta}_{MLE} \xrightarrow{p} \theta_{true}$

**Asymptotic Normality:**
$$\sqrt{n}(\hat{\theta}_{MLE} - \theta_{true}) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1})$$

where $I(\theta)$ is the Fisher Information Matrix.

**Asymptotic Efficiency:** Achieves Cramér-Rao lower bound asymptotically.

### 9.5 Fisher Information

**Definition (Scalar):**
$$I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log p(X|\theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log p(X|\theta)}{\partial \theta^2}\right]$$

**Definition (Matrix Form):**
$$[I(\theta)]_{ij} = \mathbb{E}\left[\frac{\partial \log p(X|\theta)}{\partial \theta_i} \frac{\partial \log p(X|\theta)}{\partial \theta_j}\right]$$

**Cramér-Rao Bound:**
$$\text{Var}(\hat{\theta}) \geq I(\theta)^{-1}$$

---

## 10. Bayesian Inference

### 10.1 Bayesian Framework

**Prior Distribution:** $p(\theta)$ - encodes prior belief about parameters

**Likelihood:** $p(\mathcal{D}|\theta)$ - probability of data given parameters

**Posterior Distribution:** Using Bayes' theorem:
$$p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D}|\theta) p(\theta)}{\int p(\mathcal{D}|\theta') p(\theta') d\theta'}$$

### 10.2 Posterior Predictive Distribution

**Definition:** Distribution over new data point $x^*$:
$$p(x^*|\mathcal{D}) = \int p(x^*|\theta) p(\theta|\mathcal{D}) \, d\theta$$

### 10.3 Maximum A Posteriori (MAP) Estimation

**Definition:**
$$\hat{\theta}_{MAP} = \arg\max_{\theta} p(\theta|\mathcal{D}) = \arg\max_{\theta} [p(\mathcal{D}|\theta) p(\theta)]$$

**Log Form:**
$$\hat{\theta}_{MAP} = \arg\max_{\theta} [\log p(\mathcal{D}|\theta) + \log p(\theta)]$$

**Relationship to Regularization:**
- Gaussian prior $p(\theta) \propto e^{-\frac{\lambda}{2}\|\theta\|^2}$ → L2 regularization
- Laplace prior $p(\theta) \propto e^{-\lambda\|\theta\|_1}$ → L1 regularization

### 10.4 Conjugate Priors

**Definition:** A prior $p(\theta)$ is conjugate to likelihood $p(x|\theta)$ if the posterior $p(\theta|x)$ belongs to the same family as the prior.

| Likelihood | Conjugate Prior | Posterior |
|------------|-----------------|-----------|
| Bernoulli | Beta | Beta |
| Multinomial | Dirichlet | Dirichlet |
| Gaussian (known $\sigma$) | Gaussian | Gaussian |
| Gaussian (known $\mu$) | Inverse-Gamma | Inverse-Gamma |
| Poisson | Gamma | Gamma |

---

## 11. Latent Variable Models

### 11.1 Definition and Motivation

**Definition:** Latent variable models introduce unobserved variables $z$ to capture hidden structure:
$$p_\theta(x) = \int p_\theta(x, z) \, dz = \int p_\theta(x|z) p(z) \, dz$$

**Components:**
- $z$: latent variables (hidden factors)
- $p(z)$: prior over latent space
- $p_\theta(x|z)$: decoder/generator
- $p_\theta(x)$: marginal likelihood (evidence)

### 11.2 The Inference Problem

**Goal:** Compute posterior $p(z|x)$ for inference:
$$p(z|x) = \frac{p(x|z)p(z)}{p(x)} = \frac{p(x|z)p(z)}{\int p(x|z')p(z') \, dz'}$$

**Problem:** The integral $p(x) = \int p(x|z)p(z) \, dz$ is often intractable.

### 11.3 Evidence Lower Bound (ELBO)

**Derivation:** For any distribution $q(z|x)$:

$$\log p(x) = \log \int p(x, z) \, dz = \log \int \frac{p(x, z)}{q(z|x)} q(z|x) \, dz$$

Using Jensen's inequality:
$$\log p(x) \geq \int q(z|x) \log \frac{p(x, z)}{q(z|x)} \, dz = \mathcal{L}(x; \theta, \phi)$$

**ELBO Formulation:**
$$\mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))$$

where:
- First term: Reconstruction term
- Second term: Regularization term (KL to prior)

**Alternative Form:**
$$\mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x, z) - \log q_\phi(z|x)]$$

**Gap to Evidence:**
$$\log p(x) = \mathcal{L}(x; \theta, \phi) + D_{KL}(q_\phi(z|x) \| p_\theta(z|x))$$

Since $D_{KL} \geq 0$, ELBO is a lower bound, with equality when $q_\phi(z|x) = p_\theta(z|x)$.

### 11.4 Variational Inference

**Objective:** Approximate intractable posterior with tractable family:
$$q^*(z|x) = \arg\min_{q \in \mathcal{Q}} D_{KL}(q(z|x) \| p(z|x))$$

**Equivalently:** Maximize ELBO:
$$q^*(z|x) = \arg\max_{q \in \mathcal{Q}} \mathcal{L}(x; \theta, q)$$

---

## 12. Sampling and Monte Carlo Methods

### 12.1 Law of Large Numbers

**Theorem (Strong LLN):** For i.i.d. random variables $X_1, X_2, \ldots$ with $\mathbb{E}[X_i] = \mu$:
$$\frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{a.s.} \mu \quad \text{as } n \rightarrow \infty$$

### 12.2 Central Limit Theorem

**Theorem:** For i.i.d. random variables with mean $\mu$ and variance $\sigma^2$:
$$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)$$

Equivalently:
$$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$$

### 12.3 Monte Carlo Estimation

**Goal:** Estimate expectation $\mathbb{E}_{p(x)}[f(x)]$

**Monte Carlo Estimator:** Draw samples $x_1, \ldots, x_n \sim p(x)$:
$$\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^{n} f(x_i)$$

**Properties:**
- Unbiased: $\mathbb{E}[\hat{\mu}_n] = \mathbb{E}_{p}[f(x)]$
- Variance: $\text{Var}(\hat{\mu}_n) = \frac{\text{Var}(f(X))}{n}$
- Convergence rate: $O(1/\sqrt{n})$ regardless of dimension

### 12.4 Importance Sampling

**Motivation:** Sample from proposal $q(x)$ instead of target $p(x)$.

**Identity:**
$$\mathbb{E}_{p}[f(x)] = \int f(x) p(x) \, dx = \int f(x) \frac{p(x)}{q(x)} q(x) \, dx = \mathbb{E}_{q}\left[f(x) \frac{p(x)}{q(x)}\right]$$

**Importance Weight:**
$$w(x) = \frac{p(x)}{q(x)}$$

**Importance Sampling Estimator:**
$$\hat{\mu}_{IS} = \frac{1}{n}\sum_{i=1}^{n} f(x_i) w(x_i), \quad x_i \sim q$$

**Self-Normalized Importance Sampling:** When $p(x)$ known up to constant:
$$\hat{\mu}_{SNIS} = \frac{\sum_{i=1}^{n} f(x_i) w(x_i)}{\sum_{i=1}^{n} w(x_i)}$$

### 12.5 Reparameterization Trick

**Problem:** Need gradients through sampling operation for VAE training.

**Solution:** Express $z \sim q_\phi(z|x)$ as deterministic transformation of noise.

**For Gaussian $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))$:**
$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

**Gradient Computation:**
$$\nabla_\phi \mathbb{E}_{q_\phi(z|x)}[f(z)] = \nabla_\phi \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}[f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)]$$
$$= \mathbb{E}_{\epsilon}[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)]$$

---

## 13. Probability Theory in Major Generative Models

### 13.1 Variational Autoencoders (VAEs)

**Generative Model:**
$$p(z) = \mathcal{N}(0, I)$$
$$p_\theta(x|z) = \mathcal{N}(\mu_\theta(z), \sigma^2_\theta(z))$$

**Inference Model:**
$$q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$$

**Loss Function:**
$$\mathcal{L}_{VAE} = -\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + D_{KL}(q_\phi(z|x) \| p(z))$$

### 13.2 Generative Adversarial Networks (GANs)

**Generator:** $G_\theta: z \rightarrow x$, where $z \sim p(z)$

**Discriminator Objective:**
$$\max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]$$

**Generator Objective:**
$$\min_G \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]$$

**Optimal Discriminator:**
$$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$$

### 13.3 Diffusion Models

**Forward Process (Noising):**
$$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$$

$$q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$$

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$

**Reverse Process (Denoising):**
$$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

**Training Objective (Simplified):**
$$\mathcal{L}_{simple} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

### 13.4 Autoregressive Models

**Factorization:**
$$p(x_1, \ldots, x_T) = \prod_{t=1}^{T} p(x_t | x_{<t})$$

**Language Model (Token Prediction):**
$$p(x_t | x_{<t}) = \text{softmax}(W h_t)$$

**Training Objective (Cross-Entropy):**
$$\mathcal{L}_{AR} = -\sum_{t=1}^{T} \log p_\theta(x_t | x_{<t})$$

### 13.5 Normalizing Flows

**Transformation:** $z = f_\theta(x)$ where $f_\theta$ is invertible

**Change of Variables:**
$$p_X(x) = p_Z(f_\theta(x)) \left|\det\left(\frac{\partial f_\theta(x)}{\partial x}\right)\right|$$

**Log-Likelihood:**
$$\log p_X(x) = \log p_Z(f_\theta(x)) + \log\left|\det(J_{f_\theta}(x))\right|$$

---

## 14. Summary: Probability Foundations for Gen-AI

| Concept | Mathematical Form | Gen-AI Application |
|---------|-------------------|-------------------|
| Bayes' Theorem | $p(\theta\|x) \propto p(x\|\theta)p(\theta)$ | Posterior inference in VAEs |
| Chain Rule | $p(x_{1:T}) = \prod_t p(x_t\|x_{<t})$ | Autoregressive models |
| Marginalization | $p(x) = \int p(x\|z)p(z)dz$ | Latent variable models |
| KL Divergence | $D_{KL}(q\|\|p) = \mathbb{E}_q[\log q/p]$ | VAE loss, divergence minimization |
| ELBO | $\log p(x) \geq \mathbb{E}_q[\log p(x,z)/q(z)]$ | VAE objective |
| MLE | $\arg\max_\theta \sum_i \log p_\theta(x_i)$ | Training objective |
| Reparameterization | $z = \mu + \sigma \odot \epsilon$ | Gradient through sampling |
| Change of Variables | $p_x(x) = p_z(f(x))\|\det J_f\|$ | Normalizing flows |

---

## 15. Key Takeaways

1. **Generative AI = Learning Probability Distributions:** All generative models aim to learn $p_{data}(x)$ either explicitly or implicitly.

2. **Latent Variables Enable Structure:** Introducing $z$ allows modeling complex distributions via simpler conditional distributions.

3. **Intractability Drives Methodology:** Intractable integrals/posteriors motivate variational inference, MCMC, and amortized inference.

4. **Information Theory Provides Objectives:** KL divergence, cross-entropy, and mutual information form the basis of training losses.

5. **Sampling is Fundamental:** Both training (Monte Carlo gradients) and inference (generation) rely on efficient sampling.

# Probability and Statistics for Generative AI

---

## Table of Contents
1. [Foundational Concepts](#1-foundational-concepts)
2. [Random Variables](#2-random-variables)
3. [Probability Distributions](#3-probability-distributions)
4. [Higher-Order Probability Distributions](#4-higher-order-probability-distributions)
5. [Stochastic Processes](#5-stochastic-processes)
6. [Non-Stochastic Processes](#6-non-stochastic-processes)
7. [Principles of Generative AI](#7-principles-of-generative-ai)

---

## 1. Foundational Concepts

### 1.1 Probability Space

**Definition:** A probability space is a mathematical construct $(\Omega, \mathcal{F}, P)$ that provides a formal model for randomness.

| Component | Symbol | Description |
|-----------|--------|-------------|
| Sample Space | $\Omega$ | Set of all possible outcomes |
| Event Space | $\mathcal{F}$ | $\sigma$-algebra of subsets of $\Omega$ |
| Probability Measure | $P$ | Function mapping events to $[0,1]$ |

**Kolmogorov Axioms:**

$$P(\Omega) = 1$$

$$P(A) \geq 0 \quad \forall A \in \mathcal{F}$$

$$P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) \quad \text{for mutually exclusive } A_i$$

### 1.2 Conditional Probability

**Definition:** The probability of event $A$ given event $B$ has occurred:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$

### 1.3 Bayes' Theorem

**Definition:** Fundamental theorem for posterior inference in generative models:

$$P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)}$$

where:
- $P(\theta|X)$ = **Posterior** distribution
- $P(X|\theta)$ = **Likelihood** function
- $P(\theta)$ = **Prior** distribution
- $P(X)$ = **Evidence** (marginal likelihood)

**Marginal Likelihood Computation:**

$$P(X) = \int P(X|\theta)P(\theta)d\theta$$

### 1.4 Chain Rule of Probability

**Definition:** Decomposition of joint probability into conditional factors:

$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | X_1, X_2, \ldots, X_{i-1})$$

**Gen-AI Application:** Autoregressive language models (GPT) directly model this factorization:

$$P(\text{sequence}) = \prod_{t=1}^{T} P(x_t | x_{<t})$$

---

## 2. Random Variables

### 2.1 Definition

**Definition:** A random variable is a measurable function $X: \Omega \rightarrow \mathbb{R}$ that maps outcomes from a sample space to real numbers.

$$X: \Omega \rightarrow \mathbb{R}$$

### 2.2 Types of Random Variables

#### 2.2.1 Discrete Random Variables

**Definition:** Takes countable values with associated probability mass function (PMF):

$$P(X = x_i) = p_i, \quad \sum_{i} p_i = 1$$

**Properties:**
- Cumulative Distribution Function (CDF):
$$F_X(x) = P(X \leq x) = \sum_{x_i \leq x} P(X = x_i)$$

#### 2.2.2 Continuous Random Variables

**Definition:** Takes uncountably infinite values with probability density function (PDF):

$$P(a \leq X \leq b) = \int_a^b f_X(x)dx$$

**Normalization Constraint:**

$$\int_{-\infty}^{\infty} f_X(x)dx = 1$$

**CDF-PDF Relationship:**

$$F_X(x) = \int_{-\infty}^{x} f_X(t)dt$$

$$f_X(x) = \frac{dF_X(x)}{dx}$$

### 2.3 Moments of Random Variables

#### 2.3.1 Expected Value (First Moment)

**Discrete:**
$$\mathbb{E}[X] = \sum_{i} x_i P(X = x_i)$$

**Continuous:**
$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x)dx$$

#### 2.3.2 Variance (Second Central Moment)

$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

#### 2.3.3 Higher-Order Moments

**$n$-th Moment:**
$$\mu_n = \mathbb{E}[X^n]$$

**$n$-th Central Moment:**
$$\mu'_n = \mathbb{E}[(X - \mu)^n]$$

**Skewness (3rd standardized moment):**
$$\gamma_1 = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3}$$

**Kurtosis (4th standardized moment):**
$$\gamma_2 = \frac{\mathbb{E}[(X-\mu)^4]}{\sigma^4}$$

### 2.4 Moment Generating Function (MGF)

**Definition:**
$$M_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x)dx$$

**Moment Extraction:**
$$\mathbb{E}[X^n] = \frac{d^n M_X(t)}{dt^n}\bigg|_{t=0}$$

### 2.5 Characteristic Function

**Definition:**
$$\phi_X(t) = \mathbb{E}[e^{itX}] = \int_{-\infty}^{\infty} e^{itx} f_X(x)dx$$

**Properties:**
- Always exists (unlike MGF)
- $\phi_X(0) = 1$
- $|\phi_X(t)| \leq 1$
- Uniquely determines distribution

---

## 3. Probability Distributions

### 3.1 Discrete Distributions

#### 3.1.1 Bernoulli Distribution

**Definition:** Models single binary trial.

$$X \sim \text{Bernoulli}(p)$$

$$P(X = k) = p^k(1-p)^{1-k}, \quad k \in \{0, 1\}$$

| Statistic | Value |
|-----------|-------|
| Mean | $\mu = p$ |
| Variance | $\sigma^2 = p(1-p)$ |

#### 3.1.2 Binomial Distribution

**Definition:** Number of successes in $n$ independent Bernoulli trials.

$$X \sim \text{Binomial}(n, p)$$

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

| Statistic | Value |
|-----------|-------|
| Mean | $\mu = np$ |
| Variance | $\sigma^2 = np(1-p)$ |

#### 3.1.3 Categorical Distribution

**Definition:** Generalization of Bernoulli to $K$ categories.

$$X \sim \text{Categorical}(\pi_1, \pi_2, \ldots, \pi_K)$$

$$P(X = k) = \pi_k, \quad \sum_{k=1}^{K} \pi_k = 1$$

**Gen-AI Application:** Output layer of classification/language models uses softmax to parameterize categorical distribution:

$$\pi_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}$$

#### 3.1.4 Multinomial Distribution

**Definition:** Multiple draws from categorical distribution.

$$\mathbf{X} \sim \text{Multinomial}(n, \boldsymbol{\pi})$$

$$P(X_1 = x_1, \ldots, X_K = x_K) = \frac{n!}{\prod_{k=1}^{K} x_k!} \prod_{k=1}^{K} \pi_k^{x_k}$$

#### 3.1.5 Poisson Distribution

**Definition:** Models count of events in fixed interval.

$$X \sim \text{Poisson}(\lambda)$$

$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

| Statistic | Value |
|-----------|-------|
| Mean | $\mu = \lambda$ |
| Variance | $\sigma^2 = \lambda$ |

#### 3.1.6 Geometric Distribution

**Definition:** Number of trials until first success.

$$P(X = k) = (1-p)^{k-1}p, \quad k \in \{1, 2, 3, \ldots\}$$

### 3.2 Continuous Distributions

#### 3.2.1 Uniform Distribution

**Definition:**
$$X \sim \text{Uniform}(a, b)$$

$$f_X(x) = \frac{1}{b-a}, \quad x \in [a, b]$$

| Statistic | Value |
|-----------|-------|
| Mean | $\mu = \frac{a+b}{2}$ |
| Variance | $\sigma^2 = \frac{(b-a)^2}{12}$ |

#### 3.2.2 Gaussian (Normal) Distribution

**Definition:** The most fundamental distribution in generative AI.

$$X \sim \mathcal{N}(\mu, \sigma^2)$$

$$f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**Standard Normal ($\mu=0$, $\sigma=1$):**

$$\phi(x) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{x^2}{2}\right)$$

**Properties:**
- Maximum entropy distribution for fixed mean and variance
- Closed under linear transformations
- Central Limit Theorem convergence target

**Moment Generating Function:**
$$M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)$$

#### 3.2.3 Exponential Distribution

**Definition:**
$$X \sim \text{Exponential}(\lambda)$$

$$f_X(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$

**Memoryless Property:**
$$P(X > s + t | X > s) = P(X > t)$$

#### 3.2.4 Gamma Distribution

**Definition:**
$$X \sim \text{Gamma}(\alpha, \beta)$$

$$f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0$$

where $\Gamma(\alpha) = \int_0^{\infty} t^{\alpha-1}e^{-t}dt$

#### 3.2.5 Beta Distribution

**Definition:** Distribution over $[0,1]$, conjugate prior for Bernoulli.

$$X \sim \text{Beta}(\alpha, \beta)$$

$$f_X(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}$$

| Statistic | Value |
|-----------|-------|
| Mean | $\mu = \frac{\alpha}{\alpha + \beta}$ |
| Variance | $\sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$ |

#### 3.2.6 Dirichlet Distribution

**Definition:** Multivariate generalization of Beta, distribution over probability simplices.

$$\boldsymbol{\pi} \sim \text{Dirichlet}(\boldsymbol{\alpha})$$

$$f(\pi_1, \ldots, \pi_K) = \frac{\Gamma(\sum_{k=1}^{K} \alpha_k)}{\prod_{k=1}^{K} \Gamma(\alpha_k)} \prod_{k=1}^{K} \pi_k^{\alpha_k - 1}$$

**Constraint:** $\sum_{k=1}^{K} \pi_k = 1$

**Gen-AI Application:** Prior for topic models (LDA).

---

## 4. Higher-Order Probability Distributions

### 4.1 Joint Distributions

#### 4.1.1 Bivariate Joint Distribution

**Definition:** Distribution over two random variables simultaneously.

**Discrete:**
$$P(X = x_i, Y = y_j) = p_{ij}$$

**Continuous:**
$$P(a \leq X \leq b, c \leq Y \leq d) = \int_a^b \int_c^d f_{X,Y}(x,y) \, dy \, dx$$

#### 4.1.2 Marginal Distribution

**Definition:** Distribution of single variable from joint.

$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy$$

$$f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx$$

#### 4.1.3 Conditional Distribution

$$f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}$$

### 4.2 Multivariate Gaussian Distribution

**Definition:** The cornerstone distribution for continuous latent space models.

$$\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

where:
- $\boldsymbol{\mu} \in \mathbb{R}^d$ = Mean vector
- $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ = Covariance matrix (symmetric positive definite)
- $|\boldsymbol{\Sigma}|$ = Determinant of covariance matrix

#### 4.2.1 Covariance Matrix Structure

$$\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1d} \\ \sigma_{21} & \sigma_2^2 & \cdots & \sigma_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{d1} & \sigma_{d2} & \cdots & \sigma_d^2 \end{pmatrix}$$

**Correlation Coefficient:**
$$\rho_{ij} = \frac{\sigma_{ij}}{\sigma_i \sigma_j}$$

#### 4.2.2 Mahalanobis Distance

**Definition:** Generalized distance accounting for correlations.

$$D_M(\mathbf{x}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})}$$

#### 4.2.3 Conditional Multivariate Gaussian

For partition $\mathbf{X} = [\mathbf{X}_1, \mathbf{X}_2]^T$:

$$\boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}$$

**Conditional Distribution:**
$$\mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})$$

where:
$$\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)$$

$$\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}$$

#### 4.2.4 Precision Matrix (Inverse Covariance)

$$\boldsymbol{\Lambda} = \boldsymbol{\Sigma}^{-1}$$

**Gen-AI Application:** Sparse precision matrices encode conditional independence (Gaussian Graphical Models).

### 4.3 Mixture Models

#### 4.3.1 Gaussian Mixture Model (GMM)

**Definition:** Weighted combination of Gaussian components.

$$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

**Constraints:**
$$\sum_{k=1}^{K} \pi_k = 1, \quad \pi_k \geq 0$$

**Latent Variable Formulation:**
$$p(\mathbf{x}, z) = p(z)p(\mathbf{x}|z) = \pi_z \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_z, \boldsymbol{\Sigma}_z)$$

#### 4.3.2 EM Algorithm for GMM

**E-Step (Responsibility Computation):**
$$\gamma_{nk} = \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$

**M-Step (Parameter Update):**
$$N_k = \sum_{n=1}^{N} \gamma_{nk}$$

$$\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{nk} \mathbf{x}_n$$

$$\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{nk} (\mathbf{x}_n - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_n - \boldsymbol{\mu}_k^{\text{new}})^T$$

$$\pi_k^{\text{new}} = \frac{N_k}{N}$$

### 4.4 Exponential Family Distributions

**Definition:** Unified form capturing most common distributions.

$$p(\mathbf{x}|\boldsymbol{\eta}) = h(\mathbf{x}) \exp\left(\boldsymbol{\eta}^T \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\eta})\right)$$

| Component | Description |
|-----------|-------------|
| $\boldsymbol{\eta}$ | Natural parameters |
| $\mathbf{T}(\mathbf{x})$ | Sufficient statistics |
| $A(\boldsymbol{\eta})$ | Log-partition function (normalizer) |
| $h(\mathbf{x})$ | Base measure |

**Properties of Log-Partition Function:**

$$\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}[\mathbf{T}(\mathbf{x})]$$

$$\nabla^2_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \text{Cov}[\mathbf{T}(\mathbf{x})]$$

### 4.5 Copulas

**Definition:** Functions that couple marginal distributions to form joint distributions.

$$C: [0,1]^d \rightarrow [0,1]$$

**Sklar's Theorem:**
$$F(x_1, \ldots, x_d) = C(F_1(x_1), \ldots, F_d(x_d))$$

**Gaussian Copula:**
$$C_{\boldsymbol{\Sigma}}(\mathbf{u}) = \Phi_{\boldsymbol{\Sigma}}(\Phi^{-1}(u_1), \ldots, \Phi^{-1}(u_d))$$

where $\Phi_{\boldsymbol{\Sigma}}$ is multivariate standard normal CDF and $\Phi^{-1}$ is univariate standard normal inverse CDF.

---

## 5. Stochastic Processes

### 5.1 Definition

**Definition:** A stochastic process is a collection of random variables $\{X_t\}_{t \in T}$ indexed by a parameter set $T$ (typically time), defined on a probability space $(\Omega, \mathcal{F}, P)$.

$$\{X_t : t \in T\}$$

**Types by Index Set:**
- **Discrete-time:** $T = \{0, 1, 2, \ldots\}$
- **Continuous-time:** $T = [0, \infty)$

**Types by State Space:**
- **Discrete state:** $X_t \in \{s_1, s_2, \ldots\}$
- **Continuous state:** $X_t \in \mathbb{R}^d$

### 5.2 Markov Chains

#### 5.2.1 Definition

**Definition:** A stochastic process satisfying the Markov property (memoryless).

$$P(X_{t+1} | X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} | X_t)$$

#### 5.2.2 Transition Matrix

**Definition:** For discrete state space with $n$ states:

$$\mathbf{P} = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1n} \\ p_{21} & p_{22} & \cdots & p_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{n1} & p_{n2} & \cdots & p_{nn} \end{pmatrix}$$

where $p_{ij} = P(X_{t+1} = j | X_t = i)$

**Row Stochastic Property:**
$$\sum_{j=1}^{n} p_{ij} = 1 \quad \forall i$$

#### 5.2.3 Chapman-Kolmogorov Equation

$$P(X_{t+s} = j | X_0 = i) = \sum_{k} P(X_t = k | X_0 = i) P(X_{t+s} = j | X_t = k)$$

$$\mathbf{P}^{(t+s)} = \mathbf{P}^{(t)} \mathbf{P}^{(s)}$$

#### 5.2.4 Stationary Distribution

**Definition:** Distribution $\boldsymbol{\pi}$ satisfying:

$$\boldsymbol{\pi}^T \mathbf{P} = \boldsymbol{\pi}^T$$

$$\boldsymbol{\pi} = \lim_{t \rightarrow \infty} \mathbf{P}^t \boldsymbol{\pi}_0$$

**Detailed Balance (Reversibility):**
$$\pi_i p_{ij} = \pi_j p_{ji}$$

### 5.3 Markov Chain Monte Carlo (MCMC)

#### 5.3.1 Metropolis-Hastings Algorithm

**Objective:** Sample from target distribution $p(\mathbf{x})$.

**Acceptance Probability:**
$$\alpha(\mathbf{x}' | \mathbf{x}) = \min\left(1, \frac{p(\mathbf{x}')q(\mathbf{x}|\mathbf{x}')}{p(\mathbf{x})q(\mathbf{x}'|\mathbf{x})}\right)$$

where $q(\mathbf{x}'|\mathbf{x})$ is proposal distribution.

**Algorithm:**
1. Initialize $\mathbf{x}_0$
2. For $t = 0, 1, 2, \ldots$:
   - Sample $\mathbf{x}' \sim q(\mathbf{x}'|\mathbf{x}_t)$
   - Compute $\alpha = \alpha(\mathbf{x}' | \mathbf{x}_t)$
   - Sample $u \sim \text{Uniform}(0,1)$
   - If $u < \alpha$: $\mathbf{x}_{t+1} = \mathbf{x}'$, else: $\mathbf{x}_{t+1} = \mathbf{x}_t$

#### 5.3.2 Gibbs Sampling

**Definition:** Special case of Metropolis-Hastings with component-wise updates.

For $\mathbf{x} = (x_1, x_2, \ldots, x_d)$:

$$x_i^{(t+1)} \sim p(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_d^{(t)})$$

**Acceptance Probability:** Always 1.

### 5.4 Gaussian Processes

#### 5.4.1 Definition

**Definition:** A collection of random variables, any finite subset of which has a joint Gaussian distribution.

$$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$$

where:
- $m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$ = Mean function
- $k(\mathbf{x}, \mathbf{x}') = \mathbb{E}[(f(\mathbf{x}) - m(\mathbf{x}))(f(\mathbf{x}') - m(\mathbf{x}'))]$ = Covariance (kernel) function

#### 5.4.2 Common Kernel Functions

**Squared Exponential (RBF):**
$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2l^2}\right)$$

**Matérn Kernel:**
$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}r}{l}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{l}\right)$$

where $r = \|\mathbf{x} - \mathbf{x}'\|$ and $K_\nu$ is modified Bessel function.

#### 5.4.3 GP Posterior Inference

Given observations $\mathbf{y} = f(\mathbf{X}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:

**Posterior Mean:**
$$\bar{f}(\mathbf{x}_*) = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$

**Posterior Variance:**
$$\text{Var}(f(\mathbf{x}_*)) = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*$$

where $\mathbf{K}$ is kernel matrix and $\mathbf{k}_* = [k(\mathbf{x}_*, \mathbf{x}_1), \ldots, k(\mathbf{x}_*, \mathbf{x}_n)]^T$.

### 5.5 Wiener Process (Brownian Motion)

#### 5.5.1 Definition

**Definition:** Continuous-time stochastic process $\{W_t\}_{t \geq 0}$ with properties:

1. $W_0 = 0$
2. Independent increments: $W_t - W_s$ is independent of $\{W_u : u \leq s\}$
3. Gaussian increments: $W_t - W_s \sim \mathcal{N}(0, t-s)$ for $t > s$
4. Continuous paths

**Properties:**
$$\mathbb{E}[W_t] = 0$$
$$\text{Cov}(W_s, W_t) = \min(s, t)$$

### 5.6 Stochastic Differential Equations (SDEs)

#### 5.6.1 Itô SDE

**Definition:**
$$d\mathbf{X}_t = \boldsymbol{\mu}(\mathbf{X}_t, t)dt + \boldsymbol{\sigma}(\mathbf{X}_t, t)d\mathbf{W}_t$$

where:
- $\boldsymbol{\mu}(\mathbf{X}_t, t)$ = Drift coefficient
- $\boldsymbol{\sigma}(\mathbf{X}_t, t)$ = Diffusion coefficient
- $d\mathbf{W}_t$ = Wiener process increment

#### 5.6.2 Itô's Lemma

For function $f(\mathbf{X}_t, t)$:

$$df = \frac{\partial f}{\partial t}dt + \nabla_{\mathbf{x}} f \cdot d\mathbf{X}_t + \frac{1}{2} \text{tr}\left(\boldsymbol{\sigma}\boldsymbol{\sigma}^T \nabla^2_{\mathbf{x}} f\right)dt$$

#### 5.6.3 Fokker-Planck Equation

**Definition:** Evolution equation for probability density $p(\mathbf{x}, t)$:

$$\frac{\partial p}{\partial t} = -\nabla \cdot (\boldsymbol{\mu} p) + \frac{1}{2} \nabla \cdot \nabla \cdot (\boldsymbol{\sigma}\boldsymbol{\sigma}^T p)$$

### 5.7 Ornstein-Uhlenbeck Process

**Definition:** Mean-reverting Gaussian process:

$$d\mathbf{X}_t = \theta(\boldsymbol{\mu} - \mathbf{X}_t)dt + \sigma d\mathbf{W}_t$$

**Solution:**
$$\mathbf{X}_t = \boldsymbol{\mu} + e^{-\theta t}(\mathbf{X}_0 - \boldsymbol{\mu}) + \sigma \int_0^t e^{-\theta(t-s)} d\mathbf{W}_s$$

**Stationary Distribution:**
$$\mathbf{X}_\infty \sim \mathcal{N}\left(\boldsymbol{\mu}, \frac{\sigma^2}{2\theta}\mathbf{I}\right)$$

### 5.8 Diffusion Processes (Critical for Diffusion Models)

#### 5.8.1 Forward Diffusion Process

**Definition:** Gradually adds noise to data:

$$d\mathbf{x}_t = -\frac{1}{2}\beta(t)\mathbf{x}_t dt + \sqrt{\beta(t)} d\mathbf{W}_t$$

**Marginal Distribution:**
$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$$

where $\bar{\alpha}_t = \exp\left(-\int_0^t \beta(s)ds\right)$

#### 5.8.2 Reverse Diffusion Process

**Definition:** Denoising process (Anderson, 1982):

$$d\mathbf{x}_t = \left[-\frac{1}{2}\beta(t)\mathbf{x}_t - \beta(t)\nabla_{\mathbf{x}} \log p_t(\mathbf{x}_t)\right]dt + \sqrt{\beta(t)} d\bar{\mathbf{W}}_t$$

**Score Function:**
$$\nabla_{\mathbf{x}} \log p_t(\mathbf{x}_t) = -\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{1 - \bar{\alpha}_t}$$

### 5.9 Discrete-Time Diffusion (DDPM)

**Forward Process:**
$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

**Closed-Form Marginal:**
$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$$

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$

**Reverse Process:**
$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

---

## 6. Non-Stochastic Processes

### 6.1 Definition

**Definition:** Deterministic processes where future states are completely determined by current state without randomness.

$$\mathbf{x}_{t+1} = f(\mathbf{x}_t)$$

### 6.2 Deterministic Dynamical Systems

#### 6.2.1 Continuous-Time Systems

$$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t)$$

**Solution:** Flow map $\phi_t: \mathbf{x}_0 \mapsto \mathbf{x}_t$

#### 6.2.2 Fixed Points and Stability

**Fixed Point:** $\mathbf{x}^*$ such that $\mathbf{f}(\mathbf{x}^*) = 0$

**Linearization:**
$$\frac{d\boldsymbol{\delta}}{dt} = \mathbf{J}|_{\mathbf{x}^*} \boldsymbol{\delta}$$

where $\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ is Jacobian matrix.

**Stability Criterion:** All eigenvalues of $\mathbf{J}|_{\mathbf{x}^*}$ have negative real parts.

### 6.3 Neural Network Forward Pass as Deterministic Process

**Layer-wise Transformation:**
$$\mathbf{h}^{(l+1)} = \sigma(\mathbf{W}^{(l)}\mathbf{h}^{(l)} + \mathbf{b}^{(l)})$$

**Neural ODE Interpretation:**
$$\frac{d\mathbf{h}(t)}{dt} = f_\theta(\mathbf{h}(t), t)$$

### 6.4 Gradient Descent as Deterministic Dynamics

**Continuous-Time Gradient Flow:**
$$\frac{d\boldsymbol{\theta}}{dt} = -\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$

**Discrete Update:**
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)$$

### 6.5 Normalizing Flows (Deterministic Bijections)

**Definition:** Deterministic, invertible transformations for density estimation.

$$\mathbf{z}_K = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z}_0)$$

**Change of Variables:**
$$\log p(\mathbf{x}) = \log p(\mathbf{z}_0) - \sum_{k=1}^{K} \log\left|\det\frac{\partial f_k}{\partial \mathbf{z}_{k-1}}\right|$$

---

## 7. Principles of Generative AI

### 7.1 Fundamental Objective

**Definition:** Learn data distribution $p_{\text{data}}(\mathbf{x})$ and generate new samples $\mathbf{x} \sim p_\theta(\mathbf{x})$ that approximate the true data distribution.

$$\min_\theta D(p_{\text{data}} \| p_\theta)$$

where $D$ is some divergence measure.

### 7.2 Maximum Likelihood Estimation (MLE)

**Definition:** Find parameters that maximize probability of observed data.

$$\theta^* = \arg\max_\theta \prod_{i=1}^{N} p_\theta(\mathbf{x}_i) = \arg\max_\theta \sum_{i=1}^{N} \log p_\theta(\mathbf{x}_i)$$

**Equivalence to KL Minimization:**
$$\max_\theta \mathbb{E}_{p_{\text{data}}}[\log p_\theta(\mathbf{x})] \equiv \min_\theta D_{KL}(p_{\text{data}} \| p_\theta)$$

### 7.3 Kullback-Leibler Divergence

**Definition:** Asymmetric measure of distribution difference.

**Forward KL:**
$$D_{KL}(p \| q) = \int p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})} d\mathbf{x} = \mathbb{E}_p\left[\log \frac{p(\mathbf{x})}{q(\mathbf{x})}\right]$$

**Reverse KL:**
$$D_{KL}(q \| p) = \int q(\mathbf{x}) \log \frac{q(\mathbf{x})}{p(\mathbf{x})} d\mathbf{x}$$

**Properties:**
- $D_{KL}(p \| q) \geq 0$
- $D_{KL}(p \| q) = 0 \iff p = q$
- Asymmetric: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$

**Behavior:**
| Divergence | Property | Application |
|------------|----------|-------------|
| Forward KL | Mean-seeking, mode covering | MLE training |
| Reverse KL | Mode-seeking, zero-avoiding | Variational inference |

### 7.4 Latent Variable Models

#### 7.4.1 Framework

**Generative Process:**
1. Sample latent: $\mathbf{z} \sim p_\theta(\mathbf{z})$
2. Generate data: $\mathbf{x} \sim p_\theta(\mathbf{x}|\mathbf{z})$

**Marginal Likelihood:**
$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|\mathbf{z})p_\theta(\mathbf{z})d\mathbf{z}$$

**Intractability:** This integral is typically intractable.

#### 7.4.2 Evidence Lower Bound (ELBO)

**Derivation:**
$$\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x}, \mathbf{z}) d\mathbf{z}$$

$$= \log \int \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} q_\phi(\mathbf{z}|\mathbf{x}) d\mathbf{z}$$

$$\geq \int q_\phi(\mathbf{z}|\mathbf{x}) \log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z} = \mathcal{L}(\theta, \phi; \mathbf{x})$$

**ELBO Decomposition:**
$$\mathcal{L}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}))$$

| Term | Name | Interpretation |
|------|------|----------------|
| $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})]$ | Reconstruction | Data fit |
| $D_{KL}(q_\phi \| p_\theta)$ | Regularization | Prior matching |

**Gap Analysis:**
$$\log p_\theta(\mathbf{x}) = \mathcal{L}(\theta, \phi; \mathbf{x}) + D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$$

### 7.5 Variational Autoencoders (VAEs)

#### 7.5.1 Architecture

```
Encoder: q_φ(z|x) → μ_φ(x), σ_φ(x)
         ↓
Latent:  z = μ + σ ⊙ ε, ε ~ N(0,I)  [Reparameterization]
         ↓
Decoder: p_θ(x|z) → reconstruction
```

#### 7.5.2 Reparameterization Trick

**Problem:** Gradient cannot flow through sampling operation.

**Solution:** Transform random variable:
$$\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

**Gradient Computation:**
$$\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})] = \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}[\nabla_\phi f(\boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\epsilon})]$$

#### 7.5.3 VAE Loss Function

$$\mathcal{L}_{VAE} = -\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] + D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

**Analytical KL for Gaussians:**
$$D_{KL}(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2) \| \mathcal{N}(\mathbf{0}, \mathbf{I})) = \frac{1}{2}\sum_{j=1}^{d}(\mu_j^2 + \sigma_j^2 - 1 - \log \sigma_j^2)$$

### 7.6 Generative Adversarial Networks (GANs)

#### 7.6.1 Framework

**Generator:** $G_\theta: \mathbf{z} \rightarrow \mathbf{x}$
**Discriminator:** $D_\phi: \mathbf{x} \rightarrow [0, 1]$

#### 7.6.2 Minimax Objective

$$\min_\theta \max_\phi V(G_\theta, D_\phi) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D_\phi(G_\theta(\mathbf{z})))]$$

#### 7.6.3 Optimal Discriminator

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

#### 7.6.4 Generator Objective at Optimal Discriminator

$$\min_\theta V(G_\theta, D^*) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} \| p_g)$$

where Jensen-Shannon divergence:
$$D_{JS}(p \| q) = \frac{1}{2}D_{KL}(p \| m) + \frac{1}{2}D_{KL}(q \| m), \quad m = \frac{p+q}{2}$$

#### 7.6.5 Alternative GAN Losses

**Wasserstein GAN (WGAN):**
$$W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{p_{\text{data}}}[f(\mathbf{x})] - \mathbb{E}_{p_g}[f(\mathbf{x})]$$

**WGAN-GP Loss:**
$$\mathcal{L} = \mathbb{E}_{p_g}[D(\mathbf{x})] - \mathbb{E}_{p_{\text{data}}}[D(\mathbf{x})] + \lambda \mathbb{E}_{\hat{\mathbf{x}}}[(\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2]$$

### 7.7 Diffusion Models

#### 7.7.1 Score Matching

**Definition:** Learn score function (gradient of log-density).

$$\mathbf{s}_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$$

**Denoising Score Matching Objective:**
$$\mathcal{L}_{DSM} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \boldsymbol{\epsilon}\|^2\right]$$

where $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$

#### 7.7.2 Training Objective

**Simplified Loss (DDPM):**
$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t \sim U(1,T), \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\|^2\right]$$

#### 7.7.3 Sampling Algorithms

**DDPM Sampling:**
$$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$$

**DDIM Sampling (Deterministic):**
$$\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}}\right) + \sqrt{1-\bar{\alpha}_{t-1}}\boldsymbol{\epsilon}_\theta$$

### 7.8 Autoregressive Models

#### 7.8.1 Factorization

$$p(\mathbf{x}) = \prod_{i=1}^{n} p(x_i | x_1, x_2, \ldots, x_{i-1})$$

#### 7.8.2 Transformer-based Language Models

**Attention Mechanism:**
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

**Causal Masking:**
$$M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$$

$$\text{CausalAttention} = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}$$

**Next-Token Prediction Loss:**
$$\mathcal{L} = -\sum_{t=1}^{T} \log p_\theta(x_t | x_{<t})$$

### 7.9 Normalizing Flows

#### 7.9.1 Principle

**Change of Variables Formula:**
$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \left|\det \frac{\partial f^{-1}}{\partial \mathbf{x}}\right|$$

**Log-likelihood:**
$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) + \log \left|\det \frac{\partial \mathbf{z}}{\partial \mathbf{x}}\right|$$

#### 7.9.2 Flow Architectures

**Coupling Layers (RealNVP):**
$$\mathbf{y}_{1:d} = \mathbf{x}_{1:d}$$
$$\mathbf{y}_{d+1:D} = \mathbf{x}_{d+1:D} \odot \exp(s(\mathbf{x}_{1:d})) + t(\mathbf{x}_{1:d})$$

**Jacobian (Triangular):**
$$\det \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \exp\left(\sum_j s_j(\mathbf{x}_{1:d})\right)$$

### 7.10 Information-Theoretic Measures

#### 7.10.1 Entropy

**Discrete:**
$$H(X) = -\sum_{x} p(x) \log p(x)$$

**Continuous (Differential Entropy):**
$$h(X) = -\int p(x) \log p(x) dx$$

#### 7.10.2 Mutual Information

$$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$$

$$I(X; Y) = D_{KL}(p(X, Y) \| p(X)p(Y))$$

**Application in VAEs (InfoVAE):**
$$\mathcal{L}_{InfoVAE} = \mathcal{L}_{ELBO} - \lambda I_q(X; Z)$$

#### 7.10.3 Cross-Entropy

$$H(p, q) = -\sum_{x} p(x) \log q(x) = H(p) + D_{KL}(p \| q)$$

**As Training Loss:**
$$\mathcal{L}_{CE} = -\sum_{i} y_i \log \hat{y}_i$$

### 7.11 Unified Framework: Score-Based Generative Models

**Score Function Definition:**
$$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$$

**Langevin Dynamics Sampling:**
$$\mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\epsilon}{2}\nabla_{\mathbf{x}} \log p(\mathbf{x}_t) + \sqrt{\epsilon}\mathbf{z}_t, \quad \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

**Connection to Diffusion:**
$$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) = -\sqrt{1-\bar{\alpha}_t} \cdot \mathbf{s}_\theta(\mathbf{x}_t, t)$$

---

## Summary: Probabilistic Foundations → Generative Models

```
┌─────────────────────────────────────────────────────────────────────┐
│                    PROBABILITY FOUNDATIONS                          │
├─────────────────────────────────────────────────────────────────────┤
│ Random Variables → Distributions → Joint/Conditional → Bayes       │
└────────────────────────────────────┬────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    STOCHASTIC PROCESSES                              │
├─────────────────────────────────────────────────────────────────────┤
│ Markov Chains → Gaussian Processes → SDEs → Diffusion Processes     │
└────────────────────────────────────┬────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    GENERATIVE AI PRINCIPLES                          │
├─────────────────────────────────────────────────────────────────────┤
│ MLE → ELBO → KL Divergence → Score Matching → Reparameterization    │
└────────────────────────────────────┬────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    GENERATIVE MODEL ARCHITECTURES                    │
├─────────────────────────────────────────────────────────────────────┤
│ VAE: Latent + ELBO                                                   │
│ GAN: Adversarial + JS/Wasserstein                                   │
│ Diffusion: Score + SDE                                              │
│ Flow: Bijection + Change of Variables                               │
│ Autoregressive: Chain Rule + Attention                              │
└─────────────────────────────────────────────────────────────────────┘
```

---

## Key Mathematical Relationships

| Concept | Equation | Gen-AI Application |
|---------|----------|-------------------|
| Bayes' Theorem | $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$ | Posterior in VAEs |
| Chain Rule | $p(\mathbf{x}) = \prod_i p(x_i|x_{<i})$ | Autoregressive LLMs |
| ELBO | $\log p(\mathbf{x}) \geq \mathbb{E}_q[\log p(\mathbf{x}|\mathbf{z})] - D_{KL}(q\|p)$ | VAE Training |
| Score | $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ | Diffusion Models |
| Change of Variables | $p_x = p_z \cdot |\det J^{-1}|$ | Normalizing Flows |
| Langevin Dynamics | $\mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\epsilon}{2}\mathbf{s} + \sqrt{\epsilon}\mathbf{z}$ | MCMC Sampling |

# Probability and Statistics for Generative AI: Key Probability Distributions

---

## 1. Foundational Probability Theory for Generative AI

### 1.1 Definition of Generative Modeling from Probabilistic Perspective

Generative AI fundamentally aims to learn the underlying probability distribution $p_{data}(x)$ of a dataset and generate new samples from an approximated distribution $p_{model}(x)$. The core objective is:

$$p_{model}(x; \theta) \approx p_{data}(x)$$

where $\theta$ represents learnable parameters of the generative model.

### 1.2 Probability Axioms (Kolmogorov Axioms)

For a sample space $\Omega$ and event $A$:

1. **Non-negativity**: $P(A) \geq 0$
2. **Normalization**: $P(\Omega) = 1$
3. **Additivity**: For mutually exclusive events $A_1, A_2, ...$:

$$P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$$

### 1.3 Conditional Probability

$$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$

**Relevance to Gen-AI**: Autoregressive models (GPT, LLaMA) generate sequences by modeling:

$$P(x_1, x_2, ..., x_n) = \prod_{t=1}^{n} P(x_t | x_1, x_2, ..., x_{t-1})$$

### 1.4 Bayes' Theorem

$$P(\theta|X) = \frac{P(X|\theta) P(\theta)}{P(X)} = \frac{P(X|\theta) P(\theta)}{\int P(X|\theta) P(\theta) d\theta}$$

| Component | Term | Role in Gen-AI |
|-----------|------|----------------|
| $P(\theta\|X)$ | Posterior | Updated belief about model parameters |
| $P(X\|\theta)$ | Likelihood | How well model explains data |
| $P(\theta)$ | Prior | Initial belief about parameters |
| $P(X)$ | Evidence/Marginal Likelihood | Normalization constant |

### 1.5 Joint and Marginal Distributions

**Joint Distribution**:
$$P(X, Z) = P(X|Z)P(Z)$$

**Marginalization**:
$$P(X) = \int P(X, Z) dZ = \int P(X|Z)P(Z) dZ$$

**Critical for Gen-AI**: Latent variable models (VAEs) rely on marginalization over latent space $Z$.

### 1.6 Independence and Conditional Independence

**Independence**:
$$P(X, Y) = P(X)P(Y)$$

**Conditional Independence** ($X \perp Y | Z$):
$$P(X, Y | Z) = P(X|Z)P(Y|Z)$$

---

## 2. Statistical Estimators in Generative AI

### 2.1 Maximum Likelihood Estimation (MLE)

**Definition**: Find parameters $\theta$ that maximize the probability of observed data.

$$\theta_{MLE} = \arg\max_{\theta} \prod_{i=1}^{N} p(x_i | \theta) = \arg\max_{\theta} \sum_{i=1}^{N} \log p(x_i | \theta)$$

**Log-Likelihood Objective**:
$$\mathcal{L}(\theta) = \mathbb{E}_{x \sim p_{data}}[\log p_{model}(x; \theta)]$$

### 2.2 Maximum A Posteriori (MAP) Estimation

$$\theta_{MAP} = \arg\max_{\theta} P(\theta|X) = \arg\max_{\theta} [\log P(X|\theta) + \log P(\theta)]$$

The term $\log P(\theta)$ acts as regularization.

### 2.3 Expectation and Variance

**Expectation**:
$$\mathbb{E}[X] = \int x \cdot p(x) dx \quad \text{(continuous)}$$
$$\mathbb{E}[X] = \sum_{x} x \cdot P(x) \quad \text{(discrete)}$$

**Variance**:
$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

**Covariance**:
$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$$

---

## 3. Information-Theoretic Foundations

### 3.1 Entropy

**Definition**: Measures uncertainty/randomness in a distribution.

**Discrete Entropy**:
$$H(X) = -\sum_{x} P(x) \log P(x)$$

**Differential Entropy (Continuous)**:
$$h(X) = -\int p(x) \log p(x) dx$$

**Properties**:
- $H(X) \geq 0$ for discrete distributions
- Maximum entropy for uniform distribution
- Gaussian has maximum entropy among distributions with fixed variance

### 3.2 Kullback-Leibler (KL) Divergence

**Definition**: Measures dissimilarity between two distributions.

$$D_{KL}(p \| q) = \int p(x) \log \frac{p(x)}{q(x)} dx = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right]$$

**Properties**:
- $D_{KL}(p \| q) \geq 0$ (Gibbs' inequality)
- $D_{KL}(p \| q) = 0$ iff $p = q$
- **Asymmetric**: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$

**Decomposition**:
$$D_{KL}(p \| q) = H(p, q) - H(p)$$

where $H(p, q)$ is cross-entropy.

### 3.3 Cross-Entropy

$$H(p, q) = -\mathbb{E}_{x \sim p}[\log q(x)] = -\sum_{x} p(x) \log q(x)$$

**Connection to MLE**:
$$\min_{\theta} H(p_{data}, p_{model}) = \min_{\theta} -\mathbb{E}_{x \sim p_{data}}[\log p_{model}(x; \theta)]$$

### 3.4 Jensen-Shannon Divergence

$$D_{JS}(p \| q) = \frac{1}{2} D_{KL}(p \| m) + \frac{1}{2} D_{KL}(q \| m)$$

where $m = \frac{1}{2}(p + q)$

**Properties**:
- Symmetric: $D_{JS}(p \| q) = D_{JS}(q \| p)$
- Bounded: $0 \leq D_{JS} \leq \log 2$
- Used in original GAN formulation

### 3.5 Mutual Information

$$I(X; Z) = D_{KL}(p(X, Z) \| p(X)p(Z)) = H(X) - H(X|Z)$$

**Application in Gen-AI**: InfoGAN maximizes mutual information between latent codes and generated samples.

---

## 4. Key Probability Distributions in Generative AI

---

### 4.1 Bernoulli Distribution

**Definition**: Models binary random variable $X \in \{0, 1\}$.

**Probability Mass Function (PMF)**:
$$P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}$$

**Parameters**: $p \in [0, 1]$ (success probability)

**Moments**:
$$\mathbb{E}[X] = p, \quad \text{Var}(X) = p(1-p)$$

**Applications in Gen-AI**:
- Binary image generation (binary cross-entropy loss)
- Dropout regularization during training
- Bernoulli VAE decoders for binary data

**Bernoulli VAE Reconstruction Loss**:
$$\mathcal{L}_{recon} = -\sum_{i=1}^{D} [x_i \log \hat{x}_i + (1-x_i) \log(1-\hat{x}_i)]$$

---

### 4.2 Categorical Distribution

**Definition**: Generalization of Bernoulli to $K$ discrete outcomes.

**PMF**:
$$P(X = k) = \pi_k, \quad k \in \{1, 2, ..., K\}$$

where $\sum_{k=1}^{K} \pi_k = 1$ and $\pi_k \geq 0$

**One-hot Encoding Formulation**:
$$P(x | \pi) = \prod_{k=1}^{K} \pi_k^{x_k}$$

where $x = [x_1, ..., x_K]$ is one-hot vector.

**Moments**:
$$\mathbb{E}[X_k] = \pi_k, \quad \text{Var}(X_k) = \pi_k(1-\pi_k)$$

**Applications in Gen-AI**:
- **Language Models**: Token prediction at each timestep
- **VQ-VAE**: Discrete codebook selection
- **Discrete Diffusion Models**: Categorical state transitions

**Softmax Parameterization**:
$$\pi_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}$$

where $z_k$ are logits from neural network.

---

### 4.3 Multinomial Distribution

**Definition**: Distribution over counts of $K$ categories in $n$ trials.

**PMF**:
$$P(x_1, ..., x_K) = \frac{n!}{x_1! \cdots x_K!} \prod_{k=1}^{K} \pi_k^{x_k}$$

where $\sum_{k=1}^{K} x_k = n$

**Moments**:
$$\mathbb{E}[X_k] = n\pi_k, \quad \text{Var}(X_k) = n\pi_k(1-\pi_k)$$
$$\text{Cov}(X_i, X_j) = -n\pi_i\pi_j \quad (i \neq j)$$

**Applications in Gen-AI**:
- Bag-of-words text generation
- Topic modeling (LDA)
- Multi-label classification in conditional generation

---

### 4.4 Poisson Distribution

**Definition**: Models count of events in fixed interval.

**PMF**:
$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in \{0, 1, 2, ...\}$$

**Parameters**: $\lambda > 0$ (rate parameter)

**Moments**:
$$\mathbb{E}[X] = \lambda, \quad \text{Var}(X) = \lambda$$

**Applications in Gen-AI**:
- Modeling sequence lengths in text generation
- Point process generation
- Sparse representation learning

---

### 4.5 Gaussian (Normal) Distribution

**Definition**: Most fundamental continuous distribution in Gen-AI.

#### 4.5.1 Univariate Gaussian

**Probability Density Function (PDF)**:
$$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**Notation**: $X \sim \mathcal{N}(\mu, \sigma^2)$

**Parameters**:
- $\mu \in \mathbb{R}$: Mean
- $\sigma^2 > 0$: Variance

**Moments**:
$$\mathbb{E}[X] = \mu, \quad \text{Var}(X) = \sigma^2$$

**Standard Normal**: $Z \sim \mathcal{N}(0, 1)$

**Standardization**:
$$Z = \frac{X - \mu}{\sigma}$$

#### 4.5.2 Multivariate Gaussian

**PDF**:
$$p(x | \mu, \Sigma) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$

**Notation**: $X \sim \mathcal{N}(\mu, \Sigma)$

**Parameters**:
- $\mu \in \mathbb{R}^D$: Mean vector
- $\Sigma \in \mathbb{R}^{D \times D}$: Covariance matrix (symmetric positive definite)

**Mahalanobis Distance**:
$$d_M(x, \mu) = \sqrt{(x-\mu)^T \Sigma^{-1} (x-\mu)}$$

**Properties**:
1. **Marginals are Gaussian**: If $(X_1, X_2)^T \sim \mathcal{N}$, then $X_1 \sim \mathcal{N}$
2. **Conditionals are Gaussian**: $p(X_1 | X_2) = \mathcal{N}(\mu_{1|2}, \Sigma_{1|2})$
3. **Linear transformations preserve Gaussianity**: $AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^T)$

**Conditional Distribution Formulas**:

For $\begin{pmatrix} X_1 \\ X_2 \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}\right)$:

$$\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)$$
$$\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$$

#### 4.5.3 Applications in Gen-AI

| Application | Role of Gaussian |
|-------------|------------------|
| **VAE Prior** | $p(z) = \mathcal{N}(0, I)$ |
| **VAE Encoder** | $q_\phi(z\|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$ |
| **Diffusion Forward Process** | $q(x_t\|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)$ |
| **Reparameterization** | $z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$ |

#### 4.5.4 KL Divergence Between Two Gaussians

For $p = \mathcal{N}(\mu_1, \Sigma_1)$ and $q = \mathcal{N}(\mu_2, \Sigma_2)$:

$$D_{KL}(p \| q) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - D + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T\Sigma_2^{-1}(\mu_2-\mu_1)\right]$$

**Special Case (VAE)**: For $p = \mathcal{N}(\mu, \sigma^2 I)$ and $q = \mathcal{N}(0, I)$:

$$D_{KL}(p \| q) = \frac{1}{2}\sum_{j=1}^{D}\left[\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\right]$$

---

### 4.6 Uniform Distribution

#### 4.6.1 Continuous Uniform

**PDF**:
$$p(x | a, b) = \frac{1}{b-a}, \quad x \in [a, b]$$

**Moments**:
$$\mathbb{E}[X] = \frac{a+b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12}$$

#### 4.6.2 Discrete Uniform

**PMF**:
$$P(X = k) = \frac{1}{n}, \quad k \in \{1, 2, ..., n\}$$

**Applications in Gen-AI**:
- **GAN Latent Space**: $z \sim \text{Uniform}(-1, 1)^D$
- **Random Sampling**: Index selection for mini-batches
- **Data Augmentation**: Random crop positions
- **Noise Injection**: Uniform noise in certain architectures

---

### 4.7 Exponential Distribution

**PDF**:
$$p(x | \lambda) = \lambda e^{-\lambda x}, \quad x \geq 0$$

**Parameters**: $\lambda > 0$ (rate parameter)

**Moments**:
$$\mathbb{E}[X] = \frac{1}{\lambda}, \quad \text{Var}(X) = \frac{1}{\lambda^2}$$

**Memoryless Property**:
$$P(X > s + t | X > s) = P(X > t)$$

**Applications in Gen-AI**:
- Modeling inter-arrival times in temporal generation
- Learning rate scheduling
- Exponential family connections

---

### 4.8 Beta Distribution

**Definition**: Distribution over probabilities $x \in [0, 1]$.

**PDF**:
$$p(x | \alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$$

where $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the Beta function.

**Parameters**: $\alpha > 0$, $\beta > 0$ (shape parameters)

**Moments**:
$$\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}, \quad \text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$

**Special Cases**:
- $\alpha = \beta = 1$: Uniform distribution
- $\alpha = \beta$: Symmetric around 0.5
- $\alpha > \beta$: Skewed right
- $\alpha < \beta$: Skewed left

**Applications in Gen-AI**:
- **Prior for Bernoulli parameters** (conjugate prior)
- **Mixup data augmentation**: $\lambda \sim \text{Beta}(\alpha, \alpha)$
- **Dropout probability tuning**
- **Interpolation weights in latent space**

**Beta Distribution in Mixup**:
$$\tilde{x} = \lambda x_i + (1-\lambda) x_j, \quad \lambda \sim \text{Beta}(\alpha, \alpha)$$

---

### 4.9 Gamma Distribution

**PDF**:
$$p(x | \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0$$

**Parameters**:
- $\alpha > 0$: Shape parameter
- $\beta > 0$: Rate parameter

**Alternative Parameterization** (shape-scale):
$$p(x | k, \theta) = \frac{1}{\Gamma(k)\theta^k} x^{k-1} e^{-x/\theta}$$

**Moments**:
$$\mathbb{E}[X] = \frac{\alpha}{\beta}, \quad \text{Var}(X) = \frac{\alpha}{\beta^2}$$

**Relationships**:
- Exponential is Gamma with $\alpha = 1$
- Chi-squared is Gamma with $\alpha = k/2$, $\beta = 1/2$

**Applications in Gen-AI**:
- Prior for precision parameters
- Modeling positive-valued latent variables
- Bayesian neural network priors

---

### 4.10 Dirichlet Distribution

**Definition**: Multivariate generalization of Beta; distribution over probability simplices.

**PDF**:
$$p(x | \alpha) = \frac{\Gamma(\sum_{k=1}^{K} \alpha_k)}{\prod_{k=1}^{K} \Gamma(\alpha_k)} \prod_{k=1}^{K} x_k^{\alpha_k - 1}$$

**Constraint**: $\sum_{k=1}^{K} x_k = 1$, $x_k \geq 0$

**Parameters**: $\alpha = (\alpha_1, ..., \alpha_K)$, $\alpha_k > 0$

**Moments**:
$$\mathbb{E}[X_k] = \frac{\alpha_k}{\alpha_0}, \quad \alpha_0 = \sum_{j=1}^{K} \alpha_j$$
$$\text{Var}(X_k) = \frac{\alpha_k(\alpha_0 - \alpha_k)}{\alpha_0^2(\alpha_0 + 1)}$$

**Applications in Gen-AI**:
- **Topic Modeling (LDA)**: Prior over topic distributions
- **Mixture Models**: Prior over mixture weights
- **Attention Mechanism Analysis**: Modeling attention weight distributions

---

### 4.11 Laplace Distribution

**PDF**:
$$p(x | \mu, b) = \frac{1}{2b} \exp\left(-\frac{|x - \mu|}{b}\right)$$

**Parameters**:
- $\mu$: Location (mean)
- $b > 0$: Scale

**Moments**:
$$\mathbb{E}[X] = \mu, \quad \text{Var}(X) = 2b^2$$

**Applications in Gen-AI**:
- **Sparse Latent Representations**: Encourages sparsity (L1 regularization connection)
- **Robust Generation**: Heavier tails than Gaussian
- **Flow-based Models**: Sometimes used as base distribution

**Connection to L1 Regularization**:
$$p(w) = \frac{\lambda}{2} \exp(-\lambda|w|) \Rightarrow \text{L1 penalty: } \lambda|w|$$

---

### 4.12 Student's t-Distribution

**PDF**:
$$p(x | \nu, \mu, \sigma) = \frac{\Gamma(\frac{\nu+1}{2})}{\Gamma(\frac{\nu}{2})\sqrt{\nu\pi}\sigma} \left(1 + \frac{1}{\nu}\left(\frac{x-\mu}{\sigma}\right)^2\right)^{-\frac{\nu+1}{2}}$$

**Parameters**:
- $\nu > 0$: Degrees of freedom
- $\mu$: Location
- $\sigma$: Scale

**Moments** (for $\nu > 2$):
$$\mathbb{E}[X] = \mu, \quad \text{Var}(X) = \frac{\nu}{\nu-2}\sigma^2$$

**Properties**:
- Heavier tails than Gaussian
- As $\nu \to \infty$, converges to Gaussian
- Robust to outliers

**Applications in Gen-AI**:
- **Robust VAEs**: Replace Gaussian with t-distribution for robustness
- **Outlier-resistant generation**
- **Uncertainty quantification**

---

### 4.13 Log-Normal Distribution

**Definition**: If $Y = \log(X) \sim \mathcal{N}(\mu, \sigma^2)$, then $X \sim \text{LogNormal}(\mu, \sigma^2)$.

**PDF**:
$$p(x | \mu, \sigma) = \frac{1}{x\sigma\sqrt{2\pi}} \exp\left(-\frac{(\ln x - \mu)^2}{2\sigma^2}\right), \quad x > 0$$

**Moments**:
$$\mathbb{E}[X] = \exp\left(\mu + \frac{\sigma^2}{2}\right)$$
$$\text{Var}(X) = \left[\exp(\sigma^2) - 1\right] \exp(2\mu + \sigma^2)$$

**Applications in Gen-AI**:
- Modeling positive-valued data (image intensities, word frequencies)
- Variance modeling in heteroscedastic generation
- Scale parameters in hierarchical models

---

### 4.14 Von Mises Distribution (Circular/Angular)

**PDF**:
$$p(\theta | \mu, \kappa) = \frac{\exp(\kappa \cos(\theta - \mu))}{2\pi I_0(\kappa)}$$

where $I_0(\kappa)$ is the modified Bessel function of order 0.

**Parameters**:
- $\mu \in [-\pi, \pi]$: Mean direction
- $\kappa \geq 0$: Concentration parameter

**Applications in Gen-AI**:
- **Pose generation**: Modeling joint angles
- **Directional data**: Wind direction, molecular conformations
- **3D human motion generation**

---

### 4.15 Wishart Distribution

**Definition**: Distribution over symmetric positive-definite matrices.

**PDF**:
$$p(W | V, n) = \frac{|W|^{(n-D-1)/2} \exp\left(-\frac{1}{2}\text{tr}(V^{-1}W)\right)}{2^{nD/2}|V|^{n/2}\Gamma_D(n/2)}$$

where $\Gamma_D$ is the multivariate gamma function.

**Parameters**:
- $V$: Scale matrix ($D \times D$ positive-definite)
- $n \geq D$: Degrees of freedom

**Mean**:
$$\mathbb{E}[W] = nV$$

**Applications in Gen-AI**:
- Prior for covariance matrices in Gaussian models
- Bayesian deep learning for weight uncertainty
- Structured covariance learning

---

## 5. Mixture Distributions in Generative AI

### 5.1 Gaussian Mixture Model (GMM)

**PDF**:
$$p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)$$

**Constraints**:
$$\sum_{k=1}^{K} \pi_k = 1, \quad \pi_k \geq 0$$

**Latent Variable Formulation**:
$$z \sim \text{Categorical}(\pi_1, ..., \pi_K)$$
$$x | z = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

**Log-Likelihood**:
$$\log p(X | \theta) = \sum_{n=1}^{N} \log \left[\sum_{k=1}^{K} \pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)\right]$$

**EM Algorithm for GMM**:

**E-step** (Compute responsibilities):
$$\gamma_{nk} = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}$$

**M-step** (Update parameters):
$$N_k = \sum_{n=1}^{N} \gamma_{nk}$$
$$\mu_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{nk} x_n$$
$$\Sigma_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{nk} (x_n - \mu_k^{new})(x_n - \mu_k^{new})^T$$
$$\pi_k^{new} = \frac{N_k}{N}$$

**Applications in Gen-AI**:
- VQ-VAE prior modeling
- Multi-modal generation
- Density estimation for likelihood-based models

### 5.2 Mixture of Experts (MoE)

$$p(y|x) = \sum_{k=1}^{K} g_k(x) \cdot p_k(y|x)$$

where $g_k(x)$ is gating network output (softmax over experts).

---

## 6. Probability Distributions in Specific Gen-AI Architectures

### 6.1 Variational Autoencoders (VAEs)

**Generative Model**:
$$p_\theta(x, z) = p_\theta(x|z) p(z)$$

**Prior**: $p(z) = \mathcal{N}(0, I)$

**Approximate Posterior**: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma_\phi^2(x)))$

**Evidence Lower Bound (ELBO)**:
$$\log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))$$

**Reparameterization Trick**:
$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

**Decoder Distribution Choices**:

| Data Type | Distribution | Loss |
|-----------|--------------|------|
| Real-valued | Gaussian $\mathcal{N}(\mu_\theta(z), \sigma^2 I)$ | MSE |
| Binary | Bernoulli | Binary Cross-Entropy |
| Categorical | Categorical | Cross-Entropy |

### 6.2 Generative Adversarial Networks (GANs)

**Implicit Distribution**: GANs define distributions implicitly through transformation.

$$z \sim p_z(z) = \mathcal{N}(0, I) \text{ or } \text{Uniform}(-1, 1)$$
$$x = G_\theta(z)$$

The generated distribution $p_g(x)$ is implicitly defined.

**Original GAN Objective**:
$$\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

**Optimal Discriminator**:
$$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$$

**At Optimality**:
$$\min_G V(G, D^*) = 2 \cdot D_{JS}(p_{data} \| p_g) - \log 4$$

### 6.3 Diffusion Models

#### 6.3.1 Forward Process (Adding Noise)

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

**Marginal at time $t$**:
$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$

**Direct Sampling**:
$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

#### 6.3.2 Reverse Process (Denoising)

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

**True Posterior (Tractable)**:
$$q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)$$

where:
$$\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_t$$
$$\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$$

#### 6.3.3 Training Objective

**Simplified Loss** (DDPM):
$$\mathcal{L}_{simple} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

### 6.4 Autoregressive Language Models

**Factorization**:
$$p(x_1, x_2, ..., x_T) = \prod_{t=1}^{T} p(x_t | x_{<t})$$

**Token Distribution**:
$$p(x_t | x_{<t}) = \text{Categorical}(\text{softmax}(W h_t + b))$$

where $h_t$ is hidden state from transformer.

**Temperature Scaling**:
$$p(x_t = k | x_{<t}) = \frac{\exp(z_k / \tau)}{\sum_j \exp(z_j / \tau)}$$

- $\tau < 1$: Sharper distribution (more deterministic)
- $\tau > 1$: Flatter distribution (more random)
- $\tau = 1$: Original distribution

**Top-k Sampling**:
$$p'(x_t) \propto \begin{cases} p(x_t) & \text{if } x_t \in \text{top-}k \\ 0 & \text{otherwise} \end{cases}$$

**Nucleus (Top-p) Sampling**:
$$p'(x_t) \propto \begin{cases} p(x_t) & \text{if } x_t \in V_p \\ 0 & \text{otherwise} \end{cases}$$

where $V_p = \min\{V': \sum_{x \in V'} p(x) \geq p\}$

### 6.5 Normalizing Flows

**Definition**: Transform simple distribution to complex through invertible mappings.

$$x = f_\theta(z), \quad z \sim p_z(z)$$

**Change of Variables**:
$$p_x(x) = p_z(f_\theta^{-1}(x)) \left|\det \frac{\partial f_\theta^{-1}}{\partial x}\right|$$

**Log-likelihood**:
$$\log p_x(x) = \log p_z(z) + \sum_{i=1}^{K} \log \left|\det \frac{\partial f_i}{\partial z_{i-1}}\right|$$

for composition $f = f_K \circ ... \circ f_1$

**Flow Types and Their Jacobians**:

| Flow Type | Jacobian Structure | Complexity |
|-----------|-------------------|------------|
| Planar | Rank-1 update | $O(D)$ |
| Radial | Radial | $O(D)$ |
| Coupling (RealNVP) | Triangular | $O(D)$ |
| Autoregressive (MAF) | Triangular | $O(D)$ |

---

## 7. Advanced Probabilistic Concepts for Gen-AI

### 7.1 Reparameterization Trick

**Problem**: Cannot backpropagate through stochastic sampling.

**Solution**: Express random variable as deterministic function of parameters and noise.

**General Form**:
$$z = g(\phi, \epsilon), \quad \epsilon \sim p(\epsilon)$$

**Examples**:

| Distribution | Reparameterization |
|--------------|-------------------|
| Gaussian | $z = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$ |
| Exponential | $z = -\frac{1}{\lambda} \log(1 - u), \quad u \sim \text{Uniform}(0, 1)$ |
| Gamma | Rejection sampling based |

### 7.2 Gumbel-Softmax for Discrete Distributions

**Problem**: Categorical sampling is non-differentiable.

**Gumbel-Max Trick**:
$$z = \arg\max_k [g_k + \log \pi_k]$$

where $g_k \sim \text{Gumbel}(0, 1)$

**Gumbel-Softmax (Continuous Relaxation)**:
$$y_k = \frac{\exp((\log \pi_k + g_k) / \tau)}{\sum_j \exp((\log \pi_j + g_j) / \tau)}$$

As $\tau \to 0$, approaches one-hot categorical.

**Gumbel Distribution**:
$$p(g) = \exp(-(g + \exp(-g)))$$
$$g = -\log(-\log(u)), \quad u \sim \text{Uniform}(0, 1)$$

### 7.3 Score Function and Score Matching

**Score Function**:
$$s_\theta(x) = \nabla_x \log p_\theta(x)$$

**Score Matching Objective**:
$$\mathcal{L}_{SM} = \mathbb{E}_{p_{data}}\left[\frac{1}{2}\|s_\theta(x)\|^2 + \text{tr}(\nabla_x s_\theta(x))\right]$$

**Denoising Score Matching**:
$$\mathcal{L}_{DSM} = \mathbb{E}_{p_{data}(x), q_\sigma(\tilde{x}|x)}\left[\|s_\theta(\tilde{x}) - \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x)\|^2\right]$$

### 7.4 Evidence Lower Bound (ELBO) Derivation

Starting from log-likelihood:
$$\log p(x) = \log \int p(x, z) dz$$

Introduce variational distribution $q(z|x)$:
$$\log p(x) = \log \int \frac{p(x, z)}{q(z|x)} q(z|x) dz$$

Apply Jensen's inequality:
$$\log p(x) \geq \int q(z|x) \log \frac{p(x, z)}{q(z|x)} dz = \mathcal{L}_{ELBO}$$

**Decomposition**:
$$\mathcal{L}_{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$$

**Gap Analysis**:
$$\log p(x) = \mathcal{L}_{ELBO} + D_{KL}(q(z|x) \| p(z|x))$$

---

## 8. Probability Distribution Selection Guide for Gen-AI

| Task | Recommended Distribution | Justification |
|------|-------------------------|---------------|
| **Continuous Latent Space** | Gaussian | Reparameterizable, closed-form KL |
| **Binary Data** | Bernoulli | Natural for binary outputs |
| **Token Generation** | Categorical | Discrete vocabulary |
| **Image Generation (pixels)** | Gaussian/Discretized Logistic | Continuous approximation |
| **Sparse Representations** | Laplace | Promotes sparsity |
| **Robust Generation** | Student's t | Heavy tails handle outliers |
| **Probability Parameters** | Beta/Dirichlet | Constrained to simplex |
| **Covariance Learning** | Wishart | Positive-definite constraint |
| **Mixture Modeling** | GMM | Multi-modal flexibility |
| **Discrete Latent** | Gumbel-Softmax | Differentiable relaxation |

---

## 9. Summary: Probabilistic Framework for Generative AI

```
┌─────────────────────────────────────────────────────────────────┐
│                    GENERATIVE AI PIPELINE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   DATA: x ~ p_data(x)                                          │
│                                                                 │
│           ↓                                                    │
│                                                                 │
│   LATENT SPACE: z ~ p(z)  [Prior: N(0,I), Uniform, etc.]       │
│                                                                 │
│           ↓                                                    │
│                                                                 │
│   TRANSFORMATION: x = f_θ(z) or p_θ(x|z)                       │
│                                                                 │
│           ↓                                                    │
│                                                                 │
│   OBJECTIVE: min D(p_data || p_model)                          │
│              [KL, JS, Wasserstein, Score Matching]              │
│                                                                 │
│           ↓                                                    │
│                                                                 │
│   SAMPLING: x_new ~ p_θ(x)                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Core Distributions by Gen-AI Architecture**:

| Architecture | Key Distributions Used |
|--------------|----------------------|
| **VAE** | Gaussian (prior, posterior, likelihood) |
| **GAN** | Gaussian/Uniform (latent), Implicit (generated) |
| **Diffusion** | Gaussian (forward/reverse process) |
| **Autoregressive LM** | Categorical (token prediction) |
| **Flow** | Base distribution → transformed complex distribution |
| **VQ-VAE** | Categorical (codebook), Gaussian (encoder) |

# Expectation, Moments, Covariance, Correlation, and Convolution for Generative AI

---

## 1. Expectation (Expected Value)

### 1.1 Definition

**Expectation** is the probability-weighted average of all possible values of a random variable, representing the central tendency or "mean" of a distribution.

#### 1.1.1 Discrete Random Variable

For a discrete random variable $X$ with probability mass function $P(X = x_i)$:

$$\mathbb{E}[X] = \sum_{i} x_i \cdot P(X = x_i) = \sum_{i} x_i \cdot p(x_i)$$

#### 1.1.2 Continuous Random Variable

For a continuous random variable $X$ with probability density function $p(x)$:

$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot p(x) \, dx$$

#### 1.1.3 General Form (Lebesgue Integration)

$$\mathbb{E}[X] = \int_{\Omega} X(\omega) \, dP(\omega)$$

### 1.2 Expectation of a Function

For a function $g(X)$:

**Discrete**:
$$\mathbb{E}[g(X)] = \sum_{i} g(x_i) \cdot p(x_i)$$

**Continuous**:
$$\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot p(x) \, dx$$

### 1.3 Properties of Expectation

| Property | Mathematical Expression |
|----------|------------------------|
| **Linearity** | $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ |
| **Constant** | $\mathbb{E}[c] = c$ |
| **Non-negativity** | If $X \geq 0$, then $\mathbb{E}[X] \geq 0$ |
| **Independence** | If $X \perp Y$: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ |
| **Monotonicity** | If $X \leq Y$, then $\mathbb{E}[X] \leq \mathbb{E}[Y]$ |
| **Tower Property** | $\mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X]$ |

### 1.4 Conditional Expectation

**Definition**: Expected value of $X$ given information about $Y$:

$$\mathbb{E}[X | Y = y] = \int x \cdot p(x|y) \, dx$$

**As a Random Variable**:
$$\mathbb{E}[X | Y] = g(Y)$$

where $g(y) = \mathbb{E}[X | Y = y]$

**Law of Total Expectation (Tower Property)**:
$$\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]] = \int \mathbb{E}[X | Y = y] \cdot p(y) \, dy$$

### 1.5 Multivariate Expectation

For a random vector $\mathbf{X} = (X_1, X_2, ..., X_n)^T$:

$$\mathbb{E}[\mathbf{X}] = \begin{pmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_n] \end{pmatrix} = \boldsymbol{\mu}$$

**Matrix Expectation**: For random matrix $\mathbf{A}$:
$$\mathbb{E}[\mathbf{A}]_{ij} = \mathbb{E}[A_{ij}]$$

### 1.6 Applications in Generative AI

#### 1.6.1 Training Objective (Empirical Expectation)

**True Expectation**:
$$\mathcal{L}(\theta) = \mathbb{E}_{x \sim p_{data}}[\ell(x; \theta)]$$

**Monte Carlo Approximation**:
$$\hat{\mathcal{L}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(x_i; \theta)$$

#### 1.6.2 VAE ELBO

$$\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))$$

#### 1.6.3 GAN Generator Objective

$$\mathcal{L}_G = \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

or non-saturating:
$$\mathcal{L}_G = -\mathbb{E}_{z \sim p_z}[\log D(G(z))]$$

#### 1.6.4 Diffusion Model Training

$$\mathcal{L}_{simple} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

---

## 2. Moments of a Distribution

### 2.1 Definition of Moments

Moments are quantitative measures that characterize the shape and properties of a probability distribution.

### 2.2 Raw Moments (Moments about Origin)

**Definition**: The $n$-th raw moment of random variable $X$:

$$\mu'_n = \mathbb{E}[X^n] = \int_{-\infty}^{\infty} x^n \cdot p(x) \, dx$$

| Order | Name | Formula | Interpretation |
|-------|------|---------|----------------|
| $n=1$ | Mean | $\mu'_1 = \mathbb{E}[X]$ | Central location |
| $n=2$ | Second raw moment | $\mu'_2 = \mathbb{E}[X^2]$ | Related to energy/power |
| $n=k$ | k-th raw moment | $\mu'_k = \mathbb{E}[X^k]$ | Higher-order statistics |

### 2.3 Central Moments (Moments about Mean)

**Definition**: The $n$-th central moment:

$$\mu_n = \mathbb{E}[(X - \mu)^n] = \mathbb{E}[(X - \mathbb{E}[X])^n]$$

| Order | Name | Formula | Interpretation |
|-------|------|---------|----------------|
| $n=1$ | First central moment | $\mu_1 = 0$ | Always zero |
| $n=2$ | Variance | $\mu_2 = \mathbb{E}[(X-\mu)^2] = \sigma^2$ | Spread/dispersion |
| $n=3$ | Third central moment | $\mu_3 = \mathbb{E}[(X-\mu)^3]$ | Asymmetry (unnormalized) |
| $n=4$ | Fourth central moment | $\mu_4 = \mathbb{E}[(X-\mu)^4]$ | Tail heaviness (unnormalized) |

### 2.4 Standardized Moments

**Definition**: Central moments normalized by appropriate power of standard deviation:

$$\tilde{\mu}_n = \frac{\mu_n}{\sigma^n} = \frac{\mathbb{E}[(X-\mu)^n]}{\sigma^n}$$

### 2.5 Key Standardized Moments

#### 2.5.1 Skewness (Third Standardized Moment)

$$\gamma_1 = \frac{\mu_3}{\sigma^3} = \frac{\mathbb{E}[(X-\mu)^3]}{(\mathbb{E}[(X-\mu)^2])^{3/2}}$$

**Interpretation**:
- $\gamma_1 > 0$: Right-skewed (positive skew), tail extends right
- $\gamma_1 < 0$: Left-skewed (negative skew), tail extends left
- $\gamma_1 = 0$: Symmetric distribution

```
Left-skewed (γ₁ < 0)     Symmetric (γ₁ = 0)     Right-skewed (γ₁ > 0)
        ___                    ___                    ___
       /   \                  /   \                  /   \
      /     \__              /     \                _/     \
    _/         \_           /       \              /        \_
```

#### 2.5.2 Kurtosis (Fourth Standardized Moment)

**Kurtosis**:
$$\gamma_2 = \frac{\mu_4}{\sigma^4} = \frac{\mathbb{E}[(X-\mu)^4]}{(\mathbb{E}[(X-\mu)^2])^2}$$

**Excess Kurtosis** (relative to Gaussian):
$$\kappa = \gamma_2 - 3 = \frac{\mu_4}{\sigma^4} - 3$$

**Interpretation**:
- $\kappa > 0$: Leptokurtic (heavier tails than Gaussian)
- $\kappa < 0$: Platykurtic (lighter tails than Gaussian)
- $\kappa = 0$: Mesokurtic (Gaussian-like tails)

| Distribution | Excess Kurtosis |
|--------------|-----------------|
| Gaussian | $\kappa = 0$ |
| Uniform | $\kappa = -1.2$ |
| Laplace | $\kappa = 3$ |
| Student's t ($\nu > 4$) | $\kappa = \frac{6}{\nu - 4}$ |

### 2.6 Relationship Between Raw and Central Moments

$$\mu_n = \sum_{k=0}^{n} \binom{n}{k} (-1)^{n-k} \mu'^k_1 \mu'_{n-k}$$

**Common Relationships**:
$$\mu_2 = \mu'_2 - (\mu'_1)^2 = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$
$$\mu_3 = \mu'_3 - 3\mu'_1\mu'_2 + 2(\mu'_1)^3$$
$$\mu_4 = \mu'_4 - 4\mu'_1\mu'_3 + 6(\mu'_1)^2\mu'_2 - 3(\mu'_1)^4$$

### 2.7 Moment Generating Function (MGF)

**Definition**:
$$M_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} p(x) \, dx$$

**Property**: The $n$-th derivative at $t=0$ gives the $n$-th raw moment:
$$\mu'_n = \frac{d^n M_X(t)}{dt^n}\bigg|_{t=0}$$

**Taylor Expansion**:
$$M_X(t) = \sum_{n=0}^{\infty} \frac{\mu'_n t^n}{n!} = 1 + \mu'_1 t + \frac{\mu'_2 t^2}{2!} + \frac{\mu'_3 t^3}{3!} + ...$$

### 2.8 Characteristic Function

**Definition** (Fourier transform of PDF):
$$\phi_X(t) = \mathbb{E}[e^{itX}] = \int_{-\infty}^{\infty} e^{itx} p(x) \, dx$$

**Property**: Always exists (unlike MGF), and:
$$\mu'_n = \frac{1}{i^n} \frac{d^n \phi_X(t)}{dt^n}\bigg|_{t=0}$$

### 2.9 Cumulants

**Cumulant Generating Function**:
$$K_X(t) = \log M_X(t) = \sum_{n=1}^{\infty} \kappa_n \frac{t^n}{n!}$$

**Key Cumulants**:
| Cumulant | Expression | Name |
|----------|------------|------|
| $\kappa_1$ | $\mu$ | Mean |
| $\kappa_2$ | $\sigma^2$ | Variance |
| $\kappa_3$ | $\mu_3$ | Third cumulant |
| $\kappa_4$ | $\mu_4 - 3\sigma^4$ | Excess kurtosis × $\sigma^4$ |

**Property**: For independent $X, Y$:
$$K_{X+Y}(t) = K_X(t) + K_Y(t)$$

### 2.10 Applications of Moments in Gen-AI

#### 2.10.1 Batch Normalization

$$\hat{x}_i = \frac{x_i - \mathbb{E}[X]}{\sqrt{\text{Var}(X) + \epsilon}}$$

Uses first moment (mean) and second central moment (variance).

#### 2.10.2 Layer Normalization

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad \mu = \frac{1}{D}\sum_{j=1}^{D} x_j, \quad \sigma^2 = \frac{1}{D}\sum_{j=1}^{D}(x_j - \mu)^2$$

#### 2.10.3 Feature Matching in GANs

Match statistics of real and generated distributions:
$$\mathcal{L}_{FM} = \|\mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{z \sim p_z}[f(G(z))]\|^2$$

#### 2.10.4 Maximum Mean Discrepancy (MMD)

$$\text{MMD}^2(p, q) = \mathbb{E}_{x,x' \sim p}[k(x,x')] - 2\mathbb{E}_{x \sim p, y \sim q}[k(x,y)] + \mathbb{E}_{y,y' \sim q}[k(y,y')]$$

---

## 3. Variance and Standard Deviation

### 3.1 Variance

**Definition**: Second central moment measuring spread around the mean.

$$\text{Var}(X) = \sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

**Discrete**:
$$\text{Var}(X) = \sum_{i} (x_i - \mu)^2 \cdot p(x_i)$$

**Continuous**:
$$\text{Var}(X) = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot p(x) \, dx$$

### 3.2 Standard Deviation

$$\sigma = \sqrt{\text{Var}(X)} = \sqrt{\mathbb{E}[(X - \mu)^2]}$$

### 3.3 Properties of Variance

| Property | Formula |
|----------|---------|
| Non-negativity | $\text{Var}(X) \geq 0$ |
| Constant | $\text{Var}(c) = 0$ |
| Scaling | $\text{Var}(aX) = a^2 \text{Var}(X)$ |
| Translation invariance | $\text{Var}(X + c) = \text{Var}(X)$ |
| Sum (independent) | $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ if $X \perp Y$ |
| General sum | $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$ |

### 3.4 Law of Total Variance

$$\text{Var}(X) = \mathbb{E}[\text{Var}(X|Y)] + \text{Var}(\mathbb{E}[X|Y])$$

**Interpretation**:
- $\mathbb{E}[\text{Var}(X|Y)]$: Average variance within groups
- $\text{Var}(\mathbb{E}[X|Y])$: Variance between group means

---

## 4. Covariance

### 4.1 Definition

**Covariance** measures the joint variability of two random variables.

$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$$

**Alternative Formula**:
$$\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

**Integral Form**:
$$\text{Cov}(X, Y) = \iint (x - \mu_X)(y - \mu_Y) \cdot p(x, y) \, dx \, dy$$

### 4.2 Properties of Covariance

| Property | Formula |
|----------|---------|
| Symmetry | $\text{Cov}(X, Y) = \text{Cov}(Y, X)$ |
| Self-covariance | $\text{Cov}(X, X) = \text{Var}(X)$ |
| With constant | $\text{Cov}(X, c) = 0$ |
| Linearity (first arg) | $\text{Cov}(aX + b, Y) = a \cdot \text{Cov}(X, Y)$ |
| Bilinearity | $\text{Cov}(aX + bY, cZ + dW) = ac\text{Cov}(X,Z) + ad\text{Cov}(X,W) + bc\text{Cov}(Y,Z) + bd\text{Cov}(Y,W)$ |
| Independence | If $X \perp Y$: $\text{Cov}(X, Y) = 0$ |

**Note**: $\text{Cov}(X, Y) = 0$ does NOT imply independence (only uncorrelated).

### 4.3 Covariance Matrix

For random vector $\mathbf{X} = (X_1, X_2, ..., X_n)^T$:

$$\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]$$

**Matrix Form**:
$$\boldsymbol{\Sigma} = \begin{pmatrix}
\text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_n) \\
\text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_n) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(X_n, X_1) & \text{Cov}(X_n, X_2) & \cdots & \text{Var}(X_n)
\end{pmatrix}$$

**Element-wise**:
$$\Sigma_{ij} = \text{Cov}(X_i, X_j) = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)]$$

**Alternative Form**:
$$\boldsymbol{\Sigma} = \mathbb{E}[\mathbf{X}\mathbf{X}^T] - \boldsymbol{\mu}\boldsymbol{\mu}^T$$

### 4.4 Properties of Covariance Matrix

1. **Symmetric**: $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T$

2. **Positive Semi-Definite**: For any vector $\mathbf{a}$:
$$\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = \text{Var}(\mathbf{a}^T \mathbf{X}) \geq 0$$

3. **Linear Transformation**: For $\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}$:
$$\text{Cov}(\mathbf{Y}) = \mathbf{A} \boldsymbol{\Sigma}_X \mathbf{A}^T$$

4. **Eigendecomposition**:
$$\boldsymbol{\Sigma} = \mathbf{U} \boldsymbol{\Lambda} \mathbf{U}^T$$
where $\boldsymbol{\Lambda} = \text{diag}(\lambda_1, ..., \lambda_n)$ with $\lambda_i \geq 0$

### 4.5 Precision Matrix (Inverse Covariance)

$$\boldsymbol{\Omega} = \boldsymbol{\Sigma}^{-1}$$

**Significance**: $\Omega_{ij} = 0$ implies conditional independence:
$$X_i \perp X_j | \mathbf{X}_{\setminus\{i,j\}}$$

### 4.6 Applications of Covariance in Gen-AI

#### 4.6.1 Multivariate Gaussian in VAE

$$p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$$
$$q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x})))$$

#### 4.6.2 KL Divergence with Covariance

$$D_{KL}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \| \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2}\left[\log\frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - D + \text{tr}(\boldsymbol{\Sigma}_2^{-1}\boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^T \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)\right]$$

#### 4.6.3 Whitening Transformation

**Objective**: Transform data to have identity covariance.

$$\mathbf{X}_{white} = \boldsymbol{\Sigma}^{-1/2}(\mathbf{X} - \boldsymbol{\mu})$$

**Result**: $\text{Cov}(\mathbf{X}_{white}) = \mathbf{I}$

---

## 5. Cross-Covariance

### 5.1 Definition

**Cross-covariance** measures the covariance between elements of two different random vectors.

For random vectors $\mathbf{X} \in \mathbb{R}^m$ and $\mathbf{Y} \in \mathbb{R}^n$:

$$\boldsymbol{\Sigma}_{XY} = \text{Cov}(\mathbf{X}, \mathbf{Y}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu}_X)(\mathbf{Y} - \boldsymbol{\mu}_Y)^T]$$

### 5.2 Matrix Form

$$\boldsymbol{\Sigma}_{XY} \in \mathbb{R}^{m \times n}$$

$$(\boldsymbol{\Sigma}_{XY})_{ij} = \text{Cov}(X_i, Y_j) = \mathbb{E}[(X_i - \mu_{X_i})(Y_j - \mu_{Y_j})]$$

### 5.3 Alternative Formula

$$\boldsymbol{\Sigma}_{XY} = \mathbb{E}[\mathbf{X}\mathbf{Y}^T] - \boldsymbol{\mu}_X \boldsymbol{\mu}_Y^T$$

### 5.4 Properties

1. **Relationship to Transpose**: $\boldsymbol{\Sigma}_{YX} = \boldsymbol{\Sigma}_{XY}^T$

2. **Joint Covariance Matrix**:
$$\text{Cov}\begin{pmatrix} \mathbf{X} \\ \mathbf{Y} \end{pmatrix} = \begin{pmatrix} \boldsymbol{\Sigma}_{XX} & \boldsymbol{\Sigma}_{XY} \\ \boldsymbol{\Sigma}_{YX} & \boldsymbol{\Sigma}_{YY} \end{pmatrix}$$

3. **Linear Transformation**:
$$\text{Cov}(\mathbf{A}\mathbf{X}, \mathbf{B}\mathbf{Y}) = \mathbf{A} \boldsymbol{\Sigma}_{XY} \mathbf{B}^T$$

### 5.5 Cross-Covariance Function (Stochastic Processes)

For stochastic processes $\{X_t\}$ and $\{Y_t\}$:

$$C_{XY}(t_1, t_2) = \text{Cov}(X_{t_1}, Y_{t_2}) = \mathbb{E}[(X_{t_1} - \mu_X(t_1))(Y_{t_2} - \mu_Y(t_2))]$$

**For Stationary Processes** (depends only on lag $\tau = t_2 - t_1$):
$$C_{XY}(\tau) = \mathbb{E}[(X_t - \mu_X)(Y_{t+\tau} - \mu_Y)]$$

### 5.6 Applications in Gen-AI

#### 5.6.1 Canonical Correlation Analysis (CCA)

Find projections maximizing correlation between two views:
$$\max_{\mathbf{w}_x, \mathbf{w}_y} \frac{\mathbf{w}_x^T \boldsymbol{\Sigma}_{XY} \mathbf{w}_y}{\sqrt{\mathbf{w}_x^T \boldsymbol{\Sigma}_{XX} \mathbf{w}_x} \sqrt{\mathbf{w}_y^T \boldsymbol{\Sigma}_{YY} \mathbf{w}_y}}$$

#### 5.6.2 Multi-Modal Generative Models

Cross-covariance captures relationships between different modalities (text-image, audio-video).

#### 5.6.3 Attention Mechanism Analysis

Cross-covariance between query and key representations:
$$\boldsymbol{\Sigma}_{QK} = \text{Cov}(\mathbf{Q}, \mathbf{K})$$

---

## 6. Correlation

### 6.1 Pearson Correlation Coefficient

**Definition**: Normalized covariance measuring linear relationship.

$$\rho_{XY} = \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$$

**Range**: $-1 \leq \rho_{XY} \leq 1$

**Interpretation**:
- $\rho = 1$: Perfect positive linear relationship
- $\rho = -1$: Perfect negative linear relationship
- $\rho = 0$: No linear relationship (uncorrelated)

### 6.2 Correlation Matrix

$$\mathbf{R} = \mathbf{D}^{-1/2} \boldsymbol{\Sigma} \mathbf{D}^{-1/2}$$

where $\mathbf{D} = \text{diag}(\sigma_1^2, \sigma_2^2, ..., \sigma_n^2)$

**Element-wise**:
$$R_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \Sigma_{jj}}} = \frac{\text{Cov}(X_i, X_j)}{\sigma_{X_i} \sigma_{X_j}}$$

**Properties**:
- Diagonal elements: $R_{ii} = 1$
- Off-diagonal: $-1 \leq R_{ij} \leq 1$
- Symmetric: $\mathbf{R} = \mathbf{R}^T$
- Positive semi-definite

### 6.3 Cross-Correlation Matrix

For random vectors $\mathbf{X}$ and $\mathbf{Y}$:

$$\mathbf{R}_{XY} = \mathbf{D}_X^{-1/2} \boldsymbol{\Sigma}_{XY} \mathbf{D}_Y^{-1/2}$$

---

## 7. Auto-Correlation

### 7.1 Definition

**Auto-correlation** measures the correlation of a signal/process with a delayed version of itself.

### 7.2 For Stochastic Processes

#### 7.2.1 Auto-Correlation Function (ACF)

$$R_{XX}(t_1, t_2) = \mathbb{E}[X_{t_1} X_{t_2}]$$

#### 7.2.2 Auto-Covariance Function

$$C_{XX}(t_1, t_2) = \mathbb{E}[(X_{t_1} - \mu(t_1))(X_{t_2} - \mu(t_2))] = R_{XX}(t_1, t_2) - \mu(t_1)\mu(t_2)$$

### 7.3 For Stationary Processes

**Wide-Sense Stationary (WSS)**: Statistics depend only on time difference.

$$R_{XX}(\tau) = \mathbb{E}[X_t X_{t+\tau}]$$

$$C_{XX}(\tau) = \mathbb{E}[(X_t - \mu)(X_{t+\tau} - \mu)] = R_{XX}(\tau) - \mu^2$$

**Normalized Auto-Correlation**:
$$\rho_{XX}(\tau) = \frac{C_{XX}(\tau)}{C_{XX}(0)} = \frac{C_{XX}(\tau)}{\sigma^2}$$

### 7.4 Properties of Auto-Correlation

1. **Symmetry**: $R_{XX}(\tau) = R_{XX}(-\tau)$

2. **Maximum at Zero**: $R_{XX}(0) \geq |R_{XX}(\tau)|$ for all $\tau$

3. **At Zero**: $R_{XX}(0) = \mathbb{E}[X^2]$ (mean squared value)

4. **Positive Semi-Definite**: The auto-correlation matrix is PSD

5. **Wiener-Khinchin Theorem**: Power spectral density is Fourier transform of ACF:
$$S_{XX}(f) = \mathcal{F}\{R_{XX}(\tau)\} = \int_{-\infty}^{\infty} R_{XX}(\tau) e^{-j2\pi f \tau} d\tau$$

### 7.5 Discrete Auto-Correlation

For discrete signal $x[n]$:

**Deterministic**:
$$R_{xx}[k] = \sum_{n=-\infty}^{\infty} x[n] x[n+k]$$

**Normalized (Biased Estimator)**:
$$\hat{R}_{xx}[k] = \frac{1}{N} \sum_{n=0}^{N-1-|k|} x[n] x[n+|k|]$$

**Unbiased Estimator**:
$$\hat{R}_{xx}[k] = \frac{1}{N-|k|} \sum_{n=0}^{N-1-|k|} x[n] x[n+|k|]$$

### 7.6 Applications in Gen-AI

#### 7.6.1 Temporal Modeling in Sequence Generation

Auto-correlation reveals temporal dependencies:
- High auto-correlation at lag $k$ → strong dependency between $x_t$ and $x_{t+k}$

#### 7.6.2 Audio Generation

$$R_{audio}(\tau) = \mathbb{E}[s(t) s(t+\tau)]$$

Captures periodic structure (pitch, rhythm).

#### 7.6.3 Positional Encoding Analysis

Analyze auto-correlation of positional embeddings:
$$R_{PE}(\Delta pos) = \mathbb{E}[PE(pos) \cdot PE(pos + \Delta pos)]$$

---

## 8. Cross-Correlation

### 8.1 Definition

**Cross-correlation** measures similarity between two signals as a function of displacement/lag.

### 8.2 Continuous Cross-Correlation

**For Functions**:
$$(f \star g)(\tau) = \int_{-\infty}^{\infty} \overline{f(t)} g(t + \tau) dt$$

where $\overline{f(t)}$ is complex conjugate (for real signals, just $f(t)$).

**For Stochastic Processes**:
$$R_{XY}(\tau) = \mathbb{E}[X_t Y_{t+\tau}]$$

### 8.3 Discrete Cross-Correlation

$$(x \star y)[k] = \sum_{n=-\infty}^{\infty} x[n] y[n+k]$$

**Finite Length Signals**:
$$(x \star y)[k] = \sum_{n=0}^{N-1} x[n] y[n+k]$$

### 8.4 Properties of Cross-Correlation

1. **Relationship to Auto-Correlation**:
$$R_{XX}(\tau) = (x \star x)(\tau)$$

2. **Non-Commutative**:
$$R_{XY}(\tau) \neq R_{YX}(\tau)$$
$$R_{XY}(\tau) = R_{YX}(-\tau)$$

3. **Relationship to Convolution**:
$$(f \star g)(\tau) = (f(-t) * g)(\tau)$$

where $*$ denotes convolution.

4. **Fourier Domain**:
$$\mathcal{F}\{x \star y\} = \overline{\mathcal{F}\{x\}} \cdot \mathcal{F}\{y\}$$

### 8.5 Normalized Cross-Correlation (NCC)

$$\text{NCC}(\tau) = \frac{(f \star g)(\tau)}{\sqrt{\sum f^2 \cdot \sum g^2}}$$

**Range**: $-1 \leq \text{NCC} \leq 1$

### 8.6 Cross-Correlation Matrix (for Random Vectors)

$$\mathbf{R}_{XY} = \mathbb{E}[\mathbf{X}\mathbf{Y}^T]$$

**Relationship to Cross-Covariance**:
$$\boldsymbol{\Sigma}_{XY} = \mathbf{R}_{XY} - \boldsymbol{\mu}_X \boldsymbol{\mu}_Y^T$$

### 8.7 Applications in Gen-AI

#### 8.7.1 Template Matching in Vision

$$\text{Match}(i, j) = \sum_{m,n} T(m, n) \cdot I(i+m, j+n)$$

where $T$ is template and $I$ is image.

#### 8.7.2 Attention Mechanism

Cross-correlation between queries and keys:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

$QK^T$ represents cross-correlation/similarity.

#### 8.7.3 Audio-Visual Alignment

$$R_{AV}(\tau) = \mathbb{E}[A_t \cdot V_{t+\tau}]$$

Find temporal offset between audio and video.

#### 8.7.4 Text-Image Similarity (CLIP)

$$\text{similarity}(t, i) = \frac{\mathbf{e}_t^T \mathbf{e}_i}{\|\mathbf{e}_t\| \|\mathbf{e}_i\|}$$

Normalized cross-correlation between text and image embeddings.

---

## 9. Convolution

### 9.1 Definition

**Convolution** is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other.

### 9.2 Continuous Convolution

$$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t - \tau) d\tau$$

**Alternative Form**:
$$(f * g)(t) = \int_{-\infty}^{\infty} f(t - \tau) g(\tau) d\tau$$

### 9.3 Discrete Convolution

$$(f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] g[n - k]$$

**Finite Signals**:
$$(f * g)[n] = \sum_{k=0}^{K-1} f[k] g[n - k]$$

where $K$ is kernel size.

### 9.4 2D Convolution (Images)

$$(I * K)(i, j) = \sum_{m} \sum_{n} I(i-m, j-n) K(m, n)$$

**Cross-Correlation Form** (used in deep learning):
$$(I \star K)(i, j) = \sum_{m} \sum_{n} I(i+m, j+n) K(m, n)$$

**Note**: Deep learning "convolution" is actually cross-correlation (no kernel flip).

### 9.5 Properties of Convolution

| Property | Mathematical Expression |
|----------|------------------------|
| **Commutativity** | $f * g = g * f$ |
| **Associativity** | $(f * g) * h = f * (g * h)$ |
| **Distributivity** | $f * (g + h) = f * g + f * h$ |
| **Scalar Multiplication** | $a(f * g) = (af) * g = f * (ag)$ |
| **Identity** | $f * \delta = f$ where $\delta$ is Dirac delta |
| **Differentiation** | $\frac{d}{dt}(f * g) = \frac{df}{dt} * g = f * \frac{dg}{dt}$ |

### 9.6 Convolution Theorem

**Statement**: Convolution in time/spatial domain equals multiplication in frequency domain.

$$\mathcal{F}\{f * g\} = \mathcal{F}\{f\} \cdot \mathcal{F}\{g\}$$

$$f * g = \mathcal{F}^{-1}\{\mathcal{F}\{f\} \cdot \mathcal{F}\{g\}\}$$

**Dual**:
$$\mathcal{F}\{f \cdot g\} = \mathcal{F}\{f\} * \mathcal{F}\{g\}$$

### 9.7 Relationship: Convolution vs Cross-Correlation

$$(f * g)(t) = (f(-t) \star g)(t) = (f \star g(-t))(t)$$

**Implication**: Convolution = Cross-correlation with flipped kernel

```
Cross-Correlation:    Convolution:
K = [1 2 3]           K_flipped = [3 2 1]
    
   Signal: [a b c d e]
   
Cross-Corr: 1·a + 2·b + 3·c    Conv: 3·a + 2·b + 1·c
```

### 9.8 Types of Convolution in Deep Learning

#### 9.8.1 Standard 2D Convolution

**Input**: $\mathbf{X} \in \mathbb{R}^{C_{in} \times H \times W}$
**Kernel**: $\mathbf{K} \in \mathbb{R}^{C_{out} \times C_{in} \times k_h \times k_w}$
**Output**: $\mathbf{Y} \in \mathbb{R}^{C_{out} \times H' \times W'}$

$$Y_{c_{out}, i, j} = \sum_{c_{in}=1}^{C_{in}} \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} K_{c_{out}, c_{in}, m, n} \cdot X_{c_{in}, i+m, j+n}$$

**Output Size**:
$$H' = \left\lfloor \frac{H + 2P - k_h}{S} \right\rfloor + 1$$
$$W' = \left\lfloor \frac{W + 2P - k_w}{S} \right\rfloor + 1$$

where $P$ = padding, $S$ = stride.

#### 9.8.2 Depthwise Separable Convolution

**Depthwise**: Apply separate kernel per channel
$$Y^{(c)}_{i,j} = \sum_{m,n} K^{(c)}_{m,n} \cdot X^{(c)}_{i+m, j+n}$$

**Pointwise**: 1×1 convolution to mix channels
$$Z_{c_{out}, i, j} = \sum_{c=1}^{C_{in}} W_{c_{out}, c} \cdot Y_{c, i, j}$$

**Parameter Reduction**:
- Standard: $C_{out} \times C_{in} \times k^2$
- Depthwise Separable: $C_{in} \times k^2 + C_{out} \times C_{in}$

#### 9.8.3 Transposed Convolution (Deconvolution)

Used in generators for upsampling:

$$Y = \mathbf{K}^T \mathbf{X}$$

**Output Size**:
$$H' = (H - 1) \times S - 2P + k_h$$

#### 9.8.4 Dilated (Atrous) Convolution

Kernel with holes (dilation rate $d$):

$$Y_{i,j} = \sum_{m,n} K_{m,n} \cdot X_{i + d \cdot m, j + d \cdot n}$$

**Effective Kernel Size**: $k' = k + (k-1)(d-1)$

**Receptive Field**: Grows exponentially with stacked dilated convolutions.

### 9.9 1D Convolution in Sequence Models

**Input**: $\mathbf{X} \in \mathbb{R}^{C_{in} \times L}$
**Kernel**: $\mathbf{K} \in \mathbb{R}^{C_{out} \times C_{in} \times k}$

$$Y_{c_{out}, t} = \sum_{c_{in}=1}^{C_{in}} \sum_{i=0}^{k-1} K_{c_{out}, c_{in}, i} \cdot X_{c_{in}, t+i}$$

**Causal Convolution** (for autoregressive models):
$$Y_t = \sum_{i=0}^{k-1} K_i \cdot X_{t-i}$$

Only uses past and current inputs (no future leakage).

### 9.10 Convolution in Generative AI Architectures

#### 9.10.1 U-Net (Diffusion Models)

```
Encoder (Downsampling):
    Conv → Conv → Pool → Conv → Conv → Pool → ...

Decoder (Upsampling):  
    TransposedConv → Concat(skip) → Conv → Conv → ...
```

#### 9.10.2 StyleGAN Generator

Progressive growing with transposed convolutions:
$$\mathbf{Y} = \text{Conv}^T(\text{AdaIN}(\mathbf{X}, \mathbf{style}))$$

#### 9.10.3 WaveNet (Audio Generation)

Dilated causal convolutions:
$$Y_t = \sum_{k=0}^{K-1} W_k \cdot X_{t - d \cdot k}$$

with exponentially increasing dilation: $d \in \{1, 2, 4, 8, ..., 512\}$

---

## 10. Relationship Summary

### 10.1 Mathematical Relationships

```
┌─────────────────────────────────────────────────────────────────┐
│                    RELATIONSHIP DIAGRAM                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   EXPECTATION: E[X]                                            │
│        │                                                       │
│        ▼                                                       │
│   RAW MOMENTS: μ'_n = E[X^n]                                   │
│        │                                                       │
│        ▼                                                       │
│   CENTRAL MOMENTS: μ_n = E[(X - E[X])^n]                       │
│        │                                                       │
│        ├──► VARIANCE: σ² = μ_2                                 │
│        │                                                       │
│        └──► SKEWNESS, KURTOSIS                                 │
│                                                                 │
│   COVARIANCE: Cov(X,Y) = E[(X-μ_X)(Y-μ_Y)]                     │
│        │                                                       │
│        ├──► CORRELATION: ρ = Cov(X,Y)/(σ_X σ_Y)                │
│        │                                                       │
│        └──► AUTO-CORRELATION: R_XX(τ) = E[X_t X_{t+τ}]         │
│             CROSS-CORRELATION: R_XY(τ) = E[X_t Y_{t+τ}]        │
│                                                                 │
│   CONVOLUTION: (f * g)(t) = ∫ f(τ)g(t-τ)dτ                     │
│        │                                                       │
│        └──► Cross-Corr = Conv with flipped kernel               │
│             (f ⋆ g)(t) = (f * g(-t))(t)                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### 10.2 Fourier Domain Connections

| Operation | Time Domain | Frequency Domain |
|-----------|-------------|------------------|
| Convolution | $f * g$ | $F \cdot G$ |
| Cross-Correlation | $f \star g$ | $\bar{F} \cdot G$ |
| Auto-Correlation | $f \star f$ | $|F|^2$ (Power Spectrum) |
| Product | $f \cdot g$ | $F * G$ |

### 10.3 Summary Table for Gen-AI Applications

| Concept | Definition | Gen-AI Application |
|---------|------------|-------------------|
| **Expectation** | $\mathbb{E}[X] = \int x \cdot p(x) dx$ | Loss functions, training objectives |
| **Variance** | $\text{Var}(X) = \mathbb{E}[(X-\mu)^2]$ | Normalization, uncertainty quantification |
| **Covariance** | $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ | Multivariate Gaussians, VAE |
| **Cross-Covariance** | $\boldsymbol{\Sigma}_{XY} = \mathbb{E}[(\mathbf{X}-\boldsymbol{\mu}_X)(\mathbf{Y}-\boldsymbol{\mu}_Y)^T]$ | Multi-modal learning, CCA |
| **Auto-Correlation** | $R_{XX}(\tau) = \mathbb{E}[X_t X_{t+\tau}]$ | Temporal modeling, audio/speech |
| **Cross-Correlation** | $R_{XY}(\tau) = \mathbb{E}[X_t Y_{t+\tau}]$ | Attention, similarity matching |
| **Convolution** | $(f*g)(t) = \int f(\tau)g(t-\tau)d\tau$ | CNNs, U-Net, generators |

---

## 11. Computational Considerations

### 11.1 Efficient Convolution via FFT

**Direct Convolution**: $O(N \cdot K)$ for signal length $N$, kernel size $K$

**FFT-based Convolution**: $O(N \log N)$

$$f * g = \mathcal{F}^{-1}\{\mathcal{F}\{f\} \cdot \mathcal{F}\{g\}\}$$

Efficient when $K > \log N$.

### 11.2 Batch Statistics Computation

**Batch Mean**:
$$\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i$$

**Batch Variance**:
$$\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2$$

**Welford's Online Algorithm** (numerically stable):
$$\mu_n = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{n}$$
$$M_n = M_{n-1} + (x_n - \mu_{n-1})(x_n - \mu_n)$$
$$\sigma_n^2 = \frac{M_n}{n}$$

### 11.3 Covariance Matrix Estimation

**Sample Covariance**:
$$\hat{\boldsymbol{\Sigma}} = \frac{1}{N-1} \sum_{i=1}^{N} (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T$$

**Matrix Form**:
$$\hat{\boldsymbol{\Sigma}} = \frac{1}{N-1} \mathbf{X}_c^T \mathbf{X}_c$$

where $\mathbf{X}_c$ is centered data matrix.