# Generative Models

<hr>

**Gentle introduction to Generative Models**

The idea is to model the probability distribution in the data by fitting a given type of distribution and estimating its parameters ($\theta$) using the maximum likelihood estimation (MLE). With the probability distribution known and its estimated parameters, then we can estimate the probability of observing a new data point belonging to this distribution.

In here, we will cover two types of generative models:

- Multinomial
- Gaussian Mixture Models

<hr>

**MLE: A recap using Multinomial**

The maximum likelihood estimate of $\theta$ is the value of $\theta$ that maximizes the likelihood function:

$P(D|\theta) = \prod_{w\in W} (\theta_w)^{count(w)}$

where $W$ is the vocabulary set of words

Maximizing $P(D|\theta)$ is equivalent to maximizing $\log P(D|\theta)$ and therefore we can bring down the exponents on the RHS:

$\log P(D|\theta) = \sum_{w \in W} \text{count}(w) \cdot \log \theta_w$

To maximize $\log P(D|\theta)$, we use the *Lagrange multiplier* method:

*Problem*

Let $f$ be a function from $\mathbb{R}^N \rightarrow \mathbb{R}$. Find the local maxima/minima of $f$ subject to a given constraint, $g = 0$, where $g$ is a function $\mathbb{R}^N \rightarrow \mathbb{R}$

A two dimensional example is: Find the local extrema of $f(x, y) = x^2$ subject to the constraint, $x^2 + y^2 = 1$ i.e. optimize the function $f$ on the unit circle.

*Method of Lagrange multipliers*

Without the constraint, the optimization problem can be solved as usual by setting the gradient of $f$ to zero, i.e.

$\nabla f = 0$

With the constraint, we can solve the following equation instead:

$\nabla f = \lambda \nabla g$

where $\lambda$ is a constant scalar. Geometrically, for $\lambda \neq 0$, a solution to the equation above is a point in $\mathbb{R}^N$ where the gradient of $f$ is parallel to the gradient of $g$, or equivalently, where the gradient of $f$ is perpendicular to the tangent of the curve defined by $g = k$ for some $k$. In other words, at a solution point, the directional derivative of $f$ is zero along the direction tangent to the curve $g = k$ for some constant $k$, and hence $f$ is stationary as we travel along $g = k$

Finally, we impose the constraint, $g = 0$, to find the local extrema of $f$ on $g = 0$

Since the equation $\nabla f = \lambda \nabla g$ is equivalent to $\nabla L = 0$ where $L = f - \lambda g$, the problem of optimizing $f$ is subject to $g = 0$ can be reformulated as optimizing the function $L$ along with the constraint $g = 0$. We call the function $L$ the *Lagrangian function*, and the scalar $\lambda$ is the Lagrange multiplier.

Note that we can equally define $L = f + \lambda g$, since $\lambda$ is an unknown scalar we will solve.

Lagrange function:

$L = \log P(D | \theta) + \lambda \cdot (\sum_{w \in W} \theta_w - 1)$

Find the stationary points of $L$ by solving the equation $\nabla_{\theta} L = 0$. The components of this equation are:

$\frac{\partial }{\partial \theta _ w} \left(\log P(D | \theta ) + \lambda \left(\sum _{w \in W} \theta _ w - 1\right)\right) = 0 \qquad \text {for all } w \in W.$


*Example*

Find the local extrema of $f(x,y) = x^2$ subject to the constraint $x^2 + y^2 = 1$. Geometrically the function $f$ is a parabolic cylinder, i.e. $f$ is a parabolic in the x direction with constant values in the y direction. The constraint is a unit circle.

First, solve the equation:

$\nabla f = \lambda \nabla g$ where $g(x, y) = x^2 + y^2 - 1$

$\displaystyle \begin{bmatrix} 2x\\ 0\end{bmatrix} = \displaystyle  \lambda \begin{bmatrix}  2x\\ 2y\end{bmatrix}$

$\displaystyle \begin{bmatrix} (1-\lambda ) 2x\\ \lambda (2y)\end{bmatrix} = 0$

The set of all possible solutions to the equation above are $(\lambda = 1, y = 0)$, or $(\lambda = 0, x = 0)$, or $(x = y = 0)$.

Finally, impose the constraint $x^2 + y^2 - 1 = 0$ to further pin down the local extrema. Subject to $x^2 + y^2 = 1$, $f(x, y) = x^2$ is at a local maximum or minimum at $(x = 0, y = \pm 1)$ and $(y = 0, x = \pm 1)$. At $(x = 0, y = \pm 1)$, we have $\lambda = 0$ and $\nabla f = 0$. Since $f$ has only local minima, these two points remain local minima of $f$ on the unit circle. At $(y = 0, x = \pm 1)$, we have $\lambda = 1$ and hence $\nabla f = \nabla g$. Equivalently, the directional derivative $\nabla f$ is zezro along the tangent direction of the circle at this point. Visualizing or computing second derivatives will allow us to see that these two points are local maxima of $f$ along the unit circle.

****

**Using a Generative Multinomial Model as a linear classifier**

Suppose a multinomial generative model, $M$, for a binary classification of positive and negative classes. Let the parameters be denoted by $\theta^+, \theta^-$ then we classify a new document $D$ to belong to the positive class if and only if:

$\log \frac{P(D|\theta^+)}{P(D|\theta^-)} \geq 0$

This can be rewritten due to the logarithm applied as the following:

$\log P (D | \theta ^{+}) - \log P (D | \theta ^{-})$

$= \log \prod _{w \in W} (\theta _ w^{+})^{\text {count}(w)} - \log \prod _{w \in W} (\theta _ w^{-})^{\text {count}(w)}$

$= \sum _{w \in W} \text {count}(w) \log \theta _ w^{+} - \sum _{w \in W} \text {count}(w) \log \theta _ w^{-}$

$= \sum _{w \in W} \text {count}(w) \log \frac{\theta _ w^{+}}{\theta _ w^{-}}$

****

**Prior, Posterior and Likelihood: A recap**

Bayes Rule:

$P(A | B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

where

- $P(B | A)$ is the likelihood of the data given the parameters
- $P(A)$ is the prior distribution
- $P(B)$ is the normalizing constant
- $P(A | B)$ is the posterior distribution

****

**Gaussian Generative Models**

A random vector, $X = (X^{(1)}, \dots, X^{(d)})^T$ is a *Gaussian vector*, or multivariate Gaussian variable, if any linear combination of its components is a (univariate) Gaussian variate or a constant, i.e. if $\alpha^T X$ is (univariate) Gaussian or constant for any constant non-zero vector, $\alpha \in \mathbb{R}^d$

The distribution of $X$, the $d$-dimensional Gaussian, is completely specified by the vector mean where $\mu = \mathbb{E}[X] = (\mathbb{E}[X^{(1)}], \dots, \mathbb{E}[X^{(d)}])^T$ and the $d \times d$ covariance matrix $\Sigma$.

If $\Sigma$ is invertible, then the PDF of $X$ is as follows:

$f_X (x) = \frac{1}{\sqrt {2\pi^d \det (\Sigma)}} \cdot e^{-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)}$, $x \in \mathbb{R}^d$

where $\det (\Sigma)$ is the determinant of $\Sigma$, which is positive when $\Sigma$ is invertible.

if $\mu = 0$ and $\Sigma$ is the identity matrix, then $X$ is called a standard normal random vector.

Note that when the covariance matrix $\Sigma$ is diagonal, then the components of the random vector are independent.

<hr>

# Basic code
A `minimal, reproducible example`