<h1>Expectation Maximization</h1>

# 0. Latent Random Variables Model

We use multiple random variables to model replationships among high dimensional datasets.

+ Independent assumption

Such as logistic regression, to model the marginal distribution, 

> $p(y=1|X; W) = \sigma(W^T X)$

+ Conditional independent assumption

Such as naive bayes classifier, to model the joint distribution, 

> $p(y, X) = p(y) \prod_{i}p(X_i|y)$

+ graphical models

To model complex the relationships among random variables, directed or undirected graphical models are used.

Model the obeserved random variables directly sometimes are <b>time-consuming</b> and <b>meaningless</b>.

Models with latent random variables are introduced.

## Single Gaussian Model without latent variables

For dataset obeys only one gaussian distribution, the parameters $(\mu, \sigma^2)$ are easy to calculate using point estimate method.

According to the strong large law of number, 

> $\mu = \mathbb{E}[X] = \bar{X} = \frac{1}{N}\sum_{i=1}^N X_i$

> $\sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - \mathbb{E}^2[X]$

> $= \frac{1}{N} \sum_{i=1}^N X_i^2 - \mu^2$

## K-Means

For dataset contains hidden multiple clusters, the optimized objective  is, 

> $J = \sum_{i=1}^N \sum_{k=1}^K \mathbb{I}\left(k = \arg\min_k |X_i - C_k|\right) || X_i - C_k ||^2$

> $C_k = \frac{ \sum_{i=1}^N \mathbb{I} (k = \arg\min_k |X_i - C_k|) X_i}
{\sum_{i=1}^N \mathbb{I} (k = \arg\min_k |X_i - C_k|)}$ 


## Gaussian Mixture Model

> $p(X) = \sum_{k=1}^K \pi_k \mathcal{N}(X|\mu_k, \sigma^2_k)$

# 1. Expectation Maximization

## 1.0 Maximize Likelihood Estimation (MLE)

For models with observed random variable only, maximize likihood estimation is a good method for estimating paramters.

For i.i.d dataset $X$, the likelihood function is, 

> $L(X; \theta) = \prod_{i=1}^N p(X_i; \theta)$

For computer and mathematically convenient reason, compute the log-likelihood function, 

> $l(X; \theta) = \log L(X; \theta) = \sum_{i=1}^N \log p(X_i; \theta)$

Thus, 

> $\theta^* = \arg\max_{\theta} l(X; \theta)$

Compute the gradient of $l$ with respect to $\theta$, 

> $\frac{\partial{l}}{\partial{\theta}} = \sum_{i=1}^N \nabla_{\theta} \log p(X_i; \theta) $

It can be seen that the gradient is tractable.

## 1.1 Latent Random Variable

For dataset (X, Z) with observed variable $X$ and hidden variable $Z$, the log-likelihood function of parameters $\theta$ is, 

> $l(X; \theta) = \sum_{i=1}^N \log p(X_i; \theta)$

> $= \sum_{i=1}^N \log \sum_{z}^Z p(X_i, z; \theta)$

The gradient of $l(X; \theta)$ w.r.t $\theta$ is, 

> $\nabla_{\theta} l(X; \theta) = \sum_{i=1}^N \nabla_{\theta} \log \sum_{z}^Z p(X_i, z; \theta)$

If $z$ has large state spaces, it will be very hard to compute the gradient. Thus, it's not tractable to use maximized likelihood estimate method directly for estimating parameters with latent random variables.

## 1.2 Expectation Maximization

It's obviously to see that, if the latent variables are <b>observed</b>, the problem will be easy to solve using MLE.

### GMM

Using GMM as an example, 

> $p(X; \theta) = \sum_{k=1}^K \pi_k \mathcal{N} (X|\mu_k, \sigma_k^2)$

> $\sum_{k=1}^K \pi_k = 1, 0 \le \pi_k \le 1$

$Z$ is an one-of-$K$ discrete hidden random variable indicates the probability of each cluster.

> $p(Z_k = 1) = \pi_k$

> $p(Z) = \prod_{k=1}^K \left\{ \pi_k \right\} ^{Z_k}$

> $p(X|Z_k=1;\theta) = \mathcal{N}(\mu_k, \sigma^2_k)$

> $p(X|Z; \theta) = \prod_{k=1}^K \left\{\mathcal{N}(\mu_k, \sigma^2_k)\right\}^{Z_k}$

The GMM can also be written as,

> $p(X;\theta) = \sum_{z} p(X, z; \theta)$

> $= \sum_z p(z)p(X|z;\theta)$

For every observed data point $x_n$, there is a corresponding one-of-$K$ latent variable $z_{n}$.

### EM

#### Jensen's Inequality

If $f(x)$ is convex function, then 

> $\mathbb{E}[f(x)] \ge f(\mathbb{E}[x])$

$f(x) = \log(x)$ is a concave funtion, for $f''(x) = -\frac{1}{x^2} < 0, \forall x \in \mathbb{R}^{+}$, thus $\log \mathbb{E}[x] \ge \mathbb{E}[\log x]$.

The <b>equality</b> holds if $f(x)$ is affine or $x$ is constant.

Reconsider the log-likelihood function, 

> $l(X; \theta) = \sum_{i=1}^N \log \sum_{z} q(z) \frac{p(X_i, z; \theta)}{q(z)}$ (Construct an expectation)

> $\ge \sum_{i=1}^N \sum_{z} q(z) 
\log \frac{p(X_i, z; \theta)}{q(z)}$

$q(z)$ is a distribution, 

> $\sum_{z} q(z) = 1$

To hold the equality, the function of $\log$ is not affine, thus let $\frac{p(X_i, z; \theta)}{q(z)}$ be constant.

> $q(z) \propto p(X_i, z; \theta)$

One of the <b>choices</b> of $q(z)$ could be, 

> $q(z) = \frac{p(X_i, z; \theta)}{\sum_{z} q(z)}$

> $= \frac{p(X_i, z; \theta)}{\sum_{z} p(X_i, z; \theta)}$

> $= \frac{p(X_i, z; \theta)}{p(X_i; \theta)}$

> $= p(z|X_i; \theta)$

> $l(X; \theta) \ge  \sum_{i=1}^N \sum_z p(z|X_i; \theta) 
\log \frac{p(X_i, z; \theta)}{p(z|X_i; \theta)}$

> $= \sum_{i=1}^N \sum_z p(z|X_i; \theta) \log p(X_i, z; \theta)$

> $- p(z|X_i; \theta) \log p(z|X_i; \theta)$ (Constant w.r.t $\theta$)

If we want to maximize the righthand of the inequality formula, we only need to maximzie

> $\sum_z p(z|X_i; \theta) \log p(X_i, z; \theta)$

which is the expectation of log-joint distribution $\log p(X_i, z; \theta)$ under the distributon of $p(z|X_i; \theta)$.

#### Initialize 
+ Initialize parameters $\theta^{old}$.

#### E-step

+ Compute the expectation of $q(z)$, one of the choices is $p(z|X_i; \theta^{old})$.

#### M-step

+ Maximize the Q function of $\theta$, to get the maximize parameters.

> $Q(\theta, \theta^{old}) = \sum_z p(z|X_i; \theta^{old}) \log p(X_i, z; \theta)$

> $\theta^{new} = \arg\max_{\theta} Q(\theta, \theta^{old})$

#### Check Convergence

+ Calculate the log-likelihood, check convergence between iterations.

> $l(X; \theta) = \sum_{i=1}^N \log \left\{ \sum_{z} p(z|X_i; \theta) p(X_i| z; \theta) \right \}$