# The General EM Algorithm


###  The Kullback-Leibler Divergence

- In order to prove that the EM algorithm works, we will need [Gibbs inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality), which is a famous theorem in information theory.

- Definition: the **Kullback-Leibler divergence** (a.k.a. **relative entropy**) is a distance measure between two distributions $q$ and $p$ and is defined as
$$
D_{\text{KL}}(q \parallel p) \triangleq \sum_z q(z) \log \frac{q(z)}{p(z)}
$$

-  Theorem: **Gibbs Inequality** ([proof](https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof) uses Jensen inquality):    
$$\boxed{ D_{\text{KL}}(q \parallel p) \geq 0 }$$
with equality only iff $p=q$.

- Note that the KL divergence is an asymmetric distance measure, i.e. in general $D_{\text{KL}}(q \parallel p) \neq D_{\text{KL}}(p \parallel q)$

###  EM as Free Energy minimization

- Consider a model for observations $x$, hidden (unobserved) variables $z$ and tuning parameters $\theta$. Note that, for **any** distribution $q(z)$, we can expand the log-likelihood for $\theta$ as follows: 
$$\begin{align*}
\mathrm{L}(\theta) &\triangleq \log p(x|\theta)\\
  &= \sum_z q(z) \log p(x|\theta) \\
  &= \sum_z q(z) \left( \log p(x|\theta) - \log \frac{q(z)}{p(z|x,\theta)}\right) +  \sum_z q(z) \log \frac{q(z)}{p(z|x,\theta)} \\
  &= \sum_z q(z) \log \frac{p(x,z|\theta)}{q(z)} + \underbrace{D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) 
\right)}_{\text{Kullback-Leibler div.}} \\
  &\geq \sum_z q(z) \log \frac{p(x,z|\theta)}{q(z)} \quad \text{(use Gibbs inequality)} \\
  &= \underbrace{\sum_z q(z) \log p(x,z|\theta)}_{\text{expected complete-data log-likelihood}} + \underbrace{\mathcal{H}\left[ q\right]}_{\text{entropy of }q}\\
&\triangleq \mathrm{LB}(q,\theta)
\end{align*}$$

- Technically, the Expectation-Maximization (EM) algorithm is defined by coordinate ascent on the lower-bound $\mathrm{LB}(q,\theta)$:
$$\begin{align*}
&\text{Initialize }: \theta^{(0)}\\
&\text{for }k = 1,2,\ldots \text{until convergence}\\
&\quad q^{(k+1)} = \arg\max_q \mathrm{LB}(q,\theta^{(k)}) \\
&\quad \theta^{(k+1)} = \arg\max_\theta \mathrm{LB}(q^{(k+1)},\theta)
 \end{align*}$$
where $k$ is the iteration counter. 

- Since $\mathrm{LB}(q,\theta) \leq \mathrm{L}(\theta)$, maximizing the lower-bound $\mathrm{LB}$ will also lead to log-likelihood maximization. The _reason_ to maximize $\mathrm{LB}$ rather than $\mathrm{L}$ is that $\arg\max \mathrm{LB}$ often leads to easier expressions.

<img src="./figures/Bishop-Figure914.png" width=400px>

- The negative lower-bound $\mathrm{F}(q,\theta) \triangleq -\mathrm{LB}(q,\theta)$ has appeared in various scientific disciplines. In statistical physics and variational calculus, $F$ is known as the **free energy** functional. Hence, the EM algorithm is a special case of **free energy minimization**.  
  - (Just as an aside, a very influential [neuroscientific theory](https://en.wikipedia.org/wiki/Free_energy_principle) claims that information processing in the brain is also an example of free-energy minimization, see [Friston, 2009](./files/Friston-2009-The-free-energy-principle-a-rough-guide-to-the-brain.pdf)) 


### Working out Free-energy minimization for EM
- Note that
$$
\mathrm{LB}(q,\theta) =  \mathrm{L}(\theta)  - D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) \right)
$$
and consequenty, maximizing $\mathrm{LB}$ over $q$ leads to minimization of the KL-divergence and consequently (from the Gibbs inequality)
$$
 q^{(k+1)}(z) := p(z|x,\theta^{(k)})
$$

- It also follows from (the last line of) the monster derivation above that maximizing $\mathrm{LB}$ w.r.t. $\theta$ amounts to maximization of the _expected complete-data log-likelihood_:
$$
\textbf{EM}:\, \theta^{(k+1)} := \underbrace{\arg\max_\theta}_{\text{M-step}} \underbrace{\sum_z  \overbrace{p(z|x,\theta^{(k)})}^{q^{(k+1)}(z)} \log p(x,z|\theta)}_{\text{E-step}}
$$




###  EM for GMM revisited

- (total) log-likelihood $\llh(\theta) = \sum_n \log \sum_k \pi_k \mathcal{N}(x_n|\mu_k,\sigma_k)$

- **E-step**: compute responsibilities for latent classes
$$
\gamma_{nk} = p(z_{nk}=1|x_n,\hat{\theta}) = \frac{ \hat{\pi}_k \mathcal{N}(x_n|\hat{\theta}_k) }{ \sum_j \hat{\pi}_j \mathcal{N}(x_n|\hat{\theta}_j) }
$$

- **M-step**: Maximize expected complete data log-likelihood
\begin{align}
\mathrm{E}_\mathbf{Z}[ \log p(\mathbf{X},\mathbf{Z}|\mu,\sigma,\pi) ] &= \sum_n \sum_k \gamma_{nk} \log p(x_n,z_{nk}=1|\theta) \\
    &= \sum_{nk} \gamma_{nk} \log \mathcal{N}(x_n|\mu_k,\sigma_k) + \sum_{nk} \gamma_{nk} \log \pi_k
\end{align}

We've maximized this before, see section on Density estimation (Gaussian, multinomial)


###  EM Example--Three Coins
-  You have three coins in your pocket
  -  Coin 0: $p(\mathrm{Head}) = \lambda$
  -  Coin 1: $p(\mathrm{Head}) = \rho$
  -  Coin 2: $p(\mathrm{Head}) = \theta$
    
-  (Scenario). Toss coin $0$. If Head comes up, toss three times with coin 1; otherwise, toss three times with coin 2.
-  The observed sequences **after** each toss with coin 0 were $\langle \mathrm{HHH}\rangle$, $\langle \mathrm{HTH}\rangle$, $\langle \mathrm{HHT}\rangle$, and $\langle\mathrm{HTT}\rangle$

-  [Q.] Estimate most likely values for $\lambda$, $\rho$ and $\theta$
-  [A.] homework. Use EM.




###  Some Properties of EM

-  EM is a general procedure for learning in the presence of unobserved variables.
-  In a sense, it is a **family of algorithms**. The update rules you will derive depend on the probability model assumed.
-  (Good!) **No tuning parameters** such a learning rate, unlike gradient descent-type algorithms
-  (Bad). EM is an iterative procedure that is very sensitive to initial
conditions! EM converges to a **local optimum**.
-  Start from trash $\rightarrow$ end up with trash. Hence, we need a good and fast initialization procedure (often used: K-Means)
-  Also used to train HMMs, etc.