# The General EM Algorithm


### Preliminaries

- Goal 
- A more formal treatment of the general EM algorithm   
- Materials
  - Mandatory
    - These lecture notes
  - Optional 
    - Bishop pp. 55-57 for Jensen's inequality
    - Bishop pp. 439-443 for EM applied to GMM
    - Bishop pp. 450-455 for the general EM algorithm
    - Borman (2006) [The Expectation Maximization Algorithm - A short tutorial](./files/Borman-2006-The-Expectation-Maximization-Algorithm-A-short-tutorial.pdf)


###  The Kullback-Leibler Divergence

- In order to prove that the EM algorithm works, we will need [Gibbs inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality), which is a famous theorem in information theory.

- Definition: the **Kullback-Leibler divergence** (a.k.a. **relative entropy**) is a distance measure between two distributions $q$ and $p$ and is defined as
$$
D_{\text{KL}}(q \parallel p) \triangleq \sum_z q(z) \log \frac{q(z)}{p(z)}
$$

-  Theorem: **Gibbs Inequality** ([proof](https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof) uses Jensen inquality):    
$$\boxed{ D_{\text{KL}}(q \parallel p) \geq 0 }$$
with equality only iff $p=q$.

- Note that the KL divergence is an asymmetric distance measure, i.e. in general $D_{\text{KL}}(q \parallel p) \neq D_{\text{KL}}(p \parallel q)$

###  EM as Free Energy minimization

- Consider a model for observations $x$, hidden variables $z$ and tuning parameters $\theta$. Note that, for **any** distribution $q(z)$, we can expand the log-likelihood as follows: 

$$\begin{align*}
\mathrm{L}(\theta) &\triangleq \log p(x|\theta)  \\
  &= \sum_z q(z) \log p(x|\theta) \\
  &= \sum_z q(z) \left( \log p(x|\theta) - \log \frac{q(z)}{p(z|x,\theta)}\right) +  \sum_z q(z) \log \frac{q(z)}{p(z|x,\theta)} \\
  &= \sum_z q(z) \log \frac{p(x,z|\theta)}{q(z)} + \underbrace{D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) 
\right)}_{\text{Kullback-Leibler div.}}  \\
  &\geq \sum_z q(z) \log \frac{p(x,z|\theta)}{q(z)} \quad \text{(use Gibbs inequality)}  \\
  &= \underbrace{\sum_z q(z) \log p(x,z|\theta)}_{\text{expected complete-data log-likelihood}} + \underbrace{\mathcal{H}\left[ q\right]}_{\text{entropy of }q}  \\
&\triangleq \mathrm{LB}(q,\theta) 
\end{align*}$$

- Technically, the Expectation-Maximization (EM) algorithm is defined by coordinate ascent on the lower-bound $\mathrm{LB}(q,\theta)$:
$$\begin{align*}
  &\text{Initialize }: \theta^{(0)}\\
  &\text{for }m = 1,2,\ldots \text{until convergence}\\
    &\quad q^{(m+1)} = \arg\max_q \mathrm{LB}(q,\theta^{(m)}) \\
    &\quad \theta^{(m+1)} = \arg\max_\theta \mathrm{LB}(q^{(m+1)},\theta)
\end{align*}$$
where $m$ is the iteration counter. 

- Note that
$$
\mathrm{LB}(q,\theta) =  \mathrm{L}(\theta)  - D_{\text{KL}}\left( q(z) \parallel p(z|x,\theta) \right)
$$
and consequenty, maximizing $\mathrm{LB}$ over $q$ leads to minimization of the KL-divergence and consequently (from the Gibbs inequality)
$$
 q^{(m+1)}(z) := p(z|x,\theta^{(m)})
$$

- It also follows from (the last line of) the multi-line derivation above that maximizing $\mathrm{LB}$ w.r.t. $\theta$ amounts to maximization of the _expected complete-data log-likelihood_. Hence, the EM algorithm comprises iterations of
$$
\boxed{\textbf{EM}:\, \theta^{(m+1)} := \underbrace{\arg\max_\theta}_{\text{M-step}} \underbrace{\sum_z  \overbrace{p(z|x,\theta^{(m)})}^{q^{(m+1)}(z)} \log p(x,z|\theta)}_{\text{E-step}} }
$$

<!---
- Compare this to regular log-likelihood maximization:
$$
\boxed{\textbf{ML}:\, \hat \theta:= \arg\max_\theta \log p(x|\theta)}
$$
--->

- Since $\mathrm{LB}(q,\theta) \leq \mathrm{L}(\theta)$ (always), maximizing the lower-bound $\mathrm{LB}$ will also maximize the log-likelihood. The _reason_ to maximize $\mathrm{LB}$ rather than log-likelihood $\mathrm{L}$ directly is that $\arg\max \mathrm{LB}$ often leads to easier expressions. E.g., see this illustrative figure:

<img src="./figures/Bishop-Figure914.png" width=400px>

- Just as an aside, the negative lower-bound $\mathrm{F}(q,\theta) \triangleq -\mathrm{LB}(q,\theta)$ appears in various scientific disciplines. In statistical physics and variational calculus, $F$ is known as the **free energy** functional. Hence, the EM algorithm is a special case of **free energy minimization**.  
  - A very influential [neuroscientific theory](https://en.wikipedia.org/wiki/Free_energy_principle) claims that information processing in the brain is also an example of free-energy minimization, see [Friston, 2009](./files/Friston-2009-The-free-energy-principle-a-rough-guide-to-the-brain.pdf). 



###  Exercise: EM for GMM revisited

##### E-step
- Write down the GMM generative model
- The complete-data set is $D_c=\{x_1,z_1,x_2,z_2,\ldots,x_n,z_n\}$. Write down the _complete-data_ likelihood $p(D_c|\theta)$
- Write down the complete-data _log_-likelihood $\log p(D_c|\theta)$
- Write down the _expected_ complete-data log-likelihood $\mathrm{E}_Z\left[ \log p(D_c|\theta) \right]$

##### M-step
- Maximize $\mathrm{E}_Z\left[ \log p(D_c|\theta) \right]$ w.r.t. $\theta=\{\pi,\mu,\Sigma\}$

- Verify that your solution is the same as the 'intuitive' solution of the previous lesson. 

###  Exercise: EM for Three Coins problem

- You have three coins in your pocket. For each coin, outcomes $\in \{\mathrm{H},\mathrm{T}\}$.
$$
p(\mathrm{H}) = \begin{cases} \lambda & \text{for coin }0 \\
 \rho & \text{for coin }1 \\
 \theta & \text{for coin }1 \end{cases}
$$

    
-  **Scenario**. Toss coin $0$. If Head comes up, toss three times with coin $1$; otherwise, toss three times with coin $2$.

- The observed sequences **after** each toss with coin $0$ were $\langle \mathrm{HHH}\rangle$, $\langle \mathrm{HTH}\rangle$, $\langle \mathrm{HHT}\rangle$, and $\langle\mathrm{HTT}\rangle$

- **Task**. Use EM to estimate most the likely values for $\lambda$, $\rho$ and $\theta$





###  Some Properties of EM

-  EM is a general procedure for learning in the presence of unobserved variables.
-  In a sense, it is a **family of algorithms**. The update rules you will derive depend on the probability model assumed.
-  (Good!) **No tuning parameters** such a learning rate, unlike gradient descent-type algorithms
-  (Bad). EM is an iterative procedure that is very sensitive to initial
conditions! EM converges to a **local optimum**.
-  Start from trash $\rightarrow$ end up with trash. Hence, we need a good and fast initialization procedure (often used: K-Means)
-  Also used to train HMMs, etc.

-----
_The cell below loads the style file_

In [1]:
open("../../styles/aipstyle.html") do f
    display("text/html", readall(f))
end