# The General EM Algorithm

ignore this cell (pre-amble)
$\newcommand{\llh}{{\mathrm{L}}}$
$\newcommand{\FE}{{\mathrm{F}}}$
$\newcommand{\KL}{{\mathrm{D}}}$


###  Intuition EM Algorithm

-  Let's denote all observed data by $N\times D$ matrix $\mathbf{X}=\left( x_1,\ldots,x_n \right)^T$ and the set of all latent variables by $N\times K$ matrix $\mathbf{Z}=\left( z_1,\ldots,z_n \right)^T$.
-  The ML optimization problem with observed data $\mathbf{X}$ is,
$$\hat \theta = \arg\max_\theta \log p(\mathbf{X}|\theta)$$
but this appears to be a very difficult optimization problem.
-  **Plan**: Introduce (extra) latent variables $\mathbf{Z}$ such that the **complete-data log-likelihood** $\log p(\mathbf{X},\mathbf{Z}|\theta)$ is easily maximized.
-  Alas, $\mathbf{Z}$ are latent variables, i.e. NOT observed, and we cannot evaluate $p(\mathbf{X},\mathbf{Z}|\theta)$ as a function of $\theta$ only.
-  **idea 1**: optimize instead the **expected complete-data likelihood**
$$ \sum_{z} p(\mathbf{Z}|\mathbf{X},\theta) \,\log p(\mathbf{X},\mathbf{Z}|\theta)$$
which is no longer a function of (the unobserved) $\mathbf{Z}$.
-  Note that the posterior $p(\mathbf{Z}|\mathbf{X},\theta)$ depends on $\theta$.
-  **idea 2**: Use **iterative optimization** so that we can use the estimate $\hat \theta$ from the previous optimization step in order to compute the posteriors $p(\mathbf{Z}|\mathbf{X},\hat \theta)$.
 -  These ideas lead to the iterative **EM Algorithm**,

\begin{align}
    Q(\theta,\hat{\theta}) &= \sum_z p(\mathbf{Z}|\mathbf{X},\hat{\theta})\,\log p(\mathbf{X},\mathbf{Z}|\theta) \tag{E-step}\\
    \hat \theta_{\text{(new)}} &= \arg\max_{\theta} Q(\theta,\hat{\theta})
\tag{M-step} \end{align}




-  (In the next few slides), we will show that the EM algorithm is guaranteed to converge to a local maximum, or saddle-point, of the **observed-data} likelihood function $ p(\mathbf{X}|\theta)$.





### Concavity and Jensen's Inequality

- $f(x)$ is **concave$^\frown$** over $(a,b)$ if, for all $x_1,x_2 \in (a,b)$ and $0 \leq \lambda \leq 1$,
$$
f\left( \lambda x_1 + \left( 1-\lambda \right) x_2\right) \geq 
\lambda f\left( x_1\right) + \left(1-\lambda \right) f\left( x_2\right)
$$
- **Jensen's  Inequality**: If $f$ is concave$^\frown$ and $x$ is a RV then:
$$
f\left( {\mathrm{exp} \left[ x \right]} \right)  \geq \mathrm{exp} \left[ {f(x)} \right]
$$
- Example $$\log\left( \mathrm{exp}[ x ] \right)  \geq \mathrm{exp} \left[ {\log(x)} \right]$$


\includegraphics[height=2.5cm]{./figures/fig-MacKay-centre-of-gravity

- (Physical interpretation)}. Put masses $p_i$ at locations $(x_i,f(x_i))$. Then, **center of gravity** at $(\mathrm{exp}[x],\mathrm{exp}[f(x)])$ lies under the curve at location $\left( \mathrm{exp}[x],f(\mathrm{exp}[x]) \right)$



###  The EM 'Trick'

-  The log-likelihood based on data set ${\mathbf{X}}$ is
$$
\llh(\theta) = \log p(\mathbf{X}|\theta) = \log \sum_\mathbf{Z} p(\mathbf{X},\mathbf{Z}|\theta)
$$

-  For **any** distribution $q(\mathbf{Z})$, we can write
\begin{align}
\llh(\theta) &= \log \sum_\mathbf{Z} q(\mathbf{Z}) \frac{p(\mathbf{X},\mathbf{Z}|\theta)}{q(\mathbf{Z})} \\
  &\stackrel{Jensen}{\geq} \sum_\mathbf{Z} q(\mathbf{Z}) \log \frac{p(\mathbf{X},\mathbf{Z}|\theta)}{q(\mathbf{Z})} \\
  &\equiv \FE(\theta,q) \qquad \text{(`free energy')}
\end{align}


-  Furthermore, if we choose $q(\mathbf{Z})=p(\mathbf{Z}|\mathbf{X},\theta)$, then $\FE(\theta,q)=\llh(\theta)$ (is maximal), since

\begin{align}
\FE(\theta,q)\rvert_{ q=p(\mathbf{Z}|\mathbf{X},\theta) } 
  &= \sum_\mathbf{Z} p(\mathbf{Z}|\mathbf{X},\theta) \log  \frac{ p(\mathbf{X},\mathbf{Z}|\theta) }{ p(\mathbf{Z}|\mathbf{X},\theta) } \\
    &= \sum_\mathbf{Z} p(\mathbf{Z}|\mathbf{X},\theta) \log \frac{ p(\mathbf{Z}|\mathbf{X},\theta) p(\mathbf{X}|\theta) }{ p(\mathbf{Z}|\mathbf{X},\theta) } \\
    &= \sum_\mathbf{Z} p(\mathbf{Z}|\mathbf{X},\theta) \log p(\mathbf{X}|\theta)  \\
    &= \log p(\mathbf{X}|\theta) \sum_\mathbf{Z} p(\mathbf{Z}|\mathbf{X},\theta) \\
    &= \log p(\mathbf{X}|\theta) = \llh(\theta)
\end{align}

$\Rightarrow$ We have just shown that the following procedure (coordinate ascent on $\FE$) will increase the log-likelihood:
\begin{align}
q^{\text{new}} &:= p(\mathbf{Z}|\mathbf{X},\theta^{\text{old}}) = \arg\max_q \FE(\theta^{\text{old}},q) \tag{E-step}\\
\theta^{\text{new}} &:= \arg\max_{\theta} \FE(\theta,q^{\text{new}}) \tag{M-step}
\end{align}

- Usually optimizing $\log p(x,z|\theta)$ with both $x$ and $z$ observed is straightforward. (e.g. class-conditional Gaussian classification, linear regression)

- Note: we also use notation: $\langle \log p(x,z|\theta) \rangle_q=\mathrm{exp}_q[\log p(x,z|\theta)]$



###  EM involves optimizing lower bound
\begin{figure}
         \centering
        \includegraphics[height=8cm]{./figures/Bishop-Figure914}
    \end{figure}



###  EM for GMM revisited

- (total) log-likelihood $\llh(\theta) = \sum_n \log \sum_k \pi_k \mathcal{N}(x_n|\mu_k,\sigma_k)$

- **E-step**: compute responsibilities for latent classes
$$
\gamma_{nk} = p(z_{nk}=1|x_n,\hat{\theta}) = \frac{ \hat{\pi}_k \mathcal{N}(x_n|\hat{\theta}_k) }{ \sum_j \hat{\pi}_j \mathcal{N}(x_n|\hat{\theta}_j) }
$$

- **M-step**: Maximize expected complete data log-likelihood
\begin{align}
\mathrm{E}_\mathbf{Z}[ \log p(\mathbf{X},\mathbf{Z}|\mu,\sigma,\pi) ] &= \sum_n \sum_k \gamma_{nk} \log p(x_n,z_{nk}=1|\theta) \\
    &= \sum_{nk} \gamma_{nk} \log \mathcal{N}(x_n|\mu_k,\sigma_k) + \sum_{nk} \gamma_{nk} \log \pi_k
\end{align}

We've maximized this before, see section on Density estimation (Gaussian, multinomial)


###  EM Example--Three Coins
-  You have three coins in your pocket
  -  Coin 0: $p(\mathrm{Head}) = \lambda$
  -  Coin 1: $p(\mathrm{Head}) = \rho$
  -  Coin 2: $p(\mathrm{Head}) = \theta$
    
-  (Scenario). Toss coin $0$. If Head comes up, toss three times with coin 1; otherwise, toss three times with coin 2.
-  The observed sequences **after** each toss with coin 0 were $\langle \mathrm{HHH}\rangle$, $\langle \mathrm{HTH}\rangle$, $\langle \mathrm{HHT}\rangle$, and $\langle\mathrm{HTT}\rangle$

-  [Q.] Estimate most likely values for $\lambda$, $\rho$ and $\theta$
-  [A.] homework. Use EM.




###  Some Properties of EM

-  EM is a general procedure for learning in the presence of unobserved variables.
-  In a sense, it is a **family of algorithms**. The update rules you will derive depend on the probability model assumed.
-  (Good!) **No tuning parameters** such a learning rate, unlike gradient descent-type algorithms
-  (Bad). EM is an iterative procedure that is very sensitive to initial
conditions! EM converges to a **local optimum**.
-  Start from trash $\rightarrow$ end up with trash. Hence, we need a good and fast initialization procedure (often used: K-Means)
-  Also used to train HMMs, etc.


###  (OPTIONAL) The Kullback-Leibler Divergence

-  The **Kullback-Leibler Divergence** or **relative entropy** between two distributions $q$ and $p$ is defined as
$$
\KL(q\|p) \equiv \sum_z q(z) \log \frac{q(z)}{p(z)}
$$

-  In general $\KL(q\|p) \neq \KL(p\|q)$

-  Theorem: **Gibbs Inequality**:
    
$$\boxed{ \KL(q\|p)\geq 0 }$$
with equality only iff $p=q$. 

-  Proof (use Jensen Inequality):
Define $f(u)=-\log u$ ($f$ is convex$^\smile$) and let $u = p(x)/q(x)$.
\begin{align}
\KL(q\|p) &= \mathrm{E}_q [f(u)] \stackrel{Jensen}{\geq}  f\left( \mathrm{E}_q [u] \right) \\
  &= f\left( \sum_x q(x) \frac{p(x)}{q(x)} \right) \\
     &= -\log \left(\sum_x p(x) \right) =0
\end{align}
with equality only if $u$ is constant, i.e., $q(x)=p(x)$.


###  (OPTIONAL) The Free Energy Functional

-   For **any** distribution $q(\mathbf{Z})$, the following holds:

\begin{align}
\llh(\theta) &= \log p(\mathbf{X}|\theta) = \sum_\mathbf{Z} q(\mathbf{Z}) \log p(\mathbf{X}|\theta) \\
    &= \sum_\mathbf{Z} q(\mathbf{Z}) \left[ \log p(\mathbf{X}|\theta) -\log \frac{ q(\mathbf{Z}) }{ p(\mathbf{Z}|\mathbf{X}) } \right] + \sum_\mathbf{Z} q(\mathbf{Z}) \log \frac{ q(\mathbf{Z}) }{ p(\mathbf{Z}|\mathbf{X}) } \\
    &= \sum_\mathbf{Z} q(\mathbf{Z}) \log \frac{ p(\mathbf{X},\mathbf{Z}|\theta) }{ q(\mathbf{Z}) } + \KL\left(q(\mathbf{Z})\|p(\mathbf{Z}|\mathbf{X}) \right) \\
    &\stackrel{Gibbs}{\geq} \sum_\mathbf{Z} q(\mathbf{Z}) \log \frac{ p(\mathbf{X},\mathbf{Z}|\theta) }{ q(\mathbf{Z}) } \\
    &= \sum_\mathbf{Z} q(\mathbf{Z}) \log p(\mathbf{X}\mathbf{Z}|\theta) - \sum_\mathbf{Z} q(\mathbf{Z}) \log q(\mathbf{Z}) \\
    &= \left\langle\log p(\mathbf{X},\mathbf{Z}|\theta)\right\rangle_q + \mathcal{H}(q)\\
    &\triangleq \FE(\theta,q) \tag{free energy}
\end{align}
with equality $\llh(\theta)=\FE(\theta,q)$ only if $q(\mathbf{Z})=p(\mathbf{Z}|\mathbf{X})$.
