# Mixture models and the EM algorithm

## Latent variable models

Model with hidden variables $z_i$ are als known as latent variable models (LVMs)

+ Disadvantage: harder to fit than models with no latent variables
+ Advantage: LVMs often have fewer parameters than models that directly represent correlation in the visible space. And hidden variables in LVM can serve as bottleneck, which computes a compressed representation of the data

![](../images/11.Reduce.png)

## Mixture models
The simplest form: when $z _ { i } \in \{ 1 , \ldots , K \}$, representing a discrete latent state. 
+ Prior: $p \left( z _ { i } \right) = \mathrm { Cat } ( \pi )$
+ Likelihood: $p \left( \mathbf { x } _ { i } | z _ { i } = k \right) = p _ { k } \left( \mathbf { x } _ { i } \right)$
    where $p_k$ is the $k$'th base distribution for the observations
+ Mixture model: Mixing together $K$ base distributions as follows:
    $$p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } \right) = \sum _ { k = 1 } ^ { K } \pi _ { k } p _ { k } \left( \mathbf { x } _ { i } | \boldsymbol { \theta } \right)$$
    
    + $\pi_k$: mixing weights that satisfy: $0 \leq \pi _ { k } \leq 1 \text { and } \sum _ { k = 1 } ^ { K } \pi _ { k } = 1$
    
### Mixtures of Gaussians
+ name: MOG or Gaussian mixture model (GMM)
+ each base distribution is a multivariate Gaussian with mean $\mu_k$ and covariate matrix $\Sigma_k$:
    $$p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } \right) = \sum _ { k = 1 } ^ { K } \pi _ { k } \mathcal { N } \left( \mathbf { x } _ { i } | \boldsymbol { \mu } _ { k } , \mathbf { \Sigma } _ { k } \right)$$
+ Given a sufficiently large number of mixture components, a GMM can be used to approximate any density defined on $R^D$
![](../images/11.GMM.png)  
  
### Mixture of multinoullis:
+ data consist of $D$-dimensional bit vectors, appropriate class density is a product of Bernoullis:

    $$p \left( \mathbf { x } _ { i } | z _ { i } = k , \boldsymbol { \theta } \right) = \prod _ { j = 1 } ^ { D } \operatorname { Ber } \left( x _ { i j } | \mu _ { j k } \right) = \prod _ { j = 1 } ^ { D } \mu _ { j k } ^ { x _ { i j } } \left( 1 - \mu _ { j k } \right) ^ { 1 - x _ { i j } }$$

    where $\mu_{jk}$ is the probability that bit $j$ turns on in cluster $k$
    
### Applications of mixture models:
+ black-box density model $p(x_i)$ for data compression, outlier detection and generative classifiers where we model each class-conditional density: $p(x|y=c)$ by a mixture distribution

+ use for clustering:
    * $p \left( z _ { i } = k | \mathbf { x } _ { i } , \boldsymbol { \theta } \right)$: posterior probability that point $i$ belongs to cluster $k$.
    $$r _ { i k } \triangleq p \left( z _ { i } = k | \mathbf { x } _ { i } , \boldsymbol { \theta } \right) = \frac { p \left( z _ { i } = k | \boldsymbol { \theta } \right) p \left( \mathbf { x } _ { i } | z _ { i } = k , \boldsymbol { \theta } \right) } { \sum _ { k ^ { \prime } = 1 } ^ { K } p \left( z _ { i } = k ^ { \prime } | \boldsymbol { \theta } \right) p \left( \mathbf { x } _ { i } | z _ { i } = k ^ { \prime } , \boldsymbol { \theta } \right) }$$
    * called **soft clustering**: is identical to the computations performed when using a generative classiﬁer. The difference between the two models only arises at training time: in the mixture case, we never observe $z_i$, whereas with a generative classiﬁer, we do observe $y_i$ (which plays the role of $z_i$ ).
    * **hard clustering**: 
    $$z _ { i } ^ { * } = \arg \max _ { k } r _ { i k } = \arg \max _ { k } \log p \left( \mathbf { x } _ { i } | z _ { i } = k , \boldsymbol { \theta } \right) + \log p \left( \mathbf { z } _ { i } = k | \boldsymbol { \theta } \right)$$
    
### Mixture of experts:
In some cases, a good model would be three different linear regression functions, each applying to a different part of the input space. Allowing the mixing weights and the mixture densities to be input-dependent:

$$\begin{aligned} p \left( y _ { i } | \mathbf { x } _ { i } , z _ { i } = k , \boldsymbol { \theta } \right) & = \mathcal { N } \left( y _ { i } | \mathbf { w } _ { k } ^ { T } \mathbf { x } _ { i } , \sigma _ { k } ^ { 2 } \right) \\ p \left( z _ { i } | \mathbf { x } _ { i } , \boldsymbol { \theta } \right) & = \operatorname { Cat } \left( z _ { i } | \mathcal { S } \left( \mathbf { V } ^ { T } \mathbf { x } _ { i } \right) \right) \end{aligned}$$

where $p \left( z _ { i } = k | \mathbf { x } _ { i } , \boldsymbol { \theta } \right)$: gating function decides which expert to use, depending on the input values and $p \left( y _ { i } | \mathbf { x } _ { i } , z _ { i } = k , \boldsymbol { \theta } \right)$ is expert

Overall prediction model:
$$p \left( y _ { i } | \mathbf { x } _ { i } , \boldsymbol { \theta } \right) = \sum _ { k } p \left( z _ { i } = k | \mathbf { x } _ { i } , \boldsymbol { \theta } \right) p \left( y _ { i } | \mathbf { x } _ { i } , z _ { i } = k , \boldsymbol { \theta } \right)$$

+ We can use neural network to represent both the gating functions and the experts => mixture density network
+ We also can make each expert be itself a mixture of experts => hierarchical mixture of experts.

![](../images/11.Moe.png)

## The EM algorithm
use generic gradient-based optimizer to find a local minimum of negative log likelihood (NLL):

$$\mathrm { NLL } ( \boldsymbol { \theta } ) = - \triangleq \frac { 1 } { N } \log p ( \mathcal { D } | \boldsymbol { \theta } )$$

Or using Expectation maximization (EM): iterative algorithm often with closed-form updates at each step, alternating between inferring the missing values given the parameters (E-step) and then optimizing the parameters given the filled in data (M step).

$x_i$ is visible or observed variables, $z_i$ be the hidden or missing variables, we maximize log likelihood over observed data:

$$\ell ( \boldsymbol { \theta } ) = \sum _ { i = 1 } ^ { N } \log p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } \right) = \sum _ { i = 1 } ^ { N } \log \left[ \sum _ { \mathbf { z } _ { i } } p \left( \mathbf { x } _ { i } , \mathbf { z } _ { i } | \boldsymbol { \theta } \right) \right]$$

It is hard to optimize since the log cannot be pushed inside the sum:

+ Complete data log likelihood: $$\ell _ { c } ( \boldsymbol { \theta } ) \triangleq \sum _ { i = 1 } ^ { N } \log p \left( \mathbf { x } _ { i } , \mathbf { z } _ { i } | \boldsymbol { \theta } \right)$$

+ Expected complete data log likelihood:
$$Q \left( \boldsymbol { \theta } , \boldsymbol { \theta } ^ { t - 1 } \right) = \mathbb { E } \left[ \ell _ { c } ( \boldsymbol { \theta } ) | \mathcal { D } , \boldsymbol { \theta } ^ { t - 1 } \right]$$

where $Q$ is auxiliary function

**E step**: compute $Q \left( \boldsymbol { \theta } , \boldsymbol { \theta } ^ { t - 1 } \right)$ expectation of log likelihood: the terms inside of it which the MLE depends on: expected sufficient statistics (ESS).

**M step**: optimize the $Q$ function w.r.t $\theta$
$$\boldsymbol { \theta } ^ { t } = \arg \max _ { \boldsymbol { \theta } } Q \left( \boldsymbol { \theta } , \boldsymbol { \theta } ^ { t - 1 } \right)$$

To perform MAP estimation, we modify the M step as follows:
$$\boldsymbol { \theta } ^ { t } = \underset { \boldsymbol { \theta } } { \operatorname { argmax } } Q \left( \boldsymbol { \theta } , \boldsymbol { \theta } ^ { t - 1 } \right) + \log p ( \boldsymbol { \theta } )$$

### EM for GMMs:
+ Auxiliary function:
$$Q \left( \boldsymbol { \theta } , \boldsymbol { \theta } ^ { ( t - 1 ) } \right) \triangleq \mathbb { E } \left[ \sum _ { i } \log p \left( \mathbf { x } _ { i } , z _ { i } | \boldsymbol { \theta } \right) \right] = \sum _ { i } \sum _ { k } r _ { i k } \log \pi _ { k } + \sum _ { i } \sum _ { k } r _ { i k } \log p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } _ { k } \right)$$

    where responsibility: $r _ { i k } \triangleq p \left( z _ { i } = k | \mathbf { x } _ { i } , \boldsymbol { \theta } ^ { ( t - 1 ) } \right)$ that cluster $k$ takes for data point $i$.

+ E step: (same for any mixture model)
$$r _ { i k } = \frac { \pi _ { k } p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } _ { k } ^ { ( t - 1 ) } \right) } { \sum _ { k ^ { \prime } } \pi _ { k ^ { \prime } } p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } _ { k ^ { \prime } } ^ { ( t - 1 ) } \right) }$$

+ M step: optimize $Q$ w.r.t $\pi, \theta_k$:
    $$\pi _ { k } = \frac { 1 } { N } \sum _ { i } r _ { i k } = \frac { r _ { k } } { N }$$
    
    $$\begin{aligned} \boldsymbol { \mu } _ { k } & = \frac { \sum _ { i } r _ { i k } \mathbf { x } _ { i } } { r _ { k } } \\ \mathbf { \Sigma } _ { k } & = \frac { \sum _ { i } r _ { i k } \left( \mathbf { x } _ { i } - \boldsymbol { \mu } _ { k } \right) \left( \mathbf { x } _ { i } - \boldsymbol { \mu } _ { k } \right) ^ { T } } { r _ { k } } = \frac { \sum _ { i } r _ { i k } \mathbf { x } _ { i } \mathbf { x } _ { i } ^ { T } } { r _ { k } } - \boldsymbol { \mu } _ { k } \boldsymbol { \mu } _ { k } ^ { T } \end{aligned}$$

![](../images/11.EM.png)

### K-means algorithm (Hard EM): Making hard assignment of points to clusters
![](../images/11.K_means.png)

### MAP estimation for EM GMM:
+ New auxiliary function: the expected complete data log-likelihood plus the log prior:

$$Q ^ { \prime } \left( \boldsymbol { \theta } , \boldsymbol { \theta } ^ { \text { old } } \right) = \left[ \sum _ { i } \sum _ { k } r _ { i k } \log \pi _ { i k } + \sum _ { i } \sum _ { k } r _ { i k } \log p \left( \mathbf { x } _ { i } | \boldsymbol { \theta } _ { k } \right) \right] + \log p ( \boldsymbol { \pi } ) + \sum _ { k } \log p \left( \boldsymbol { \theta } _ { k } \right)$$

+ Prior for:
    + Mixture weights: $\boldsymbol { \pi } \sim \operatorname { Dir } ( \boldsymbol { \alpha } )$
    + Likelihood: $p \left( \boldsymbol { \mu } _ { k } , \boldsymbol { \Sigma } _ { k } \right) = \mathrm { NIW } \left( \boldsymbol { \mu } _ { k } , \boldsymbol { \Sigma } _ { k } | \mathbf { m } _ { 0 } , \kappa _ { 0 } , \nu _ { 0 } , \mathbf { S } _ { 0 } \right)$ 
    where NIW is Normal inverse wishart distribution
   
+ MAP for:
    + Mixture weights: $$\pi _ { k } = \frac { r _ { k } + \alpha _ { k } - 1 } { N + \sum _ { k } \alpha _ { k } - K }$$
    + Other parameters: 
    $$\begin{aligned} \hat { \boldsymbol { \mu } } _ { k } & = \frac { r _ { k } \overline { \mathbf { x } } _ { k } + \kappa _ { 0 } \mathbf { m } _ { 0 } } { r _ { k } + \kappa _ { 0 } } \\ \overline { \mathbf { x } } _ { k } & \triangleq \frac { \sum _ { i } r _ { i k } \mathbf { x } _ { i } } { r _ { k } } \\ \hat { \mathbf { s } } _ { k } & = \frac { \mathbf { S } _ { 0 } + \mathbf { S } _ { k } + \frac { \kappa _ { 0 } r _ { k } } { \kappa _ { 0 } + r _ { k } } \left( \overline { \mathbf { x } } _ { k } - \mathbf { m } _ { 0 } \right) \left( \overline { \mathbf { x } } _ { k } - \mathbf { m } _ { 0 } \right) ^ { T } } { \nu _ { 0 } + r _ { k } + D + 2 } \\ \mathbf { S } _ { k } & \triangleq  \sum _ { i } r _ { i k } \left( \mathbf { x } _ { i } - \overline { \mathbf { x } } _ { k } \right) \left( \mathbf { x } _ { i } - \overline { \mathbf { x } } _ { k } \right) ^ { T } \end{aligned}$$
    
### Online EM:
+ Let 
    * $\phi ( \mathbf { x } , \mathbf { z } )$ be a vector of sufficient statistics for a single data case
    * $\mathbf { s } _ { i } = \sum _ { \mathbf { z } } p ( \mathbf { z } | \mathbf { x } _ { i } , \boldsymbol { \theta } ) \phi \left( \mathbf { x } _ { i } , \mathbf { z } \right)$ be teh expected sufficient statistics for case $i$
    * $\boldsymbol { \mu } = \sum _ { i = 1 } ^ { N } \mathbf { s } _ { i }$ be the sum of the ESS:
    
+ Batch EM: 
![](../images/11.BatchEM.png)

+ Incremental EM:
Keep track of $\mu$ and $s_i$. swap old $s_i$ and replace it with new $s_i^{new}$:
![](../images/11.IncrementalEM.png)

+ Stepwise EM (best):
Use momentum factor $\eta _ { k } = ( 2 + k ) ^ { - \kappa }$ for $0.5 < \kappa \leq 1$ for $\mu$
![](../images/11.StepwiseEM.png)