# Clustering with Gaussian Mixture Models


###  Limitations of Simple IID Gaussian Models

Sofar, model inference was solved analytically, but we
used strong assumptions
- IID sampling, $p(D) = \prod_n p(x_n)$
- Simple Gaussian (or multinomial) PDFs, $p(x_n) \sim \mathcal{N}(x_n|\mu,\Sigma)$
- Some limitations of Simple Gaussian Models with IID Sampling
  1. What if the PDF is **multi-modal** (or is just not Gaussian in any other way)?
  2. Covariance matrix $\sigma$ has $D(D+1)/2$ parameters.
    - This quickly becomes **a very large number** for increasing dimension $D$.
  3. Temporal signals are often **not IID**.



###  Towards More Flexible Models

-  What if the PDF is multi-modal (or is just not Gaussian in any other way)?
  -   **Discrete latent** variable models (a.k.a. **mixture** models).
    
-  Covariance matrix $\Sigma$ has $D(D+1)/2$ parameters. This quickly becomes very large for increasing dimension $D$.
  -  **Continuous latent** variable models (a.k.a. **dimensionality reduction** models).
    
-  Temporal signals are often not IID.
  -  Introduce **Markov dependencies** and **latent state** variable models.
    





###  What if the Data are Not like This ...
\begin{center}\includegraphics[height=8cm]{./figures/fig-2-class-data}\end{center}


###  ... but like This
\begin{center}\includegraphics[height=8cm]{./figures/fig-unlabeled-data}\end{center}

###  Unobserved Classes

Consider again a set of observed data $D=\{x_1,\dotsc,x_N\}$

- This time we suspect that there are unobserved class labels that would help explain (or predict) the data, e.g.,
  - the observed data are the color of living things; the unobserved classes are animals and plants.
  - observed are wheel sizes; unobserved categories are trucks and personal cars.
  - observed is an audio signal; unobserved classes include speech, music, traffic noise, etc.
    
Classification problems with unobserved classes are called **Clustering** problems. The learning algorithm needs to **discover the classes from the observed data**.


###  Latent Variable Model Specification
 
If the categories were observed as well, these data could be nicely modeled by the previously discussed generative classification framework.

-  Introduce the 1-of-$K$ variable $z = (z_1,\ldots,z_K)^T$ to represent the unobserved classes.
  - NB: our notation is: $Y_k$ for observed targets; $Z_k$ for unobserved outputs.
-  Use completely **equivalent model assumptions to linear generative classification**, (except now the class
    labels $z_k$ are not observed),
    
\begin{align}
p(x_n) &= \sum_{k=1}^K p(z_{nk}) \, p(x_n|z_{nk})  \\
	&= \sum_k \pi_k \mathcal{N}\left(x_n|\mu_k,\Sigma_k \right)
\end{align}

This model is called a **Gaussian Mixture Model**.


###  Gaussian Mixture Models
GMMs are **universal approximators of densities** (as long as there are enough Gaussians of course)

\begin{figure}
\begin{center}
\includegraphics[width=10.5cm]{./figures/fig-ZoubinG-GMM-universal-approximation}
\end{center}
The red curves show the (weighted) Gaussians; the blue curve the resulting density.
\end{figure}


###  Inference: Log-Likelihood for GMM
The log-likelihood for observed data $D=\{x_1,\dotsc,x_N\}$,

\begin{align}
\log p(D|\theta) &\stackrel{\text{IID}}{=} \sum_n \log p(x_n|\theta)\\
  &= \sum_n \log \sum_{z_n} p(x_n,z_{n}|\theta)\\
  &= \sum_n \log \sum_{z_n} p(z_{n}|\theta) p(x_n|z_n,\theta) \\
  &= \sum_n \log \sum_k p(z_{nk}=1|\theta)p(x_n|z_{nk}=1,\theta) \\
  &= \boxed{\sum_n \log \sum_k \pi_k\mathcal{N}(x_n|\mu_k,\Sigma_k)}
\end{align}

... and now the log-of-sum cannot be further simplified.

Compare to classification: $$\sum_k N_k \log \pi_k + \sum_{n,k} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma)$$

- Fortunately GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization.


###  Posterior Responsibility is a Soft Class Indicator

Consider the (posterior) expectation for the (hidden) class labels

\begin{align}
\gamma_{nk} &\triangleq \mathrm{exp}\left(z_{nk}|x_n,\theta\right) = 0\times p(z_{nk}=0|x_n,\theta) + 1\times p(z_{nk}=1|x_n,\theta) \\
  &= p(z_{nk}=1|x_n,\theta) = \frac{p(x_n|z_{nk}=1)p(z_{nk}=1)}{\sum_j p(x_n|z_{nj}=1)p(z_{nj}=1)} \\
  &= \frac{\pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_n|\mu_j,\Sigma_j)}
\end{align}
            
-  Note that $0 \leq \gamma_{nk} \leq 1$ and is available (i.e., can be evaluated).
-  $\gamma_{nk}$ are (soft) **reponsibilities**.
-  PLAN: Let's use the reponsibilities $\gamma_{nk}$ (rather than the binary class indicators $y_{nk}$) and apply the classification formulas.

%\begin{center}\includegraphics[height=4cm]{./Figure95.jpg}\end{center}


###  ML estimation for Clustering

-  Try parameter updates (like conditional Gaussian classification):

$$
\hat \pi_k = \frac{N_k}{N};\; \hat \mu_k = \frac{1}{N_k} \sum_n \gamma_{nk} x_n; \;  \hat \Sigma_k  = \frac{1}{N_k} \sum_{n} \gamma_{nk} (x_n-\hat \mu_k)(x_n-\hat \mu_k)^T$$
where $N_k = \sum_n \gamma_{nk}$ .

-  But wait, the responsibilities $\gamma_{nk}=\frac{\pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_n|\mu_j,\Sigma_j)}$ are a function of the model parameters $\{\pi,\mu,\Sigma\}$ and the parameter updates depend on the responsibilities ...
-  **Solution(?)**: iterate between updating the responsibilities $\gamma_{nk}$ and the model parameters $\{\pi,\mu,\Sigma\}$.

-  This iteration works (!) and is called the **Expectation-Maximization (EM)** algorithm.




### Clustering vs. Classification

<table>
<tr> <td></td><td>**Classification**</td> <td>**Clustering**</td> </tr> 

<tr> <td>1</td><td>Class label $y_n$ is observed</td> <td>Class label $z_n$ is latent</td> </tr>

<tr> <td>2</td><td>log-likelihood **conditions** on observed class<br />$\propto \sum_{nk} y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k)$</td> <td> log-likelihood **marginalizes** over latent classes<br />$\propto \sum_{n}\log \sum_k \pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k)$</td> </tr>

<tr> <td>3</td><td>'Hard' class selector<br />$y_{nk} = \mathrm{logical}(y_n \in \mathcal{C}_k)$</td> <td>'Soft' class responsibility<br />$\gamma_{nk} = p(z_{nk}=1|x_n,\theta)$</td> </tr>

<tr> <td>4</td>
<td>Estimation:<BR /> 
\begin{align}
\hat{\pi}_k &= \frac{1}{N}\sum_n y_{nk} \\
\hat{\mu}_k &= \frac{\sum_n y_{nk} x_n}{\sum_n y_{nk}} \\
\hat{\Sigma}_k &= \frac{\sum_n y_{nk} (x_n-\hat\mu_k)(x_n-\hat\mu_k)^T}{\sum_n y_{nk}}
\end{align}
</td> 
<td>Estimation (1 update step!)<BR />
\begin{align}
\hat{\pi}_k &= \frac{1}{N}\sum_n \gamma_{nk} \\
\hat{\mu}_k &= \frac{\sum_n \gamma_{nk} x_n}{\sum_n \gamma_{nk}} \\
\hat{\Sigma}_k &= \frac{\sum_n \gamma_{nk} (x_n-\hat\mu_k)(x_n-\hat\mu_k)^T}{\sum_n \gamma_{nk}}
\end{align}
</td>
</tr>

</table>