#Gaussian Mixture Models And Expectation Maximization #

##After this Lesson you will: 

1. Be able to describe in your own words the difference between GMMs and k-means and why we might choose one over the other. 
2. Be able to describe Expectation Maximization and understand the basis of the method. 
3. Give examples of where GMMs might be applied.

##Expectation Maximization: An Introduction 

There are a significant class of models that use Expectation Maximization (commonly called EM) as the basis for optimization. It is more important that you understand EM on a relatively intuitive level than you have an in-depth understanding of every model that uses it, since the algorithm itself is easily the most important part. 

###EM: Motivation 

EM is a method that extends beyond the scope of a simple algorithm. The prinicipal idea is one underlying many other models: find the most likely set of parameters for a model, under the assumption that the observations in the training data are a result of *one or more latent variables.* The latent variables are those data that are **presumed to be missing.** How is it different from other optimization methods? 

EM optimizes the parameters of the log-likelihood function of the model while simultaneously trying to estimate the effect of the latent variables. This is done iteratively, in two steps.
1. Expectation: The expectation value of the log-likehood equation is calculated given the training data and an estimate of the parameters.
2. Maximization: Maximize the log-likelihood equation in the above step by varying the parameters with respect to the dummy variable.

The two steps continue until a convergence condition is reached. The choice of convergence condition is less important than these two steps and normally depends on the implementation of the model.

###EM: General Algorithm 
1. Start with an initial guess for the parameter set $\hat{\theta}_0$ ($\theta \leftarrow \hat{\theta}_0$) (almost always randomized)
2. At the jth step, compute the expectation value of the log-likelihood function as a function of a dummy parameter $\theta'$ Compute $Q(\theta'|\hat{\theta}_j)) = E(l_0(\theta';(X+X'))|X, \hat{\theta}_j))$ (So we are computing the expectation value of the real data + "latent" data (i.e. missing data) $(X+X')$ given the current parameter set and the real data.)
3. At the jth step, estimate a new parameter set by maximizing $Q(\theta'|\hat{\theta}_j)$ by varying $\hat{\theta}_j$.
4. Iterate until convergence.

Why do we introduce a dummy variable? The dummy variable provides a coordinate that tells us where we are in the optimization process. It turns out in EM implementations, the dummy variable (it's a dummy because its units and particular meaning are not important) ends up being part of $Q(\theta'|\hat{\theta}_j))$. 

One way to think of the EM algorithm is as a maximization of the likelihood in the coordinate space of latent variables vs. model parameters. The red curve in the below figure corresponds to the log-likelihood of the observed data. This red curve is obtained by maximizing the log-likelihood of the observed data for each value of $\theta'$ (dummy variable). (This process simultaneously minimizes the effect of the unobserved latent variables). The black level curves are the contours of the combined effect of the latent and observed variables.

The EM algorithm proceeds in two steps. One that travels down level curves in the model parameter space (Maximization), and another travels along latent variable space (Expectation).

![em_opt](images/EM_Optimization.png)

The magic of the algorithm is that we spend our time optimizing $Q(\theta'|\hat{\theta}_j))$ rather than trying to optimize $log(P(X|\hat{\theta}_j))$


##QUIZ:

What is an expectation value? What is the log-likelihood function? Describe in your own words what the EM algorithm does. 

#Gaussian Mixture Models 

###GMM: Intuition

Suppose you have a set of data points that you *believe to represent a sample* from one or more unknown i.i.d. distributions in the same data set. What would the *most likely* group of distributions be that describes this data?

###**Hypothesis:** 

The variables can be most accurately described as a mixture of gaussian distributions (of an unknown number). We identify the underlying distribution of each of the K components of the mixture as a Gaussian distribution:

$$p_k(X|\theta_{k}) = \frac{1}{(2\pi)^{d/2}|\Sigma_{k}|^{1/2}}e^{-\frac{1}{2}(X-\mu_k)^{T}\Sigma_{k}^{-1}(X-\mu_k)} = \phi(\theta, x_{i})$$ 

With its own set of parameters $$\theta_{k} = \mu_k, \Sigma_{k}$$ 

###**Cost Function:** 

We compute the log-likelihood using an IID assumption:
$$log(l(\theta)) = \Sigma_{i=1}^{N}log(p_{k}(x_{i}|\theta)) = \Sigma_{i=1}^{N}( log(\Sigma_{k=1}^{K}\alpha_{k}p_{k}(x_{i}|z_{k},\theta_{k}))) $$ 

Where the $p_{k}(x_{i}|z_{k},\theta_{k})$ are the gaussian densities for the kth component. The $z = (z_1, z_2,..., z_k)$ are indicator variables that are exclusive and exhaustive. That is, z_ik = 1 for each of the K members of z for every x, representing the gaussian component that generated x (a flag indicating which distribution generated x).

**Question: ** How are the probabilities calculated in practice, do you think? We compute the "membership" or "responsibility" of each distribution using Bayes' Law. 

Recall: $$P(X|\theta) = \frac{P(\theta|X)P(X)}{P(\theta)}$$

Since we assume: $$P(X|\theta) = \Sigma_{k=1}^{N}\alpha_{k}p_{k}(x|z_{k}, \theta_{k})$$

Where $$\theta = (\theta_{1}, \theta_{2}, ... , \theta_{k})$$ 

We can write: # # $$w_{ik} = p(z_{ik}=1|x_{i}, \theta) = \frac{\alpha_{k} \cdot p_{k}(x_{k}|z_{k}, \theta_{k})}{\Sigma_{m=1}^{K}\alpha_{m}p_{m}(x_{i}|z_{m}, \theta_{m})}$$ 

for $k \in K$ and $i \in N$. These are the mixture weights for each of the gaussians.

###Optimization:

Initialize $\hat{\theta}$ to random values. The model is optimized in two steps: 

**The E-step:** Compute weights $w_{ik}$ for all variables and all mixture components. Here we get a membership matrix of $N \times K$ dimensions, where the rows sum to 1. 

**The M-step:** Use the set of weights to calculate the parameters for the Gaussian mixture: $N_k = \Sigma_{k}^{N}w_{ik}$

Now the dummy variables $\alpha$ are recalculated with the current memberships. 

$$\alpha_{k} = \frac{N_k}{N}$$ 

The means are calculated as a **sum of the mixture weights:** $$\mu_{k} = \frac{1}{N}\Sigma_{i=1}^{N}w_{ik} \cdot x_{i}$$  

This is the most important part. **The parameters of the emerging gaussian distribution are defined by the points that appear to belong to it. **  

Finally, we calculate the variance matrix, $\Sigma$, using weights and the new means: 

$$\Sigma_{k} = \frac{1}{N}\Sigma_{i=1}^{N}w_{ik} \cdot (x_k-\mu_{k})^{T}(x_{k}-\mu_{k})$$  This is a square matrix. 

**Convergence:** We compute the log-likelihood for each E-M step until the change drops below some preset value (usually set by the author of the software). Once that happens, the algorithm exits.

##QUIZ: 
How do the E and M steps given here relate to the general E-M algorithm? More specifically, what are we maximizing in the M-step, and what other algorithm is it most like? 

###Reasoning:

Given a relatively simple training data and relatively even density, we can produce a reasonable estimation of the actual underlying distributions under the assumption that these are IID. The distributions can be determined iteratively by adjusting them sequentially, determining the parameters of the distribution as we go and computing the log-likelihood of the overall model as we go. 

Example 1: Suppose we have two simple 1-dimensional distributions (for example, tree heights by species). In this case, we can organize the data roughly into two classes. 

![em_example](images/EM_example_1.png)

**Question:** What model might we consider applying here right away, before considering other models?

Let's set up the two classes as Normal distributions: $$X_1 \sim N(\mu_{1}, \sigma_{1}^{2})$$ $$X_2 \sim N(\mu_{2}, \sigma_{2}^{2})$$ $$X = (1-\Delta) \cdot X_{1} + \Delta \cdot X_{2}$$ Where $\Delta \in {0,1}$ (is either 0 or 1) and the probability of it being 1 is fixed. 

$P(\Delta=1) = \alpha$  This is a way of classifying points into $X_{2}$. We count $\alpha$ as the proportion of the data belonging to class 2. To solve the problem, we compute the following:

1. Take initial guesses for the parameters, $\alpha, \sigma_{1}^{2}, \sigma_{2}^{2}, \mu_{1}, \mu_{2}$ We choose $\alpha = 0.5$, the $\sigma$ to the overall sample variance, and the $\mu$ to random points.
2. E-step: Compute weights ("responsibilities") $$w_{i} = \frac{\alpha \cdot \phi(\theta_{2}, x_{i})}{(1-\alpha) \cdot \phi(\theta_1, x_{i}) + \alpha \cdot \phi(\theta_2, x_{i})}$$
3. M-step: Compute weighted means and variance: $$\mu_{1} = \frac{\Sigma_{i=1}^{N}(1-w_{i})x_{i}}{\Sigma_{i=1}^{N}(1-w_{i})}$$ $$\sigma_{1}^{2} = \frac{\Sigma_{i=1}^{N}(1-w_{i})(x_{i}-\mu_{1})^{2}}{\Sigma_{i=1}^{N}(1-w_{i})}$$$$\mu_{2} = \frac{\Sigma_{i=1}^{N}w_{i}x_{i}}{\Sigma_{i=1}^{N}w_{i}}$$ $$\sigma_{2}^{2} = \frac{\Sigma_{i=1}^{N}w_{i}(x_{i}-\mu_{2})^{2}}{\Sigma_{i=1}^{N}w_{i}}$$
4. Compute mixing probability: $\alpha = \Sigma_{i=1}^{N}w_{i}/N$
5. Iterate until $\alpha$ stops changing. Solving by hand we get: 

| Iteration | alpha |   
|-----------|-------|
|     1     | 0.485 |
|     5     | 0.493 |
|    10     | 0.523 |
|    15     | 0.544 |
|    20     | 0.546 |

The final estimates of variables are: $\mu_{1} = 4.62$, $\mu_{2} = 1.06$, $\sigma_{1}^{2} = 0.87$, $\sigma_{2}^{2} = 0.77$, $\alpha = 0.546$ 

Example 2:

![gmm_classifier](images/plot_gmm_classifier.png)


##Where to Apply GMM: 
1. Low Density Samples, sparse 
2. Clearly gaussian processes (Gaussian Clusters)

##Where GMM Fails: 
1. Mixed Density clusters, highly overlapped clusters. (Mixed density problem) Unfortunately this is very common in practice.
2. High Dimensions (greater than 3)
3. High Density samples 
4. K-means fails here too. 

#GMM vs. K-Means

![gmm_data](images/GMM_data_Kmeans_example.png)
![kmeans_out](images/KMeans_output.png)


1. Both estimate clusters of uniform density.
2. Both partition a data set into selection of K distributions. (there are self optimizing versions of GMM)

###Why choose GMM?
1. GMM does well with gaussian distributions
2. Resists breaking up obvious clusters in $ \leq 3$ dimensions.

###Why choose KMeans? 
1. Faster. 
2. Often used for exploratory work to find K
3. Used to initialize GMM!