# Gaussian Mixture Model Review
### Extending bayes classifier<hr>
- Single Gaussian model learns blurry images - why?
- We tried to force a single Gaussian to fit a multi-modal distribution
- Recall: mode is local max in pdf
- Makes sense for images of digits
- Not everyone will write digits the same way
- But there should be a finite number of clusters

## Multi-modal distributions
![multi_modal_distributions](../images/multi_modal_distributions.PNG)
- Suppose we collect 1000 images of written 2s
- 500 similar to the left, 500 similar to the right
- Then we'd call this a bi-modal distribution
- In reality, there will be more, but still finite
- There are only so many ways to write a 2 until it ceases to look like a 2

## How can we model a multi-modal distribution?
- We'll use a pre-built GMM (Sci-Kit Learn)
- What's important: this can fit multiple Gaussians in different proportions to approximate a multi-modal dist

## Interesting Facts about GMMs
- GMM is a latent variable model, and unsupervised learning is all about latent variables!
- We call the latent variable **'Z'** - it represents **"which cluster x belongs to "**
- The marginal of x looks a lot like our Bayes classifier
- 2 clusters: \\(p(x) = p(z=1)p(x|z=1) + p(z=2)p(x|z=2)\\)
- \\(p(z)\\): prior probability that any x belongs to a cluster
- \\(p(z)\\): categorical / discrete distribution
- \\(p(x|z)\\): Gaussian

### The prior
- \\(p(z)\\) tells us, without looking at any x, which cluster you're likely to belong to 
- Ex. ask 1000 people if they have a disease (yes = 1, no = 0)
- \\(p(z=1)\\) = # people who said yes / 1000

### Common Mistake
- \\(p(z)\\) is not the same as \\(p(z|x)\\):
- Analogy:
- \\(p(z)\\) - frequency of disease in population
- \\(p(z|x)\\) - patient goes to doctor's office and performs a test
- x is the test data (clearly, having this data would alter the probability of whether or not you have the disease)

## Assigning a cluster
- Given an x, how can we find which cluster z it belongs to?
- Use Bayes rule!
- \\(p(z|x) = p(x|z)p(z)/p(x)\\)
- Where \\(p(x)\\) is called the *"evidence"*, and is just the sum of \\(p(z)p(x|z)\\) over all z

## More interesting facts about GMMs
- GMM is trained using expectation-maximization(EM)
- Details not important, but **"Where it fits"** is interesting
- We use EM for latent variable models because we can't find a closed-form maximum-likelihood solution
- EM is iterative (likelihood improves at each step)
- But still, no special objective, we are still just doing maximum-likelihood
- Simple example: finding the mean height of students in your class assuming a Gaussian distribution
- We find sample mean/stddev by maximizing likelihood L wrt params, i.e. dL/d\\(\theta\\)=0
- EM: cannot solve dL/d\\(\theta\\)=0

## Why is it important?
- Variational inference is a key component of variational autoencoders
- Variational inference can be seen as a Bayesian extension of EM
![variational_inference](../images/variational_inference.PNG)

## An important application of variational inference
- Recall: one weakness of K-Means Clustering / GMMs is we have to choose K, the number of clusters
- If you choose wrong, model is bad
- The variational inference version of GMM contains an infinite number of clusters
- Most remain empty, and so VI-GMM automatically finds the number of clusters for you 
- We will assume it can do so effectively
- We will use Sci-Kit Learn's built-in VI-GMM

## More graphical modeling
![More_graphical_modeling](../images/More_graphical_modeling.PNG)

## Summary
- Mostly new concepts and ideas
- Implementation should be very simple
- Just replace Gaussian (mean/cov) with Variational GMM:
 - from sklearn.mixture import BayesianGaussianMixture
- Already comes with fit() and sample() functions
