# Mixture Models

In this problem we will look at different ways to detect natural "groupings" or "clusters" in data. 

You are given the biometric data (along with maternal statistics) for approximately 1,000 new borns in California between the years 2000 and 2001. The data has already been split into training and testing. Previous studies have indicated that new born bimetric data "naturally" clusters into two groups, full term and preterm births (preterm here means a gestational age of less than 37 weeks). Your task is to investigate whether or not this assertion holds for the dataset at hand. That is, our hypothesis for this problem is:

> New born bimetric data "naturally" clusters into exactly two groups. One group will encompase the full term births and the other will encompase the preterm births.

One way to investigate "natural" clustering in data is to perform an unsupervised clustering algorithm. Then examine the clusters obtained. If one cluster contains, largely, full term data points and the other preterm data points, then we might interpret this as support for our hypothesis. Another way to asses our data segmentation is to classify new data as full term or preterm by comparing new data to our clusters.

The number of features for this dataset is small so you are welcome to use visualization for exploration and sanity check (both are highly recommended but not required). 

***Grading Notes:*** 

You can keep your analysis short and to the point (of course without sacrificing correctness). 

Inspection of the traceplots and autocorrelation plots are sufficient for diagnosing convergence in your samplers.


### Part A: K-Means
K-Means is a fast and simple unsupervised clustering algorithm that produces a hard clustering of the data, each data point is assigned a cluster. Fit a K-Means model from `sklearn` on your training data ([see example code half-way down this page](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)). Investigate the cluster memebership in order to label one cluster "full term" and the other "preterm". Use your model to assign each data point to a cluster and classify the data point as "full term" or "preterm", based on the label of its assigned cluster. Compute the classification error on the training set and on the test set. 

Do your findings support our hypothesis?

### Part B: Mixture Model and MLE
While fast, clustering by K-Means has many drawbacks. A natural generalization of K-means clustering is model-based clustering, in particular, clustering based on Gaussian mixture models (K-Means can be interpreted to be a particular form of mixture of Gaussians). 

Model our biometric data as a mixture of two Gaussian components. Use Expectation Maximization, initialized with K-Means estimates of component means and covariance, to compute the MLE of the model parameters (sample code found in [lecture notes](https://am207.github.io/2017/wiki/typesoflearning.html)). 

If you'd like to check your results, you can compare your results with `sklearn`'s `GaussianMixture` model.

Use your MLE parameters to label each component as "full term" or "preterm". Then assign each data point to a component (by taking the argmax of the responsibilities of the components for each data point), classify the data point as "full term" or "preterm", based on the label of its assigned component. Compute the classification error on the training set and on the test set. 

How do your error rates compare with those from K-Means clustering?

Do your findings support our hypothesis?


### Part C: Bayesian Model and MAP
Overfitting is a primary concern with using MLE model parameters, this is particularly problemmatic for Gaussian Mixture Models. One solution is to compute MAP estimates. Consider the following Bayesian model for our mixture of Gaussians:

\begin{align}
x_{i} | z_{ik} = 1 &\sim \mathcal{N}(\mu_{k}, \Sigma_{k}), \quad i = 1, \ldots, N\\
\mu_{k} &\sim \mathcal{N}\left([5, 5, 5], \left(\begin{array}{ccc}
100 & 0 &0\\
0 & 100 & 0 \\
0 & 0 &100 \\
\end{array}\right)\right), \quad k = 0, 1\\
\Sigma_{k} &= \left(\begin{array}{ccc}s_{0k} & 0 &0\\
0 & s_{1k} & 0 \\
0 & 0 &s_{2k} \\
\end{array}\right), \quad k = 0, 1\\
s_{ik} &\sim U(0, 20), \quad i = 0, 1, 2;\quad k = 0, 1\\
z_{i} &\sim Cat(\pi)\\
\pi &\sim \mathcal{Dir}([100, 100])
\end{align}
where $x_i$ is the biometrics of the $i$-th birth.

Using the posterior mean estimates (code for sampling from the posterior found in [lecture notes](https://am207.github.io/2017/wiki/mixtures_and_mcmc.html)), compute the classification error on the training set and on the test set. 

How do your error rates compare with those from K-Means clustering?

Do your findings support our hypothesis?


### Part D: Comparison with A More Complex Model
A number of studies show that the stats of the birth mother has a significant effect on whether or not a birth is premature. In particular, researchers have proposed the following mixture model for birth data,
\begin{align}
x_{i} | z_{ik} = 1 &\sim \mathcal{N}(\mu_{k}, \Sigma_{k}), \quad i = 1, \ldots, N\\
\mu_{k} &\sim \mathcal{N}\left([5, 5, 5], \left(\begin{array}{ccc}
100 & 0 &0\\
0 & 100 & 0 \\
0 & 0 &100 \\
\end{array}\right)\right), \quad k = 0, 1\\
\Sigma_{k} &= \left(\begin{array}{ccc}s_{0k} & 0 &0\\
0 & s_{1k} & 0 \\
0 & 0 &s_{2k} \\
\end{array}\right), \quad k = 0, 1\\
s_{ik} &\sim U(0, 20), \quad i = 0, 1, 2;\quad k = 0, 1\\
z_{i} &\sim Cat([\pi_{i0}, \pi_{i1}])\\
\pi_{i1} &= \sigma(\beta^T c_{i} + \alpha)\\
\pi_{i0} &= 1 - \sigma(\beta^T c_{i} + \alpha)\\
\beta &\sim \mathcal{N}(0, 100)\\
\alpha &\sim \mathcal{N}(0, 100)\\
\end{align}
where $x_i$ is the biometrics of the $i$-th birth, $c_i$ is the vector of covariates for the mother of the $i$-th birth, and $\sigma$ is the sigmoid function.

Use posterior mean estimates of $\beta$ and $\alpha$ to hypothesize on the effect of maternal age, income and education on the probability of an infant being born premature. 

Use model comparison criteria (like the WAIC, AIC) to compare your model that factors in maternal stats and your model that does not. Does the result of this comparison support or contradict your hypothesis?

### Extra Credit: Appropriateness of the Choice of K

Use model comparison criteria to determine the optimal choice of $k$ for our mixture model. Does the optimal value of $k$ you find support our initial hypothesis 

> New born bimetric data "naturally" clusters into exactly two groups. One group will encompase the full term births and the other will encompase the preterm births.
