## AMPTH 207: Stochastic Methods for Data Analysis, Inference and Optimization

### Final exam

**Harvard University**  
**Spring 2017**  
**Instructor: Rahul Dave**  
**Due Date: May 12th, 2017, 11:59pm** 

**Instructions:**

- Upload your final answers as well as your iPython notebook containing all work to Canvas.

- Structure your notebook and your work to maximize readability.

## Problem 1: Variational Inference for Hierarchical Feed Forward Neural Networks
Earlier in the course we've built models to predict a label from a discrete finite set for a given input - that is, we've built classifiers. When building these classifiers, we learn a function that maps certain input data to a set of outputs, usually the classes into which the dataset naturally separate. In the case of a Artificial Neural Network (ANN) like the Multi-Layer Perceptron, the learning is achieved by iteratively updating the weights that regulate the flow of information between the different layers of the ANN. We have also seen that it is possible to regularize our model we learn by penalizing the magnitude of some of the weights - this is done in order to avoid overfitting. This type of regularization can be accomplished by  imposing priors on the weights when they are initialized.

This approach is inherently Bayesian. We can specify priors to inform and constrain our models and get uncertainty estimations in the form of a posterior distribution. Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. This also means that we can obtain uncertainties on both the weights and the classes that we obtain from the output layer.

In this problem you are asked to build a hierarchical Bayesian ANN to classify the MNIST dataset (properly split into training and test sets) that we have used before in Long Homework #1, and to infer uncertainties for the weighths and the resulting classes. Variational inference can become very handy here: instead of drawing samples from the posterior, these algorithms fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem. ADVI -- Automatic Differentation Variational Inference is implemented in PyMC3 and Stan. 

Read this [blog post](http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learning/) to get an idea about the problem, and to see how some of the ADVI related technology is set up.

### Part A: Bayesian ANN

In Long HW 1 you built a multi-layer perceptron to classify the MNIST dataset.

Here we will build a Bayesian Artificial Neural Network. 

Build an artificial neural network in pymc3 (or Stan, your choice) with two hidden layers and 25 neurons in each, and use it to classify the MNIST dataset. Use the $\tanh$ function as the activation function, and initialize ALL the weights using using normal priors with $\mu=0$ and two different values for $\sigma = 0.03, 0.1$.

1. You could use a MCMC sampler such as NUTS to train your model. But this could be extremely slow as you add more layer or more units. Instead, use the PyMC3/Stan ADVI implementation to approximate the posteriors of the weights, and obtain the posterior means, standard deviations, and the evidence lower bound (ELBO), which is your objective function. Use mini-batch gradient descent for the optimization (a minibatch size of 500 is appropriate), and a todal of 50000 ADVI iterations.

2. Plot the objective function (ELBO) as a function of the iteration. Using the best fit from your training, predict on the test set by estimating posterior predictives for the classes in each case, and provide the accuracy of your classification. How does your accuracy compare for the two values of $\sigma$? *Hint: if you need traces to perform inference, you can sample directly from the normal distribution that ADVI outputs*.


### Part B: Hierarchical ANN
The connection between the standard deviation of the weight prior to the strengh of the L2 penalization term leads to an interesting idea. In Part A we fixed $\sigma=0.03, 0.1$, but this is a somehow arbitrary decision. Perhaps $\sigma$ should be different for each layer, and have a different magnitude. Rather than randomly trying different $\sigma$s, we can learn the best values directly from the data. This is when the ANN becomes hierarchical. 

1. Modify your algorithm to include different hyperpriors for the $\sigma$s for the bias and weights in each layer, and use ADVI again to optimize the model. Plot the posteriors distributions for the hyperpriors.

2. Just as in part A, predict on the test data and report your accuracy.

3. You can now use the fact that we're in a Bayesian framework and explore uncertainty in your predictions. Compute the $\chi^2$ statistic for the predictions and see how uniform the samples are. The more uniform, the higher our uncertainty. Plot histograms of the $\chi^2$ statistic for those predictions that you got right with respect to the test set, and for those that you got wrong. What is your conclusion?

# Problem 2: Gaussian Mixture Models

In this problem we will look at different ways to detect natural "groupings" or "clusters" in data. 

You are given the biometric data (along with maternal statistics) for approximately 1,000 new borns in California between the years 2000 and 2001. The data has already been split into training and testing. Previous studies have indicated that new born bimetric data "naturally" clusters into two groups, full term and preterm births (preterm here means a gestational age of less than 37 weeks). Your task is to investigate whether or not this assertion holds for the dataset at hand. That is, our hypothesis for this problem is:

> New born bimetric data "naturally" clusters into exactly two groups. One group will encompase the full term births and the other will encompase the preterm births.

One way to investigate "natural" clustering in data is to perform an unsupervised clustering algorithm. Then examine the clusters obtained. If one cluster contains, largely, full term data points and the other preterm data points, then we might interpret this as support for our hypothesis. Another way to asses our data segmentation is to classify new data as full term or preterm by comparing new data to our clusters.

The number of features (biometric measurements of newborns) for this dataset is small so you are welcome to use visualization for exploration and sanity check (both are highly recommended but not required). 

***Grading Notes:*** 

You can keep your analysis short and to the point (of course without sacrificing correctness). 

Inspection of the traceplots and autocorrelation plots are sufficient for diagnosing convergence in your samplers.


### Part A: K-Means
K-Means is a fast and simple unsupervised clustering algorithm that produces a hard clustering of the data, each data point is assigned a cluster. Fit a K-Means model from `sklearn` on your training data ([see example code half-way down this page](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)). Investigate the cluster memebership in order to label one cluster "full term" and the other "preterm". Use your model to assign each data point to a cluster and classify the data point as "full term" or "preterm", based on the label of its assigned cluster. Compute the classification error on the training set and on the test set. 

Do your findings support our hypothesis?

### Part B: Mixture of Guassian Likelihood and MLE
While fast, clustering by K-Means has many drawbacks. A natural generalization of K-means clustering is model-based clustering, in particular, clustering based on Gaussian mixture models (K-Means can be interpreted to be a particular form of mixture of Gaussians). 

Model our biometric data as a mixture of two Gaussian components. Use Expectation Maximization, initialized with K-Means estimates of component means and covariance, to compute the MLE of the model parameters (sample code found in [materials from week 12](https://am207.github.io/2017/lectures/lecture23.html)). 

If you'd like to check your results, you can compare your results with `sklearn`'s `GaussianMixture` model.

Use your MLE parameters to label each component as "full term" or "preterm". Then assign each data point to a component (by taking the argmax of the responsibilities of the components for each data point), classify the data point as "full term" or "preterm", based on the label of its assigned component. Compute the classification error on the training set and on the test set. 

How do your error rates compare with those from K-Means clustering?

Do your findings support our hypothesis?


### Part C: Bayesian Model and MAP
Overfitting is a primary concern with using MLE model parameters, this is particularly problemmatic for Gaussian Mixture Models. One solution is to compute MAP estimates. Consider the following Bayesian model for our mixture of Gaussians:

\begin{align}
x_{i} | z_{ik} = 1 &\sim \mathcal{N}(\mu_{k}, \Sigma_{k}), \quad i = 1, \ldots, N\\
\mu_{k} &\sim \mathcal{N}\left([5, 5, 5], \left(\begin{array}{ccc}
100 & 0 &0\\
0 & 100 & 0 \\
0 & 0 &100 \\
\end{array}\right)\right), \quad k = 0, 1\\
\Sigma_{k} &= \left(\begin{array}{ccc}s_{0k} & 0 &0\\
0 & s_{1k} & 0 \\
0 & 0 &s_{2k} \\
\end{array}\right), \quad k = 0, 1\\
s_{ik} &\sim U(0, 20), \quad i = 0, 1, 2;\quad k = 0, 1\\
z_{i} &\sim Cat(\pi)\\
\pi &\sim \mathcal{Dir}([100, 100])
\end{align}
where $x_i$ is the biometrics of the $i$-th birth; $z_{i}$ is the (latent) vector indicating to which component $x_i$ belongs; $(\mu_k, \Sigma_k)$ are the parameters of the $k$-th Gaussian component; $\pi$ is the proportions of the two components in our mixture. 

Using the posterior mean estimates (code for sampling from the posterior found in [materials from Week 11](https://am207.github.io/2017/lectures/lecture22.html)), compute the classification error on the training set and on the test set. 

**Hint:** You might wish to marginalize out the latent variable $z$ when setting up your model for sampling in pymc3.

How do your error rates compare with those from K-Means clustering?

Do your findings support our hypothesis?


### Part D: Comparison with A More Complex Mixture Model
A number of studies show that the stats of the birth mother has a significant effect on whether or not a birth is premature. In particular, researchers have proposed the following mixture model for birth data,
\begin{align}
x_{i} | z_{ik} = 1 &\sim \mathcal{N}(\mu_{k}, \Sigma_{k}), \quad i = 1, \ldots, N\\
\mu_{k} &\sim \mathcal{N}\left([5, 5, 5], \left(\begin{array}{ccc}
100 & 0 &0\\
0 & 100 & 0 \\
0 & 0 &100 \\
\end{array}\right)\right), \quad k = 0, 1\\
\Sigma_{k} &= \left(\begin{array}{ccc}s_{0k} & 0 &0\\
0 & s_{1k} & 0 \\
0 & 0 &s_{2k} \\
\end{array}\right), \quad k = 0, 1\\
s_{ik} &\sim U(0, 20), \quad i = 0, 1, 2;\quad k = 0, 1\\
z_{i} &\sim Cat([\pi_{i0}, \pi_{i1}])\\
\pi_{i1} &= \sigma(\beta^T c_{i} + \alpha)\\
\pi_{i0} &= 1 - \sigma(\beta^T c_{i} + \alpha)\\
\beta &\sim \mathcal{N}(0, 100)\\
\alpha &\sim \mathcal{N}(0, 100)\\
\end{align}
where $x_i$ is the biometrics of the $i$-th birth, $c_i$ is the vector of covariates for the mother of the $i$-th birth, and $\sigma$ is the sigmoid function.

Use posterior mean estimates of $\beta$ and $\alpha$ to hypothesize on the effect of maternal age, income and education on the probability of an infant being born premature. Use the hierarchical model to support the correctness of your interpretations of significance of $\beta$ and $\alpha$.

**Hint:** You might wish to marginalize out the latent variable $z$ when setting up your model for sampling in pymc3.

Use model comparison criteria (like the WAIC, AIC) to compare your model that factors in maternal stats and your model that does not. Does the result of this comparison support or contradict our hypothesis that the stats of the birth mother has a significant effect on whether or not a birth is premature?

### Extra Credit: Appropriateness of the Choice of K

Use model comparison criteria to determine the optimal choice of $k$ for our mixture model. Does the optimal value of $k$ you find support our initial hypothesis 

> New born bimetric data "naturally" clusters into exactly two groups. One group will encompase the full term births and the other will encompase the preterm births.


### Extra Credit: Modeling Full Covariate Matrices

Notice that in our mixture model, we assumed that the covariate matrices for both Gaussian components are diagonal. This means that the features of our data cannot have any non-trivial correlation. Ideally we'd like to model $\Sigma_k$ with a distribution that is supported over the entire feasible set of covariance matrices. Unfortunately, `pymc3` does not currently support sampling from distributions over covariance matrices. Rather you must draw correlation matrics from a LJK distribution and variances from another distribution, then put the two together to form a covariance matrix. 

Modify your `pymc3` model in Part C to sample $\Sigma_k$ from a distribution supported over full covariance matrices, using LJK distributions. A tutorial for how to do this can be found [here](http://austinrochford.com/posts/2015-09-16-mvn-pymc3-lkj.html) and in [these lecture notes](https://am207.github.io/2017/wiki/corr.html).