# APMTH 207: Advanced Scientific Computing: 
## Stochastic Methods for Data Analysis, Inference and Optimization
## Homework #11
**Harvard University**<br>
**Spring 2017**<br>
**Instructors: Rahul Dave**<br>
**Due Date: ** Wednesday, April 26th 2017 at 11:59pm

**Instructions:**

- Upload your final answers as well as your iPython notebook containing all work to Canvas.

- Structure your notebook and your work to maximize readability.

## Problem 1: Homework #8 Revisited

Recall the context of Homework #8: 

A plant nursery in Cambridge is exprimentally cross-breeding two types of hibiscus flowers: blue and pink. The goal is to create an exotic flower whose petals are pink with a ring of blue on each. 

There are four types of child plant that can result from this cross-breeding: 

  - Type 1: blue petals
  - Type 2: pink petals 
  - Type 3: purple petals
  - Type 4: pink petals with a blue ring on each (the desired effect). 

Out of 197 initial cross-breedings, the nursery obtained the following distribution over the four types of child plants: 
$$Y = (y_1, y_2, y_3, y_4) = (125, 18, 20, 34)$$
where $y_i$ represents the number of child plants that are of type $i$.

They know that the probability of obtaining each type of child plant in any single breeding experiment is as follows:
$$ \frac{\theta+2}{4}, \frac{1-\theta}{4}, \frac{1-\theta}{4}, \frac{\theta}{4},$$
where $\theta$ is unknown.

Sensibly, the nursery chose to model the observed data using a multinomial model; *they also imposed a prior on $\theta$, $\rm{Beta}(a, b)$*.

Recall that to simplify sampling from their Bayesian model, the nursery augmented the data with a new variable $z$ such that:
$$z + (y_1 - z) = y_1.$$
That is, using $z$, they are breaking $y_1$, the number of type I child plants, into two subtypes. Let the probability of obtain the two subtype be $1/2$ and $\theta/4$, respectively.

In Homework 8, you implemented a Gibbs sampler for this Bayesian model to compute the posterior mean estimate of $\theta$. 

In this homework we will investigate ways to compute the Maximum Likelihood Estimate (MLE) of $\theta$.

***Note:*** Expectation Maximization can also be applied to compute the posterior mode (MAP) estimates. We are choosing not to do that in this homework so that you are not just repeating the task from Homework #8.

Parts A and B involve algebraic manipulations and no programming.

### Part A:
Treat the augmented model as a latent variable model. Write down an expression (up to unimportant constants - you must decide what unimportant means) for each of the following:

1. the observed data log likelihood
2. the complete data log likelihood

**Hint:** You should already have the above from Homework #8.

3. the Auxilary function, $Q(\theta, \theta^{(t-1)})$, or the expected complete data log likelihood, defined by
$$Q(\theta, \theta^{(t-1)}) = \mathbb{E}_{Z | Y, \Theta^{t-1}}[\text{the complete data log likelihood}]$$

### Part B:
We will maximize the likelihood through Expectation Maximization (EM). In order to preform EM, we must iterate through the following steps

- (Expectation) Compute the Auxilary function, $Q(\theta, \theta^{t-1})$
- (Maximization) Compute $\theta^{t} = \text{argmax}_\theta Q(\theta, \theta^{(t-1)})$

Thus, you must compute exact formulae for the following:
1. the Auxilary function, $Q(\theta, \theta^{(t-1)})$, for a given $\theta^{(t-1)}$. That is, compute the expectation of the complete data log likelihood.
2. $\theta^{t}$, by maximizing the Auxilary function $Q(\theta, \theta^{(t-1)})$.

### Part C:
Estimate the MLE of $\theta$ using EM. Explain the advantage of treating this problem like a latent variable model and using EM to compute the MLE (i.e. why not compute MLE directly by maximizing the likelihood?)

Compare this value with the posterior mean estimate of $\theta$ from Homework #8. In general, what is the difference between MLE and MAP or posterior mean estimates of model parameters? That is, name a couple of major pro's and con's of each type of estimate.

In [11]:
conv_thresh = 1e-9
iters = 1000
diff = 1.
y = (125,18,20,34)
thetas = [1.]
i = 0

while i < iters and diff > conv_thresh:
    theta = ((y[0] * thetas[-1])/(2 + thetas[-1]) + y[3])/((y[0] * thetas[-1])/(2 + thetas[-1]) + y[3] + y[2] + y[1])
    print i, theta
    i += 1
    diff = abs(theta - thetas[-1])
    thetas.append(theta)
    
print thetas[-1]

0 0.66568914956
1 0.631838674952
2 0.627485219761
3 0.626909582965
4 0.626833192939
5 0.626823050714
6 0.626821704055
7 0.626821525248
8 0.626821501506
9 0.626821498354
10 0.626821497935
0.626821497935
