# APMTH 207: Advanced Scientific Computing: 
## Stochastic Methods for Data Analysis, Inference and Optimization
## Homework #11
**Harvard University**<br>
**Spring 2017**<br>
**Instructors: Rahul Dave**<br>
**Due Date: ** Wednesday, April 26th 2017 at 11:59pm

**Instructions:**

- Upload your final answers as well as your iPython notebook containing all work to Canvas.

- Structure your notebook and your work to maximize readability.

#### Problem 1: Homework #8 Revisited

Recall the context of Homework #8: 

A plant nursery in Cambridge is exprimentally cross-breeding two types of hibiscus flowers: blue and pink. The goal is to create an exotic flower whose petals are pink with a ring of blue on each. 

There are four types of child plant that can result from this cross-breeding: 

  - Type 1: blue petals
  - Type 2: pink petals 
  - Type 3: purple petals
  - Type 4: pink petals with a blue ring on each (the desired effect). 

Out of 197 initial cross-breedings, the nursery obtained the following distribution over the four types of child plants: 
$$Y = (y_1, y_2, y_3, y_4) = (125, 18, 20, 34)$$
where $y_i$ represents the number of child plants that are of type $i$.

They know that the probability of obtaining each type of child plant in any single breeding experiment is as follows:
$$ \frac{\theta+2}{4}, \frac{1-\theta}{4}, \frac{1-\theta}{4}, \frac{\theta}{4},$$
where $\theta$ is unknown.

Sensibly, the nursery chose to model the observed data using a multinomial model; *they also imposed a prior on $\theta$, $\rm{Beta}(a, b)$*.

Recall that to simplify sampling from their Bayesian model, the nursery augmented the data with a new variable $z$ such that:
$$z + (y_1 - z) = y_1.$$
That is, using $z$, they are breaking $y_1$, the number of type I child plants, into two subtypes. Let the probability of obtain the two subtype be $1/2$ and $\theta/4$, respectively.

In Homework 8, you implemented a Gibbs sampler for this Bayesian model to compute the posterior mean estimate of $\theta$. 

In this homework we will investigate ways to compute the Maximum Likelihood Estimate (MLE) of $\theta$.

***Note:*** Expectation Maximization can also be applied to compute the posterior mode (MAP) estimates. We are choosing not to do that in this homework so that you are not just repeating the task from Homework #8.


### Part A:

Treat the augmented model as a latent variable model. Write down an expression (up to unimportant constants - you must decide what unimportant means) for each of the following:

(1) the observed data log likelihood

(2) the complete(full) data log likelihood

**Hint:** You should already have the observed data likelihood and the complete data likelihood from Homework #8, you just need to take their logs for this problem.

(3) the Auxilary function, $Q(\theta, \theta^{(t-1)})$, or the expected complete(full) data log likelihood, defined by
$$Q(\theta, \theta^{(t-1)}) = \mathbb{E}_{Z  \vert  Y=y, \Theta = \theta^{t-1}}[\text{the complete data log likelihood}]$$

In other words $Z  \vert  Y=y, \Theta = \theta^{t-1}$ is $q(z, \theta_{old})$ at the end of the E-step from the EM lecture. The Auxilary function $Q$ is the ELBO minus the entropy of $q$ (which being evaluated at $\theta_{old}$ is not dependent on $\theta$ and thus irrelevant for maximization).

### Part B:

We will maximize the likelihood through Expectation Maximization (EM). In order to preform EM, we must iterate through the following steps

- (Expectation) Compute the Auxilary function, $Q(\theta, \theta^{t-1})$ (the expectation of the full data likelihood)
- (Maximization) Compute $\theta^{t} = \text{argmax}_\theta Q(\theta, \theta^{(t-1)})$

Thus, you must compute exact formulae for the following:
1. the Auxilary function, $Q(\theta, \theta^{(t-1)})$, for a given $\theta^{(t-1)}$. That is, compute the expectation of the complete data log likelihood.
2. $\theta^{t}$, by maximizing the Auxilary function $Q(\theta, \theta^{(t-1)})$.

**Hint:** You don't actually need to do any difficult optimization for the M-step. After taking the expectation of the complete data log likelihood in the E-step, match your $Q(\theta, \theta^{(t-1)})$ to the log pdf of a familiar distribution, then use the known formula for the mode of this distribution to optimize $Q(\theta, \theta^{(t-1)})$.

Use these to **estimate the MLE** of $\theta$ using EM (choose your own reasonable criterion for convergence).

### Extra Credit:

Explain the advantage of treating this problem like a latent variable model and using EM to compute the MLE (i.e. why not compute MLE directly by maximizing the likelihood?)

Compare this value with the posterior mean estimate of $\theta$ from Homework #8. In general, what is the difference between MLE and MAP or posterior mean estimates of model parameters? That is, name a couple of major pro's and con's of each type of estimate.


**Solution:**
Model:
\begin{align}
z | y, \theta &\sim Bin\left(y_1, \frac{\theta}{2 + \theta}\right)\\
y | \theta & \sim Mult\left(197, \left[\frac{1}{2} + \frac{\theta}{4}, \frac{1 - \theta}{4}, \frac{1 - \theta}{4}, \frac{\theta}{4} \right]\right)
\end{align}

Observed:
$$
p(y | \theta) = Mult\left(y; 197, \left[\frac{1}{2} + \frac{\theta}{4}, \frac{1 - \theta}{4}, \frac{1 - \theta}{4}, \frac{\theta}{4} \right]\right)
$$
Complete:
$$
p(y, z | \theta) = Bin\left(z; y_1, \frac{\theta}{2 + \theta}\right)Mult\left(y; 197, \left[\frac{1}{2} + \frac{\theta}{4}, \frac{1 - \theta}{4}, \frac{1 - \theta}{4}, \frac{\theta}{4} \right]\right)
$$

Auxilary function 
$$
Q(\theta, \theta^{t-1}) =& \mathbb{E}_{z | y, \theta}\left[\log p(y, z | \theta) \right] 
$$

For the E-step we're going to drop terms not involving $z$ or $\theta$ since these will be constant in both the E and the M steps:
\begin{align}
Q(\theta, \theta^{t-1}) =& \mathbb{E}_{z | y, \theta}\left[\log p(y, z | \theta) \right] \\
\propto& \mathbb{E}_{z | y, \theta}\left[ \log \left\{\binom{y_1}{z} \left( \frac{1}{2}\right)^{y_1 - z} \left( \frac{\theta}{4}\right)^{z} (1 - \theta)^{y_2 + y_3} \theta^{y_4}\right\}\right]\\
=& \mathbb{E}_{z | y, \theta}\left[  \log\binom{y_1}{z} +  (y_1 - z)\log\frac{1}{2}  + z\log \frac{\theta}{4} +  (y_2 + y_3) \log (1 - \theta) + y_4 \log \theta \right]\\
=&  \mathbb{E}_{z | y, \theta}\left[\log\binom{y_1}{z}\right] +   \mathbb{E}_{z | y, \theta} \left[(y_1 - z)\log\frac{1}{2}\right]  + \mathbb{E}_{z | y, \theta} \left[z\log \theta\right] \\
& - \mathbb{E}_{z | y, \theta} \log 4 +  \mathbb{E}_{z | y, \theta} \left[(y_2 + y_3) \log (1 - \theta)\right] + \mathbb{E}_{z | y, \theta}\left[y_4 \log \theta\right]\\
=&  \mathbb{E}_{z | y, \theta}\left[\log\binom{y_1}{z}\right] +  \log\frac{1}{2}  \left[\mathbb{E}_{z | y, \theta} y_1 - \mathbb{E}_{z | y, \theta} z\right]  + \log \theta\mathbb{E}_{z | y, \theta} z\\
& - \mathbb{E}_{z | y, \theta} \log 4 +  \mathbb{E}_{z | y, \theta} \left[(y_2 + y_3) \log (1 - \theta)\right] + \mathbb{E}_{z | y, \theta}\left[y_4 \log \theta\right]\\
=&  F_1(z, \theta^{t-1}) +  \log\frac{1}{2} \left(y_1 - \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}\right)  + \log \theta\frac{y_1\theta^{t-1}}{2 + \theta^{t-1}} - \log 4 +  (y_2 + y_3) \log (1 - \theta) + y_4 \log \theta\\
=&  F_1(z, \theta^{t-1}) +  F_2(z, \theta^{t-1}) - \log 4   +  (y_2 + y_3) \log (1 - \theta) + \left(y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}\right) \theta\\
\end{align}

For the M step we maximize the above with respect to $\theta$, and thus can drop any terms not involving $\theta$:
\begin{align}
\text{argmax}_\theta \left[ F_1(z, \theta^{t-1}) +  F_2(z, \theta^{t-1}) - \log 4   +  (y_2 + y_3) \log (1 - \theta) + \left(y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}\right) \theta\right] = \text{argmax}_\theta\left[(y_2 + y_3) \log (1 - \theta) + \left(y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}\right) \theta\right]
\end{align}
But the above looks like (up to a constant) the log likelihood of a binomial pdf, i.e. 
\begin{align}
\text{argmax}_\theta\left[(y_2 + y_3) \log (1 - \theta) + \left(y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}\right) \theta\right] =  \text{argmax}_\theta \log Binom\left(B; A + B, \theta  \right)+ C
\end{align}
where $A = y_2 + y_3$ and $B = y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}$. So $\argmax_\theta$ of the above is just taking the MLE of a binomial. But the MLE of this binomial is 
$$
\frac{B}{A + B} = \frac{ y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}}{ y_2 + y_3 + y_4  + \frac{y_1\theta^{t-1}}{2 + \theta^{t-1}}}
$$

In [11]:
conv_thresh = 1e-9
iters = 1000
diff = 1.
y = (125,18,20,34)
thetas = [1.]
i = 0

while i < iters and diff > conv_thresh:
    theta = ((y[0] * thetas[-1])/(2 + thetas[-1]) + y[3])/((y[0] * thetas[-1])/(2 + thetas[-1]) + y[3] + y[2] + y[1])
    print i, theta
    i += 1
    diff = abs(theta - thetas[-1])
    thetas.append(theta)
    
print thetas[-1]

0 0.66568914956
1 0.631838674952
2 0.627485219761
3 0.626909582965
4 0.626833192939
5 0.626823050714
6 0.626821704055
7 0.626821525248
8 0.626821501506
9 0.626821498354
10 0.626821497935
0.626821497935
