# CS229: Problem Set 3
## Problem 2: Expectation-Maximization for Maximum a Posteriori


**C. Combier**

This iPython Notebook provides solutions to Stanford's CS229 (Machine Learning, Fall 2017) graduate course problem set 3, taught by Andrew Ng.

The problem set can be found here: [./ps3.pdf](ps3.pdf)

I chose to write the solutions to the coding questions in Python, whereas the Stanford class is taught with Matlab/Octave.

## Notation

- $x^i$ is the $i^{th}$ feature vector
- $y^i$ is the expected outcome for the $i^{th}$ training example
- $m$ is the number of training examples
- $n$ is the number of features

This problem is very similar to the derivation of the EM algorithm for MLE given in the lectures notes. The difference is that we are now in a Bayesian setting, and impose a prior on $\theta$:

$$
MAP = \prod_i^m \sum_{z^i} p(x^i, z^i | \theta)p(\theta)
$$

Here, $z^i$ denotes the latent (hidden) random variables.

### Step 1: E-step

1. We start by taking the log-MAP:

$$
\log MAP = \sum_i^m \log \sum_{z^i} Q_i(z^i) \frac{p(x^i, z^i | \theta)}{Q_i(z^i)} + \log p(\theta)
$$

2. We apply Jensen's inequality to the above formula:

$$
\log MAP \geq \sum_i^m  \sum_{z^i} Q_i(z^i) \log \frac{p(x^i, z^i | \theta)}{Q_i(z^i)} + \log p(\theta)
$$

3. Next, we need to choose a distribution $Q_i$ for $z^i$. The above inequality become an equality if $\frac{p(x^i, z^i | \theta)}{Q_i(z^i)} = cste$, which will lead to the inequality becoming tight for the current value of $\theta$:

$$
\begin{align*}
\frac{p(x^i, z^i | \theta)}{Q_i(z^i)} = \lambda & \iff Q_i(z^i) = \frac{1}{\lambda} p(x^i, z^i | \theta) \\
& \iff Q_i(z^i) = \frac{p(x^i, z^i | \theta) }{\sum_{z^i}p(x^i, z^i | \theta)} \\
& \iff Q_i(z^i) = \frac{p(x^i, z^i | \theta) }{p(x^i | \theta)} \\
& \iff Q_i(z^i) = p(z^i | x^i, \theta)
\end{align*}
$$

This obtained by using the fact that since $Q_i$ is a distribution, $\sum_{z^i} Q_i(z^i) = 1 \implies \lambda = \sum_{z^i} p(x^i,z^i | \theta)$.

**This completes the E-step of the EM algorithm.**

### Step 2: M-step

For the M-step, we simply maximize the expression obtained in step 2) with respect to $\theta$:

$$
\theta := \text{arg}\max_{\theta} \sum_i^m  \sum_{z^i} Q_i(z^i) \log \frac{p(x^i, z^i | \theta)}{Q_i(z^i)} + \log p(\theta)
$$

As usual, we do this by taking the gradient with respect to $\theta$ and setting it to $0$.

### Proof of Convergence

We consider two successive iterations $k+1$ and $k$  of EM, and we will prove that $\ell(\theta^{k+1}) \geq \ell(\theta^k)$, i.e. that $\ell$ is monotonically increasing.

We refer the reader to the lecture notes, as the proof is the same.