# Gaussian Mixture Model and Expectation Maximization Algorithm

> Weitong Zhang
> 2015011493
>
> <zwt15@mails.tsinghua.edu.cn>

## EM and Gradient Descent

Setting $\sigma^2 = \beta$, we get 

$$P(x_i|\mu_k,\beta_kI,x_i \in \omega_k) = \frac1{(2\pi\beta_k)^{\frac d2}}\exp(-0.5\frac{\|x-\mu_k\|_2^2}{\beta_k})$$

### Calc $\mu$

#### EM method on $\mu$

$$\begin{aligned}z_{ik} &= Prob(x_i \in \omega_k | x_i,\mu_k,\beta_k) = \frac{P(x_i|\mu_k,\beta_k,x_i \in \omega_k)P(\omega_k)}{\sum_jP(x_i|\mu_j,\beta_j,x_i \in \omega_j)P(\omega_j)} \\&= \frac{\beta_k^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\pi_k}{\sum_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})\pi_j}\end{aligned}$$

We want to maximize the following function:

$$\begin{aligned}f(\mu) &= \sum_{i=1}^n\sum_{k=1}^Kz_{ik}[\ln [\frac1{(2\pi\beta_k)^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})] + \ln \pi_k] \\&= \sum_{i=1}^n\sum_{k=1}^Kz_{ik}[-\frac d2\ln (2\pi\beta_k) - 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k}  + \ln \pi_k]\end{aligned}$$

We have to notice that the $z_{ik}$ is determined by the previous step and should be a constant in this step, therefore, $\forall i,j,k, \frac{\partial z_{ik}}{\partial\mu_j} = 0$

Therefore, we get:

$$\begin{aligned}
\frac{\partial f(\mu)}{\partial \mu_k} &= \frac\partial{\partial \mu_k} \sum_{i=1}^n\sum_{k=1}^Kz_{ik}[-\frac d2\ln (2\pi\beta_k) - 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k}  + \ln \pi_k] \\
&= \frac\partial{\partial \mu_k} \sum_{i=1}^nz_{ik}[-\frac d2\ln (2\pi\beta_k) - 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k}  + \ln \pi_k] \\
&=\frac\partial{\partial \mu_k} \sum_{i=1}^nz_{ik}[- 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k}] = 0 \Leftrightarrow \frac\partial{\partial \mu_k} \sum_{i=1}^nz_{ik}[- 0.5\|x_i-\mu_k\|_2^2] = 0 \\
&\Rightarrow \mu_k^{(t+1)} = \frac{\sum_{i=1}^nz_{ik}x_i}{\sum_{i=1}^nz_{ik}} = 
\frac{\sum_{i=1}^n\frac{x_i\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\pi_k}{\sum_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})\pi_j}}{\sum_{i=1}^n\frac{\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\pi_k}{\sum_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})\pi_j}}
\end{aligned}$$

#### Gradient Descent method on $\mu$

$$l(\mu) = \sum_{i=1}^n\ln(\sum_{k=1}^K\frac{\pi_k}{(2\pi\beta_k)^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})$$

$$\begin{aligned}
\frac{\partial l}{\partial \mu_k} &= \sum_{i=1}^n\frac{\partial}{\partial \mu_k}\ln(\sum_{k=1}^K\frac{\pi_k}{(2\pi\beta_k)^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k}) \\
&=\sum_{i=1}^n\frac{\frac{\partial}{\partial \mu_k} \pi_k\beta_k^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{(\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})} \\
&=\sum_{i=1}^n\frac{\pi_k\beta_k^{-\frac d2 -1}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})(x_i-\mu_k)}{\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j}}\\
&=\sum_{i=1}^n\frac{\pi_k\beta_k^{-\frac d2 -1}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j}}x_i - \sum_{i=1}^n\frac{\pi_k\beta_k^{-\frac d2 -1}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j}}\mu_k
\end{aligned}$$

By setting 

$$\eta_k = \frac1{\sum_{i=1}^n\frac{\pi_k\beta_k^{-\frac d2 -1}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j}}} > 0$$

We can conclude that 

$$\mu_k^{(t+1)} = \mu_k + \eta_k\nabla l(\mu) = \frac{\sum_{i=1}^n\frac{\pi_k\beta_k^{-\frac d2 -1}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j}}x_i}{\sum_{i=1}^n\frac{\pi_k\beta_k^{-\frac d2 -1}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{\sum_{j=1}^K\pi_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j}}} = \frac{\sum_{i=1}^n\frac{x_i\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\pi_k}{\sum_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})\pi_j}}{\sum_{i=1}^n\frac{\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\pi_k}{\sum_j\beta_j^{-\frac d2}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})\pi_j}}$$

Therefore, we can prove that the Gradient Descent method and the EM algorithm is the same with calculating $\mu$

### Calc $\beta = \sigma^2$

To begin with, let's write down the $f$ and $l$ above:

$$\begin{cases}
l(\beta) = \sum_{i=1}^n\ln(\sum_{k=1}^K\frac{\pi_k}{(2\pi\beta_k)^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\\
f(\beta) = \sum_{i=1}^n\sum_{k=1}^Kz_{ik}[-\frac d2\ln (2\pi\beta_k) - 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k}  + \ln \pi_k]
\end{cases}$$

#### EM method on $\beta = \sigma^2$

We are about to maximize the function $f(\beta)$

$$\begin{aligned}
&\frac{\partial f}{\partial \beta_k} = \sum_{i=1}^nz_{ik}[-\frac d2\frac1{\beta_k} + 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k^2}] = 0, \beta_k \ne 0 \\
&\Leftrightarrow \sum_{i=1}^nz_{ik}[-\frac d2{\beta_k} + 0.5\|x_i-\mu_k\|_2^2] = 0\\
&\Leftrightarrow \beta_k = \frac{\sum_{i=1}^nz_{ik}\|x_i-\mu_k\|_2^2}{d\sum_{i=1}^nz_{ik}} =
\end{aligned}$$

$d$ is the dimension of stochastic variable $x$, and we have to point out the $|\Sigma|=|\beta I| = \beta^d$ in the PDF of Gaussian distribution

#### Gradient Descent method on $\beta$

$$\begin{aligned}
\frac{\partial f}{\partial \beta} &= \sum_{i=1}^n\frac\partial{\partial \beta}\ln(\sum_{k=1}^K\frac{\pi_k}{(2\pi\beta_k)^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})\\
&=\sum_{i=1}^n\frac{\frac\partial{\partial \beta} \frac{\pi_k}{\beta_k^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})}{\sum_{j=1}^K\frac{\pi_j}{\beta_j^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})}\\
&=\sum_{i=1}^n\frac{\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})(-\frac d2\frac{\pi_k}{\beta_k^{\frac d2 +1}} + 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k^2}\frac{\pi_k}{\beta_k^{\frac d2}})}{\sum_{j=1}^K\frac{\pi_j}{\beta_j^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})}\\
&=\sum_{i=1}^n\frac{\frac{\pi_k}{\beta_k^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k})(-\frac d2\frac{1}{\beta_k} + 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k^2})}{\sum_{j=1}^K\frac{\pi_j}{\beta_j^{\frac d2}}\exp(-0.5\frac{\|x_i-\mu_j\|_2^2}{\beta_j})}\\
&=\sum_{i=1}^nz_{ik}(-\frac d2\frac{1}{\beta_k} + 0.5\frac{\|x_i-\mu_k\|_2^2}{\beta_k^2})
\end{aligned}$$

Let $s_k = \frac1{\sum_{i=1}^nz_{ik}\frac d2}\beta_k^2$, therefore,

$$\beta_k^{(t+1)} = \beta_k + s_k\nabla l(\beta) =\frac{\sum_{i=1}^nz_{ik}0.5\|x_i-\mu_k\|_2^2}{\sum_{i=1}^nz_{ik}\frac d2} = \frac{\sum_{i=1}^nz_{ik}\|x_i-\mu_k\|_2^2}{d\sum_{i=1}^nz_{ik}}$$

Therefore, we can conclude that the Gradient Descent method and EM algorithm is equivalence in this case.

## EM for MAP Estimation

## Programming