## Proof of GMM

Assume $X = \{x^{(i)}, \cdots, x^{(m)}\}$ are drawn form an unkown distribution $p(x)$.  Our objective is to find a good approximation of this unknown distribution by means of a GMM with $K$ mixture components.  We exploit our i.i.d (independently and identically distributed) assumption, which leads to the log-likelihood as

$$\log p(X | \theta) = \sum\limits_{i=1}^m \log p(x^{(i)} | \theta) = \sum\limits_{i=1}^m \log \sum\limits_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)$$

Our objective function is to find $\theta$ that maximize the log-likehood $\mathcal{L}$

$$
\max_\theta \sum\limits_{i=1}^m \log \sum\limits_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)$$

Our "normal" procedure would be to comute the gradient $\frac{d\mathcal{L}}{d\theta}$ of the log-likelihood with respect to the model parameters $\theta$, set it to 0, and solve for $\theta$, however, if you try this yourself at home, you will find that it is not possible to find the closed form.

One way we can do turns out to be the EM algorithm, where the key idea is to update one model parameter at a time, while keeping the others fixed.

Before we find the partial derivatives, let us introduce a quantity that will play a central role in this algorithm: **responsibilities**.

We define the quantity

$$ r^{(i)}_{k} = \frac{\pi_k\mathcal{N}(x^{(i)} \mid \mu_k, \Sigma_{k})}{\Sigma_{j=1}^{K} \pi_j\mathcal{N}(x^{(i)} \mid \mu_j, \Sigma_j)}$$

as the *responsibility* of the $k$th mixture component for the $i$th data point.  

$r^{(i)}_{k}$ basically gives us $$ \frac{\text{Probability of $x^{(i)}$ belonging to cluster k}}{\text{Probability of $x^{(i)}$ over all clusters}} $$

The responsibility $r^{(i)}_{k}$ of the $k$th mixture component for data point $x^{(i)}$ is proportional to the likelihood of the mixture component given the data point.

$$p(x^{(i)} | \pi_k, \mu_k, \Sigma_k) = \pi_k\mathcal{N}(x^{(i)}|\mu_k, \Sigma_k)$$

Therefore, mixture components have a high responsibility for a data point when the data point could be a **plausible sample** from that mixture component.  Note that 

$$r^{(i)} = r^{(i)}_1, r^{(i)}_2, \cdots, r^{(i)}_k \in \mathbb{R}^k$$

is a normalized probability vector, i.e., for each sample $i$

$$\sum\limits_{j=1}^{k}r^{(i)}_j = 1$$

$$ r^{(i)}_j \geq 0 $$

Thus this probability vector distributes probability mass among the $K$ mixture components, and we can think of $r^{(ik)}$ as probability that $x^{(i)}$ has been generated by the $k$th mixture component.

By summing all the total responsibility of the $k$th mixture component along all samples, we get $N_k$.

$$N_k = \sum\limits_{i=1}^{m}r^{(i)}_k$$

Note that this value does not necessarily equal to 1.

**Updating the mean**

The update of the mean parameters $\mu_k, k=1,\cdots,K$ of the GMM is given by:

$$ \mu_k^{new} = \frac{\sum\limits_{i=1}^{m}r^{(i)}_{k}x^{(i)}}{\sum\limits_{i=1}^{m}r^{(i)}_{k}}$$

To prove this:

Any local optimum of a function exhibits the property that its gradient with respect to the parameters must vanish, i.e., setting its partial derivative to zero.

We take a partial derivative of our objective function with respect to the mean parameters $\mu_k, k=1, \cdots, K$.  To simplify things, let's perform partial derivative without the log first and only consider one sample.

$$
	\frac{\partial p(x^{(i)} | \theta)}{\partial \mu_k} = \sum\limits_{j=1}^K \pi_j \frac{\partial \mathcal{N}(x^{(i)} | \mu_j, \Sigma_j)}{\partial \mu_k} = \pi_k \frac{\partial \mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)}{\partial \mu_k} = \pi_k(x^{(i)} - \mu_k)^T \Sigma_k^{-1}\mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)
$$

Now, taking all samples and log, since we know the partial derivative of $\log$ something is $\frac{1}{x}$, thus

$$
\frac{\partial \mathcal{L}}{\partial \mu_k} =\sum\limits_{i=1}^{m} \frac{\partial \log p(x^{(i)} | \theta)}{\partial \mu_k} = \sum\limits_{i=1}^{m} \frac{1}{p(x^{(i)} | \theta)} \frac{\partial p(x^{(i)} | \theta) }{\partial \mu_k} = \sum\limits_{i=1}^{m}(x^{(i)} - \mu_k)^T \Sigma_k^{-1}\frac{\pi_k\mathcal{N}(x^{(i)} \mid \mu_k, \Sigma_{k})}{\Sigma_{j=1}^{K} \pi_j\mathcal{N}(x^{(i)} \mid \mu_j, \Sigma_j)}
$$

To simplify, we can substitute $r^{(i)}_{k}$ into the equation, thus

$$= \sum\limits_{i=1}^{m} r^{(i)}_{k}(x^{(i)} - \mu_k)^T\Sigma_k^{-1}$$

We can now solve for $\mu_k$ so that $\frac{\partial \mathcal{L}}{\partial \mu_k} = 0$ and obtain

$$\sum\limits_{i=1}^{m} r^{(i)}_{k}(x^{(i)} - \mu_k)^T\Sigma_k^{-1} = 0$$

Multiply both sides by $\Sigma$ will cancel out the inverse $\Sigma$, and move $\mu_k$ to another side

$$\sum\limits_{i=1}^{m} r^{(i)}_{k}x^{(i)}  = \sum\limits_{i=1}^{m} r^{(i)}_{k}\mu_k$$

$$\frac{\sum\limits_{i=1}^{m} r^{(i)}_{k}x^{(i)} }{\sum\limits_{i=1}^{m} r^{(i)}_{k}}  = \mu_k$$

We can further substitute $N_k$ so that

$$
\frac{1}{N_k}\sum\limits_{i=1}^{m} r^{(i)}_{k}x^{(i)} = \mu_k
$$

Here we can interpret that $\mu_k$ is pulled toward a data point $x^{(i)}$ with strength given by $r^{(i)}_{k}$.  The means are pulled stronger toward data points for which the corresponding mixture component has a high responsibility, i.e., a high likelihood.

**Updating the covariances**

The update of the covariance parameters $\Sigma_k, k=1,\cdots,K$ of the GMM is given by:

$$ \Sigma_k^{new} = \frac{1}{N_k} \sum\limits_{i=1}^{m}r^{(i)}_{k}(x^{(i)} - \mu_k)(x^{(i)} - \mu_k)^T $$

To prove this:

We take a partial derivative of our objective function with respect to the Sigma parameters $\Sigma_k, k=1, \cdots, K$.  Similarly, to simplify things, let's perform partial derivative without the log first and only consider one sample.

$$
	\frac{\partial p(x^{(i)} | \theta)}{\partial \Sigma_k} = \frac{\partial}{\partial \Sigma_k} \big(\pi_k(2\pi)^{-\frac{D}{2}} \det(\Sigma_k)^{\frac{1}{2}}exp\big(-\frac{1}{2}(x^{(i)} - \mu_k)^T\Sigma^{-1}_k(x^{(i)} - \mu_k)\big)\big)
$$

Using derivative multiplication rule, we got

$$
= \pi_k(2\pi)^{-\frac{D}{2}}\big[\frac{\partial}{\partial \Sigma_k}\det(\Sigma_k)^{-\frac{1}{2}}exp\big(-\frac{1}{2}(x^{(i)} - \mu_k)^T\Sigma^{-1}_k(x^{(i)} - \mu_k)) + \det(\Sigma_k)^{-\frac{1}{2}}\frac{\partial}{\partial \Sigma_k}exp\big(-\frac{1}{2}(x^{(i)} - \mu_k)^T\Sigma^{-1}_k(x^{(i)} - \mu_k)\big]
$$

Using this following rule

$$
\frac{\partial}{\partial X}\det(f(x)) = \det(f(x))tr\big(f(x)^{-1}\frac{\partial f(x)}{\partial x}\big)
$$

We get that

$$
\frac{\partial}{\partial \Sigma_k}\det(\Sigma_k)^{-\frac{1}{2}} = -\frac{1}{2}\det(\Sigma_k)^{-\frac{1}{2}}\Sigma_k^{-1}
$$

Using this following rule

$$
\frac{\partial a^TXb}{\partial X} = ab^T
$$

We get that

$$
\frac{\partial}{\partial \Sigma_k}(x^{(i)} - \mu_k)^T\Sigma^{-1}_k(x^{(i)} - \mu_k) = -\Sigma_k^{-1}(x^{(i)} - \mu_k)(x^{(i)} - \mu_k)^T\Sigma_k^{-1}
$$

Putting them together, we got:

$$
\frac{\partial p(x^{(i)} | \theta)}{\partial \Sigma_k} = \pi_k\mathcal{N}(x^{(i)} | \mu_k, \Sigma_k) * \big[-\frac{1}{2}(\Sigma_k^{-1}-\Sigma_k^{-1}(x^{(i)}-\mu_k)(x^{(i)} - \mu_k)^T\Sigma_k^{-1}\big]
$$

Now consider all samples and log as well, the partial derivative of the log-likelihood with respect to $\Sigma_k$ is given by

$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \Sigma_k} &=  \sum\limits_{i=1}^{m}\frac{\partial \log p(x^{(i)} | \theta)}{\partial \Sigma_k}\\
&=\sum\limits_{i=1}^{m}\frac{1}{(p(x^{(i)} | \theta)}\frac{\partial p(x^{(i)} | \theta)}{\partial \Sigma_k}\\
&=\sum\limits_{i=1}^{m}\frac{\pi_k\mathcal{N}(x^{(i)} \mid \mu_k, \Sigma_{k})}{\Sigma_{j=1}^{K} \pi_j\mathcal{N}(x^{(i)} \mid \mu_j, \Sigma_j)} * \big[-\frac{1}{2}(\Sigma_k^{-1}-\Sigma_k^{-1}(x^{(i)}-\mu_k)(x^{(i)} - \mu_k)^T\Sigma_k^{-1})\big]
\end{aligned}
$$

Substituting $r^{(i)}_{k}$, we got

$$
= -\frac{1}{2}\sum\limits_{i=1}^{m}r^{(i)}_{k}(\Sigma_k^{-1}-\Sigma_k^{-1}(x^{(i)}-\mu_k)(x^{(i)} - \mu_k)^T\Sigma_k^{-1})\\
= -\frac{1}{2}\Sigma_k^{-1}\sum\limits_{i=1}^{m}r^{(i)}_{k} + \frac{1}{2}\Sigma_k^{-1}\big(\sum\limits_{i=1}^{m}r^{(i)}_{k}(x^{(i)}-\mu_k)(x^{(i)} - \mu_k)^T\big)\Sigma_k^{-1}
$$

Setting this to zero, we obtain:

$$
N_k\Sigma_k^{-1} = \Sigma_k^{-1}\big(\sum\limits_{i=1}^m r^{(i)}_{k}(x^{(i)}-\mu_k)(x^{(i)} - \mu_k)^T\big)\Sigma_k^{-1}
$$

By solving for $\Sigma_k$ we got

$$
\Sigma_k = \frac{1}{N_k}\sum\limits_{i=1}^{m}r^{(i)}_{k}(x^{(i)}-\mu_k)(x^{(i)} - \mu_k)^T
$$

**Updating the pi - weight of mixture components**

The update of the mixture weights $\pi_k, k=1,\cdots,K$ of the GMM is given by:

$$ \pi_k^{new} = \frac{N_k}{m}$$

To prove this:

To find the partial derivative, we account for the equality constraint 

$$\sum\limits_{k=1}^K \pi_k=1$$

The Lagrangian $\mathscr{L}$ is

$$
\begin{aligned}
\mathscr{L} &= \mathcal{L} + \beta\big(\sum\limits_{k=1}^K \pi_k-1\big)\\
&= \sum\limits_{i=1}^m \log \sum\limits_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k) + \beta\big(\sum\limits_{k=1}^K \pi_k-1\big)
\end{aligned}
$$

Taking the partial derivative with respect to $\pi_k$ as

$$
\begin{aligned}
\frac{\partial \mathscr{L}}{\partial \pi_k} &= \sum\limits_{i=1}^m
\frac{\mathcal{N}(x^{(i)} \mid \mu_k, \Sigma_{k})}{\Sigma_{j=1}^{K} \pi_j\mathcal{N}(x^{(i)} \mid \mu_j, \Sigma_j)} + \beta \\
&= \frac{1}{\pi_k}\sum\limits_{i=1}^m\frac{\pi_k\mathcal{N}(x^{(i)} \mid \mu_k, \Sigma_{k})}{\Sigma_{j=1}^{K} \pi_j\mathcal{N}(x^{(i)} \mid \mu_j, \Sigma_j)} + \beta\\
&= \frac{N_k}{\pi_k} + \beta
\end{aligned}
$$

Taking the partial derivative with respect to $\beta$ is

$$
\frac{\partial \mathscr{L}}{\partial \beta} = \sum\limits_{i=1}^{K} \pi_k - 1
$$

Setting both partial derivatives to zero yield

$$
\pi_k = -\frac{N_k}{\beta}
$$

$$
1 = \sum\limits_{i=1}^K\pi_k
$$

Using the top formula to solve for the bottom formula:

$$    
- \sum\limits_{i=1}^{m}\frac{N_k}{\beta} = 1\\
= -\frac{m}{\beta} = 1\\
= \beta = -m
$$

Substitute $-m$ for $\beta$ yield

$$
\pi_k = \frac{N_k}{m}
$$