# 4. Clustering

## 4.1. K-means Algorithm

* **Algorithm**

>* **Step 1:** Initialise: $\boldsymbol{m}_k \in \mathbb{R}^D$
>* **Step 2:** $s_n={\text{argmin}_k} ||\boldsymbol{x}_n - \boldsymbol{m}_k|| $
>* **Step 3:** $\boldsymbol{m}_k = \text{mean}(\boldsymbol{x}_n : s_n=k)$
>* Continue until $s_n$ converge

## 4.2. K-means as Optimisation

* **Cost fn.** (Lyupanov function - converges but hard to find **global optimum**)

>$$\mathcal{C}(\{s_{n,k}\},\{\boldsymbol{m}_k\})=\sum^N_{n=1} \sum^K_{k=1} s_{n,k} \;||\boldsymbol{x}_n-\boldsymbol{m}_k||^2$$

>$$\text{where}\;\;\;\sum^K_{k=1} s_{n,k} = 1 \;\;\;\text{and}\;\;\; s_{n,k}\in\{0,1\}$$

* **K-means as Optimisation**

>1. Minimise $\mathcal{C}$ w.r.t. $\{s_{n,k}\}$, holding $\{\boldsymbol{m}_k\}$ fixed
>2. Minimise $\mathcal{C}$ w.r.t. $\{\boldsymbol{m}_k\}$, holding $\{s_{n,k}\}$ fixed

## 4.3. K++ means

* **Algorithm to select initial centroids** (set $M$: store centroid)

>1. Randomly select $\mu_0$, and put it in $M$
>2. For $x_i \notin M$, calculate $d(M, x_i)$ ($\mu_k$ that minimizes $d(\mu_k, x_i)$ is the class of $x_i$)
>3. Select the next $\mu$, based on pmf proportional to $d(M, x_i)$
>4. Repeat until $K$ centroids are selected


# 5. EM Algorithm

## 5.1. Mixture of Gaussian: Generative Model

* **Sample cluster membership $\rightarrow$ Sample data-value**

>$$p(s_n=k|\theta)=\pi_k \;\;\;\text{where}\;\;\; \sum^K_{k=1}\pi_k=1$$

>$$p(\mathbf{x}_n|s_n=k,\theta)=\mathcal{N}(\mathbf{x}_n;\mathbf{m}_k,\Sigma_k)$$

## 5.2. KL Divergence

>$$\mathcal{KL}(p_1(z)||p_2(z))=\sum_z p_1(z)\log{\frac{p_1(z)}{p_2(z)}}$$

>* **Gibb's Inequality:** $\mathcal{KL}(p_1(z)||p_2(z))\geq0\;\;\;\text{equality at } p_1(z)=p_2(z)$
>* **Non-Symmetric:** $\mathcal{KL}(p_1(z)||p_2(z))\neq\mathcal{KL}(p_2(z)||p_1(z))$

## 5.3. EM Algorithm

* **Free Energy**: Lower bound on LL

>$$\mathcal{F}(q(s),\theta)=\log p(x|\theta) - \sum_s q(s)\log{\frac{q(s)}{p(s|x,\theta)}}$$

* **E Step:** For fixed $\theta{t-1}$, maximise lower bound $\mathcal{F}(q(s),\theta_{t-1})$ wrt $q(s)$

>$$q_t(s)=p(s|x,\theta_{t-1})$$

* **M Step:** For fixed $q_t(s)$, maximise lower bound $\mathcal{F}(q_t(s),\theta)$ wrt $\theta$

>$$\mathcal{F}(q(s),\theta)=\sum_s q(s)\log(p(x|s,\theta)p(s|\theta))-\sum_s q(s)\log q(s)$$

>$$\theta_t=\underset{\theta}{\text{argmax}}\sum_s q_t(s) \log (p(x|s,\theta)p(s|\theta)) $$

* **LL cannot decrease**

><img src="images\image04.png" width=550>

## 5.4. EM - Application to 1D data

* **Probability of the observations given the latent variables and the parameters:**

>$$p(x_n|s_n=k,\theta)=\frac{1}{\sqrt{2\pi\sigma^2_k}}\exp\left(-\frac{1}{2\sigma_k^2}(x_n-\mu_k)^2\right)$$

* **Prior on latent variables**

>$$p(s_n=k)=\pi_k$$

* **E Step: fills in the values of the hidden variables** 

>\begin{align}
q(s_n=k)=p(s_n=k|x_n,\theta)&\propto p(x_n,s_n=k|\theta)\\
&=\frac{\pi_k}{\sqrt{2\pi\sigma^2_k}}\exp\left(-\frac{1}{2\sigma_k^2}(x_n-\mu_k)^2\right)=u_{nk}
\end{align}

>$$q(s_n=k)=r_{nk}=\frac{u_{nk}}{u_n} \;\;\;\text{where}\;\;\; u_n=\sum^K_{k=1}u_{nk}$$

>* $r_{nk}$: ***Responsibility*** that component $k$ takes for datapoint $n$

* **M Step: performs supervised learning with known (soft) cluster assignments**

>$$\mathcal{F}(q(s),\theta)=\sum^N_{n=1} \sum^K_{k=1} q(s_n=k) \left[ \log(\pi_k)-\frac{1}{2\sigma_k^2}(x_n-\mu_k)^2-\frac{1}{2}\log(\sigma_k^2) \right] + \text{const.}$$

><img src="images\image05.png" width=550>