# Machine Learning Online Class Exercise 7 | Principle Component Analysis and K-Means Clustering

In [1]:
import numpy as np
import pandas as pd
from scipy.io import loadmat
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(font_scale=1.3)

## K-means clustering
The goal of the k-means clustering is quite simple: Given a set of points $x_i\in\mathbb{R}^d$ $i=1,\dots,m$, find the cluster centroids $\mu_k\in\mathbb{R}^d$ $k=1,\dots,K$, such that
$$J(\mu)=\frac{1}{m}\sum_{i=1}^m|x_i-\mu_{c(i)}|^2,$$
is minimum, where $c(i)$ denotes the cluster where $x_i$ belongs.<br> 
<br>
The algorithm used to achieve approximately this goal is the following:
1. Initialize the centroids of the $K$ clusters.
2. Set the cluster belonging index $c(i)=k$ for every $x_i$ by asignning each point to the cluster $k$ which is closest to it. That it the one that minimizes $|x_i-\mu_k|$.
3. Move each cluster $\mu_k$ to the mean position of all the $x_i$ such that $c(i)=k$. 
4. If $\mu_k=\mu_{c(i)}$ for all clusters stop. Else go to 2.

Note that this algorithm does not assure that the global minimum of $J$ is reached, as it can get stuck into a local minumum. 

## ================= Part 1: Find Closest Centroids ====================

In [3]:
data = loadmat("ex7data2.mat")
X = data['X']
X.shape

(300, 2)

In [4]:
# Select an initial set of centroids
ini_centroids = np.array([[3,3],[6,2],[8,5]])

In [5]:
# Find the closest centroid to each sample
def findClosestCentroids(X, centroids):
    idx = [-1 for i in range(X.shape[0])]
    for i in range(X.shape[0]):
        md = 1.e8
        for j in range(centroids.shape[0]):
            d = np.linalg.norm(X[i]-centroids[j])
            if d < md:
                md = d
                idx[i] = j
    return idx


In [7]:
idx = findClosestCentroids(X, ini_centroids)

In [9]:
print("Closest centroids for the first 3 examples: ")
print(idx[:3])
print("(the closest centroids should be 0, 2, 1 respectively)")

Closest centroids for the first 3 examples: 
[0, 2, 1]
(the closest centroids should be 0, 2, 1 respectively)


## ===================== Part 2: Compute Means =========================

In [29]:
#Compute means based on the closest centroids
def computeCentroids(X, idx, K):
    centroids = np.array([[0.0,0.0] for j in range(K)])
    for i in range(K):
        ind = [x==i for x in idx]
        XX = X[ind]
        centroids[i][0] = np.mean(XX[:,0])
        centroids[i][1] = np.mean(XX[:,1])
    return centroids

In [30]:
centroids = computeCentroids(X, idx, ini_centroids.shape[0])

array([[2.42830111, 3.15792418],
       [5.81350331, 2.63365645],
       [7.11938687, 3.6166844 ]])

In [31]:
print("Centroids computed after initial finding of closest centroids: ")
print(centroids)
print("the centroids should be")
print("[ 2.428301 3.157924 ]\n[ 5.813503 2.633656 ]\n[ 7.119387 3.616684 ]\n")

Centroids computed after initial finding of closest centroids: 
[[2.42830111 3.15792418]
 [5.81350331 2.63365645]
 [7.11938687 3.6166844 ]]
the centroids should be
[ 2.428301 3.157924 ]
[ 5.813503 2.633656 ]
[ 7.119387 3.616684 ]



## =================== Part 3: K-Means Clustering ======================