#K Means

Today we'll be going over K Means. This is the first *unsupervised* algorithm we'll go through. That is, there will be no assumption of labels in our data. 

Credits: We used a guide written by Tony Yiu on https://towardsdatascience.com/k-means-clustering-from-scratch-6a9d19cafc25 to get an idea of the concept behind the algorithm. However, our implementation is a little different.


#Theory of K Means

Well, this sounds a lot like KNN, but it's not. Like KNN, there isn't a loss function; unlike KNN, however, K Means is unsupervised. 


We'll go though the general steps of how to implement K means:
1.   Randomly pick K set of coordinates for the center of clusters.
2.   Calculate the distance between each point to each cluster.
3.   Place each point in the cluster it's closest too.
4.   Calculate the new center of each cluster.
5.   Rinse and repeat till the center doesn't change.



#Load Libraries and Dataset

In [None]:
import numpy as np
import random
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

In [None]:
iris = datasets.load_iris()
Z = iris.data[:, :2]  # we only take the first two features.
y = iris.target
z = y.reshape((y.shape[0],1))
for n in range(len(z)):
    if (z[n] != 0):
        z[n] = 1
X = np.hstack((Z, z))

scaler = StandardScaler() # call an object function
scaler.fit(X)   # calculate mean
X = scaler.transform(X)  # apply normalization on X_train

#Functions

Okay, so for the first function, all we'll do is find the distance to each center. Each row represents a data point and each column is for a cluster. For example, the $(10,1)$ entry is the distance from the 10th data point to the second center (since we start from 0).

In [None]:
def calc_dist_to_center(X, K, centers):
    dist_to_center = np.zeros((X.shape[0],K), dtype = 'float')
    for k in range(K):
        for n in range(X.shape[0]):
            #print(X[n,:])
            dist_to_center[n][k] = np.linalg.norm(X[n,:] - centers[k, :])
    #print(dist_to_center)
    return dist_to_center     

Once we've found our distances to the centers, we need to assign each data point to the cluster it is closest to. We'll do this by finding the argmin for row of the distance list we generate above and setting the corresponding entry in our "cluster" matrix as $1$ while the others are zeros.

In [None]:
def assign_center(dist_to_cluster, K):
    cluster = np.zeros((X.shape[0], K), dtype = 'int')
    for n in range(dist_to_cluster.shape[0]):
        for k in range(K):
            arg = np.argmin(dist_to_cluster[n, :])
            if (k == arg):
               cluster[n][k] = 1    
    return cluster

We'll now calculate the new centers based on our cluster groupings from before by taking the average of the entries of the data points of $X$ in each cluster.

In [None]:
def calc_new_centers(X, K, in_which_cluster): 
    new_centers = np.zeros((K, X.shape[1]), dtype = 'float')
    for k in range(K):
        coordinates_sum = np.zeros(X.shape[1], dtype = 'float')
        count = 0
        for n in range(X.shape[0]):
            if(in_which_cluster[n][k] != 0):
                coordinates_sum = coordinates_sum + X[n,:]   
                count = count + 1
        #print(coordinates_sum)
        if (count != 0):    
            new_centers[k,:] = coordinates_sum/count
    return new_centers

Let's put this all together now and see what happens.

In [None]:
K = 2 #number of clusters
in_cluster = np.zeros((X.shape[0], K), dtype = 'int') #1 = in cluster, 0 otherwise
center_coords =  np.random.rand(K,X.shape[1]) #save center coordinates
for j in range(1000): #loop for Kmeans
    dist = calc_dist_to_center(X, K, center_coords) 
    in_cluster = assign_center(dist, K)
    center_coords = calc_new_centers(X, K, in_cluster)
print(center_coords)

[[ 0.50728948 -0.42663134  0.70710678]
 [-1.01457897  0.85326268 -1.41421356]]


Let's see how this compares to sklearn:

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
kmeans.cluster_centers_

array([[-1.01457897,  0.85326268, -1.41421356],
       [ 0.50728948, -0.42663134,  0.70710678]])

It looks like my algorithm is relatively accurate, surprisingly enough. However, the sklearn algorithm is definitely much more efficient. I don't even want to discuss the difference in runtime.