### PROBLEM 2 : KMeans on data 

##### Using Euclidian distance or dot product similarity (choose one per dataset, you can try other similarity metrics).
##### Run KMeans on the FASHION Dataset, try K=10

You can use a library for distance/similarity but you have to implement your own kmeans (EM steps, termination criteria etc).
For all three datasets, evaluate the KMeans objective for a higher K (for example double) or smaller K(for example half).
For all three datasets, evaluate external clustering performance using data labels and performance metrics Purity and Gini Index (see [A] book section 6.9.2).

-----

The below code imports the Fashion MNIST dataset, reshapes and normalizes it and then splits into train, validation and test.

In [1]:
import numpy as np
import random
from keras.datasets import fashion_mnist
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

(trainX, trainy), (testX, testy) = fashion_mnist.load_data()

#reshaping images
trainX = np.reshape(trainX, (-1, 784))
testX = np.reshape(testX, (-1, 784))

# normalize
scaler = MinMaxScaler()
trainX = scaler.fit_transform(trainX)
testX = scaler.transform(testX)

random_sample_indices = random.sample(range(trainX.shape[0]), 20000)
train_images_25 = trainX[random_sample_indices]
train_labels_25 = trainy[random_sample_indices]

train_images_final_80, validation_images_final_10, train_labels_final_80,validation_labels_final_10 = train_test_split(train_images_25, train_labels_25, test_size=0.1, random_state=42)

print("Final train dataset size: ", train_images_final_80.shape)
print("Final validation dataset size: ", validation_images_final_10.shape)

Final train dataset size:  (18000, 784)
Final validation dataset size:  (2000, 784)


KMeans implementation

In [2]:
def kmeans(X, num_clusters, max_iterations = 100):
    centroids = X[np.random.choice(X.shape[0], num_clusters, replace=False)]
    for _ in range(max_iterations):

        #E-step
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        
        #M-step
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(num_clusters)])

        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return labels, centroids

In [3]:
labels, centroids = kmeans(train_images_final_80, 10)

In [4]:
def calculate_accuracy(true_labels, cluster_labels):
    accuracy = 0
    for i in range(len(np.unique(cluster_labels))):
        mask = cluster_labels == i
        correct_label = np.argmax(np.bincount(true_labels[mask]))
        accuracy += np.sum(true_labels[mask] == correct_label)
    return accuracy / len(true_labels)

In [5]:
def calculate_gini(labels):
    _, counts = np.unique(labels, return_counts=True)
    probabilities = counts / len(labels)
    gini_index = 1 - np.sum(probabilities ** 2)
    return gini_index

Calculating Purity for clustering run on train data

In [6]:
accuracy = calculate_accuracy(train_labels_final_80, labels)
print("Purity:", accuracy)

Purity: 0.5227222222222222


Calculating Gini index

In [7]:
gini1 = calculate_gini(labels)
print("Gini index:", gini1)

Gini index: 0.8869473148148148


Running performance metrics on test data

In [8]:
labels, centroids = kmeans(testX, 10)
accuracy = calculate_accuracy(testy, labels)
print("Purity:", accuracy)

gini2 = calculate_gini(labels)
print("Gini index:", gini2)

Purity: 0.5812
Gini index: 0.89075132


Evaluating KMeans objective for higher K:

In [9]:
labels, centroids = kmeans(train_images_final_80, 20)
accuracy = calculate_accuracy(train_labels_final_80, labels)
print("Accuracy with k=20: ", accuracy)

gini = calculate_gini(labels)
print("Gini index with k=20:", gini)

Accuracy with k=20:  0.6481666666666667
Gini index with k=20: 0.9430828641975308


Evaluating KMeans objective for lower K:

In [10]:
labels, centroids = kmeans(train_images_final_80, 5)
accuracy = calculate_accuracy(train_labels_final_80, labels)
print("Accuracy with k=5: ", accuracy)

gini = calculate_gini(labels)
print("Gini index with k=5:", gini)

Accuracy with k=5:  0.4132222222222222
Gini index with k=5: 0.7872556604938271
