### PROBLEM 2 : KMeans on data 

##### Using Euclidian distance or dot product similarity (choose one per dataset, you can try other similarity metrics).
##### Run KMeans on the 20NG Dataset, try K=20

You can use a library for distance/similarity but you have to implement your own kmeans (EM steps, termination criteria etc).
For all three datasets, evaluate the KMeans objective for a higher K (for example double) or smaller K(for example half).
For all three datasets, evaluate external clustering performance using data labels and performance metrics Purity and Gini Index (see [A] book section 6.9.2).

-----

The below code imports the 20NG dataset, normalizes it and then splits into train, validation and test.

In [1]:
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

vectorizer = TfidfVectorizer(stop_words='english')

train_data_vector = vectorizer.fit_transform(newsgroups_train.data).toarray()
test_data_vector = vectorizer.fit_transform(newsgroups_test.data).toarray()

train_labels = newsgroups_train.target
test_labels = newsgroups_test.target

random_sample_indices = random.sample(range(train_data_vector.shape[0]), 3000)
train_data_20 = train_data_vector[random_sample_indices]
train_labels_20 = train_labels[random_sample_indices]

train_data_final_80, validation_data_final_10, train_labels_final_80,validation_labels_final_10 = train_test_split(train_data_20, train_labels_20, test_size=0.1, random_state=42)

Kmeans Implementation

In [2]:
from sklearn.metrics.pairwise import euclidean_distances

def kmeans(X, num_clusters, max_iterations = 100):

    centroids = X[np.random.choice(X.shape[0], num_clusters, replace=False)]
    for _ in range(max_iterations):

        #E-step
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)

        #M-step
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(num_clusters)])

        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return labels, centroids

In [3]:
# labels, centroids = kmeans(train_data_20, 10)
labels, centroids = kmeans(validation_data_final_10, 10)

Calculating Purity

In [4]:
def calculate_accuracy(true_labels, cluster_labels):
    accuracy = 0
    for i in range(len(np.unique(cluster_labels))):
        mask = cluster_labels == i
        correct_label = np.argmax(np.bincount(true_labels[mask]))
        accuracy += np.sum(true_labels[mask] == correct_label)
    return accuracy / len(true_labels)

accuracy = calculate_accuracy(validation_labels_final_10, labels)
print("Purity:", accuracy)

Purity: 0.24333333333333335


Calculating Gini index

In [5]:
def calculate_gini(labels):
    _, counts = np.unique(labels, return_counts=True)
    probabilities = counts / len(labels)
    gini_index = 1 - np.sum(probabilities ** 2)
    return gini_index

In [6]:
gini1 = calculate_gini(labels)
print("Gini index:", gini1)

Gini index: 0.8497111111111111


In [7]:
labels, centroids = kmeans(validation_data_final_10, 20)
accuracy = calculate_accuracy(validation_labels_final_10, labels)
print("Purity:", accuracy)
gini1 = calculate_gini(labels)
print("Gini index:", gini1)

Purity: 0.30333333333333334
Gini index: 0.9248444444444445


In [8]:
labels, centroids = kmeans(validation_data_final_10, 5)
accuracy = calculate_accuracy(validation_labels_final_10, labels)
print("Purity:", accuracy)
gini1 = calculate_gini(labels)
print("Gini index:", gini1)

Purity: 0.17
Gini index: 0.7424222222222222
