####First, download a simulated dataset: kmeans_data.zip from Modules->Datasets. Then, implement the K-means algorithm from scratch. K-means algorithm computes the distance of a given data point pair. Replace the distance computation function with Euclidean distance, 1- Cosine similarity, and 1 – the Generalized Jarcard similarity (refer to: https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/jaccard.htm).

---
### Q1: Run K-means clustering with Euclidean, Cosine and Jarcard similarity. Specify K= the number of categorical values of y (the number of classifications). Compare the SSEs of Euclidean-K-means, Cosine-K-means, Jarcard-K-means. Which method is better? (10 points)
---

In [1]:
import pandas as pd
import numpy as np
import random
import time

In [2]:
data = pd.read_csv("/content/data.csv",header=None).values

label = pd.read_csv("/content/label.csv",header=None).values.reshape(-1)


In [3]:
data

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [4]:
label

array([7, 2, 1, ..., 4, 5, 6])

Defining distances/similarities

In [5]:
def euclidean_dist(a, b):
    return np.linalg.norm(a - b)

def cosine_sim(a, b):
    return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def jaccard_sim(a, b):
    return 1 - np.sum(np.minimum(a, b)) / np.sum(np.maximum(a, b))

**K-Means**

In [6]:
def k_means(data, k, sim, max_iterations=500):
    centroids = random.sample(list(data), k)
    iterations = 0
    old_sse = 0
    while True:
        # Assign each data point to its closest centroid
        clusters = [[] for i in range(k)]
        for point in data:
            distances = [sim(point, centroid) for centroid in centroids]
            cluster_index = np.argmin(distances)
            clusters[cluster_index].append(point)
        
        # Calculate the new centroid for each cluster
        new_centroids = []
        sse = 0
        for i in range(k):
            cluster = clusters[i]
            if len(cluster) == 0:
                new_centroids.append(centroids[i])
                continue
            centroid = np.mean(cluster, axis=0)
            new_centroids.append(centroid)
            sse += np.sum([sim(point, centroid) for point in cluster])
        
        # Check for convergence
        if np.allclose(new_centroids, centroids) or sse >= old_sse or iterations >= max_iterations:
            break
        
        # Update the centroids and SSE
        centroids = new_centroids
        old_sse = sse
        iterations += 1
    
    return clusters, centroids, sse

In [7]:
k = len(np.unique(label))
k

10

so k = 10

In [8]:
clusters_euclidean, centroids_euclidean, sse_euclidean = k_means(data, 10, euclidean_dist)

clusters_cosine, centroids_cosine, sse_cosine = k_means(data, 10, cosine_sim)

clusters_jaccard, centroids_jaccard, sse_jaccard = k_means(data, 10, jaccard_sim)

In [9]:
print("SSE with Euclidean distance: ", sse_euclidean)
print("SSE with Cosine similarity: ", sse_cosine)
print("SSE with Jaccard similarity: ", sse_jaccard)

SSE with Euclidean distance:  16474295.675340524
SSE with Cosine similarity:  2880.9134811848053
SSE with Jaccard similarity:  6532.516498391192


**Comment:**

Cosine Similarity seems to be the better method.

---
#### Compare the accuracies of Euclidean-K-means Cosine-K-means, Jarcard-K-means. First, label each cluster using the majority vote label of the data points in that cluster. Later, compute the predictive accuracy of Euclidean-K-means, Cosine-K-means, Jarcard-K-means. Which metric is better? (10 points)
---

In [10]:
class KMeans:
    def __init__(self, k=10, max_iter=100, distance=euclidean_dist):
        self.k = k
        self.max_iter = max_iter
        self.distance = distance
    
    def fit(self, X):
        self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]
        for i in range(self.max_iter):
            clusters = [[] for _ in range(self.k)]
            for x in X:
                distances = [self.distance(x, c) for c in self.centroids]
                cluster = np.argmin(distances)
                clusters[cluster].append(x)
            new_centroids = []
            for cluster in clusters:
                if len(cluster) == 0:
                    new_centroids.append(np.zeros(X.shape[1]))
                else:
                    new_centroids.append(np.mean(cluster, axis=0))
            if np.allclose(self.centroids, new_centroids):
                break
            self.centroids = new_centroids
    
    def predict(self, X):
        distances = np.array([np.array([self.distance(x, c) for c in self.centroids]) for x in X])
        return np.argmin(distances, axis=1)


In [11]:
def Accuracy(predicted, actual):
    count = 0
    total = len(label)
    for i in range(total):
        if predicted[i] == actual[i]:
            count += 1
    return (count/total)*100

In [13]:
k_euclid = KMeans(k=10, max_iter=100, distance=euclidean_dist)
k_euclid.fit(data)
euclid_predictions = k_euclid.predict(data)

In [15]:
k_cosine = KMeans(k=10, max_iter=100, distance=cosine_sim)
k_cosine.fit(data)
cosine_predictions = k_cosine.predict(data)

In [16]:
k_jarc = KMeans(k=10, max_iter=100, distance=jaccard_sim)
k_jarc.fit(data)
jarc_predictions = k_jarc.predict(data)

In [17]:
print("Accuracy using Euclidean Distance:",Accuracy(euclid_predictions,label))
print("Accuracy using Cosine Similarity:",Accuracy(cosine_predictions,label))
print("Accuracy using Jaccard Similarity:",Accuracy(jarc_predictions,label))

Accuracy using Euclidean Distance: 1.78
Accuracy using Cosine Similarity: 11.09
Accuracy using Jaccard Similarity: 6.859999999999999


**Comment:**


Cosine similarity works better 
