# Worksheet 05

Name: Azad Ellafi
UID: U19478001

### Topics

- Cost Functions
- Kmeans

### Cost Function

Solving Data Science problems often starts by defining a metric with which to evaluate solutions were you able to find some. This metric is called a cost function. Data Science then backtracks and tries to find a process / algorithm to find solutions that can optimize for that cost function.

For example suppose you are asked to cluster three points A, B, C into two non-empty clusters. If someone gave you the solution `{A, B}, {C}`, how would you evaluate that this is a good solution?

Notice that because the clusters need to be non-empty and all points must be assigned to a cluster, it must be that two of the three points will be together in one cluster and the third will be alone in the other cluster.

In the above solution, if A and B are closer than A and C, and B and C, then this is a good solution. The smaller the distance between the two points in the same cluster (here A and B), the better the solution. So we can define our cost function to be that distance (between A and B here)!

The algorithm / process would involve clustering together the two closest points and put the third in its own cluster. This process optimizes for that cost function because no other pair of points could have a lower distance (although it could equal it).

### K means

a) (1-dimensional clustering) Walk through Lloyd's algorithm step by step on the following dataset:

`[0, .5, 1.5, 2, 6, 6.5, 7]` (note: each of these are 1-dimensional data points)

Given the initial centroids:

`[0, 2]`


## Step 1: Assign Points to Nearest Centroid
In this step, we assign each data point to the nearest centroid. To calculate the distance between a data point and a centroid, you can simply use the absolute difference in 1D space.

For the first centroid (0):

    Data points closer to 0: [0, 0.5]
    Data points closer to 2: [1.5, 2, 6, 6.5, 7]

So, the initial cluster assignments are:

    Cluster 1: [0, 0.5]
    Cluster 2: [1.5, 2, 6, 6.5, 7]

## Step 2: Recalculate Centroids
In this step, we recalculate the centroids for each cluster. For 1D data, the centroid is simply the mean of the data points within the cluster.

For Cluster 1:

    Centroid = (0 + 0.5) / 2 = 0.25

For Cluster 2:

    Centroid = (1.5 + 2 + 6 + 6.5 + 7) / 5 = 4.4

So, the updated centroids are:

    Cluster 1 centroid: 0.25
    Cluster 2 centroid: 4.4

## Step 3: Update Cluster Assignments
Now that we have updated centroids, we reassign data points to the nearest centroids using the new centroids.

For the updated Cluster 1 centroid (0.25):

    Data points closer to 0.25: [0, 0.5]

For the updated Cluster 2 centroid (4.4):

    Data points closer to 4.4: [1.5, 2, 6, 6.5, 7]

The cluster assignments remain the same because there were no changes in assignments.

## Step 4: Repeat Steps 2 and 3
You repeat steps 2 and 3 until the centroids no longer change or change very little between iterations. In this case, the cluster assignments and centroids remain the same after the first iteration.

So, the final clusters and centroids are:

    Cluster 1: [0, 0.5]

    Cluster 2: [1.5, 2, 6, 6.5, 7]

    Cluster 1 centroid: 0.25

    Cluster 2 centroid: 4.4

The algorithm has converged, and these are the final clusters for the given dataset with the initial centroids [0, 2].

b) Describe in plain english what the cost function for k means is.

The cost function for K-means is a way to measure how well the data points in a dataset are grouped into clusters. It helps us determine how "tight" or "compact" the clusters are. The cost function is all about finding the best positions for cluster centers so that the data points are as close as possible to their respective cluster centers.

c) For the same number of clusters K, why could there be very different solutions to the K means algorithm on a given dataset?

There can be different solutions to the K-means algorithm on a given dataset with the same number of clusters K because of random initialization of Centroids and sensitivity to initial Centroids.

d) Does Lloyd's Algorithm always converge? Why / why not?

Lloyd's algorithm, does not always guarantee convergence, this can be because of poor initial centroids, etc.

e) Follow along in class the implementation of Kmeans

In [4]:
import numpy as np
from PIL import Image as im
import matplotlib.pyplot as plt
import sklearn.datasets as datasets

centers = [[0, 0], [2, 2], [-3, 2], [2, -4]]
X, _ = datasets.make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=0)

class KMeans():

    def __init__(self, data, k):
        self.data = data
        self.k = k
        self.assignment = [-1 for _ in range(len(data))]
        self.snaps = []
    
    def 2snap(self, centers):
        TEMPFILE = "temp.png"

        fig, ax = plt.subplots()
        ax.scatter(X[:, 0], X[:, 1], c=self.assignment)
        ax.scatter(centers[:,0], centers[:, 1], c='r')
        fig.savefig(TEMPFILE)
        plt.close()
        self.snaps.append(im.fromarray(np.asarray(im.open(TEMPFILE))))


    def lloyds(self):
        # Initialize cluster centroids randomly from the data
        centers = self.data[np.random.choice(len(self.data), self.k, replace=False)]

        for _ in range(300):
            # Assign each data point to the nearest cluster
            for i, point in enumerate(self.data):
                self.assignment[i] = np.argmin(np.linalg.norm(centers - point, axis=1))

            # Save a snapshot of the current clustering
            self.snap(centers)

            # Update cluster centroids as the mean of data points in each cluster
            for cluster_idx in range(self.k):
                cluster_points = self.data[np.array(self.assignment) == cluster_idx]
                if len(cluster_points) > 0:
                    centers[cluster_idx] = np.mean(cluster_points, axis=0)

kmeans = KMeans(X, 6)
kmeans.lloyds()
images = kmeans.snaps

images[0].save(
    'kmeans.gif',
    optimize=False,
    save_all=True,
    append_images=images[1:],
    loop=0,
    duration=500
)