# Worksheet 05

Name:  Carlos Contreras
UID: U63425893

### Topics

- Cost Functions
- Kmeans

### Cost Function

Solving Data Science problems often starts by defining a metric with which to evaluate solutions were you able to find some. This metric is called a cost function. Data Science then backtracks and tries to find a process / algorithm to find solutions that can optimize for that cost function.

For example suppose you are asked to cluster three points A, B, C into two non-empty clusters. If someone gave you the solution `{A, B}, {C}`, how would you evaluate that this is a good solution?

Notice that because the clusters need to be non-empty and all points must be assigned to a cluster, it must be that two of the three points will be together in one cluster and the third will be alone in the other cluster.

In the above solution, if A and B are closer than A and C, and B and C, then this is a good solution. The smaller the distance between the two points in the same cluster (here A and B), the better the solution. So we can define our cost function to be that distance (between A and B here)!

The algorithm / process would involve clustering together the two closest points and put the third in its own cluster. This process optimizes for that cost function because no other pair of points could have a lower distance (although it could equal it).

### K means

a) (1-dimensional clustering) Walk through Lloyd's algorithm step by step on the following dataset:

`[0, .5, 1.5, 2, 6, 6.5, 7]` (note: each of these are 1-dimensional data points)

Given the initial centroids:

`[0, 2]`

Points close to 0: 0, 0.5

Points close to 2: 1.5, 2, 6, 6.5, 7

average of clusters 1st round : 

close to 0: 0.25

close to 2: 4.6

second round of clustering:

close to 0.25: 0, 0.5, 1.5, 2

closer to 4.6: 6, 6.5, 7 

Average of clusters in second round:

close to 0.25: 1

close to 4.6: 6.5

Final Clusters

Cluster with mean 1: 0, 0.5, 1.5, 2

Cluster with mean 6.5: 6, 6.5, 7

No change. therefore we are done by this point!



b) Describe in plain english what the cost function for k means is.

It is essentially a metric to determine how good a set of clusters is. the smalller the cost, the better. 

For each cluster, calculate the distance between the point and the cluster mean and square it to account for potential negative values. sum of all of these values to get the cost for an individual cluster. add the cost for all clusters to the cost function for the entire graph given K clusters. 

c) For the same number of clusters K, why could there be very different solutions to the K means algorithm on a given dataset?

Because of how we choose our centers. For example, if we have two clusters and place two random centers, such that only one point is closest to one center and all other points are closest to the other center then we will have two different clusters from one where we initially put two centers close to each other and both had multiple points near them. 

d) Does Lloyd's Algorithm always converge? Why / why not?

Yes. it will always converge. The algorithm cant get stuck in a cycle because this would require a clustering with a cost lower than itself. 

e) Follow along in class the implementation of Kmeans

In [None]:
import numpy as np
from PIL import Image as im
import matplotlib.pyplot as plt
import sklearn.datasets as datasets


centers = [[0, 0], [2, 2], [-3, 2], [2, -4]]
X, _ = datasets.make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=0)

class KMeans():

    def __init__(self, data, k):
        self.data = data
        self.k = k
        self.assignment = [-1 for _ in range(len(data))]
        self.snaps = []
    
    
    def snap(self, centers):
        """
        Essentially takes a snap picture of the points and creates a tempfile
        """
        TEMPFILE = "temp.png"

        fig, ax = plt.subplots()
        ax.scatter(X[:, 0], X[:, 1], c=self.assignment)
        ax.scatter(centers[:,0], centers[:, 1], c='r')
        fig.savefig(TEMPFILE)
        plt.close()
        self.snaps.append(im.fromarray(np.asarray(im.open(TEMPFILE))))

    def is_unassigned(self, i):
        ''' 
        checks if a data spot is unassigned. since we create with a defult value of -1, we just
        need to check for  a -1 to know it is unassigned. 
        '''
        return self.assignment[i] == -1 
    

    def unassigned_all(self):
        """ 
        Unassigns values by converting all values in assignment back to -1
        """
        self.assignment = [-1 for _ in range(len(self.data))]

    def initialize(self): 
        ''' 
        Creates a list of lists where each list represents the coordiantes of a point we use as a center
        '''
        return self.data[np.random.choice(range(len(self.data)), size = self.k)]
    
    def assign(self,centers):
        ''' 
        Assigns points in self.data to the closest center in centers.
        modifies the self.assignment list to match points to a center 
        '''
        # for each element of our data
        for i in range(len(self.data)):

            # change the assignment at that position to 0. this means the data point i is currently associated with center 0. 
            self.assignment[i] = 0 

            #calculate the distance between our data point and the first center  
            temp_dist = self.dist(self.data[i], centers[0])

            # compare distances of a point to the centers to determine which center a point is closest to 
            for j in range(1,len(centers)):
                new_dist = self.dist(self.data[i], centers[j])
                # if a point is closer to another center than the current center it is assigned to, we should reassing the center it is associated to and change the current lowest center
                # for the other comparisons. 
                if new_dist < temp_dist:
                    self.assignment[i] = j
                    temp_dist = new_dist
 
    def dist(self,x, y):
        return sum((x-y)**2) ** (1/2)
    
    def are_centers_diff(self, c1,c2):
        ''' 
        Checks if two lists of centers are the same by comparing elements returns True if they are different
        '''
        for i in range(len(c1)):
            if c1[i] not in c2:
                return True
        return False 


    def calculate_new_centers(self):
        # We create an empty list to store our new centers
        centers =  []
        # We want to calculate calculate k new center, so we use a for loop
        for j in range(self.k):
            cluster_j = self.data[
                np.array([i for i in range(len(self.data)) if self.assignment[i] == j])
            ]
 
            centers.append(np.mean(cluster_j,axis=0))
        
        return np.array(centers)
        
    def lloyds(self):

        # first step is initializing the centers and taking a snap of the current chart 
        centers = self.initialize()
        self.snaps.append(self.snap(centers))

        # then we assign points to those centers  based on how close they are
        self.assign(centers)
        # Then calcualte the new means (new centers)
        new_centers = self.calculate_new_centers()
        

        # repeat the process until centers dont change
        while self.are_centers_diff(centers, new_centers):
            centers = new_centers
            self.unassigned_all()
            self.assign(centers)
            self.snap(centers)
            new_centers = self.calculate_new_centers()
        return
            

kmeans = KMeans(X, 6)
kmeans.lloyds()
images = kmeans.snaps


images[0].save(
    'kmeans.gif',
    optimize=False,
    save_all=True,
    append_images=images[1:],
    loop=0,
    duration=500
)