# **Homework 10**

Due:

# **Coding Assignment**

### Introduction

In this assignment, you will implement an iterative method for
clustering: k-means. Your implementation will be used for a k-means
classifier, which will be trained and tested on handwritten digits
dataset to classify an exact digit (0 to 9).

### Stencil Code & Data

You have been provided with the following stencil files:

-   `Models`: contains the K-means classifier class that you will
    need to fill in.

-   `Check Model`: contains a series of tests to ensure you are coding your 
    model properly.

-   `Main`: contains a main function to read data, run classifier and
    print/visualize results.

-   `kmeans`: contains helper functions for K-means clustering via
    iterative improvement that you will need to fill in.

### 8x8 Hand-written digits

In the `digits.csv` file, each row is an observation of a 8 x 8
hand-written digit (0 - 9), containing a label in the first column and 8
x 8 = 64 features (pixel values) in the rest of columns.

### Data Format

We have written all the preprocessing code for you. The dataset is
represented by a `namedtuple` with two fields:

-   `data.inputs` is a $m \times p$ NumPy array that contains the binary
    features of the $m$ examples, where $p$ is the number of pixels in
    each example (64).

-   `data.labels` is a $m$-dimensional NumPy array that contains the
    labels of the $m$ examples.

You can find more infomation on `namedtuple`
[**here**](https://docs.python.org/3/library/collections.html#collections.namedtuple).

## **The Assignment**

## **K-means Clustering**

K-means is a clustering algorithm most often used for unsupervised
machine learning. In unsupervised learning, the learner is given a
dataset with no labels and attempts to learn some useful representation
of the dataset. You may be wondering how K-means can used for
classification in this assignment, as the training data is unlabeled. To
address this, given a dataset $S = \{{\bf x}_{1}, \dots, {\bf x}_{m} \}$

-   You will run K-means clustering on the unlabeled training data and
    plot the pixel representations of different cluster centers
    (centroids): $M = \{\mu_1 \dots \mu_{10}\}$, for clusters
    $C=\{C_{1} \dots C_{10}\}$. With K = 10, these cluster centers
    should vaguely resemble the 10 digits (0-9).

-   Using the pixel plots of the centroids, you will manually assign
    which digit each centroid represents.: $A = {a_1 \dots a_{10}}$.

-   To predict the label, $y_{m+1}$ of a new datapoint ${\bf x}_{m+1}$,
    find the cluster center nearest to ${\bf x}_{m+1}$, $\mu_i$, and
    predict using your assignment, $a_i$.

*Note:* You don't need to worry about changing your centroid assignments
in between runs, as we've set the random seed in the stencil.

## Functions

1.  `Models`\
    In this file, you will implement two functions. They are:

    -   `KmeansClassifier`

        -   **train()**: Learn K=10 cluster centroids (representatives)
            from the data that are robust (because they are estimated
            using a lot of data). Store cluster centroids as a Numpy
            array in *model attribute*

        -   **predict()**: predict label of inputs using the label of
            closest centroid's assignment

2.  `kmeans`:\
    In this file, you will implement four functions:

    -   **init_centroids()**: pick **K** random data points as cluster
        centers called centroids.

    -   **assign_step()**: assign each data instance to its nearest
        cluster centroid using Euclidean distance measure.

    -   **update_step()**: find the new cluster centroids by taking the
        average of its assigned data points.

    -   **kmeans()**: run the $K$-means algorithm: initialize centroids,
        then repeat the assignment step and update step until the
        proportion the centroids \[defined below\] change between two
        iterations is below a tolerance threshold or the maximum
        iteration time is met. The tolerance threshold is passed into
        `kmeans()` as `tol` and tolerance is compared against the ratio
        of the norm of the difference between centroids and the norm of
        the original centroids.

    *Note:* You might also want to create a separate function that
    calculates the Euclidean distance between two data points in the
    `kmeans` file. Please feel free to do so.

3.  `main`:\
    You will not need to implement any functions in this file. However,
    you will need to do two things:

    -   Uncomment the call to `plot_Kmeans` in main. This function will
        allow you to see the centroids that your k-means model learns.

        **Please note:** to complete the report you will need access to
        graphics on the machine you are working on. If you are running
        locally or through FastX/XQuartz, you do not have to worry about
        this. If you have been working exclusively through ssh, please
        read about how to set up remote work that is compatible with
        this assignment
        **[here](https://cs.brown.edu/about/system/connecting/fastx/)**.
        If there are any limitations to you doing this (e.g. not having
        access to a personal computer), please email.

    -   Fill in the `centroid_assignments` array using the results of
        `plot_Kmeans` in your call to `test_Kmeans`.

## **Kmeans**

In [None]:
import numpy as np
import random

def init_centroids(k, inputs):
    """
    Selects k random rows from inputs and returns them as the chosen centroids
    Hint: use random.sample (it is already imported for you!)
    :param k: number of cluster centroids
    :param inputs: a 2D Numpy array, each row of which is one input
    :rand: random seed to be used when sampling from inputs
    :return: a Numpy array of k cluster centroids, one per row
    """
    row_indeices = random.sample(range(inputs.shape[0]), k)
    return inputs[row_indeices]


def assign_step(inputs, centroids):
    """
    Determines a centroid index for every row of the inputs using Euclidean Distance
    :param inputs: inputs of data, a 2D Numpy array
    :param centroids: a Numpy array of k current centroids
    :return: a Numpy array of centroid indices, one for each row of the inputs
    """
    # TODO
    distances = np.zeros([len(inputs),len(centroids)])
    for i in range(len(inputs)):
        for j in range(len(centroids)):
            distances[i,j] = np.linalg.norm(inputs[i] - centroids[j])
    return np.argmin(distances, axis=1)


def update_step(inputs, indices, k):
    """
    Computes the centroid for each cluster
    :param inputs: inputs of data, a 2D Numpy array
    :param indices: a Numpy array of centroid indices, one for each row of the inputs
    :param k: number of cluster centroids, an int
    :return: a Numpy array of k cluster centroids, one per row
    """
    # TODO
    centroids = np.zeros([k, inputs.shape[1]])
    for i in range(k):
        centroids[i] = np.sum(inputs[indices==i],axis=0) / sum(indices==i)
    return centroids


def kmeans(inputs, k, max_iter, tol):
    """
    Runs the K-means algorithm on n rows of inputs using k clusters via iterative improvement
    :param inputs: inputs of data, a 2D Numpy array
    :param k: number of cluster centroids, an int
    :param max_iter: the maximum number of times the algorithm can iterate trying to optimize the centroid values, an int
    :param tol: the tolerance we determine convergence with when compared to the ratio as stated on handout
    :param rand: a given random seed to be used within init_centroids
    :return: a Numpy array of k cluster centroids, one per row
    """
    # TODO
    old_centroids = init_centroids(k, inputs)
    converge = False
    iter = 0
    while converge == False:
        indices = assign_step(inputs, old_centroids)
        new_centroids = update_step(inputs, indices, k)
        iter += 1
        ratio = np.linalg.norm(new_centroids - old_centroids) / np.linalg.norm(old_centroids)
        
        if iter > max_iter or ratio < tol:
            converge = True

        old_centroids = new_centroids
    return new_centroids

## **Models**

In [None]:
class KmeansClassifier(object):
    """
    K-Means Classifier via Iterative Improvement
    @attrs:
        k: The number of clusters to form as well as the number of centroids to
           generate (default = 10), an int
        tol: Value specifying our convergence criterion. If the ratio of the
             distance each centroid moves to the previous position of the centroid
             is less than this value, then we declare convergence.
        max_iter: the maximum number of times the algorithm can iterate trying
                  to optimize the centroid values, an int,
                  the default value is set to 500 iterations
        cluster_centers_: a Numpy array where each element is one of the k cluster centers
    """

    def __init__(self, n_clusters = 10, max_iter = 500, threshold = 1e-6):
        """
        Initiate K-Means with some parameters
        """
        self.k = n_clusters
        self.tol = threshold
        self.max_iter = max_iter
        self.cluster_centers_ = np.array([])

    def train(self, X):
        """
        Compute K-Means clustering on each class label and store your result in self.cluster_centers_
        :param X: inputs of training data, a 2D Numpy array
        :param rand: random seed to be used during training
        :return: None
        """
        # TODO (hint: use kmeans())
        self.cluster_centers_ = kmeans(X, self.k, self.max_iter, self.tol)

    def predict(self, X, centroid_assignments):
        """
        Predicts the label of each sample in X based on the assigned centroid_assignments.
        :param X: A dataset as a 2D Numpy array
        :param centroid_assignments: a Numpy array of 10 digits (0-9) representing the interpretations of the digits of the plotted centroids
        :return: A Numpy array of predicted labels
        """

        # TODO: complete this step only after having plotted the centroids!
        predictions = np.zeros(len(X))
        for i in range(len(X)):
            distances = np.zeros(self.k)
            for j in range(self.k):
                distances[j] = np.linalg.norm(X[i] - self.cluster_centers_[j])
            centroids = np.argmin(distances)
            predictions[i] = centroid_assignments[centroids]

        return predictions

    def accuracy(self, data, centroid_assignments):
        """
        Compute accuracy of the model when applied to data
        :param data: a namedtuple including inputs and labels
        :return: a float number indicating accuracy
        """
        pred = self.predict(data.inputs, centroid_assignments)
        return np.mean(pred == data.labels)

## **Check Models**

In [None]:
import pytest
from collections import namedtuple
np.random.seed(0)
random.seed(0)

# Creates Test Model with 3 clusters
test_model1 = KmeansClassifier(3)
# Creates Test Model with 3 clusters
test_model2 = KmeansClassifier(2)

# Creates Test Data
x = np.array([[0,1,7], [1,1,9], [5,0,1], [4,1,1], [0,5,0], [1,9,0]])
y = np.array([2,2,0,0,1,1])
x2 = np.array([[3,1,7], [5,1,9], [2,8,1], [0,1,1], [0,5,0], [2,0,8]])
y2 = np.array([1,1,0,0,0,1])
data = namedtuple('Dataset', ['inputs', 'labels'])
test_data1 = data(x, y)
test_data2 = data(x2, y2)

# Test Train Model and Checks Cluster Centers
test_model1.train(x)
test_model2.train(x2)
test_model1_sorted_clusters = test_model1.cluster_centers_[test_model1.cluster_centers_[:, 0].argsort()]
test_model2_sorted_clusters = test_model2.cluster_centers_[test_model2.cluster_centers_[:, 0].argsort()]
assert (test_model1_sorted_clusters == np.array([[.5, 7, 0], [.5, 1, 8], [4.5, .5, 1]])).all()
assert (test_model2_sorted_clusters == np.array([[1.25, 1.75, 4], [3.5, 4.5, 5]])).all()

# Tests Model Predict
assert (np.sort(test_model1.predict(x, [0,1,2])) == [0, 0, 1, 1, 2, 2]).all()
assert (np.sort(test_model2.predict(x2, [0,1])) == [0, 0, 1, 1, 1, 1]).all()

# Tests Model Accuracy
assert test_model1.accuracy(test_data1, [0,1,2]) == 1.0
assert test_model2.accuracy(test_data2, [0,1]) == .5

# TODO: student should print their names and date
print('student name and date ')

## **Main**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split

## KMEANS HELPERS ##
def plot_Kmeans(model):
    """
        Takes in a pre-trained K-Means classifier model and plots the 10 centroids.
        Note: this function is designed only for the digits.csv data set.
    :param model: pre-trained K-Means classifier model object
    :return: None
    """
    if isinstance(model, KmeansClassifier) == False:
        print("Invalid input! Model must be a KmeansClassifier object.")
        return

    cluster_centers = model.cluster_centers_
    fig, ax = plt.subplots(1, len(cluster_centers), figsize=(3, 1))

    for i in range(len(cluster_centers)):
        axi = ax[i]
        center = cluster_centers[i]
        center = np.array(center).reshape(8,8)
        axi.set(xticks=[], yticks=[])
        axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
    plt.show()

def test_Kmeans(model, test_data, centroid_assignments):
    """
        Prints the accuracy of model on test_data, based on the centroid ordering provided by the student.
    :param model: pre-trained K-Means classifier model object
    :param test_data: a namedtuple including test inputs and test train_labels
    :param centroid_assignments: a python list of 10 digits (0-9) representing your interpretations of the digits of the plotted centroids from plot_Kmeans (in order from left ot right).
    :return: None
    """
    if isinstance(centroid_assignments, list) == False:
        print("Invalid input! Centroid assignments must be a python list!")
        return
    elif not np.array_equal(np.array(list(range(10))), np.sort(np.array(centroid_assignments))):
        print("Invalid Input! Centroid assignments must contain all numbers in the range 0-9 (in the order displayed in your plot).")
        return 
    elif isinstance(model, KmeansClassifier) == False:
        print("Invalid input! Model must be a KmeansClassifier object.")
        return
   
    accuracy = model.accuracy(test_data, centroid_assignments)
    print("Testing on K-Means Classifier (K = " + str(model.k) + "), the accuracy is {:.2f}%".format(accuracy * 100))
    return accuracy

def runKMeans():
    '''
        Trains, plots, and tests K-Means classifier on digits.csv dataset.
    '''
    NUM_CLUSTERS = 10 # Change only for Question 3 of the Project Report
    random.seed(1) # DO NOT CHANGE
    np.random.seed(1) # DO NOT CHANGE

    Dataset = namedtuple('Dataset', ['inputs', 'labels'])

    # Read data
    data = pd.read_csv("digits.csv", header = 0)

    # We assume labels are in the first column of the dataset
    labels = data.values[:, 0]

    # If labels are of type string, convert class names to numeric values
    if isinstance(labels[0], str):
        classes = np.unique(labels)
        class_mapping = dict(zip(classes, range(0, len(classes))))
        labels = np.vectorize(class_mapping.get)(labels)

    # Features columns are indexed from 1 to the end, make sure that dtype = float32
    inputs = data.values[:, 1:].astype("float32")

    # Split data into training set and test set with a ratio of 2:1
    train_inputs, test_inputs, train_labels, test_labels = train_test_split(inputs, labels, test_size = 0.33)

    all_data = Dataset(inputs, labels)
    train_data = Dataset(train_inputs, train_labels)
    test_data = Dataset(test_inputs, test_labels)
    print("Shape of training data inputs: ", train_data.inputs.shape)
    print("Shape of test data inputs:", test_data.inputs.shape)

    # Train K-Means Classifier
    kmeans_model = KmeansClassifier(NUM_CLUSTERS)
    kmeans_model.train(train_data.inputs)

    # DO NOT MODIFY ABOVE THIS LINE!

    # TODO: uncomment below to plot the centroids for the 10 digits (0-9).
    plot_Kmeans(kmeans_model)

    # TODO: fill out centroid_assignments below based on the visualization of plot_Kmeans (in order from left to right).
    #   In this step, you are assigning each centroid to its most resembling digit (0-9).
    #   DO NOT add print lines below test_Kmeans on final handin
    #   (Comment out this line when running Question 3 of the Project Report)
    test_Kmeans(kmeans_model, test_data, centroid_assignments=[9,2,1,4,3,0,8,6,5,7])

# DO NOT MODIFY BELOW
runKMeans()

## **Project Report**

### **Question 1**
Display your output of `plot_Kmeans()`. Does your plot match your expectations?

**Solution:**
Plot should look like the image below. Some possible
answers could include: all numbers 0-9 are present, numbers in
general are fuzzier or less fuzzy than expected, certain numbers
like 7 are fuzzier due to variation in how people write it, etc.

![image](kmeans_plot.png)

### **Question 2**
In this assignment, you implemented k-means through a Euclidean
distance metric. Describe other distance metrics that can be used
and how they cluster inputs.

**Solution:**
Many possible answers. Ex) Manhattan distance

### **Question 3**
In `runKMeans()` in the Main section, change the number of clusters (`NUM_CLUSTERS`) to 6, and display the digits with `plot_Kmeans(kmeans_model)`. Do this as well for 15 clusters. Describe what the clutsers' centers (centroids) look like and why this is happening.

**Solution:**

If $K<10$, clusters would be more difficult to
read/interpret since some would be a blend of two or more
\"similar\" digits. (Ex: 1 and 7 might be the same cluster).

If $K>10$, our clusters will begin distinguishing between different
ways of writing the same digit. (Ex: 4 with the leftmost edge
slanted vs vertical).