In [0]:
from sklearn.cluster import KMeans
from sklearn.datasets import *
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools
import random
import matplotlib.colors as mcolors
%matplotlib inline

# Clustering Handwriting Samples

In this homework, you will be working with the k-Means clustering algorithm that we discussed last week.

To start off with you will be using k-means clustering as a classification approach to recognizing handwriten digits, like the ones below:
![MNist Handwriting Examples](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

The goal is to correctly group together all of the 0s, all of the 1s, etc. Some of these can be tricky! A 3 can look like an 8, or a 7 like a 9!

Glance through the code below and then run it -- it contains functions for creating training/testing splits from a dataset, running a classifier, and making plots of your predictions (don't worry too much about understanding the plotting code -- it's a little confusing)

In [0]:
# yet another way of creating train/test splits!
def create_train_test_split(data,test_size=0.3):
  df = pd.DataFrame(data.data)
  train_data, test_data, train_labels, test_labels = train_test_split(df, data.target, test_size=test_size)
  return np.array(train_data), np.array(test_data), np.array(train_labels), np.array(test_labels)

def classify(predictor, train_data, test_data, train_labels=None):
  # fit the predictor to our training data
  if train_labels is not None:
    predictor.fit(train_data, train_labels)
  else:
    predictor.fit(train_data)
  
  # make predictions on the testing data
  test_predictions = predictor.predict(test_data)

  # return our predicted cluster identities and cluster centers
  return test_predictions, predictor.cluster_centers_

def create_plot(predictor, train_data, test_data):
  Z, cluster_centers = classify(predictor, train_data, test_data)

  appended = np.vstack([test_data,cluster_centers])
  reduced_data = PCA(n_components=3).fit_transform(appended)

  reduced_test_data = reduced_data[:test_data.shape[0]]
  reduced_centers = reduced_data[test_data.shape[0]:]

  possible_colors = np.array(list(mcolors.TABLEAU_COLORS.keys()))
  possible_colors[np.linspace(0, possible_colors.shape[0] - 1, 10).astype(int)]

  for pt, label in zip(reduced_test_data, Z):
    plt.scatter(pt[0],pt[1], color=possible_colors[label], marker='o',alpha=0.25, s=10, linewidths=2)

  for pt, label in zip(reduced_centers,range(max(Z)+1)):
    plt.scatter(pt[0],pt[1], color=possible_colors[label], marker='+', s=300, linewidths=15,alpha=1)

First things first, we need to load the data! Here, I'll load the handwriting data from sklearn using the load_digits() function, and then we'll split that into training and testing data using the function written above.

In [0]:
# load the MNist Handwriting dataset
data = load_digits()
train_data, test_data, train_labels, test_labels = create_train_test_split(data)

Below, we are going to create two predictors, one that has an "random" initialization and one that has an initialization called "k-means++".

***HOMEWORK QUESTION 1 (5 points):*** Visit the [sklearn documentation for the KMeans predictor](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) that we're using. Read about the difference between these initialization methods. Describe in a short paragraph the difference between them.

In [0]:
n_clusters = 10 # we know that there are 10 digits, so we want to have 10 clusters
kmeans_random_initialization = KMeans(init='random', n_clusters=n_clusters)
kmeans_better_initialization = KMeans(init='k-means++', n_clusters=n_clusters)

In [5]:
random_predictions, random_cluster_centers = classify(kmeans_random_initialization, train_data, test_data, train_labels=train_labels)
print('Accuracy: {}'.format(accuracy_score(test_labels, random_predictions)))

Accuracy: 0.0962962962962963


In [6]:
better_predictions, better_cluster_centers = classify(kmeans_better_initialization, train_data, test_data, train_labels=train_labels)
print('Accuracy: {}'.format(accuracy_score(test_labels, better_predictions)))

Accuracy: 0.18888888888888888


***HOMEWORK QUESTION 2 (10 points): ***Run the classification and accuracy computation for both the random initialization and the better initialization ~10 times each. Report the average accuracy for each. Does one classifier seem better than the other? If so, which, and why do you think that is?




## Visualization and Dimensionality Reduction

I provided you with a function "create_plot" which takes in your predictor, and the training data and training labels and creates a plot of them. The code for that plotting is complex, but the general idea is that in order for us to visualize what our predictor is doing, we first need to reduce the dimensionality of the input data. What does that mean?

Well, the MNist handwriting digits that we've loaded are images, represented by 64 pixel values, or 64 dimensions. But we can't visualize 64 dimensions on a plot easily! Instead, we want to find a good representation of our data in 2 dimensions (so that we can plot it on an x-y scatter plot).

Principal Component Analysis is an approach that finds the most important ways in which your data varies and represents your data with just those dimensions -- so we can take our 64-dimensional images and instead represent them with 2-dimensional points. (Dimensionality reduction is a huge area of work and research in data analysis! Don't worry too much about the details here -- the major takeaway is that the create_plot function lets you make 2D scatter plots of our data, even if it had more dimensions than that to start with.)

***HOMEWORK QUESTION 3 (5 points)***: Run the create plot function for both the random initialization and the k-means++ initialization a few times each (the dimensionality reduction has some amount of randomization each run). Paste these visualizations into your homework, labeling them by which initialization was used. Do you notice any significant differences?

In [0]:
create_plot(kmeans_random_initialization, train_data, test_data)

In [0]:
create_plot(kmeans_better_initialization, train_data, test_data)

# HOMEWORK QUESTION 4 (20 points): Putting This Together on Your Own

Starting next week, you will be working on your final projects, which involve performing your own data analysis. Thus far, you have largely been running code I write for you an modifying parameters. But now it's time for you to put the pieces together on your own.

Below, I load a new dataset for you. It's in the same format at the handwriting digits dataset we loaded up above (except the labels are junk -- this is truly the unsupervised context; we don't know the "correct" labels for even our training data).

Your task is to:

1. Split the data into appropriate training/testing splits
2. Train a classifier on your training data, experimenting with different initializations and different numbers of clusters (for the handwriting digits it was obvious that we wanted 10 clusters; here we don't know the appropriate number of clusters).
3. Create the scatter plot visualization to help you qualitatively assess which combinations of initializations and numbers of clusters seem most appropraite for your data. Include two visualizations in your report: one showing what you think is the best combination of initializations and number of clusters in your report, and one that you didn't think was as good. Explain why you think your selected parameters are better.

Finally, upload both your report and this iPython notebook to your Slack channel.

In [0]:
data = load_iris()

In [0]:
# YOUR CODE GOES HERE!