# ExoStat Lab 11:  Clustering and Classification

**Administrative details:**

- This Lab will be turned in for credit.

- Some questions of this lab are variations of lecture notes and demos found on the main [YData website](http://ydata123.org/sp19/).  

- Collaborating on the ExoStat Labs is encouraged. If you get stuck for a while on a question, feel free to ask a neighbor or come to the instructor's or TF's office hours for additional help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.

This term we will be using Piazza for class discussion. Find our class page [here](https://piazza.com/yale/spring2019/sds170/home)

You can read more about course policies on our [canvas site](https://canvas.yale.edu).

**Deadline:**

This assignment is due Monday, April 29th at 11:59 P.M. Late work will not be accepted as per the course policies (see the Syllabus and Course policies on [Canvas](https://canvas.yale.edu)).

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

#### Today's ExoStat Lab

1. Clustering:  See lecture notes

2. Classification:
See lecture notes or textbook pages [here](https://www.inferentialthinking.com/chapters/17/Classification.html)

**Submission:**

Submit your assignment both as a .pdf and .ipynb (Jupyter notebook) in Canvas.  

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"

Let's begin by running the cell below.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

from sklearn.cluster import KMeans

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

## 1.  Preparing the data

As we discussed and used during the lecture, we are going to use a simulated dataset to illustrate clustering and classification.  The cell below reads in the simulated dataset, `data.txt`.  This is the "unlabeled" data, which we will use for clustering.

In [None]:
sim = Table.read_table("data.txt", delimiter = " ", names = ["x","y"])
sim.scatter(0,1)

The labels/classes for `sim` can be loaded using the cell below.  Run the cell below to load the labels, `data_labels.txt`.

In [None]:
sim_labels = Table.read_table("data_labels.txt", delimiter = " ", names = ["labels"]).column(0)

The cell below creates a scatterplot of `sim`, but colors the points according to their label.  Notice that there are five different labels/classes.  We will use this dataset for classification.

In [None]:
sim.with_columns("Class", sim_labels).scatter(0,1,colors = "Class") 
plt.xlim(-4,4)
plt.ylim(-6,4)

## 2. Clustering

We discussed the k-means clustering method during lecture.  To implement it, we will use `KMeans` in the Python module `sklearn.cluster`. (This was loaded in the setup cell above.)  You can learn more about `KMeans` [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

It turns out to use `KMeans`, the data format needs to be an array.  The code below takes the columns of our `sim` table and combines them as a 2D array.  Use `sim_km` in `KMeans`.

In [None]:
sim_km = np.column_stack(make_array(sim.column(0), sim.column(1)))
print(type(sim_km))
print(sim_km.shape)

Let's first consider running this with two clusters (k = 2).  The code below specificies in `KMeans` that we would like k=`n_clusters=2`.  The `n_init` specifies that number of different starting means.  The `.fit(sim_km)` gets us the fit model.

In [None]:
nclust2 = 2
sim_km2 = KMeans(n_clusters=nclust2, n_init = 10).fit(sim_km)
sim_km2

Using `.cluster_centers_` specifies the centers of the two clusters.

In [None]:
sim_km2_centers = sim_km2.cluster_centers_
sim_km2_centers

And `.labels_` stores the cluster assignment labels.

In [None]:
sim_km2_labels = sim_km2.labels_
sim_km2_labels

The code below produces a plot of `sim` with the points colored according to the cluster assignments.  The centers of the clusters are indicated with an `x`.

In [None]:
sim.with_columns("Cluster", sim_km2_labels).scatter(0,1,colors = "Cluster") 
plt.scatter(sim_km2_centers[:,0],sim_km2_centers[:,1], marker = "x", s = 100)
plt.xlim(-4,4)
plt.ylim(-6,4)

If desiring to compare the cluster models under different values of `n_clusters`, sometimes "elbow plots" are employed.  An elbow plot displays the *within cluster* variability against varying numbers of clusters.  (See the lecture notes for additional discussion.)  In `KMeans`, the `.inertia_` gives the within cluster variability for the specified value of `n_clusters`.

In [None]:
sim_km2.inertia_

Now that we have the basics of k-means clustering down, let's try this out!  If you are unsure about the questions below, refer back to the previous example.

**Question 2.1.**  Make a scatterplot of `sim` and color the points according to the k-means result with k = 5.  Be sure to plot the cluster centers as well!

In [None]:
...

**Question 2.2.**  In the question above with k = 5, does this look like the true class assignments?  If it does not match the `sim_labels`, is there any chance that k-means clustering would be able to find the `sim_labels`?  Explain.

[Add your response here]

**Question 2.3.**  Make an elbow plot and determine the "best" number of clusters.  Be sure to label your axes.  You will need to think about the appropriate range for the number of clusters...make sure to have at least 1!

Then, make a scatterplot of `sim` and color the points according to the k-means with your selected number of clusters.  Be sure to plot the cluster centers as well!

In [None]:
#Make Elbow plot
...

In [None]:
#Scatterplot with "best" k
...

Now lets try k-means clustering for exoplanets!

We'll use a subset of the confirmed exoplanet data, `confirmed_planets.csv`, and was collected from the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu).  

Let's start with considering the planet mass and planet radius.  See [Chen and Kipping (2017)](https://iopscience.iop.org/article/10.3847/1538-4357/834/1/17/pdf) for a clustering approach using advanced statistical methods.  Figure 3 presents a possible interpretation of the clusters they found.


Below we read in the data, remove the `nan`, and then convert to Earth units.

In [None]:
# load data
exoplanets = Table.read_table("confirmed_planets.csv", skiprows = 71)

In [None]:
# Function to remove nans
def remove_nan(x):
    if any(np.isnan(x)):
        return False
    else:
        return True

In [None]:
mass = exoplanets.column("pl_bmassj")*317.8
radius = exoplanets.column("pl_radj")*11.2
data1 = Table().with_columns("mass", mass, "radius", radius)
nans1 = data1.apply(remove_nan)
data1 = data1.where(nans1)
data1

**Question 2.3.**  Make a scatterplot of Radius vs. Mass on the log scale.  Be sure to label your axes!

In [None]:
...

As before, we need to turn the data into a 2D array to use in `KMeans`.  Run the cell below.

In [None]:
#Make a 2D array
data1_km = np.column_stack(make_array(data1.column(0), data1.column(1)))
data1_km.shape

**Question 2.4.**  Try out k-means clustering for the radius and mass data.  Make plots of several different values for `k` on the log scale, with the colors corresponding to the assigned clusters.  How do they look?  Do the cluster assignments look like what you would expect?  Why or why not?

In [None]:
...

**Question 2.5.**  Make an elbow plot and determine the "best" number of clusters.  Be sure to label your axes.  You will need to think about the appropriate range for the number of clusters.

Then, make a scatterplot of `data1` and color the points according to the k-means with your selected number of clusters.  Be sure to plot the cluster centers as well!

In [None]:
#Make Elbow plot
...

In [None]:
#Scatterplot with "best" k
...

**Question 2.6.**  Now you get to do some exploring!  Pick out two columns from the exoplanet data for which you would like to find clusters.  You should include the following:

(i) A scatterplot of the two variables you are considering (be sure to label your axes).
(ib) Explain why you selected these two variables

(ii) Make an elbow plot to decide on the number of clusters you would like to consider

(iii) Make a scatterplot of you data with the points colored according to the cluster assignments you found in part (ii).

In [None]:
# Define data table of your variables (you may need to remove nans)
...

[Explain why you selected the two variables here]

In [None]:
# Make a scatterplot of data
...

In [None]:
# Make an elbow plot and decide on the number of clusters
...

In [None]:
# Fit k-means model
# Make a scatterplot of data with points colored according 
##  to the cluster assignment
...

## 3.  Classification

Next we are going to move into classification.  While clustering is unsupervised (no labels or classes used), classification *does* use labels and is considered a supervised approach because of this.

We are going to focus on k-nearest neighbers (kNN) classification, which you can (and are encouraged to) read about in the [textbook](https://www.inferentialthinking.com/chapters/17/Classification.html).

First we will use our simulated dataset, `sim`, along with the assigned labels, `sim_labels`.  We are going to walk through the steps for building a classification model using the simulated data.  If anything is unclear, refer back to today's lecture notes or the textbook.

### Step 1:  the training dataset

Below is our simulated dataset along with the labels.  Classification is different from clustering in many ways, but one major difference is that there *is* a correct classification we will be able to assess the performance of our model.

In [None]:
sim2 = sim.with_columns("Class", sim_labels)

sim2.scatter("x","y", colors = "Class")

### Calculate distance function

Since we are using kNN classification, we will need to have a way to determine who the nearest neighbors are...this requires us to specify a distance function.  The function `distance` below calculates the distance between two points, and `row_distance` calculates the distance between two rows in a table.

In [None]:
#Run this
def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return np.sqrt(sum((pt1 - pt2)**2))

def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return distance(np.array(row1), np.array(row2))

Let's try this out to make sure we understand how it works.

In [None]:
#Run this to see the table format
sim2.show(3)

In [None]:
#Define the attributes - what we want to use for the classification 
sim_attributes = sim2.drop("Class")
sim_attributes.show(3)

In [None]:
# Calculate the distance between two rows
row_distance(sim_attributes.row(0), sim_attributes.row(1))

In [None]:
# This is what row_distance is doing
np.sqrt((-2.82951--3.0712)**2 + (.0903454-.110908)**2)

### Recipe for finding the nearest neighbors

Now that we have our distance defined, we want to use it for nearest neighbors classification.  This subsection defines a function for calculating the distance between an example point (of which we would like to know the class) and all the points in our training set (we will define the training and test set soon!).

In [None]:
# Nearest neighbors procedure
def distances(training, example):
    """Compute distance between example and every row in training.
    Return training augmented with Distance column"""
    distances = make_array()
    attributes = training.drop('Class')
    for row in attributes.rows:
        distances = np.append(distances, row_distance(row, example))
    return training.with_column('Distance', distances)

Let's consider the following example point and try to predict it's label.

In [None]:
example = sim_attributes.row(12)
example

Below we get the distances between `example` and the points in the dataset in ascending order.  That is, the first row is `example`'s nearest neighbor.

In [None]:
distances(sim2.exclude(12), example).sort('Distance')

Now that we have the list of points and distances, we need to determine the `k` closest points to determine our classification.  The function below returns the `k` nearest neighbors of some example point.

In [None]:
def closest(training, example, k):
    """Return a table of the k closest neighbors to example"""
    return distances(training, example).sort('Distance').take(np.arange(k))

In [None]:
#This gives the 5 nearest neighbors to example
closest(sim2.exclude(12), example, 5)

The `k` nearest neighbors can now be used to define the majority class (that is, the majority class among the `k` nearest neighbors).  The two functions below lead us to ultimately be able to classify an example point.

In [None]:
def majority_class(topk):
    """Return the class with the highest count"""
    return topk.group('Class').sort('count', descending=True).column(0).item(0)

def classify(training, example, k):
    "Return the majority class among the k nearest neighbors of example"
    return majority_class(closest(training, example, k))

In [None]:
# Classify our example point
classify(sim2.exclude(12), example, 5)

In [None]:
# The true class for the example point
sim2.take(12)

In [None]:
#Let's try another example
new_example = sim_attributes.row(100)
classify(sim2.exclude(100), new_example, 5)

In [None]:
sim2.take(100)

### Step 2:  Make a training and test set

We now have an approach for building our kNN classification model.  As discussed in the lecture and in the textbook, we should separate our data into a *training* dataset and a *testing* dataset.  The training dataset is used for building our model, and the testing dataset is how we can assess the performance of our classifier.  Recall that the testing data are not used in building the model so we get a better assessment of the prediction accuracy.  

Let's use half the observations for training and half for testing.

In [None]:
# Number of observations
sim2.num_rows

In [None]:
# Randomize the order of the observations, then divide into 
## the training and testing sets
shuffled = sim2.sample(with_replacement=False) # Randomly permute the rows
training_set = shuffled.take(np.arange(125))
test_set  = shuffled.take(np.arange(125,250))

In [None]:
training_set.num_rows

In [None]:
test_set.num_rows

**Question 3.1.** Create two scatterplots:  one of the training dataset, and one of the testing dataset, both with points colored according to the true class.

In [None]:
...

In [None]:
...

### Step 3:  Build the model using the training set

Next we need to build our classification model using the traning dataset.  We updated the functions used previously and added some new functions in the cell below so that they work for a table of values we would like to predict.  Read over and review the functions to make sure you understand what each one is doing. 

In [None]:
#Calculates the distance between two points
def distance(pt1, pt2):
    return np.sqrt(np.sum((pt1 - pt2)**2))

#Calculates distance between training set and a point p
def all_dists(training, p):
    attributes = training.drop('Class')
    def dist_point_row(row):
        return distance(np.array(row), p)
    return attributes.apply(dist_point_row)

#Creates a table with the distances between training and point p
def table_with_distances(training, p):
    return training.with_column('Distance', all_dists(training, p))

#Finds the k nearest neighbors to p from training
def closest(training, p, k):
    with_dists = table_with_distances(training, p)
    sorted_by_dist = with_dists.sort('Distance')
    topk = sorted_by_dist.take(np.arange(k))
    return topk

#Finds the majority class among k nearest neighbors
def majority_class(topk):
    """Return the class with the highest count"""
    return topk.group('Class').sort('count', descending=True).column(0).item(0)

#Classifies point p using k nearest neighbors of p
def classify(training, p, k):
    closestk = closest(training, p, k)
    topkclasses = closestk.select('Class')
    return majority_class(topkclasses)

First we are going to create a plot that helps us to define the decision boundaries of the classification model.  We can do this by defining a grid across `x` and `y` and then getting the predicted labels for each grid point.  The next few cells produce this...make sure you understand what each cell is doing!

In [None]:
# Classifies each point in the grid using kNN
def classify_grid(training, test, k):
    c = make_array()
    for i in range(test.num_rows):
        # Run the classifier on the ith patient in the test set
        c = np.append(c, classify(training, make_array(test.row(i)), k))   
    return c

In [None]:
# Create the grid
x_array = make_array()
y_array = make_array()
for x in np.arange(-4, 4, 0.2):
    for y in np.arange(-5, 4, 0.2):
        x_array = np.append(x_array, x)
        y_array = np.append(y_array, y)
        
test_grid = Table().with_columns(
    'x', x_array,
    'y', y_array
)

In [None]:
#Get the classification labels
c = classify_grid(training_set, test_grid, 1)

In [None]:
#Create a class label-color table 
## (defined to match the class label colors previously used)
color_table = Table().with_columns(
    'Class', np.arange(1,6),
    'Color', make_array('darkblue', 'gold', 'lightblue', 'green', 'red')
)
color_table

In [None]:
#Plot of the predicted class lables for the test grid 
## colored according to the class label
#The training points are added and colored to match the class labels

test_grid = test_grid.with_column('Class', c)
test_grid.scatter('x', 'y', colors='Class', alpha=0.4, s=30)

training2 = training_set.join("Class", color_table)
plt.scatter(training2.column('x'), training2.column('y'), c=training2.column('Color'), edgecolor='k')

**Question 3.2**  What does the decision boundary plot above tell us?

[Add response here]

### Step 3:  Evaluate the model using the test set

Now we can see how well our model performs by predicting the labels of the test dataset and comparing the predicted labels to the true labels.  The function `evaluate_accuracy` estimates the *test error*.

In [None]:
def evaluate_accuracy(training, test, k):
    """Return the proportion of correctly classified examples 
    in the test set"""
    test_attributes = test.drop('Class')
    num_correct = 0
    for i in np.arange(test.num_rows):
        c = classify(training, test_attributes.row(i), k)
        num_correct = num_correct + (c == test.column('Class').item(i))
    return num_correct / test.num_rows

Run the cells below to see the test error rate for different values of `k`.

In [None]:
evaluate_accuracy(training_set, test_set, 5)

In [None]:
evaluate_accuracy(training_set, training_set, 1)

In [None]:
evaluate_accuracy(training_set, training_set, 20)

You have now seen how to fit a classification model.  In these next questions, you get to decide on your own classification model using the confirmed exoplanet data!

**Question 3.3.**  Select two attributes from the confirmed exoplanet data along with classes for which you would like to try to build a kNN classification model.  Make a scatterplot of the attributes and color the points according to the classes that you have selected. Be sure to label the axes! 

A couple ideas for possible classes are `pl_letter` (indicating the order in which planets are discovered in a system) or `pl_discmethod` (the discovery method).  You will need to decide which attributes might be useful for predicting the classes you select!

In [None]:
...

**Question 3.4.** Divide the data into a training set and a testing set and make a scatterplot of each with the points colored according the labels.

In [None]:
...

**Question 3.5.** Build your classification model using your training set, and create a plot of the decision boundary as was done above using the grid.  Add the training data points to the grid as well (colored to match the appropriate grid label colors).  You will likely need to adjust the functions used in the previous example to work in your new setting.

In [None]:
...

**Question 3.6.**  Use your test dataset to estimate the test error rates for at least three different values for `k`.

In [None]:
...

In [None]:
...

In [None]:
...

**Submission**: Once you're finished, follow the instructions at the top of this notebook to save as a .pdf and .ipynb. Then submit the two files through Canvas.