# Lab 3 - Clustering


##Clustering

Now let's move to unsupervised learning. Here we aren't trying to predict an outcome. Instead we just want to understand relationships between our observations. Remember that clustering is a technique used for things like market segmentation. What if we repurpose our housing data set for clustering? Say now we are trying to find similar census tracts (i.e. our observations) so we can mail them some marketing information. 

Luckily all our data set features are numeric so we shouldn't have issues with the algorithm. 

##Shortcuts
*   Use the "+ Code" button in the top left corner to add another block like this one only for running **code** or the "+ text" button for adding a block that runs **text**
*   I suggest looking at "Tools-->Keyboard Shortcuts..." for additional ways to run Colaboratory but here are a few useful ones:
> **Ctrl+F9** - Run all blocks   
> **Ctrl+Enter** - Run selected block   
> **Alt+Enter** - Run block and add a new block beneath   
> **Shift+Enter** - Run block and select next block   
> **Ctrl+F8** - Run all blocks before selected block   
> **Ctrl+F10** - Run selected block and all following blocks   
> **Ctrl+M+Y** - Convert selected block to a *code* block   
> **Ctrl+M+M** - Convert selected block to a *text* block

Also useful, Colaboratory supports code completion. Start typing code and press the **Tab** key. A drop down will appear with likely code based on what you typed. If only one possible command exists, it should complete it for you automatically.

Even better yet, if there is an error produced by your code, Colaboratory will provide a button at the bottom of the code output to search StackOverflow for an answer!

**Remember**: Code blocks need to be run in succession or they might produce errors!

In [0]:
#Load our packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, accuracy_score, silhouette_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn import datasets
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline 

In [0]:
#Upload the 'HousingData.csv' file from your local computer 
files.upload()

In [0]:
#Assign data to object
raw = pd.read_csv('HousingData.csv')

In [0]:
#Repeat the data cleaning activites from Lab 1
raw[['CRIM','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']] = raw[['CRIM','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']].fillna(raw.mean())

for col in ['ZN','CHAS']:
    raw[col].fillna(raw[col].mode()[0], inplace=True)

##K-means

Before we begin, let's take another look at our data and see if there are any variables that appear to make good clusters. We can do this using the **pairplot** graph we used in Lab 1. Visualizing multidimensional data is difficult so for the sake of demonstration let's choose just two variables from our data set that look like they might cluster well. In real life we wouldn't have this convenience.

In [0]:
#Check correlations between features
sns.pairplot(raw, size=1.5)

I'm going to chose to use **B** and **NOX** as they appear to have 2-clusters naturally. Let's build a k-means model and specify these details in our code:

In [0]:
#Train a k-means cluster model with 2 clusters
kmeans = KMeans(n_clusters=2)

X = raw[['B','NOX']]
y_pred = kmeans.fit_predict(X)

cmap = 'tab10' #This just sets the color palette
plt.figure(figsize=(15,10))
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=y_pred, cmap=cmap)
plt.xlabel('B')
plt.ylabel('NOX')
plt.show()

So it seems that k-means is also finding the same two clusters we can visually see. Clearly, there appears to be some imbalance between the two but the objective of k-means is only to find groupings -- whether or not the contain they same number of data points.

Let's try again with 3 clusters. Since it appears that 2 clusters is the *ideal* soultion, let's see what we get when we force k-means to find an extra cluster:

In [0]:
#Train a k-means cluster model with 3 clusters
kmeans = KMeans(n_clusters=3)

y_pred = kmeans.fit_predict(X)

cmap = 'tab10'
plt.figure(figsize=(15,10))
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=y_pred, cmap=cmap)
plt.xlabel('B')
plt.ylabel('NOX')
plt.show()

Now you can see how k-means forces all data points into one of the clusters. Since we have to specify the number of clusters a priori we get arbitrary splits in the data as shown above. It is senseless.

Let's simulate some data where we can show a good clustering vs. a bad one using k-means:

In [0]:
#Create toy data sets
centers_neat = [(-10, 10), (0, -5), (10, 5)]
x_neat, _ = datasets.make_blobs(n_samples=5000,
                                centers=centers_neat,
                                cluster_std=2,
                                random_state=2)

x_messy, _ = datasets.make_classification(n_samples=5000,
                                          n_features=10,
                                          n_classes=3,
                                          n_clusters_per_class=1,
                                          class_sep=1.5,
                                          shuffle=False,
                                          random_state=301)
#Default plot params
plt.style.use('seaborn')
cmap = 'tab10'

plt.figure(figsize=(15,8))
plt.subplot(121, title='"Neat" Clusters')
plt.scatter(x_neat[:,0], x_neat[:,1])
plt.subplot(122, title='"Messy" Clusters')
plt.scatter(x_messy[:,0], x_messy[:,1])

km_neat = KMeans(n_clusters=3, random_state=0).fit_predict(x_neat)
km_messy = KMeans(n_clusters=3, random_state=0).fit_predict(x_messy)

plt.figure(figsize=(15,8))
plt.subplot(121, title='"Neat" K-Means')
plt.scatter(x_neat[:,0], x_neat[:,1], c=km_neat, cmap=cmap)
plt.subplot(122, title='"Messy" K-Means')
plt.scatter(x_messy[:,0], x_messy[:,1], c=km_messy, cmap=cmap)

When our data is roughly spherical and separated, k-means works fine. When it's the opposite, not so much!

##Your Turn!


**Let's start by creating some artificial data clusters using the scikit-learn.datasets sample generator used above (learn more and discover other data generators [here](http://scikit-learn.org/stable/datasets/index.html#sample-generators).**

**Using the data you generated, run a k-means clustering and explain the outcome. How many clusters did you choose? Are you satisfied with these clusters? Why?**

**One method for determining how well a k-means clustering worked is called the "silhouette score." The score is computed by computing the mean distance between each data point and the other data points in its cluster and comparing that to the mean distance between each data point with points in the nearest cluster it *doesn't* belong to. Formally:** 

`(mean nearest cluster distance - mean intra-cluster distance)/max(nearest, intra-cluster)` 

**We've already loaded the silhouette score module in the first code cell above. The function works by passing the data set object, the labels assigned by the k-means algorithm, and the distance measure to it to return a value between -1 (bad clustering) to 1 (good clustering). For example:**

`metrics.silhouette_score(data set, labels, metric='euclidean')`

**Calcluate the score and report it. Are you satisfied or could it be improved?**

##Hierarchical Clustering

Let's try a second method for clustering, hierarchical (or sometimes called agglomerative) clustering. Remember with this type of clustering algorithm, we are *not* selecting the number of clusters beforehand but rather choosing them after the algorithm is run. This removes some of the potential bias of the analyst and can be more helpful in cases where we don't know anything about the potential clusters existing within the data set.

We already generated a few sample data sets above for k-means but with 5000 samples our dendrogram might be hard to visualize. Let's use those again only this time using hierarchical clustering with fewer samples. We'll finish be generating a dendrogram to see our results.

**NOTE: Scikit-learn *can* perform hierarchical clustering, but plotting a dendrogram isn't a native feature of the function. Therefore, we will be using the package "Scipy" for this example.**

In [0]:
#Create toy data sets
centers_neat = [(-10, 10), (0, -5), (10, 5)]
hc_neat, _ = datasets.make_blobs(n_samples=150,
                                 centers=centers_neat,
                                 cluster_std=1,
                                 random_state=2)

#Build an Agglomerative Cluster model with default settings
clt = linkage(hc_neat, method='single')

#Create a function for plotting dendrogram
plt.figure(figsize=(31,10))
dendrogram(clt, leaf_font_size=10, color_threshold=5)
plt.title('Dendrogram of \'neat\' clusters')
plt.ylabel('Euclidean Distance')
plt.show()

Our x-axis is labeled with our data set's row index and our y-axis is the Euclidean distance between data points (this is the default distance measure when using Scipy's 'linkage' function). What can we tell about our data from the dendrogram? How many clusters did we generate with our sample data set? Does our dendrogram look like it can identify that number of clusters? Let's have a look at the data set we generated:

In [0]:
#Plot our sample data
plt.figure(figsize=(12,10))
plt.scatter(hc_neat[:,0], hc_neat[:,1])
plt.title('Sample data set')

It looks like our dendrogram is identifying three clusters which would agree with what we see when we visulaize our sample data set above. Not bad for only a few lines of code!

##Your Turn!

**Generate a new sample data set and change the parameters to make it a little "messier" than what we did above.**

**Create a dendrogram to visualize your clustering and compare it to the data set you generate. What do you see? How many clusters do you think would be appropriate in this situation?**