# Problem Set 3: HAC

This problem set departs a bit from the usual in that we're going to use `scipy`'s hierarchical clustering tools. `Scikit-learn` has `AgglomerativeClustering`, but the `scipy.cluster.hierarchy` tools give us dendrograms, which are nice for visualization.

In [None]:
# Boilerplate

import numpy as np
import pandas as pd

from sklearn.utils import check_random_state
# Some scaling functions
from sklearn.preprocessing import robust_scale, minmax_scale, maxabs_scale, scale

import matplotlib.pyplot as plt
import seaborn as sns

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# Enable high resolution PNGs
%config InlineBackend.figure_formats = {'png', 'retina'}

# Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': '#DFDFE5'}
sns.set(context='notebook', style='darkgrid', rc=rc)

# We're going to use scipy for visualizations and clustering
from scipy.cluster import hierarchy


## Loading the data

I've included an [HTML file](Data-Cleaning-Tutorial.html) that describes in detail how I loaded and cleaned the dataset. If you're interested, you should look this over to see what I did. It might also be helpful for your project.

To sum up, 
1. I load the dataset that I [downloaded from here](http://catalog.data.gov/dataset/ssa-disability-claim-data), using `pd.read_csv`, making sure to specify the thousands separator as a comma (see [the tutorial](Data-Cleaning-Tutorial.html) for why).
2. I drop any column with one unique value, because it will tell us nothing, and any column that has an "object" type.
3. I drop any row with `na` values.

Then I use `head` to look at the first few columns:

In [None]:
df = pd.read_csv('SSA-SA-FYWL.csv', thousands=',')
df_clean = df.drop([col for col, data in df.iteritems() if data.dtype == np.object or data.nunique() == 1], axis=1).dropna()
df_clean.head()

## Problem 1

We're plotting some dendrograms in this assignment. Remember that a *dendrogram* is a visualization of the HAC process that tells us how the clusters are formed. It's a tree, and the leaves are the individual examples. The tree branches where two clusters are joined. The *height* where the branch happens is the distance between the two clusters, which depends on the metric we use and the method of joining clusters. 

Let's make a function that gives us a nice big dendrogram so we can see what's going on, and plots it. Then let's plot a dendrogram for each method:

In [None]:
def plot_big_dendro(data, title=None, scale_fn=None, method='ward', n_clusters=1):
    plt.figure(figsize=(15, 15))
    plt.title(title)
    if scale_fn is None:
        Z = hierarchy.linkage(data, method=method)
    else:
        Z = hierarchy.linkage(scale_fn(data), method=method)
    dn = hierarchy.dendrogram(Z, color_threshold=Z[1-n_clusters,2], distance_sort=True, no_labels=True)
    plt.show()

plot_big_dendro(df_clean.values, title="Single linkage", method='single')
plot_big_dendro(df_clean.values, title="Complete linkage", method='complete')
plot_big_dendro(df_clean.values, title="Average linkage", method='average')
plot_big_dendro(df_clean.values, title="Ward linkage", method='ward')

Given what you know about linkages, compare the dendrograms above. Remember that height corresponds to distance between clusters. What does it mean for a cluster to be "high and narrow"? Or "low and wide"? Is complete linkage the best *for this data*? Why or why not? Is single the best *for this data*? Why or why not? Would you use single linkage for some applications and complete for others? What about average linkage? Or Ward linkage?

YOUR ANSWER HERE

## Problem 2

In addition to what we already see above, we can also preprocess the data to scale all of the features before we try to cluster the data. We might want to do this because certain features in the original dataset are naturally larger than others -- for example, any "population" feature will be naturally larger than a "percentage" feature. 

There are several ways to do this, of which these are only a few:
1. `scale`, which normalizes each feature by converting to its Z score (subtract mean and divide by standard deviation)
2. `robust_scale`, which does something similar, but uses median and interquartile range
3. `minmax_scale`, which contracts the values linearly to the range [0,1], so that the min is 0 and the max is 1
4. `maxabs_scale`, which simply divides by the maximum absolute value of the feature. This is good for preserving sparsity, because it doesn't affect zero values.

Let's compare them, and also to one without scaling, and use Ward linkage:

In [None]:
plot_big_dendro(df_clean.values, title="Just a dendrogram, Ma'am")
plot_big_dendro(df_clean.values, title="Scale max absolute value", scale_fn=maxabs_scale)
plot_big_dendro(df_clean.values, title="Scale min and max to range", scale_fn=minmax_scale)
plot_big_dendro(df_clean.values, title="Scale using Z score", scale_fn=scale)
plot_big_dendro(df_clean.values, title="Scale using robust scaling", scale_fn=robust_scale)

Compare the above dendrograms. In the spirit of Problem 1, what would you say about them? Is any one scaling method preferable? Would you use *any* of them?

YOUR ANSWER HERE

## Problem 3

We can also color the dendrogram with different colors for different clusters. The function above just chooses the right cluster join from the linkage information and colors everything below that distance. Every connected component below the threshold gets a different color.

Run the code below a few different times and play with the number of clusters, the linkage method, and the scaling function:

In [None]:
n_clusters = 8
scale_fn = None # One of None, scale, robust_scale, minmax_scale, maxabs_scale
method = 'ward' # One of 'single', 'complete', 'average', and 'ward'

In [None]:
plot_big_dendro(df_clean.values, title="Dendrogram with colors", n_clusters=n_clusters, scale_fn=scale_fn, method=method)

What are the settings that you feel give the best clustering *for this data*? What makes these choices better than others?

YOUR ANSWER HERE

## Problem 4

Now we're going to extract a labeling given the same three parameters. Then we'll join that information with the *original* data, that still has state and region codes. We'll start with the same parameters that you selected before, but you can change this later. We also include a stacked bar plot, broken out by region. The height is the count in each category:

In [None]:
n_clusters = n_clusters
scale_fn = scale_fn # One of None, scale, robust_scale, minmax_scale, maxabs_scale
method = method # One of 'single', 'complete', 'average', and 'ward'

In [None]:
def get_hac_labels(data, scale_fn=None, method='ward', n_clusters=8):
    if scale_fn is None:
        Z = hierarchy.linkage(data, method=method)
    else:
        Z = hierarchy.linkage(scale_fn(data), method=method)
    return hierarchy.cut_tree(Z, n_clusters=n_clusters).squeeze()

labels = get_hac_labels(df_clean.values, scale_fn=scale_fn, method=method, n_clusters=n_clusters)
label_df = pd.DataFrame(labels, index=df_clean.index, columns=['label'])
label_df = df.join(label_df)

column = 'Region Code'
sns.set_palette('Set2', n_clusters)
label_df.loc[:,[column, 'label']].pivot_table(
    index=column, columns='label', aggfunc=len
).plot.bar(figsize=(15, 5), stacked=True, width=1.0)

Play with the parameters above. Do any parameters make any more sense than others?

YOUR ANSWER HERE

## Problem 5

Now we'll do the same thing, except that we'll break it out by state. Again, play with the parameters:

In [None]:
n_clusters = n_clusters
scale_fn = scale_fn # One of None, scale, robust_scale, minmax_scale, maxabs_scale
method = method # One of 'single', 'complete', 'average', and 'ward'

In [None]:
column = 'State Code'
sns.set_palette('Set2', n_clusters)
label_df.loc[:,[column, 'label']].pivot_table(
    index=column, columns='label', aggfunc=len
).plot.bar(figsize=(15, 5), stacked=True, width=1.0)

Something surprising happens with the states. All of them should be nearly the same height, 15, except for PR, which is 13 (there were `nan` values in those data points). But there is nearly no variation within most of the states. Why is this? Also, when you play with the parameters, can you say anything about the clusters that occur? Why some states land in some colors with other states?

YOUR ANSWER HERE