# Clustering Algorithms

## Unsupervised Learning

Our previous models have all been various instances of **supervised learning**, in which we attempt to determine the class to which an observation belongs from among a *known group of classes*. **Unsupervised learning** is used where we are uncertain of exactly how our data should be divided. In the case of internet browsing, determining whether or not a viewer will click an online advertisement is a case of supervised learning. On the other hand, trying to group individuals who shop on a particular website in order to make recommendations of other items they might wish to buy is a case of unsupervised learning.

In unsupervised learning, we need to be aware that we **do not know** how many groups exist, and that there are no "correct" labels. Unsupervised learning is better thought of as part of the data exploration process than as an end result of data modeling. We can use unsupervised learning to better understand our customers (or other groups!), but the results of unsupervised learning are unlikely to have value outside of their contribution in helping us to categorize various groups within our data.

## Why Make Clusters?

As we seek to understand our data, finding data-driven groups may be valuable in aiding our efforts to understand the patterns that we are seeking. For example, clustering observations based on observable data may reveal that what we believed to be a single outcome was in fact several different outcomes that look similar in some ways but distinct in others. 

Imagine that you work for a website that sells both macOS and Windows computers. You know that individuals with higher income in your data tend to buy more expensive computers (think larger hard drives, larger screens, more RAM, etc.). Clustering algorithms may reveal the disctinctions between individuals who choose macOS or Windows computers in a way that allows you to better understand the needs of each type of consumer. You might observe that individuals who play computer games tend to purchase Windows computers, and can make recommendations to those individuals accordingly, while it becomes apparent that macOS consumers prefer to purchase design software such as photo or video editing programs. It is even possible that there is a third group, some of whom purchase macOS computers while others purchase Windows computers, and who focus on web/software design. This third group might be a great opportunity to promote additional monitors to keep more windows open! 

Clustering allows us to reveal these patterns within our data.

## Agglomerative Clustering

Agglomerative clustering is one of two popular clustering algorithms (the other is K-means clustering), and lends itself particularly well to visual analysis. Agglomerative clustering is also very easy to understand. The process for creating clusters is:

1. Measure the distance between all observations/groups
2. Combine the two observations or groups that are the smallest distance from one another
3. Record the new group
4. Repeat steps 1 to 3 until only one group remains
5. Map the process of combining (agglomerating) observations and groups to form a dendrogram
6. Choose the appropriate level of clustering, and note the groups to which each observation belongs

When you are finished, you will have a dendrogram, which will look very similar to a decision tree:

![](https://f1000researchdata.s3.amazonaws.com/manuscripts/10884/292fd343-e909-466c-9f2d-4f6b9aa4c44e_figure1.gif)

The higher up the visual two observations or groups join together, the less similar they are. Conversely, like observations and groups will join together lower down the visual.

## Using `sklearn` to Cluster digits

Now it's time to make our own clusters. Let's explore MNIST with clusters. First, we need to import the data. We will import it, and then create a SMALL sample in order to make our visuals easier to follow.

In [3]:
import pandas as pd

# Read in data
data = pd.read_csv("https://github.com/dustywhite7/pythonMikkeli/blob/master/exampleData/mnistTrain.csv?raw=true")

# Grab 10 observations of each digit (0 to 9)
data = data.groupby('Label').apply(lambda x: x.sample(n=10)).reset_index(drop=True)

Unnamed: 0,Label,0,1,2,3,4,5,6,7,8,...,774,775,776,777,778,779,780,781,782,783
0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,9,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
96,9,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
97,9,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
98,9,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0


Next, we can create our clusters using `sklearn`:

In [10]:
from sklearn.cluster import AgglomerativeClustering

y = data['Label']
x = data.drop('Label', axis=1)

clusters = AgglomerativeClustering(distance_threshold=0, n_clusters=None).fit(x)

By setting our `distance_threshold=0` and `n_clusters=None`, we have asked that our dendrogram be fully-formed, so that each group can be combined with all other groups, and we will not stop at any particular number of groups. If we know that we expect there to be three distinct categories of observations, then we would instead set `distance_threshold=None` and `n_clusters=3`. If `distance_threshold` is `None`, then `n_clusters` must be specified, and vice versa.

We can also use the `fit_predict()` method to both fit the tree and return a list of the clusters to which each observation belongs in order to include the cluster variable in future analyses:

In [16]:
# I specify 10 groups, because MNIST has 10 unique labels

AgglomerativeClustering(distance_threshold=None, n_clusters=10).fit_predict(x)

array([6, 4, 2, 4, 6, 4, 4, 6, 4, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 1, 1,
       1, 6, 6, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 9, 0,
       9, 3, 0, 7, 3, 7, 0, 0, 6, 0, 2, 2, 6, 0, 6, 1, 5, 5, 9, 5, 9, 5,
       5, 5, 9, 5, 8, 3, 3, 3, 3, 3, 3, 3, 3, 7, 2, 6, 2, 2, 6, 2, 2, 8,
       2, 2, 3, 7, 0, 0, 7, 7, 7, 3, 7, 0])

## Plotting the dendrogram

To make our lives easy, `plotly` includes a pre-built agglomeration clustering algorithm that we can use to visualize our data:

In [63]:
# Import statements
import plotly.figure_factory as ff
import plotly.offline as py

py.init_notebook_mode(connected=True)

# Create the dendrogram
fig = ff.create_dendrogram(x, orientation = 'left')

# Make the figure look nice
fig.update_xaxes(title="Difference Level", tickangle=90)
fig.update_yaxes(title="Observation")
fig.layout.title = "Dendrogram of MNIST"
fig.layout.width=600
fig.layout.height=1200

# Plot the figure
py.iplot(fig)

There is a significant amount of nuance to our MNIST data that clustering is unable to determine. The clusters do not seem to resemble groups of ten at any level, and each group of ten observations (all digits are grouped by the label in our small sample data) is scattered across the dendrogram.

**Describe it:**

Choose a question that you think would benefit from being described through a clustering algorithm. Find a data set that might be capable of answering your question, and explain why you believe that clustering would be beneficial to answering the question with that particular data set. Then, provide code to load the data of your choice and perform the clustering that you explained.

The first graded cell should contain your explanation, and the second graded cell should contain your code. Your work will be graded as follows:

- You state a specific research question **[1 point]**
- You find and indicate data that could answer that question **[1 point]**
- You explain how clustering would be beneficial **[1 point]**
- You load the data you found **[1 point]**
- You implement clustering on the data that you loaded **[1 point]**

Code goes here $\downarrow\downarrow\downarrow$