# Comparing clustering algorithm effectiveness

In this lab you'll try the three main clustering algorithms you've learned so far on 7 different datasets designed to evaluate clustering algorithm effectiveness.

This lab is exploratory and data visualization heavy. 

---

### Load packages

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

---

### Load the datasets

Each of the seven datasets have 3 columns:

    x
    y
    label
    
Since they each only have two variables, they are easy to examine visually. The label column is the "true" label for the data that you will compare to the clusters the algorithms find.

In [2]:
flame = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/flame.csv')
agg = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/aggregation.csv')
comp = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/compound.csv')
jain = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/jain.csv')
path = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/pathbased.csv')
r15 = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/r15.csv')
spiral = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/shape_clusters/spiral.csv')

---

### Plot each of the datasets with the true labels colored

The datasets have different numbers of unique labels, so you will need to figure out how many there are for each one and color the clusters accordingly (r15 has 15 different clusters).

---

### Write a plotting function or functions to compare the performance of three clustering algorithms

Below three clustering algorithms we have covered earlier in the class are loaded in.

    KMeans: k-means clustering
    AgglomerativeClustering: Hierarchical clustering (bottom-up)
    DBSCAN: density based clustering
    
Your function or functions should allow you to visually examine the effect of changing different parameters in the clustering algorithms. The parameters that you should explore at least are:

    KMeans:
        n_clusters
    AgglomerativeClustering:
        n_clusters
    DBSCAN
        eps
        min_samples
        
You are, of course, welcome to explore other parameters for these models.


In [3]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

## Tinkering with clustering parameters

In the next sections, play around with the parameters for the clustering algorithms to see their effect and try to get clusters that make sense. There is no right answer here, as these are unsupervised techniques.

---

### Find parameters for the `flame` dataset

Which algorithm performs best?

---

### Find parameters for the `agg` dataset

Which algorithm performs best?

---

### Find parameters for the `comp` dataset

Which algorithm performs best?

---

### Find parameters for the `jain` dataset

Which algorithm performs best?

---

### Find parameters for the `pathbased` dataset

Which algorithm performs best?

---

### Find parameters for the `r15` dataset

Which algorithm performs best?

---

### Find parameters for the `spiral` dataset

Which algorithm performs best?

---

## [BONUS] Explore algorithms we have not covered

sklearn comes with a variety of unsupervised clustering algorithms, many of which we have not covered in class. Two algorithms in particular may be of interest to you:

1. [Affinity Propagation](http://scikit-learn.org/dev/modules/clustering.html#affinity-propagation) finds clusters by "sending messages" from points to other points. Points group into clusters based on a "damping factor". The main appeal of affinity propagation is that the number of clusters do not need to be specified by the user (like DBSCAN).
- [Birch](http://scikit-learn.org/dev/modules/clustering.html#birch) finds clusters with a tree-based algorithm (somewhat) reminiscent of decision trees. It finds clusters by evaluating branches/nodes on a tree that best describe the features of the data.