In [5]:
# Uncomment and run if you do not have these packages installed or are outdated
#!pip install scikit-learn --upgrade

# Clustering Algorithms - Hierarchical Clustering

In [6]:
%matplotlib notebook
from sklearn import datasets
from sklearn.metrics import adjusted_mutual_info_score
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
import seaborn as sns

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import warnings
warnings.filterwarnings('ignore')



We are going to use the functions that scipy provide for hierarchical agglomerative clustering. 
We will also continue working with the iris dataset

In [7]:
iris = datasets.load_iris()
plt.figure(figsize=(6,6))
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=iris['target'],s=100);

<IPython.core.display.Javascript object>

From the plot we can see that the classes from the labels do not form well separated clusters, so it is going to be difficult for hierarchical clustering to discover these three clusters. 

## Single Linkage

First we apply **single linkage** clustering to the iris dataset

In [8]:
%time clust = linkage(iris['data'], method='single')
plt.figure(figsize=(8,8))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 1.71 ms, sys: 976 µs, total: 2.69 ms
Wall time: 952 µs


<IPython.core.display.Javascript object>

There is only evidence of two distinctive partitions in the dataset, some inversions also appear on the dendrogram. If we cut the dendreogram so we have tree clusters we obtain the following

In [9]:
plt.figure(figsize=(6,6))
clabels = fcluster(clust, 3, criterion='maxclust')

plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels ,s=100);

<IPython.core.display.Javascript object>

We can compare the true labels with the ones obtained using this clustering algorithm using for example the mutual information score

In [10]:
print("AMI= ", adjusted_mutual_info_score(iris['target'], clabels))

AMI=  0.5820928222202184


Not a very good result

## Complete linkage

Lets apply the **complete link** criteria to the data.

In [11]:
%time clust = linkage(iris['data'], method='complete')
plt.figure(figsize=(8,8))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 2.21 ms, sys: 46 µs, total: 2.26 ms
Wall time: 1.58 ms


<IPython.core.display.Javascript object>

Also two aparent clusters, but if we cut the dendrogram to three clusters we obtain something a little better.

In [12]:
plt.figure(figsize=(6,6))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);

<IPython.core.display.Javascript object>

If we compute the mutual information score

In [13]:
print (adjusted_mutual_info_score(iris['target'], clabels))

0.6963483696671463


## Average linkage

Now we use the **average link** criteria to the data.

In [14]:
%time clust = linkage(iris['data'], method='average')
plt.figure(figsize=(8,8))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 151 µs, sys: 941 µs, total: 1.09 ms
Wall time: 825 µs


<IPython.core.display.Javascript object>

We cut again to obtain three classes

In [15]:
plt.figure(figsize=(6,6))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);

<IPython.core.display.Javascript object>

In this case the mutual information scores higher for this criteria.

In [16]:
print (adjusted_mutual_info_score(iris['target'], clabels))

0.7934250515435666


## Ward criterion

Now we apply the **Ward criterion** (uses the variances of the clusters)

In [17]:
%time clust = linkage(iris['data'], method='ward')
plt.figure(figsize=(8,8))
dendrogram(clust, distance_sort=True, orientation='right');

CPU times: user 474 µs, sys: 876 µs, total: 1.35 ms
Wall time: 1.03 ms


<IPython.core.display.Javascript object>

In [18]:
plt.figure(figsize=(6,6))
clabels = fcluster(clust, 3, criterion='maxclust')
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=clabels,s=100);

<IPython.core.display.Javascript object>

This criteria scores a little lower than the previous one. As we do not usually have the labels with a real unsupervise dataset we will have to use other quality criteria to decide the method to use for clustering the data.

In [19]:
print (adjusted_mutual_info_score(iris['target'], clabels))

0.7578034225092115
