### Clustering Whiskies By Flavor Profile

We'll do this using a clustering method from the scikit-learn machine learning
module.
The specific method we'll be using is called `spectral co-clustering`.

One way to think about spectral co-clustering method
is to consider a list of words and a list of documents,
which is the context in which the method was first introduced.
We can represent the problem as a graph, where on the left we have words
and on the right, we have documents.
Let's draw this.
Words on the left.
We have documents on the right.
Here, different circles correspond to different words.
And we have our documents.
I'm going to use squares for them.
And we have lines connecting words to documents.
So for example, in this particular document,
these four words might appear.
Of course, in practice, you have many more,
but this is just a schematic of the basic idea.
In the second document, we might have these words.
And let me just fill in this picture.
The goal is to find clusters that consist
of sets of words and sets of documents that often go together.
An extreme example would be books written in different languages.
We can also think about documents that deal with specific scientific fields,
such as biology or physics.
The term co-clustering refers to the idea
that we simultaneously find both clusters
of words and clusters of documents.

We can first represent this graph as what
is called an `adjacency matrix`, where the rows correspond to words
and the columns correspond to documents.
Any given element of this matrix represents
the number of times a given word appears in the given document.
We can then take this matrix, manipulate it in certain ways,
and find an approximate solution to the stated clustering problem,
in terms of `eigenvalues` and `eigenvectors` of this modified matrix.
We will not go into the details here, but the term `spectral`
refers to the use of eigenvalues and eigenvectors of some matrix,
and this is the meaning of the term spectral in spectral co-clustering.
If you'd like to learn more about eigenvalues and eigenvectors,
you can take any course or any textbook on matrix or linear algebra.

We can think of the co-clustering technique as simultaneous clustering
the rows and columns of matrix, whatever they happen to represent.
This method effectively re-orders the rows and columns of the correlation
matrix so that we can see blocks corresponding to groups of whiskeys
more clearly.
Since that whiskeys in the dataset come from six different regions,
we're going to ask the clustering algorithm to find six blocks.

In [7]:
#from sklearn.cluster.bicluster import SpectralCoClustering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
whisky = pd.read_csv('whiskies.txt')
whisky['Region'] = pd.read_csv('regions.txt')
flavors = whisky.iloc[:, 2:14]
corr_flavors = pd.DataFrame.corr(flavors)
corr_whisky = pd.DataFrame.corr(flavors.transpose())

In [8]:
from sklearn.cluster.bicluster import SpectralCoclustering

The first argument is going to be n clusters, which
is the number of clusters we'd like to specify.
In this case, we'll set that equal to 6.
We'll also specify the random state to be equal to 0.
This is a parameter that we don't have to care about too much at this point.

In [9]:
model=SpectralCoclustering(n_clusters=6,random_state=0)

In [10]:
model.fit(corr_whisky)

SpectralCoclustering(init='k-means++', mini_batch=False, n_clusters=6,
                     n_init=10, n_jobs=None, n_svd_vecs=None, random_state=0,
                     svd_method='randomized')

The output is an array with the following dimensions--
number of row clusters times number of rows
in the data matrix, the correlation matrix, in this case.
The entries of the array are either true or false.

In [13]:
model.rows_

array([[False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False,  True, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False,  True, False,  True, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False],
       [False, False, False, False, False,  True, False, False, False,
        False, False,  True, False,  True, False, False,  True, False,
         True, False, False, False, False, False, False, False, False,
        False, False, False, False, False,  True,  True,  True, False,
        False, False,  True, Fals

If we sum all of the columns of this array,
we can find out how many observations belong to each cluster.
Let's try it out.
We will us NumPy sum.
We'll take the model.rows.
And we specify the axes.
Remember, axis 0 is rows, axis equal to 1 is columns.
The output tells us how many whiskeys belong to a cluster 0,
cluster 1, cluster 2, and so on.
For example, here, 19 whiskeys belong to cluster number 2.

In [14]:
np.sum(model.rows_,axis=1)

array([ 5, 20, 19,  6, 19, 17])

If instead we sum all of the rows, we can find out how many
clusters belong to each observation.
Because each observation belongs in just one of the six clusters,
the answer should be 1 for all of them.
But let's check that just to be sure.

In [15]:
np.sum(model.rows_,axis=0)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Let's look at the row labels in our model.
How do we interpret the output here?
Observation number 0 belongs to cluster number 5,
observation number 1 belongs to cluster number 2, and so on.
All of the entries in the array have to be numbers between 0 and 5

In [16]:
model.row_labels_

array([2, 4, 5, 3, 2, 1, 5, 4, 2, 5, 4, 1, 2, 1, 2, 2, 1, 2, 1, 0, 5, 3,
       5, 3, 5, 5, 4, 4, 5, 4, 5, 2, 1, 1, 1, 2, 4, 5, 1, 0, 1, 5, 4, 4,
       4, 1, 2, 1, 1, 5, 5, 4, 4, 4, 1, 0, 2, 3, 3, 1, 5, 2, 4, 2, 2, 4,
       0, 2, 0, 1, 4, 2, 1, 2, 4, 2, 5, 3, 2, 5, 1, 5, 4, 1, 4, 1])