# Unsupervised clustering on rock properties

Sometimes we don't have labels, but would like to discover structure in a dataset. This is what clustering algorithms attempt to do. They don't require labels from us &mdash; they are 'unsupervised'.

We'll use a subset of the [Rock Property Catalog](http://subsurfwiki.org/wiki/Rock_Property_Catalog) data, licensed CC-BY Agile Scientific. Note that the data have been preprocessed, including the addition of noise. See the notebook [RPC_for_regression_and_classification.ipynb](RPC_for_regression_and_classification.ipynb). 

We'll use two unsupervised techniques:

- k-means clustering
- DBSCAN

We do have lithology labels for this dataset, so we can use those as a measure of how well we're doing with the clustering.

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
uid = "1TMqV0d6zEqhP-gK_jQlagTuPN7pFEI5rhkVN0xJIx4g"
url = f"https://docs.google.com/spreadsheets/d/{uid}/export?format=csv"

df = pd.read_csv(url)

Notice that the count of `Rho` values is smaller than for the other properties.

Pairplots are a good way to see how the various features are distributed with respect to each other:

In [None]:
cols = ['Vp', 'Vs', 'Rho_n']

sns.pairplot(df.dropna(), vars=cols, hue='Lithology', plot_kws={'edgecolor': None})

## Clustering with _k_-means

From [the Wikipedia article](https://en.wikipedia.org/wiki/K-means_clustering):

> k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

In [None]:
from sklearn.cluster import KMeans

In [None]:
clu = 

In [None]:
df['K means'] = clu.predict(df[cols].values)

In [None]:
for name, group in df.groupby('K means'):
    plt.scatter(group.Vp, group.Rho_n, label=name)
plt.legend()

We actually do have the labels, so let's compare...

In [None]:
for name, group in df.groupby('Lithology'):
    plt.scatter(group.Vp, group.Rho_n, label=name)
plt.legend()

## Measuring the accuracy

There are metrics for comparing clusterings. For example, `adjusted_rand_score` &mdash; from the scikit-learn docs:

> The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
>
> The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:
> 
> ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
> 
> The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

In [None]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(df.Lithology, df['K means'])

That is not a good score.

## Clustering with DBSCAN

DBSCAN has nothing to do with databases. From [the Wikipedia article](https://en.wikipedia.org/wiki/DBSCAN):

> Density-based spatial clustering of applications with noise (DBSCAN) is [...] a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In [None]:
from sklearn.cluster import DBSCAN

DBSCAN()

There are two important hyperparameters:

- `eps`, the maximum distance between points in the same cluster.
- `min_samples`, the minimum number of samples in a cluster.

In [None]:
clu = DBSCAN(eps=150, min_samples=10)

clu.fit(df[cols].values)

In [None]:
df['DBSCAN'] = clu.labels_

In [None]:
for name, group in df.groupby('DBSCAN'):
    plt.scatter(group.Vp, group.Rho_n, label=name)

It's a bit hard to juggle the two parameters... let's make an interactive widget:

Now we can apply this idea to our problem:

In [None]:
@interact(eps=(10, 250, 10))
def plot(eps):
    clu = DBSCAN(eps=eps)
    clu.fit(df[cols].values)
    df['DBSCAN'] = clu.labels_
    for name, group in df.groupby('DBSCAN'):
        plt.scatter(group.Vp, group.Rho_n, label=name)

In [None]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(df.Lithology, df.DBSCAN)

### Exercises

- Can you make the interactive widget display the Rand score? Use `plt.text(x, y, "Text")`.
- Can you write a loop to find the value of `eps` giving the highest Rand score?
- Can you add the `min_samples` parameter to the widget?
- Explore some of [the other clustering algorithms](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster).
- Try some clustering on one of your own datasets (or use something from [sklearn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets), e.g. `sklearn.datasets.load_iris`).