In [None]:
#| echo: false 

# We are suppressing warnings for now. Comment lines below if you want to see the warnings, as they tend to be informative.
import warnings
warnings.filterwarnings('ignore')

# Lab: Clustering and Ground Truth


We are going to use the Wine Quality Dataset from @cortezWineQuality2009 that you may be familiar with by now (but if you don't, tou can find more information about it here: <https://doi.org/10.24432/C56S3T>).

## Data Wrangling


In [None]:
import pandas as pd

df = pd.read_csv('data/wine.csv')

Look at our data.

In [None]:
df.head()

There is a column called Class label that gives us the ground truth. The wines come from three different cultivars. Knowing the actual grouping helps us to identify how well our methods can capture this ground truth.

Following our process above, we should first get a sense of our data.

In [None]:
df.describe()

No missing data. The scales of our features vary (e.g., Magnesium is in the 100s whereas Hue is in the low single digits).

How about our feature distributions?

In [None]:
df_long = df.melt(id_vars='Class label')

In [None]:
import seaborn as sns

sns.violinplot(data = df_long, x = 'variable', y = 'value')

Makes sense to normalise our data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# create a scaler object
scaler = MinMaxScaler()

# fit and transform the data
df_norm = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

df_long = df_norm.melt(id_vars='Class label')
df_long

In [None]:
#create seaborn violin plot
my_plot = sns.violinplot(data = df_long, x = 'variable', y = 'value')

#rotate x-axis labels
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=90)

Are there any patterns?

How about a pairplot?

In [None]:
#| column: page
sns.pairplot(data = df_norm.iloc[:,1:])

Hmm, a few interesting correlations. Some of our variables are skewed. We could apply some PCA here to look at fewer dimension or even log transform some of the skewed variables.

## Cluster analysis

For now we will just run a kmeans cluster and then check our results against the ground truth.

### Determining the number of clusters

Lets decide how many clusters we need.

In [None]:
from sklearn.cluster import KMeans

ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k, n_init = 10)
    
    # Fit model to samples
    model.fit(df.iloc[:,1:])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

import matplotlib.pyplot as plt

plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

What happens if we use the normalised data instead?

In [None]:
from sklearn.cluster import KMeans

ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k, n_init = 10)
    
    # Fit model to samples
    model.fit(df_norm.iloc[:,1:])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

import matplotlib.pyplot as plt

plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

:::callout-warning
### Pause for thought

Both of the graphs are the same. Is that what you would expect?

:::

Three clusters seems about right (and matches our number of origonal labels).

In [None]:
df['Class label'].value_counts()

### Computing the clusters

In [None]:
# Create a KMeans instance with k clusters: model
k_means = KMeans(n_clusters=3)

# Fit model to samples
df_k_means = k_means.fit(df.iloc[:,1:])

df['Three clusters'] = pd.Series(df_k_means.predict(df.iloc[:,1:].values), index = df.index)
df

## Clusters and Ground Truth
Now that we have created three clusters, we may ask ourselves: Do our cluster labels match our ground truth? Did our cluster model capture reality?

In [None]:
ct = pd.crosstab(df['Three clusters'], df['Class label'])
ct

It might be easier to see as a stacked plot (see [this post](https://stackoverflow.com/questions/43544694/using-pandas-crosstab-with-seaborn-stacked-barplots)).

In [None]:
import matplotlib.pyplot as plt
import numpy as np

ct.plot.bar(stacked=True)
plt.legend(title='Class label')

How has the kmeans model done compared to our ground truth?

::: callout-important

We need to be really careful here. We notice that it is not easily possible to compare the known class labels to clustering labels. The reason is that the clustering algorithm labels are just arbitrary and not assigned to any deterministic criteria. Each time you run the algorithm, you might get a different id for the labels. **The reason is that the label itself doesn't actually mean anything, what is important is the list of items that are in the same cluster and their relations.**

:::

### Principal Components Analysis

A way to come over this ambiguity and evaluate the results is to look at a visualisations of the results and compare. But this brings in the question of what type of visualisation to use for looking at the clusters. An immediate alternative is to use scatterplots. However, it is not clear which axis to use for clustering. A common method to apply at this stage is to make use of PCA to get a 2D plane where we can project the data points and visualise them over this projection.

In [None]:
df.iloc[:,1:14]

In [None]:
from sklearn.decomposition import PCA

n_components = 2

pca = PCA(n_components=n_components)
df_pca = pca.fit(df.iloc[:,1:14])
df_pca_vals = df_pca.transform(df.iloc[:,1:14])

Grab our projections and plot along with our cluster names.

In [None]:
df['c1'] = [item[0] for item in df_pca_vals]
df['c2'] = [item[1] for item in df_pca_vals]

ax = sns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Class label')
ax.set_title('Known labels visualised over PCs')

In the figure above, we colored the points based on the actual labels, we observe that there has been several misclassifications in the figure above (i.e., in the algorithm's results). So one may choose to use an alternative algorithm or devise a better distance metric.

In [None]:
ax = sns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')
ax.set_title('Results of the algorithm visualised over PCs')

This shows the parallelism between the clustering algorithm and PCA. By looking at the PCA loadings, we can find out what the x-axis mean and try to interpret the clusters (We leave this as an additional exercise for those interested).

How might your interpret the above plots? Did the kmeans model identify the ground truth?

How robust is our clustering? It may be that the kmeans algorithm becamse stuck or that a few outliers have biased the clustering.

Two ways to check are:

* Running the model multiple times with different initial values.
* Removing some data and running the modelling multiple times.

Run the below cell a few times. What do you see?

In [None]:
# Create a KMeans instance with k clusters: model
k_means = KMeans(n_clusters=3, init='random', n_init = 10)

# Fit model to samples
df_k_means = k_means.fit(df.iloc[:,1:14])

df['Three clusters'] = pd.Series(df_k_means.predict(df.iloc[:,1:14].values), index = df.index)

ax = sns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')
ax.set_title('Results of the algorithm visualised over PCs')

How about with only 80% of the data?

In [None]:
df_sample = df.sample(frac=0.8, replace=False)

# Create a KMeans instance with k clusters: model
k_means = KMeans(n_clusters=3, init='random', n_init = 10)

# Fit model to samples
df_k_means = k_means.fit(df_sample.iloc[:,1:14])

df_sample['Three clusters'] = pd.Series(df_k_means.predict(df_sample.iloc[:,1:14].values), index = df_sample.index)

ax = sns.scatterplot(data = df_sample, x = 'c1', y = 'c2', hue = 'Three clusters')
ax.set_title('Results of the algorithm visualised over PCs')

We may want to automate the process of resampling the data or rerunning the model then perhaps plotting the different inertia values or creating different plots.

Do you think our clustering algorithm is stable and provide similiar results even when some data is removed or the initial values are random?

If so, then is our algorithm capturing the ground truth?