**Section 8: Clusters**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.0, June 21 2024



# 1.  Preparations

`import` statements required for this notebook.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

And import some functions written for this notebook:

In [None]:
from modules.ClusterFunctions import * 

The function `plotClusters()` plots clusters with the centers.

In [None]:
help(plotClusters)

# 2. Introduction

In this notebook you will work with **k-means clustering**. Contrary to the algorithms we discussed so far in the practical sessions (regression and different classification algorithms), **clustering** is an unsupervised learning algorithm , i.e. our data is not labeled. 

Therefore the main task of this assignment is to learn different clusters for different data sets to observe the influence of $k$. We will equally use the propsed graphical method to create an elbow plot and make an estimate for $k$.

# 3. Cluster Analysis

First we load the data frame and use a `pandas` function:

In [None]:
df=pd.read_csv('data/example1_data.csv')

In order to get to know the data, we create a simple scatter plot.

In [None]:
# retrieve the columns names
cols=df.columns
# plot the data
plt.plot(df[cols[0]],df[cols[1]],'.')
# label the x and y axis with the column names    
plt.xlabel(cols[0])
plt.ylabel(cols[1])

This data can obviously be separated into two clusters. To create a cluster, we first need to crate an instance of the estimator `KMeans`. We specify the number of clusters, i.e.$k$ as well as a random state (to be able to repeat the algorithm with different valus for $k$).

The result of the clustering depends on the initial centroids (cluster centers). Those are selected randomly. `KMeans` sets the centroids randomly. The clustering process is repeated several times and then the best result, in terms of *inertia* (sum of squared distances of samples to their closest cluster center) is selected. The number of consecutives runs is defined in `n_init`. It was set to 10, with `sklean` version 1.4 this default value is changed to 1.4. For compatibility reason we will use the old default value here.

In [None]:
kmeans=KMeans(n_clusters=2,random_state=10,n_init=10)

And we apply the algoritm on our data:

In [None]:
kmeans.fit(df)

Now we call the function `plotClusters()`

In [None]:
plotClusters(df=df,model=kmeans)

Our function equally allows for the plotting of the cluster centers:

In [None]:
plotClusters(df=df,model=kmeans,draw_center=True)

And their annotation:

In [None]:
plotClusters(df=df,model=kmeans,draw_center=True,annotate_centers=True)

Using `save=True` and a filename you may also save the plot.

Now let's look at some information the estimator. The values of the cluster means are stored in `cluster_centers_`: 

In [None]:
print(kmeans.cluster_centers_)

`inertia_` is the sum of squared disctances of the samples to the nearest cluster centers. We can try to reduce this sum. Please note that we talk of disctances and not squared errors, as this is an **unsupervised** learning method and we do not know the ground truth.

In [None]:
print(kmeans.inertia_)

The overall silhouette coefficient is calculated a using the `silhouette_score()` method.

In [None]:
silhouette=silhouette_score(df, kmeans.fit_predict(df))

Now let's take a look at the result of learning three clusters:

In [None]:
kmeans2=KMeans(n_clusters=3,random_state=10,n_init=10)
kmeans2.fit(df)
plotClusters(df,kmeans2,legend=False,draw_center=True)
silhouette2=silhouette_score(df, kmeans2.fit_predict(df))

In [None]:
print('sum of squared distances (2 clusters):', kmeans.inertia_)
print('sum of squared distances (3 clusters):', kmeans2.inertia_)
print('overall silhouette score (2 clusters):',silhouette)
print('overall silhouette score (3 clusters):',silhouette2)

The sum of distances was reduced bute the silhouette score was equally reduced. We might equally test four clusters:

In [None]:
kmeans3=KMeans(n_clusters=4,random_state=10,n_init=10)
kmeans3.fit(df)
plotClusters(df,kmeans3,legend=False,draw_center=True)
silhouette3=silhouette_score(df, kmeans3.fit_predict(df))
print('sum of squared distances (2 clusters):', kmeans.inertia_)
print('sum of squared distances (3 clusters):', kmeans2.inertia_)
print('sum of squared distances (4 clusters):', kmeans2.inertia_)
print('overall silhouette score (2 clusters):',silhouette)
print('overall silhouette score (3 clusters):',silhouette2)
print('overall silhouette score (4 clusters):',silhouette3)

The resulting sum of squared distances is the same but teh silhouette score was further reduces. We might now make a test from 1 to 6 clusters. When the number of clusters is set to 1, we get a warning. We simply ignore it. In this case we set the silhouette coefficient to -1, the worst possible value.:

In [None]:
kValues=list(range(1,7))
distances=[]
cummulatedDistancesReduction=[0]
silhouetteCoefficient=[]
for k in kValues:
    estimator=KMeans(n_clusters=k,random_state=10,n_init=10)
    estimator.fit(df)
    distances.append(estimator.inertia_)
    if k==1:
        silhouetteCoefficient.append(-1)
    else:
        silhouetteCoefficient.append(silhouette_score(df, estimator.fit_predict(df)))
for i in range(1,len(kValues)):
    cummulatedDistancesReduction.append(distances[0]-distances[i])

And now we plot the result:

In [None]:
plt.plot(kValues,cummulatedDistancesReduction,'.-')
plt.xlabel('k')
plt.ylabel('cummulated distances reduction')

This "ellbow" plot shows nicely that 2 is a good value for $k$.

Now let's print the silhouette coefficients:

In [None]:
for i in range(1,7):
    print(f"{i} clusters: silhouette coefficient: {silhouetteCoefficient[i-1]:.3f}")

Based on the silhouette coefficient 2 is euqally a good value for $k$, the number of clusters,

Information: The data was artifically generated using `sklearn.datasets.make_blobs()` with two cluster centers.

# 4. Exercise

Use above steps to determine the best $k$ for k-means clustering for the data in the files:
- example2_data.csv
- example3_data.csv
- example4_data.csv
- example5_data.csv

In [None]:
# Your code for example2_data.csv
df2=pd.read_csv('data/example2_data.csv')

In [None]:
# Your code for example3_data.csv
df3=pd.read_csv('data/example3_data.csv')

In [None]:
# Your code for example4_data.csv
df4=pd.read_csv('data/example4_data.csv')

In [None]:
# Your code for example5_data.csv
df5=pd.read_csv('data/example5_data.csv')

*End of the Notebook*

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.