# Clustering

Until now we have exclusively looked at **supervised** methods: to create a model we always had a dataset containing **features** and a **target** to predict. The goal in those methods was then to be able to do a **prediction**, i.e. given a set of new samples with possibly unseen features of the same kind, predict the value of another variable (either a *continuous* variable as in regression or a *categorical* variable as in classification). 

Now we will look at **clustering**, which is an **unsupervised** learning algorithm. In unsupervised learning, we do not have access to any target features. Instead, we try to learn patterns within the features that we do have access to. Samples with similar patterns can then be grouped together within clusters. The idea is that the samples within a cluster should be very similar to each other, and they should be ideally very different from samples of other clusters.

Because we do not have targets anymore, the scikit-learn algorithms for unsupervised learning do not expect you to pass a `y` argument anymore. Instead, we only have the features `X` that we pass to the specific classifier during fitting. 

## Clustering methods

There are many different algorithms that can be used to find clusters, but generally the idea is to find sub-groups in our dataset where data points are close together according to some metric. We will fist look at some artificial data to get better understand clustering. For this we use a scikit-learn function that creates blobs of data:

In [None]:
from sklearn.datasets import make_blobs
import pandas as pd

In [None]:
blobs, labels = make_blobs(n_samples=30, n_features=2, centers=3, random_state=42)

blobs = pd.DataFrame(blobs, columns=['feature1', 'feature2'])
blobs['label'] = labels

blobs.head(5)

Above, we generated a DataFrame containing two features and label. The label tells us, to which cluster a sample belongs to. This is just for demonstration pruposes, of course we would not have access to such information in advance. The clustering algorithm would then find this cluster membership mapping for each sample in our dataset. Let us plot these artificial samples in a scatterplot as they are 2D:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

sns.set_theme(style="whitegrid", palette="Set2")

In [None]:
sns.scatterplot(data=blobs, x='feature1', y='feature2');

## *k*-Means Clustering

In this toy example, we can cleary see that there are three different groups of samples. Now let us try to discover these three clusters using a clustering algorithm. We will first look at *k*-Means, which is a rather simple clustering algorithm. 

The algorithm works as follows:

#### 1. Decide in advance how many clusters you expect in the dataset. This is what the *k* in *k*-Means refers to. In our case, we set $k=3$.

#### 2. For each of these *k* clusters, we randomly initialize a cluster center. In this example, we choose the initial cluster centers manually.

In [None]:
cluster_centers = np.array([[-1, -10], [-4, 0], [-1.5,-2]])

sns.scatterplot(data=blobs, x='feature1', y='feature2')
plt.plot(cluster_centers[:,0], cluster_centers[:,1], 'rD');

#### 3. For each sample in our dataset (green points), we compute the distance to the cluster centers (red diamonds) and assign each sample to closest cluster.


Below you can see how you can manually compute this distance from each sample to all of the three cluster centers in our toy example.

In [None]:
distances = []
for center in cluster_centers:
    # Calculate the distance from each point to this specific cluster center
    distance = np.linalg.norm(blobs[['feature1', 'feature2']].values - center, axis=1)
    distances.append(distance)

# Convert the list of distances (list of numpy arrays) to a 2D numpy array
distances = np.stack(distances, axis=1)

# Show the array, each row corresponds to a point, and each column corresponds to the distance to a cluster center
distances

You can think of the clusters as being labeles with an id between $0$ and $k-1$. For each sample in our dataset, we want to find the cluster id of the cluster center that is closest to that sample. We could apply the `np.min` function across the columns to find the minimum distance in each row. However, this would not give use the id of the cluster but only the distance. Fortunately, there is another function that we can used: `np.argmin`. The function does one extra step: it first searches for the minimum, but then it does not return the minimum but rather the index at which the minimum was found. We use that function below to get the cluster assignments:

In [None]:
min_dist = np.argmin(distances, axis=1)  # axis = 1 because we want to find the minimum distance across the columns
min_dist

Now we can assign these cluster labels to our DataFrame as a new column:

In [None]:
blobs['cluster_label'] = min_dist

Below, we plot the actual labels that we got from the `make_blobs` function on the left, and the labels that we just inferred from the distances to the cluster centers on the right.

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(data=blobs, x='feature1', y='feature2', hue='label', palette='Set2', ax=ax[0])
sns.scatterplot(data=blobs, x='feature1', y='feature2', hue='cluster_label', palette='Set2', ax=ax[1]);

After the first iteration, we can see that in the cluster on the lower left there are some point that are assigned the wrong cluster. Let us move on with the second iteration.

#### 4. Now we re-compute the cluster centers. To do so, we select the samples that are assigned to a specific cluster, and then we compute the average feature values. This will give us a new cluster center:

In [None]:
cluster_centers = []
for label in blobs.cluster_label.unique():
    # Calculate the mean of the points in this cluster
    center = blobs[blobs.cluster_label == label][['feature1', 'feature2']].mean().values
    cluster_centers.append(center)


cluster_centers = np.stack(cluster_centers, axis=0)  # axis = 0 because we want to stack along the first dimension (rows)

print(f"Shape of cluster_centers (num_samples, num_features): {cluster_centers.shape}")

In [None]:
sns.scatterplot(data=blobs, x='feature1', y='feature2')
plt.plot(cluster_centers[:,0], cluster_centers[:,1], 'rD');

Because we still had some samples that were not assigned to their actual cluster, the new cluster centers that we computed are not entirely centered yet. But if we keep on performing the same steps as before, this will change.

#### 5. We can repeat the same operation as before: we assign points to the closest cluster center and update the position of the cluster centers.

In [None]:
# Here we do the same as before, but this time in the form of a list comprehensions
distances = np.stack([np.linalg.norm(blobs[['feature1', 'feature2']].values - center, axis=1) for center in cluster_centers], axis=1)
min_dist = np.argmin(distances, axis=1)
blobs['cluster_label'] = min_dist
cluster_centers = np.stack([blobs[blobs.cluster_label == label][['feature1', 'feature2']].mean().values for label in blobs.cluster_label.unique()], axis=0)

fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(data=blobs, x='feature1', y='feature2', hue='label', palette='Set2', ax=ax[0])
sns.scatterplot(data=blobs, x='feature1', y='feature2', hue='cluster_label', palette='Set2', ax=ax[1])
plt.plot(cluster_centers[:,0], cluster_centers[:,1], 'rD');

We successfuly detected the clusters already in the second iteration! In practice, the algorithms needs usually more iterations, as the data is not as clearly separated as in this toy example. You can imagine that some things can go wrong. For example, if the initial (random) cluster centers are poorly initialized, two cluster centers might converge to the same location. However, many libraries try to avoid that by choosing good initial cluster centers that are spread out.

## Application in scikit-learn

Now that you understand how the algorithm works, let us have a look at how we can use it out-of-the-box with scikit-learn. We start by importing `KMeans` from the `sklearn.cluster`:

In [None]:
from sklearn.cluster import KMeans

Then we instantiate our model with the required parameters. We need to specify the number of cluster that we expect:

In [None]:
kmeans_model = KMeans(n_clusters=3)

Then we fit our model. Since this is an unsupervised learning algorithm and we do not have a targets, we only need to specify the features ```X```:

In [None]:
X = blobs[['feature1', 'feature2']]

In [None]:
kmeans_model.fit(X)

The cluster centers that the algorithm found are stored under the `cluster_centers_` attribute:

In [None]:
kmeans_model.cluster_centers_

In [None]:
# We can also get the cluster assignments for each sample in our dataset
kmeans_model.labels_

Let us see if they correspond to our expectations:

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(data=blobs, x='feature1', y='feature2', hue=kmeans_model.labels_, palette='Set2', ax=ax)
plt.plot(kmeans_model.cluster_centers_[:,0], kmeans_model.cluster_centers_[:,1], 'rD');

### Comparing with Real Labels

In practice, clustering is used without labels. But when testing on labeled data, we can compare the results to the true classes.

Keep in mind: cluster labels are arbitrary — k-Means might label the "green" group as 2 instead of 0. So even with perfect clustering, the label numbers may not match.

To fairly evaluate, you can:
- Match clusters to true labels using majority vote, or  
- Use metrics like Adjusted Rand Index that ignore label order.

This gets harder when clusters overlap or contain outliers.


## Real case

Let us now look at a real dataset. The following table contains information about the size (Lb - "cell **l**ength at **b**irth") and growth rate of bacteria growing in two different conditions:

In [None]:
bacteria = pd.read_csv('https://raw.githubusercontent.com/digital-sustainability/SAI3-2025/refs/heads/main/datasets/coli.csv')
bacteria.head()

In [None]:
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue='condition', palette='Set2');

Again, we happen to know the two conditions/labels in this example. But you might imagine an experiment where you have different populations of cells measured at the same time and wish to identify different groups. Let us initialize a clustering model with two clusters: 

In [None]:
X = bacteria[['Lb', 'growth_rate']]

kmeans_model = KMeans(n_clusters=2)
kmeans_model.fit(X=X)

Now we will check the predicted cluster assignments and compare it to the actual conditions:

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue='condition', palette='Set2', ax=ax[0])
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue=kmeans_model.labels_, palette='Set2', ax=ax[1])
plt.plot(kmeans_model.cluster_centers_[:,0], kmeans_model.cluster_centers_[:,1], 'rD');

The algorithm identified two clusters, but they do not match the actual conditions or what we would expect from the scatterplot. This mismatch is due to a familiar issue: the two features are on very different scales.

## Feature scaling

In the case of K-Means clustering the reason is particularly evident: as we measure distances between points, if one feature has a much larger scale, then it will dominate the clustering, i.e. the data will mainly be partitioned along one given axis as is the case here with the length ```Lb```. We can now check whether scaling our feature in the above case, could help obtain a better clustering. Again we use the ```preprocessing``` module:

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
std_scaler = StandardScaler()
X_scaled = std_scaler.fit_transform(X)

In [None]:
kmeans_model = KMeans(n_clusters=2)
kmeans_model.fit(X=X_scaled)

Note that when we want to plot things in the original (unscaled) space, we need to *reverse* the scaling. You can use the `inverse_transform` function for this. Below is an example of the scaled cluster centers and the unscaled cluster centers:

In [None]:
kmeans_model.cluster_centers_

In [None]:
std_scaler.inverse_transform(kmeans_model.cluster_centers_)

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue='condition', palette='Set2', ax=ax[0])
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue=kmeans_model.labels_, palette='Set2', ax=ax[1])
plt.plot(std_scaler.inverse_transform(kmeans_model.cluster_centers_)[:,0], std_scaler.inverse_transform(kmeans_model.cluster_centers_)[:,1], 'rD');

We see that the scaling fixes the problem! Certain points are still in the wrong cluster but that will always be the case.

## Other methods

As mentioned above, there are many ways to detect clusters in a dataset. Just for the purpose of illustration, we show here two alternatives called Mean Shift and DBScan clustering which are capable of finding clusters in smooth distributions and can determine the number of clusters on their own, i.e. we don't have to provide a ```n_clusters``` argument.

In [None]:
from sklearn.cluster import MeanShift, DBSCAN

In [None]:
ms_model = MeanShift()
ms_model.fit(X=X_scaled)

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue='condition', palette='Set2', ax=ax[0])
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue=ms_model.labels_, palette='Set2', ax=ax[1])
plt.plot(std_scaler.inverse_transform(ms_model.cluster_centers_)[:,0], std_scaler.inverse_transform(ms_model.cluster_centers_)[:,1], 'rD');

The algorithm performs well overall but creates extra clusters for isolated points. These outliers can be removed in a post-processing step.

Next, we try DBSCAN, a method that identifies dense regions and expands clusters from them. A key parameter is `eps`, which defines the maximum distance between points in the same cluster. You can experiment with different values

In [None]:
db_model = DBSCAN(eps=0.15)
db_model.fit(X=X_scaled)

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue='condition', palette='Set2', ax=ax[0])
sns.scatterplot(data=bacteria, x='Lb', y='growth_rate', hue=db_model.labels_, palette='Set2', ax=ax[1])
ax[1].legend(ncols=4);

We see that the method is also capable of finding relevant clusters but with a fairly different results from previous solutions. It is difficult to tell in advance which clustering methods and hyperparameters are optimal for a problem, and often some trial and error is necessary.

## Exercise
1. Import the ```movement.csv``` dataset
2. Visualize as scatter plot of ```z_acc``` and ```y_acc``` and color by the ```move_type```.
3. Try to cluster the data with these features. Try KMeans and DBSCAN. Do you achieve a good clustering? Try out different `eps` values.

### Bonus
4. Use the following code snippet to load the moons dataset:
   
   ```python
    from sklearn.datasets import make_moons
    moons, labels = make_moons(n_samples=300, noise=0.1)
   ```
5. Cluster the moons dataset with DBSCAN and visualize it with two scatter plots in the same row: on the left with the true labels returned by `make_moons`, on the right with the learned labels.

    - Try varying `eps` and `min_samples` and see what happens
    - What happens when you use more noise?