# Lab 3: Hierarchical and density-based clustering

<a target="_blank" href="https://colab.research.google.com/github/drchadvidden/courseMaterials/blob/main/UnsupervisedLearning/Labs/Lab%202/Lab_2.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Lab Instructions

Run each of the coding cells. For tutorial example cells, understand the commands and check that the outputs make sense. For exercise cells, write your own code where indicated to generate the correct output. Give text explanations where indicated.

### Submission:
Complete the following notebook in order. Once done, save the notebook, print the file as a .pdf, and upload the resulting file to the Canvas course assignment.

### Rubric:
15 total points, 5 points to running tutorial example cells and saving outputs, 10 points for completing exercises.

### Deadline:
Tuesday at midnight after the lab is assigned.

# Tutorial: Hierarchical Clustering

## Hierarchical Clustering of Penguins

In this tutorial, we perform hierarchical clustering on the penguins dataset using several numeric measurements: bill length, bill depth, flipper length, and body mass.

Hierarchical clustering creates a tree-like structure (dendrogram) that shows how observations are merged into clusters step by step. The y-axis of the dendrogram represents the distance at which clusters are fused, and the leaves correspond to individual observations.

Before clustering, we standardize the features so that differences in scale (e.g., body mass vs. bill length) do not dominate the clustering. We then compute the linkage using the complete linkage method, which defines the distance between clusters as the maximum distance between points in the two clusters. Finally, we plot the dendrogram to visualize the hierarchical structure of the penguins data.

Below is the Python code to perform these steps.

In [None]:
# --- Imports ---
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler

# --- Load dataset, more details here https://seaborn.pydata.org/generated/seaborn.load_dataset.html ---
penguins = sns.load_dataset('penguins')

# --- Select numeric features and drop missing values ---
numeric_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
data = penguins[numeric_cols].dropna()

# --- Standardize features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)

# --- Perform hierarchical clustering ---
Z = linkage(X_scaled, method='complete')  # method can be 'single', 'complete', 'average', 'ward'

# --- Plot dendrogram ---
plt.figure(figsize=(10, 5))
dendrogram(Z, labels=data.index, color_threshold=5)
plt.title("Hierarchical Clustering Dendrogram (Penguins)")
plt.xlabel("Observation Index")
plt.ylabel("Distance")
plt.show()


## How the Dendrogram Cut Works

The dendrogram plotted above shows the full hierarchical clustering, from individual observations at the leaves to a single cluster at the top.

Although we did not explicitly specify a cut, scipy’s dendrogram() function automatically colors branches to make clusters easier to see. Try changing the "color_threshold=5" in this command.

Branches that merge below this threshold are colored differently. In the penguins dendrogram, this produces three main colored branches, giving a visual suggestion of three clusters.

Important: This coloring is purely for visualization. The dendrogram itself still represents the full hierarchy; no cluster labels have been assigned yet.
To actually assign clusters, we can use scipy.cluster.hierarchy.fcluster() with either a distance threshold or a target number of clusters.

Next, decide the dedrogram cut location visually from above and assign cluster memberships to our data. This can be visualized in 2D for any pair of numeric features.

In [None]:
from scipy.cluster.hierarchy import fcluster
import matplotlib.pyplot as plt

# --- Assign clusters from linkage matrix Z ---
# Here we choose 3 clusters
clusters = fcluster(Z, 3, criterion='maxclust')

# Add cluster labels to the data
data_plot = data.copy()
data_plot['cluster'] = clusters

# --- Plot clusters: flipper_length_mm vs bill_length_mm ---
plt.figure(figsize=(8,6))
for c in sorted(data_plot['cluster'].unique()):
    subset = data_plot[data_plot['cluster'] == c]
    plt.scatter(subset['flipper_length_mm'], subset['bill_length_mm'], label=f'Cluster {c}', s=50)

plt.xlabel('Flipper Length (mm)')
plt.ylabel('Bill Length (mm)')
plt.title('Hierarchical Clustering of Penguins (3 Clusters)')
plt.legend()
plt.grid(True)
plt.show()


## Cluster assignment vs penguin species

Because we actually know the penguin species name in our dataset, we can compare our clustering findings to a ground truth as below. I wonder why some Chinstrap penguins were clustered with Adelie penguins?

In [None]:
# --- Add cluster labels and species ---
data_plot = data.copy()
data_plot['cluster'] = clusters
data_plot['species'] = penguins.loc[data.index, 'species']

# --- Confusion matrix: cluster number vs species ---
confusion_matrix = pd.crosstab(data_plot['cluster'], data_plot['species'])
print(confusion_matrix)


## Evaluating the Dendrogram with Cophenetic Correlation

The cophenetic correlation coefficient (CCC) measures how well a dendrogram preserves the original pairwise distances between observations.

A higher CCC (closer to 1) indicates that the dendrogram accurately reflects the true distances in the data.

A lower CCC suggests that the hierarchical clustering distorts the distances more.

This metric is useful for assessing different linkage methods or confirming that the dendrogram is a faithful representation of the data before deciding where to cut it.

Below is how to compute the CCC for our penguins hierarchical clustering.

In [None]:
from scipy.cluster.hierarchy import cophenet

c, coph_dists = cophenet(Z, pdist(X_scaled))
print(f"Cophenetic Correlation Coefficient: {c:.3f}")

# Tutorial: DBSCAN

## Density-Based Clustering (DBSCAN) with Penguins

In this tutorial, we will use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to cluster penguin observations based on numeric features: bill length, bill depth, flipper length, and body mass.

DBSCAN is a density-based clustering algorithm. Its key properties:

- Groups together points that are densely packed.

- Labels points in sparse regions as noise (outliers).

- Can detect clusters of arbitrary shapes, unlike k-means or hierarchical clustering.

We will standardize the features, fit DBSCAN, and visualize the clusters using two key numeric features: flipper length and bill length. Finally, we’ll compare the DBSCAN clusters to the actual penguin species.

To fir DBSCAN, we choose two key parameters:

- eps: maximum distance between two points to be considered neighbors

- min_samples: minimum number of points in a neighborhood for a point to be considered a core point

In [None]:
# --- Imports ---
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# --- Load dataset ---
penguins = sns.load_dataset('penguins')

# --- Select numeric features and drop missing values ---
numeric_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
data = penguins[numeric_cols].dropna()

# --- Standardize features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)

# --- Fit DBSCAN ---
db = DBSCAN(eps=0.8, min_samples=5)  # tweak eps for better clusters
db.fit(X_scaled)

# --- Cluster labels ---
labels = db.labels_
data_plot = data.copy()
data_plot['cluster'] = labels

# Number of clusters (excluding noise)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Estimated number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")


## Visualizing DBSCAN Clusters

We will plot flipper length (x-axis) vs bill length (y-axis) and color points by their DBSCAN cluster. Noise points (-1) will be colored gray.

In [None]:
plt.figure(figsize=(8,6))

# Assign colors: noise=-1 is gray
colors = ['gray', 'red', 'blue', 'green', 'orange', 'purple', 'cyan']
for cluster_id in set(labels):
    subset = data_plot[data_plot['cluster'] == cluster_id]
    color = colors[cluster_id + 1] if cluster_id != -1 else 'gray'
    label = 'Noise' if cluster_id == -1 else f'Cluster {cluster_id}'
    plt.scatter(subset['flipper_length_mm'], subset['bill_length_mm'],
                label=label, s=50, alpha=0.7, color=color)

plt.xlabel('Flipper Length (mm)')
plt.ylabel('Bill Length (mm)')
plt.title('DBSCAN Clustering of Penguins')
plt.legend()
plt.grid(True)
plt.show()


Again, we can compare the cluster assignment to the true cluster solution. You can tweak eps and min_samples to see how cluster assignments change.

In [None]:
# --- Add species names ---
data_plot['species'] = penguins.loc[data.index, 'species']

# --- Confusion matrix: cluster vs species ---
confusion_matrix = pd.crosstab(data_plot['cluster'], data_plot['species'])
print(confusion_matrix)

## Choosing DBSCAN Parameters (eps and min_samples)

DBSCAN requires two parameters:

- eps — the maximum distance between points to be considered neighbors.

- min_samples — the minimum number of points in a neighborhood for a point to be considered a core point.

To select a good eps:

- Set min_samples based on the dataset size and dimensionality (a common rule of thumb is number of features + 1).

- Compute the distance to the k-th nearest neighbor for each point (k = min_samples).

- Plot these distances in ascending order (k-distance graph) and look for a “knee” or sharp bend — this indicates a natural choice for eps.

Once eps and min_samples are chosen, DBSCAN can automatically identify clusters and label outliers (noise).

In [None]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt

k = 5  # same as min_samples
neighbors = NearestNeighbors(n_neighbors=k)
neighbors_fit = neighbors.fit(X_scaled)
distances, indices = neighbors_fit.kneighbors(X_scaled)

# Take the k-th nearest neighbor distance
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(8,5))
plt.plot(k_distances)
plt.ylabel(f"{k}-th Nearest Neighbor Distance")
plt.xlabel("Points sorted by distance")
plt.title("k-distance Graph for DBSCAN")
plt.grid(True)
plt.show()


# Exercise(s): Hierarchical and density-based clustering





## Exercise 1: Exploring Different Linkage Methods

In this exercise, you will redo the hierarchical clustering on the penguins dataset using linkage methods other than complete linkage, and explore how the choice of linkage affects the clusters.

### Tasks:

1. Using the same numeric features (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g), standardize the data as in the tutorial.

2. Perform hierarchical clustering using the following linkage methods:
- Single linkage
- Average linkage
- Ward linkage

3. For each linkage method:
- Plot the dendrogram.
- Assign clusters using 3 clusters (as in the tutorial) with fcluster().
- Plot the clusters using flipper length (x-axis) and bill length (y-axis).

4. For each linkage method:

- Identify the height at which the final three clusters merge into two.
- Compare these heights across linkage methods.
- Briefly explain what this suggests about cluster separation under each method.

5. Compare the cluster assignments to penguin species:

- Create a confusion matrix (cluster vs species) for each linkage method.
- Briefly describe how the cluster assignments differ between linkage methods.
- Compute the cophenetic correlation coefficient (CCC) for all 4 linkage methods and explain which is the best fit. Does the best CCC clustering relative to species label?

In [None]:
# Write your code for the exercise 2 here!

### Explain your findings here:




## Exercise 2: Analyzing DBSCAN Behavior on Penguins

In this exercise, you will investigate how DBSCAN behaves under different parameter choices and compare its structure to hierarchical clustering.

You may reuse the standardized dataset from the tutorial.

### Tasks:

Using min_samples = 5:

1. Run DBSCAN with three different values of eps:
- One clearly too small
- One near the knee from the k-distance plot
- One clearly too large
- Visualize your results and compare to the true species

2. For each choice of eps, report:
- Number of clusters found
- Number of noise points

3. Briefly explain:
- What happens when eps is too small?
- What happens when eps is too large?
- Why does this occur in terms of density connectivity?

4. For your best DBSCAN model:
- How many core points are there?
- How many border points?
- How many noise points?

5. Choose one cluster and:
- Describe the geometric difference between its core and border points.
- Explain why border points might represent biologically ambiguous penguins (for example, overlap between Adelie and Chinstrap).

6. Pairwise feature visualization
- Create a pairwise scatterplot matrix of the four numeric variables, coloring points by their DBSCAN cluster label.
- Use seaborn’s pairplot() function.
- Color points using the hue argument.
- Restrict the plot to the four numeric columns.
- Hint: sns.pairplot(data_plot, hue='cluster', vars=numeric_cols)
- Explain your findings.

7. Compute the mean of each numeric feature within each DBSCAN cluster and explain your findings.


In [None]:
# Write your code for the exercise 1 here!

### Explain your findings here:




## Exercise 3: Comparing Clustering Methods on Customer Data

You previously analyzed this dataset using k-means. Now apply:
- Hierarchical clustering
- DBSCAN

Your goal is to determine which method produces the most meaningful customer segmentation.

### Tasks:

1. Apply hierarchical clustering (choose an appropriate linkage method and number of clusters).

2. Apply DBSCAN (choose reasonable eps and min_samples).

3. Compare both results to your previous k-means solution of Lab 2.

### Analyze

How many clusters does each method produce?

Do the cluster sizes differ substantially?

Does DBSCAN identify noise or outliers?

Which method produces the most interpretable customer segments?

### Conclusion

In 4–6 sentences, argue which clustering method is “best” for this dataset and justify your reasoning.


In [None]:
# Write your code for the exercise 3 here!

import pandas as pd

# this file is also hosted on Kaggle: https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python
url = 'https://gist.githubusercontent.com/pravalliyaram/5c05f43d2351249927b8a3f3cc3e5ecf/raw/8bd6144a87988213693754baaa13fb204933282d/Mall_Customers.csv'
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())
print(df["Gender"].value_counts())

### Explain your findings here:


