# Lab 4: Assessing Cluster Quality

<a target="_blank" href="https://colab.research.google.com/github/drchadvidden/courseMaterials/blob/main/UnsupervisedLearning/Labs/Lab%204/Lab_4.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Lab Instructions

Run each of the coding cells. For tutorial example cells, understand the commands and check that the outputs make sense. For exercise cells, write your own code where indicated to generate the correct output. Give text explanations where indicated.

### Submission:
Complete the following notebook in order. Once done, save the notebook, print the file as a .pdf, and upload the resulting file to the Canvas course assignment.

### Rubric:
15 total points, 5 points to running tutorial example cells and saving outputs, 10 points for completing exercises.

### Deadline:
Tuesday at midnight after the lab is assigned.

# Tutorial: Cluster fit metrics

## Kmeans clustering of mall customer data

Here we revisit the mall customer dataset from past homeworks.

In [None]:
import pandas as pd

# this file is also hosted on Kaggle: https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python
url = 'https://gist.githubusercontent.com/pravalliyaram/5c05f43d2351249927b8a3f3cc3e5ecf/raw/8bd6144a87988213693754baaa13fb204933282d/Mall_Customers.csv'
df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())
print(df["Gender"].value_counts())

## Choosing the Number of Clusters (Model Selection)

In this lab, we will perform K-means clustering using only two variables:

- **Annual Income (k$)**
- **Spending Score (1–100)**

This 2D setting makes it easy to visualize customer segments and interpret the results.

Because K-means requires us to specify the number of clusters $ k $, we need a principled way to choose it. Rather than guessing, we will evaluate several values of $ k $ (from 2 to 10) using three common cluster selection criteria.

---

### 1️⃣ WCSS (Within-Cluster Sum of Squares)

Also called **inertia**, this measures how tightly grouped the points are within each cluster.

- Smaller values are better.
- We look for an **“elbow”** in the curve where improvement begins to level off.

---

### 2️⃣ Silhouette Score

Measures how well each point fits within its cluster compared to other clusters.

- Ranges from -1 to 1.
- Larger values indicate better separation.
- We typically choose the $ k $ that **maximizes** this score.

---

### 3️⃣ Calinski–Harabasz (CH) Index

Measures the ratio of between-cluster separation to within-cluster cohesion.

- Larger values indicate better-defined clusters.
- We choose the $ k $ that **maximizes** this index.

---

Before running K-means, we standardize the variables. Since K-means is distance-based, differences in scale would otherwise distort the clustering.

The following code evaluates all three criteria for $ k = 2, \dots, 10 $ and produces comparison plots.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# ---------------------------
# 1. Select 2D Features
# ---------------------------

X = df[["Annual Income (k$)", "Spending Score (1-100)"]]

# Standardize (important for distance-based clustering)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ---------------------------
# 2. Evaluate k = 2 to 10
# ---------------------------

k_range = range(2, 11)

wcss = []
silhouette_scores = []
ch_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
    labels = kmeans.fit_predict(X_scaled)

    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, labels))
    ch_scores.append(calinski_harabasz_score(X_scaled, labels))

# ---------------------------
# 3. Plot Selection Criteria
# ---------------------------

plt.figure()
plt.plot(k_range, wcss, marker='o')
plt.xlabel("k")
plt.ylabel("WCSS")
plt.title("Elbow Method")
plt.show()

plt.figure()
plt.plot(k_range, silhouette_scores, marker='o')
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.title("Silhouette vs k")
plt.show()

plt.figure()
plt.plot(k_range, ch_scores, marker='o')
plt.xlabel("k")
plt.ylabel("Calinski-Harabasz Index")
plt.title("CH Index vs k")
plt.show()

## Fitting the Final K-Means Model

Based on the previous model selection plots (WCSS, Silhouette, and CH index), we now choose an appropriate number of clusters $k$.

Using this selected value, we:

1. Fit the K-means model.
2. Assign each customer to a cluster.
3. Add the cluster labels to the dataframe.
4. Visualize the resulting segmentation in the 2D feature space.

The scatterplot below shows how customers are grouped according to **Annual Income** and **Spending Score**, with colors indicating cluster membership.


In [None]:
k_optimal = 5

kmeans = KMeans(n_clusters=k_optimal, random_state=42, n_init=20)
labels = kmeans.fit_predict(X_scaled)

df["Cluster"] = labels

plt.figure()

plt.scatter(
    df["Annual Income (k$)"],
    df["Spending Score (1-100)"],
    c=labels
)

plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.title(f"K-Means Clustering (k={k_optimal})")

plt.show()


# Tutorial: Cluster interpretation

## Summarizing Clusters for Interpretability

Once we’ve assigned customers to clusters, the next step is to **summarize and interpret** those clusters. A key point in cluster-based segmentation is that the groups should not only be statistically distinct, they should also be **actionable and meaningful** to stakeholders — for example, marketing or product teams. Summary tables showing average characteristics per cluster help reveal whether the clusters differ in ways that make sense for targeting or strategy.

This aligns with best practices in customer segmentation, which emphasize that segmentation analysis should go beyond a “black box” to provide **explorative and interpretable results** rather than just algorithm outputs (<https://www.researchgate.net/publication/30385490_A_Review_of_Unquestioned_Standards_in_Using_Cluster_Analysis_for_Data-Driven_Market_Segmentation>).


In [None]:
# ---------------------------
# Summarize Cluster Profiles
# ---------------------------

# Select the variables we want to summarize
summary_vars = ["Annual Income (k$)", "Spending Score (1-100)", "Age"]

# Compute mean and count per cluster
cluster_summary = df.groupby("Cluster")[summary_vars].agg(["count", "mean", "std"])
print(cluster_summary)

# Optionally show proportions by Gender
gender_dist = df.groupby("Cluster")["Gender"].value_counts(normalize=True).unstack()
print(gender_dist)


## Cluster Interpretation Table

| Cluster | Income (k$) | Spending Score | Age | Gender | Interpretation |
|---------|------------|----------------|-----|--------|----------------|
| 0       | ~55        | ~43            | 50  | ~59% F | Middle-income, moderate spenders, older adults |
| 1       | ~87        | ~82            | 32  | ~54% F | High-income, high spenders, younger adults |
| 2       | ~26        | ~25            | 25  | ~59% F | Low-income, low spenders, young adults |
| 3       | ~88        | ~41            | 41  | ~46% F | High-income, moderate spenders, mid-age adults |
| 4       | ~26        | ~45            | 45  | ~61% F | Low-income, moderate spenders, mid-age adults |

### Key Actionable Insights from Clusters

- **High-income clusters (1 & 3)** could be targeted for premium or higher-end offers.  
- **Low-income clusters (2 & 4)** may respond better to budget-friendly campaigns.  
- **Gender distributions** are fairly balanced but could inform marketing messaging.  
- **Age differences** suggest using different channels or messaging strategies for different segments.  


In [None]:
import matplotlib.pyplot as plt

# Select variables to visualize
cluster_means = df.groupby("Cluster")[["Annual Income (k$)", "Spending Score (1-100)", "Age"]].mean()

# Plot
cluster_means.plot(kind="bar", figsize=(10,6))
plt.title("Average Characteristics per Cluster")
plt.ylabel("Mean Value")
plt.xticks(rotation=0)
plt.show()


# Tutorial: Cluster stability

## Assessing Cluster Stability with Bootstrap

Cluster assignments can vary if the data changes slightly. To check **stability**, we can use **bootstrap resampling**:

1. **Bootstrap sampling**: Repeatedly draw random samples (with replacement) from the original data.
2. **Re-cluster** each sample using the same K-means parameters.
3. **Compare clusters**:
   - **Gap statistic**: Measures how much the clustering structure improves over a random uniform reference. Larger gap → more stable cluster separation.
   - **Adjusted Rand Index (ARI)**: Compares cluster assignments between bootstrap samples and the original clustering. ARI close to 1 → highly stable clusters.

This approach helps validate that the chosen number of clusters $k$ produces **robust, reliable segments**, not just artifacts of a particular sample.


In [None]:
from sklearn.utils import resample
from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
import numpy as np

k_optimal = 5
n_bootstrap = 50
ari_scores = []

# Fit original clustering
kmeans_orig = KMeans(n_clusters=k_optimal, random_state=42, n_init=50)
labels_orig = kmeans_orig.fit_predict(X_scaled)

# Bootstrap Loop
for i in range(n_bootstrap):
    # Resample the data (with replacement)
    X_resampled, indices = resample(X_scaled, np.arange(len(X_scaled)), random_state=i)

    # Fit KMeans on the bootstrap sample
    kmeans_boot = KMeans(n_clusters=k_optimal, random_state=i, n_init=10)
    labels_boot = kmeans_boot.fit_predict(X_resampled)

    # Compare original labels to bootstrap labels for the sampled indices
    ari = adjusted_rand_score(labels_orig[indices], labels_boot)
    ari_scores.append(ari)

# Quick stats
print(f"Mean ARI: {np.mean(ari_scores):.4f} (+/- {np.std(ari_scores):.4f})")

import matplotlib.pyplot as plt

# Simple line plot of all bootstrap scores
plt.plot(ari_scores, marker='o', linestyle='-')
plt.axhline(sum(ari_scores)/len(ari_scores), color='red', label='Mean')

plt.title('ARI Score per Bootstrap Iteration')
plt.xlabel('Iteration')
plt.ylabel('ARI Score')
plt.legend()
plt.show()



## Next Steps Toward a More Robust Customer Segmentation

The current clustering analysis provides an initial segmentation based on income and spending behavior. However, a robust customer segmentation strategy typically goes beyond a single K-means model on two variables.

Potential next steps include:

- **Incorporate additional features** (e.g., age, gender, purchase frequency, product categories) to capture richer behavioral patterns.
- **Compare alternative clustering methods** (e.g., hierarchical clustering or DBSCAN models) to assess whether segment structure is consistent.
- **Evaluate stability over time**, if longitudinal data are available.
- **Validate business usefulness**, ensuring segments are actionable for marketing, pricing, or personalization strategies.
- **Profile and label segments clearly**, translating statistical clusters into meaningful customer personas.

Effective segmentation balances statistical validity, stability, and managerial interpretability.


# Exercise(s): Cluster assessment and validity





## Exercise 1: Silhouette score exploration

Cluster evaluation need not rely only on a single overall silhouette score.  
A strong average value can sometimes hide poorly separated clusters or misclassified observations.  
In this exercise, you will examine silhouette scores at the **individual observation level** to better understand cluster structure and separation.


### Tasks:
1. **Compute silhouette values for each observation**  
   - Use `silhouette_samples()` from `sklearn.metrics`.  
   - Store the result in a new column in your dataframe.

2. **Compute the average silhouette score per cluster**  
   - Group by cluster label.  
   - Report the mean and standard deviation silhouette score for each cluster.  

3. **Create a visualization**  
   - Produce a plot that shows silhouette values grouped by cluster.  
   - A boxplot or bar chart of average silhouette scores is sufficient, but you could color points on the clustering scatterplot with clusters by shapes.  
   - Clearly label axes and clusters.

4. **Interpretation Questions**
   - Which cluster has the **lowest average silhouette score**?
   - Are there any **negative silhouette values**?
   - What do negative silhouette values indicate?
   - Do any clusters appear poorly separated or internally inconsistent?
   - Would this change your confidence in the chosen value of $k$?

In [None]:
# Write your code for the exercise here!

### Explain your findings here:




## Exercise 2: Comparing $k=4$, $k=5$, and $k=6$ — Fit, Stability, and Interpretability

In the tutorial, we selected $k=5$ based on WCSS, silhouette, and CH index.  
However, cluster analysis rarely has a single “correct” solution. Different values of $k$ may provide different trade-offs between statistical fit, stability, and managerial usefulness.

In this exercise, you will critically evaluate whether $k=5$ is truly preferable by comparing it to $k=4$ and $k=6$.

---

## Part A — Model Fit

For each value of $k \in \{4,5,6\}$:

1. Fit a K-means model using the same scaled features.
2. Compute:
   - Overall silhouette score
   - Calinski–Harabasz (CH) index
3. Compare the three models:
   - Which $k$ has the highest silhouette?
   - Which has the highest CH index?
   - Are differences large or marginal?

### Considerations
- Does statistical fit clearly favor one model?
- If the metrics disagree, how would you decide?

---

## Part B — Stability (Bootstrap ARI)

For each $k$:

1. Perform bootstrap resampling.
2. Compute the Adjusted Rand Index (ARI) between the original clustering and bootstrap clusterings.
3. Report the **mean ARI** across bootstrap samples.

### Considerations
- Which $k$ appears most stable?
- Does the most stable solution match the best silhouette score?
- What does low stability imply about segmentation reliability?

---

## Part C — Segment Structure and Interpretability

For each $k$:

1. Summarize cluster sizes.
2. Profile clusters using mean:
   - Income
   - Spending
   - Age
   - Gender
3. Compare the segmentation structure:
   - Does $k=4$ merge meaningful groups?
   - Does $k=6$ create very small or fragmented clusters?
   - Does $k=5$ provide a clearer business narrative?

### Considerations

- Which solution produces the most **balanced cluster sizes**?
- Which produces the most **distinct behavioral profiles**?
- Are any clusters difficult to describe in marketing terms?
- Does increasing $k$ meaningfully improve insight, or just add complexity?

---

## Final Deliverable

Write a short paragraph (5–8 sentences) recommending one value of $k$.

Your justification must reference:
- Statistical fit (silhouette / CH)
- Stability (ARI)
- Cluster size balance
- Business interpretability


In [None]:
# Write your code for the exercise here!

### Explain your findings here:




## Exercise 3: Domain Expertise

The tutorial cited the following paper:

<https://www.researchgate.net/publication/30385490_A_Review_of_Unquestioned_Standards_in_Using_Cluster_Analysis_for_Data-Driven_Market_Segmentation>

This paper argues that cluster analysis in marketing is often applied mechanically, without sufficient attention to business relevance and validation.

---

### High-Level Takeaways

1. **Statistical fit is not enough**  
   Good silhouette or internal validity metrics do not guarantee meaningful or useful market segments.

2. **Segments must be actionable**  
   Clusters should lead to clear managerial decisions (targeting, positioning, pricing, communication).

3. **Stability and robustness matter**  
   Segments should not change dramatically with small data perturbations.

4. **Interpretability is critical**  
   If a cluster cannot be clearly described in business terms, it is unlikely to be useful.

---

### Short Reflection

In 4–6 sentences, discuss whether your chosen clustering solution satisfies these broader criteria beyond statistical metrics.


### Explain your findings here:




## Exercise 4: Wholesale Customer Segmentation

In this final exercise, you will apply the full clustering workflow to a new dataset:  
the **Wholesale Customers Dataset** from the UCI Machine Learning Repository.

Unlike the mall dataset, this dataset contains annual spending across multiple product categories. Your goal is to construct meaningful customer segments and evaluate their statistical and managerial quality.

---

### Tasks

1. **Load and Explore the Data**
   - Load the dataset from the provided online source.
   - Examine summary statistics.
   - Identify the feature columns to use for clustering.
   - Decide whether any variables should be excluded.

2. **Preprocess the Data**
   - Standardize the numeric features.
   - Briefly justify why scaling is necessary.

3. **Select the Number of Clusters**
   - Evaluate $k = 2$ to $k = 10$.
   - Compute:
     - WCSS (Elbow Method)
     - Silhouette Score
     - Calinski–Harabasz Index
   - Select an optimal $k$ and justify your choice.

4. **Assess Cluster Stability**
   - Perform bootstrap resampling.
   - Compute the mean Adjusted Rand Index (ARI).
   - Briefly interpret the stability of your solution.

5. **Profile and Interpret Clusters**
   - Summarize cluster means for each spending category.
   - Identify distinguishing characteristics of each segment.
   - Provide descriptive labels for each cluster.

6. **Business Interpretation**
   - Propose at least two actionable strategies based on your segments.
   - Discuss whether your segmentation is statistically strong, stable, and managerially useful.

---

### Deliverables

- One figure for model selection  
- One table summarizing cluster profiles  
- One stability metric (mean ARI)  
- A short written interpretation (8–12 sentences)



In [None]:
# Write your code for the exercise here!

import pandas as pd

df_wholesale = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv")
df_wholesale.head()

### Explain your findings here:




In [None]:
# code to export notebook as .html for Canvas upload

from google.colab import drive
from google.colab import files

drive.mount('/content/drive')

notebook_name = "Lab_3"
!cp "/content/drive/MyDrive/Colab Notebooks/DSC 430/{notebook_name}.ipynb" /content/
!jupyter nbconvert --to html "/content/{notebook_name}.ipynb"
files.download(f"/content/{notebook_name}.html")

