# Chapter 31: Unsupervised Learning and Clustering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/31_clustering.ipynb)

This notebook contains all the executable code examples from Chapter 31 of the BANA 4080 textbook. You can run each code cell and experiment with the examples to deepen your understanding of clustering and unsupervised learning.

## Learning Objectives

By working through this notebook, you will be able to:

- Understand the difference between supervised and unsupervised learning
- Apply the K-Means clustering algorithm using scikit-learn
- Engineer behavioral features from transactional data
- Use the elbow method and silhouette scores to select optimal number of clusters
- Apply proper feature scaling for clustering
- Interpret cluster profiles and translate them into business insights
- Explore alternative clustering methods (hierarchical, DBSCAN)
- Complete a real-world customer segmentation case study

## Setup: Import Required Libraries

First, let's import all the libraries we'll need throughout this notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

---

## Part 1: Introduction to Clustering with Synthetic Data

We'll start with a simple 2D example to visualize how K-Means clustering works.

### Generate Synthetic Customer Data

Let's create three distinct groups of customers based on age and income to see how K-Means discovers these natural groupings.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate three distinct customer groups
# Group 1: Young, lower income (students/entry-level)
group1_age = np.random.normal(25, 3, 50)
group1_income = np.random.normal(35000, 5000, 50)

# Group 2: Middle-aged, moderate income (professionals)
group2_age = np.random.normal(40, 4, 50)
group2_income = np.random.normal(65000, 8000, 50)

# Group 3: Older, higher income (executives/established)
group3_age = np.random.normal(55, 5, 50)
group3_income = np.random.normal(95000, 12000, 50)

# Combine into single dataset
age = np.concatenate([group1_age, group2_age, group3_age])
income = np.concatenate([group1_income, group2_income, group3_income])

# Create DataFrame
customer_data = pd.DataFrame({
    'age': age,
    'income': income
})

print(f"Created {len(customer_data)} customers")
customer_data.head()

### Apply K-Means Clustering

Now let's use K-Means to discover the three customer segments and visualize the results.

In [None]:
# Fit K-Means with k=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
customer_data['cluster'] = kmeans.fit_predict(customer_data[['age', 'income']])

# Get cluster centers
centers = kmeans.cluster_centers_

# Create side-by-side plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Original data (no clusters visible)
axes[0].scatter(customer_data['age'], customer_data['income'],
                alpha=0.6, s=50, color='gray')
axes[0].set_xlabel('Age (years)', fontsize=12)
axes[0].set_ylabel('Income ($)', fontsize=12)
axes[0].set_title('Before Clustering: Unlabeled Customer Data', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Right plot: After clustering with centroids
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
for i in range(3):
    cluster_data = customer_data[customer_data['cluster'] == i]
    axes[1].scatter(cluster_data['age'], cluster_data['income'],
                   alpha=0.6, s=50, color=colors[i], label=f'Cluster {i+1}')

# Plot centroids
axes[1].scatter(centers[:, 0], centers[:, 1],
               marker='X', s=300, c='black', edgecolors='white', linewidths=2,
               label='Centroids', zorder=5)

axes[1].set_xlabel('Age (years)', fontsize=12)
axes[1].set_ylabel('Income ($)', fontsize=12)
axes[1].set_title('After K-Means: Discovered Customer Segments', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print cluster summaries
print("\nCluster Summaries:")
print("=" * 60)
for i in range(3):
    cluster_data = customer_data[customer_data['cluster'] == i]
    print(f"\nCluster {i+1}:")
    print(f"  Size: {len(cluster_data)} customers")
    print(f"  Average age: {cluster_data['age'].mean():.1f} years")
    print(f"  Average income: ${cluster_data['income'].mean():,.0f}")

---

## Part 2: K-Means Implementation with Scikit-Learn

Let's explore the complete K-Means workflow including feature scaling and accessing model results.

### Create Sample Customer Data

In [None]:
# Create sample customer data with multiple features
np.random.seed(42)
n_customers = 150

customer_data_multi = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.random.randint(20, 70, n_customers),
    'annual_income': np.random.randint(20000, 120000, n_customers),
    'purchase_frequency': np.random.randint(1, 50, n_customers)
})

print("Sample customer data:")
customer_data_multi.head(10)

### Complete K-Means Workflow with Feature Scaling

Feature scaling is **critical** for K-Means because it uses distance-based calculations. Without scaling, features with larger values will dominate the clustering.

In [None]:
# Step 1: Prepare your data (features only, no target variable)
X = customer_data_multi[['age', 'annual_income', 'purchase_frequency']]

# Step 2: Scale your features (IMPORTANT!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Create and fit the K-Means model
kmeans = KMeans(
    n_clusters=3,        # Number of clusters
    random_state=42,     # For reproducibility
    n_init=10            # Number of different initializations (default=10)
)
kmeans.fit(X_scaled)

# Step 4: Get cluster assignments
customer_data_multi['cluster'] = kmeans.predict(X_scaled)

print("Clustering complete!")
print(f"\nCluster distribution:")
print(customer_data_multi['cluster'].value_counts().sort_index())
print("\nFirst 10 customers with cluster assignments:")
customer_data_multi.head(10)

### Accessing K-Means Results

The K-Means model stores useful information that we can access after fitting.

In [None]:
# Get cluster centers (centroids)
centroids = kmeans.cluster_centers_
print("Cluster centroids shape:", centroids.shape)  # (n_clusters, n_features)
print("\nCentroid values (scaled):")
print(centroids)

# Get cluster labels for training data
labels = kmeans.labels_
print("\nFirst 10 cluster assignments:", labels[:10])

# Get WCSS (inertia)
wcss = kmeans.inertia_
print(f"\nWithin-Cluster Sum of Squares: {wcss:.2f}")

# Predict cluster for new data
new_customer = pd.DataFrame([[35, 60000, 12]], 
                           columns=['age', 'annual_income', 'purchase_frequency'])
new_customer_scaled = scaler.transform(new_customer)
predicted_cluster = kmeans.predict(new_customer_scaled)
print(f"\nNew customer (age=35, income=$60k, frequency=12) assigned to cluster: {predicted_cluster[0]}")

---

## Part 3: Choosing the Optimal Number of Clusters

One of the biggest challenges in clustering is determining how many clusters (k) to use. We'll explore two popular methods.

### Elbow Method

The elbow method plots WCSS (Within-Cluster Sum of Squares) for different values of k. We look for the "elbow" where the rate of decrease sharply changes.

In [None]:
# Calculate WCSS for different values of k
wcss = []
k_range = range(1, 11)

for k in k_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_temp.fit(customer_data[['age', 'income']])
    wcss.append(kmeans_temp.inertia_)  # inertia_ is scikit-learn's name for WCSS

# Plot elbow curve
plt.figure(figsize=(9, 5))
plt.plot(k_range, wcss, marker='o', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Within-Cluster Sum of Squares (WCSS)', fontsize=12)
plt.title('Elbow Method for Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(k_range)

# Highlight the elbow at k=3
plt.axvline(x=3, color='red', linestyle='--', linewidth=2, label='Elbow at k=3')
plt.legend(fontsize=11)

plt.tight_layout()
plt.show()

print("WCSS values:")
for k, wcss_val in zip(k_range, wcss):
    print(f"  k={k}: WCSS = {wcss_val:,.0f}")

### Silhouette Analysis

Silhouette scores measure how similar each point is to its own cluster compared to other clusters. Scores range from -1 to +1:
- **+1**: Point is well-matched to its cluster
- **0**: Point is on the border between clusters
- **-1**: Point might be assigned to the wrong cluster

In [None]:
# Calculate silhouette scores for k=2 through k=10
silhouette_scores = []
k_range_sil = range(2, 11)  # Need at least 2 clusters for silhouette

for k in k_range_sil:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    cluster_labels = kmeans_temp.fit_predict(customer_data[['age', 'income']])
    silhouette_avg = silhouette_score(customer_data[['age', 'income']], cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Plot silhouette scores
plt.figure(figsize=(9, 5))
plt.plot(k_range_sil, silhouette_scores, marker='o', linewidth=2, markersize=8, color='green')
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Average Silhouette Score', fontsize=12)
plt.title('Silhouette Analysis for Optimal k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(k_range_sil)

# Highlight the maximum
max_k = k_range_sil[silhouette_scores.index(max(silhouette_scores))]
plt.axvline(x=max_k, color='red', linestyle='--', linewidth=2,
            label=f'Maximum at k={max_k}')
plt.legend(fontsize=11)

plt.tight_layout()
plt.show()

print("Silhouette scores:")
for k, score in zip(k_range_sil, silhouette_scores):
    print(f"  k={k}: Silhouette = {score:.3f}")

---

## Part 4: Alternative Clustering Methods

K-Means works well for spherical, evenly-sized clusters. But what about other data structures?

### Hierarchical Clustering

Hierarchical clustering builds a tree-like structure (dendrogram) showing how observations group together at different levels.

In [None]:
# Use a subset of customers for clarity
sample_customers = customer_data.sample(30, random_state=42)

# Scale the data
scaler = StandardScaler()
sample_scaled = scaler.fit_transform(sample_customers[['age', 'income']])

# Perform hierarchical clustering
linkage_matrix = linkage(sample_scaled, method='ward')

# Plot dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix,
           labels=sample_customers.index.tolist(),
           leaf_font_size=8)
plt.xlabel('Customer Index', fontsize=12)
plt.ylabel('Distance (Ward Linkage)', fontsize=12)
plt.title('Hierarchical Clustering Dendrogram', fontsize=14, fontweight='bold')
plt.axhline(y=6, color='red', linestyle='--', linewidth=2, label='Cut at k=3')
plt.legend()
plt.tight_layout()
plt.show()

print("Reading the dendrogram:")
print("- Each leaf (bottom) represents one customer")
print("- Branches merge at heights indicating dissimilarity")
print("- Cutting the tree horizontally (red line) gives k=3 clusters")

### DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) can find clusters of arbitrary shapes and identify outliers.

In [None]:
# Generate crescent-shaped data where K-Means would struggle
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_moons)

# Plot DBSCAN results
plt.figure(figsize=(10, 6))
unique_labels = set(dbscan_labels)
colors = ['#FF6B6B', '#4ECDC4', '#FFA500']

for label in unique_labels:
    if label == -1:
        # Outliers
        color = 'yellow'
        marker = 'x'
        label_text = 'Outliers'
    else:
        color = colors[label % len(colors)]
        marker = 'o'
        label_text = f'Cluster {label+1}'
    
    mask = dbscan_labels == label
    plt.scatter(X_moons[mask, 0], X_moons[mask, 1],
               c=color, marker=marker, alpha=0.6, s=60,
               edgecolors='black', linewidths=0.5,
               label=label_text)

plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('DBSCAN: Handles Non-Spherical Clusters', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"DBSCAN found {len(set(dbscan_labels) - {-1})} clusters")
print(f"Number of outliers: {sum(dbscan_labels == -1)}")

---

## Part 5: Real-World Case Study - Complete Journey Customer Segmentation

Now let's apply everything we've learned to a real grocery store dataset. We'll segment customers based on their shopping behavior and demographics.

### Load the Complete Journey Data

In [None]:
# Install the package if running in Colab
try:
    from completejourney_py import get_data
except ImportError:
    print("Installing completejourney_py package...")
    !pip install completejourney-py
    from completejourney_py import get_data

# Load the Complete Journey datasets
print("Loading Complete Journey data...")
data = get_data()
transactions = data['transactions']
demographics = data["demographics"]

print(f"\nTransactions: {len(transactions):,} rows")
print(f"Households: {demographics['household_id'].nunique():,} unique households")

print("\nTransaction data sample:")
print(transactions.head())

print("\nDemographic data sample:")
print(demographics.head())

### Feature Engineering from Transactions

We'll transform raw transaction records into behavioral features that describe each customer's shopping patterns.

In [None]:
# Step 1: Create behavioral features from transactions

# Convert transaction_timestamp to datetime
transactions['transaction_timestamp'] = pd.to_datetime(transactions['transaction_timestamp'], format='mixed')

# Find the last date in the dataset for recency calculations
max_date = transactions['transaction_timestamp'].max()

# Aggregate transaction data by household
behavioral_features = transactions.groupby('household_id').agg({
    # Spending metrics
    'sales_value': ['sum', 'mean'],  # Total spending and average transaction value
    'basket_id': 'nunique',  # Number of unique shopping trips
    'product_id': 'nunique',  # Number of unique products purchased
    
    # Discount sensitivity
    'retail_disc': 'sum',  # Total retail discounts used
    'coupon_disc': 'sum',  # Total coupon discounts used
    
    # Temporal patterns
    'transaction_timestamp': ['min', 'max']  # First and last purchase dates
}).reset_index()

# Flatten column names
behavioral_features.columns = ['household_id', 'total_spending', 'avg_basket_value',
                                'num_trips', 'num_unique_products',
                                'total_retail_disc', 'total_coupon_disc',
                                'first_purchase', 'last_purchase']

# Create additional engineered features
behavioral_features['days_active'] = (behavioral_features['last_purchase'] - 
                                      behavioral_features['first_purchase']).dt.days + 1
behavioral_features['recency_days'] = (max_date - behavioral_features['last_purchase']).dt.days
behavioral_features['avg_days_between_trips'] = (behavioral_features['days_active'] / 
                                                  behavioral_features['num_trips'])

# Calculate discount usage rates
behavioral_features['total_discount'] = (behavioral_features['total_retail_disc'] + 
                                         behavioral_features['total_coupon_disc'])
behavioral_features['discount_rate'] = (behavioral_features['total_discount'] / 
                                        behavioral_features['total_spending'])
behavioral_features['coupon_usage_rate'] = (behavioral_features['total_coupon_disc'] / 
                                            behavioral_features['total_spending'])

# Drop temporary date columns
behavioral_features = behavioral_features.drop(['first_purchase', 'last_purchase'], axis=1)

print("Behavioral features created!")
print(f"\nFeatures per household: {len(behavioral_features.columns)-1}")
print("\nSample behavioral features:")
behavioral_features.head()

### Merge with Demographics and Encode Categorical Features

In [None]:
# Step 2: Merge behavioral features with demographics
customer_data = behavioral_features.merge(demographics, on='household_id', how='inner')

print(f"Merged data: {len(customer_data)} households")

# Step 3: Encode demographic features
# Map column names (handling different possible column names)
col_mapping = {}
for col in customer_data.columns:
    lower_col = col.lower()
    if 'age' in lower_col and 'age_encoded' not in lower_col:
        col_mapping['age'] = col
    elif 'income' in lower_col and 'income_encoded' not in lower_col:
        col_mapping['income'] = col
    elif 'household_size' in lower_col or 'hh_size' in lower_col:
        col_mapping['household_size'] = col
    elif 'marital' in lower_col:
        col_mapping['marital_status'] = col
    elif 'homeowner' in lower_col or 'home_owner' in lower_col:
        col_mapping['homeowner'] = col
    elif 'kid' in lower_col or 'child' in lower_col:
        col_mapping['kids'] = col

# Convert age brackets to ordinal numbers
age_map = {
    '19-24': 1, '25-34': 2, '35-44': 3, '45-54': 4,
    '55-64': 5, '65+': 6
}
customer_data['age_encoded'] = customer_data[col_mapping.get('age', 'age')].map(age_map)

# Convert income brackets to ordinal numbers
income_map = {
    'Under 15K': 1, '15-24K': 2, '25-34K': 3, '35-49K': 4,
    '50-74K': 5, '75-99K': 6, '100-124K': 7, '125-149K': 8,
    '150-174K': 9, '175-199K': 10, '200-249K': 11, '250K+': 12
}
customer_data['income_encoded'] = customer_data[col_mapping.get('income', 'income')].map(income_map)

# Extract household size
hh_size_col = col_mapping.get('household_size', 'household_size')
if customer_data[hh_size_col].dtype == 'object':
    customer_data['household_size_num'] = customer_data[hh_size_col].str.extract(r'(\d+)').astype(float)
else:
    customer_data['household_size_num'] = customer_data[hh_size_col]

# Extract number of kids
if 'kids' in col_mapping:
    kids_col = col_mapping['kids']
    if customer_data[kids_col].dtype == 'object':
        customer_data['num_kids'] = customer_data[kids_col].replace('None/Unknown', '0')
        customer_data['num_kids'] = customer_data['num_kids'].str.extract(r'(\d+)').fillna(0).astype(int)
    else:
        customer_data['num_kids'] = customer_data[kids_col].fillna(0).astype(int)
else:
    customer_data['num_kids'] = 0

# Create binary features
marital_col = col_mapping.get('marital_status', 'marital_status')
customer_data['is_married'] = (customer_data[marital_col] == 'Married').astype(int)

homeowner_col = col_mapping.get('homeowner', 'homeowner')
customer_data['is_homeowner'] = (customer_data[homeowner_col] == 'Homeowner').astype(int)

# Handle missing values
customer_data_clean = customer_data.dropna(subset=['age_encoded', 'income_encoded'])

print(f"\nCleaned data: {len(customer_data_clean)} households")
print("\nEncoded features sample:")
customer_data_clean[['household_id', 'age_encoded', 'income_encoded', 
                     'household_size_num', 'num_kids', 'is_married', 'is_homeowner']].head()

### Select Features for Clustering

In [None]:
# Step 4: Select features for clustering
cluster_features = [
    # Behavioral features
    'total_spending',
    'avg_basket_value',
    'num_trips',
    'num_unique_products',
    'avg_days_between_trips',
    'recency_days',
    'discount_rate',
    'coupon_usage_rate',
    
    # Demographic features
    'age_encoded',
    'income_encoded',
    'household_size_num',
    'num_kids',
    'is_married',
    'is_homeowner'
]

X_cluster = customer_data_clean[cluster_features]

print(f"Clustering features: {len(cluster_features)}")
print("\nFeature names:")
for i, feature in enumerate(cluster_features, 1):
    print(f"  {i:2d}. {feature}")

print("\nData shape:", X_cluster.shape)
X_cluster.head()

### Determine Optimal Number of Clusters

Let's use both the elbow method and silhouette analysis to choose k.

In [None]:
# Scale the features first (CRITICAL!)
scaler = StandardScaler()
X_cluster_scaled = scaler.fit_transform(X_cluster)

print("Testing different values of k...")

# Elbow method
wcss_values = []
k_range = range(2, 21)

for k in k_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=20)
    kmeans_temp.fit(X_cluster_scaled)
    wcss_values.append(kmeans_temp.inertia_)

# Silhouette scores
sil_scores = []
for k in k_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=20)
    labels_temp = kmeans_temp.fit_predict(X_cluster_scaled)
    sil_score = silhouette_score(X_cluster_scaled, labels_temp)
    sil_scores.append(sil_score)

# Plot both methods
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow plot
axes[0].plot(k_range, wcss_values, marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (k)', fontsize=12)
axes[0].set_ylabel('WCSS', fontsize=12)
axes[0].set_title('Elbow Method', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(k_range)

# Silhouette plot
axes[1].plot(k_range, sil_scores, marker='o', linewidth=2,
             markersize=8, color='green')
axes[1].set_xlabel('Number of Clusters (k)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Analysis', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_xticks(k_range)

plt.tight_layout()
plt.show()

print("\nResults for different k values:")
print("k  | WCSS       | Silhouette")
print("-" * 35)
for k, wcss_val, sil_val in zip(k_range, wcss_values, sil_scores):
    print(f"{k:2d} | {wcss_val:10,.0f} | {sil_val:6.3f}")

### Fit Final K-Means Model

Based on the elbow and silhouette analysis, let's choose k=4 for our segmentation.

In [None]:
# Fit final K-Means model with k=4
optimal_k = 4
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)
customer_data_clean['cluster'] = kmeans_final.fit_predict(X_cluster_scaled)

print(f"K-Means clustering complete with k={optimal_k}")
print(f"\nCluster distribution:")
cluster_counts = customer_data_clean['cluster'].value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    pct = (count / len(customer_data_clean)) * 100
    print(f"  Cluster {cluster_id}: {count:4d} households ({pct:5.1f}%)")

### Analyze Cluster Profiles

Now for the most important part: understanding what each cluster represents!

In [None]:
# Create cluster profiles using original (unscaled) features
print("\n" + "=" * 80)
print("CLUSTER PROFILES")
print("=" * 80)

# Behavioral characteristics by cluster
behavioral_profiles = customer_data_clean.groupby('cluster').agg({
    'total_spending': 'mean',
    'avg_basket_value': 'mean',
    'num_trips': 'mean',
    'num_unique_products': 'mean',
    'avg_days_between_trips': 'mean',
    'recency_days': 'mean',
    'discount_rate': 'mean',
    'coupon_usage_rate': 'mean'
}).round(2)

behavioral_profiles['count'] = customer_data_clean['cluster'].value_counts().sort_index()

print("\nBehavioral Characteristics:")
print(behavioral_profiles)

# Demographic characteristics by cluster
demo_agg_dict = {
    'household_size_num': 'mean',
    'num_kids': 'mean',
    'is_married': lambda x: f"{x.mean():.1%}",
    'is_homeowner': lambda x: f"{x.mean():.1%}"
}

age_col = col_mapping.get('age', None)
income_col = col_mapping.get('income', None)

if age_col and age_col in customer_data_clean.columns:
    demo_agg_dict[age_col] = lambda x: x.mode()[0] if len(x.mode()) > 0 else 'Mixed'
if income_col and income_col in customer_data_clean.columns:
    demo_agg_dict[income_col] = lambda x: x.mode()[0] if len(x.mode()) > 0 else 'Mixed'

demographic_profiles = customer_data_clean.groupby('cluster').agg(demo_agg_dict).round(1)

print("\nDemographic Characteristics:")
print(demographic_profiles)

# Add more detailed analysis for each cluster
print("\n\nDetailed Segment Descriptions:")
print("=" * 80)

for cluster_id in range(optimal_k):
    cluster_data = customer_data_clean[customer_data_clean['cluster'] == cluster_id]
    
    print(f"\nCluster {cluster_id} (n={len(cluster_data)}):")
    print(f"  Total spending: ${cluster_data['total_spending'].mean():,.0f}")
    print(f"  Avg basket value: ${cluster_data['avg_basket_value'].mean():.2f}")
    print(f"  Shopping trips: {cluster_data['num_trips'].mean():.0f}")
    print(f"  Discount rate: {cluster_data['discount_rate'].mean():.1%}")
    print(f"  Coupon usage: {cluster_data['coupon_usage_rate'].mean():.1%}")
    
    if age_col and age_col in cluster_data.columns:
        print(f"  Dominant age: {cluster_data[age_col].mode()[0] if len(cluster_data[age_col].mode()) > 0 else 'Mixed'}")
    if income_col and income_col in cluster_data.columns:
        print(f"  Dominant income: {cluster_data[income_col].mode()[0] if len(cluster_data[income_col].mode()) > 0 else 'Mixed'}")

### Visualize Customer Segments

Let's create visualizations to help interpret the clusters.

In [None]:
# Create a 2D visualization showing behavioral patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Total Spending vs Number of Trips
colors_segments = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']
for i in range(optimal_k):
    cluster_data = customer_data_clean[customer_data_clean['cluster'] == i]
    axes[0].scatter(cluster_data['num_trips'], cluster_data['total_spending'],
                   alpha=0.6, s=50, color=colors_segments[i],
                   label=f'Cluster {i}', edgecolors='black', linewidths=0.3)

axes[0].set_xlabel('Number of Shopping Trips', fontsize=12)
axes[0].set_ylabel('Total Spending ($)', fontsize=12)
axes[0].set_title('Customer Segments: Spending vs. Frequency', fontsize=14, fontweight='bold')
axes[0].legend(loc='best', fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Discount Rate vs Coupon Usage
for i in range(optimal_k):
    cluster_data = customer_data_clean[customer_data_clean['cluster'] == i]
    axes[1].scatter(cluster_data['discount_rate'], cluster_data['coupon_usage_rate'],
                   alpha=0.6, s=50, color=colors_segments[i],
                   label=f'Cluster {i}', edgecolors='black', linewidths=0.3)

axes[1].set_xlabel('Discount Rate', fontsize=12)
axes[1].set_ylabel('Coupon Usage Rate', fontsize=12)
axes[1].set_title('Customer Segments: Discount Sensitivity', fontsize=14, fontweight='bold')
axes[1].legend(loc='best', fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Summary and Key Takeaways

Congratulations! You've completed the clustering chapter examples. Here's what you learned:

### Core Concepts
1. **Unsupervised Learning**: Discovering patterns without labeled outcomes
2. **K-Means Algorithm**: Iteratively assigns points to nearest centroids
3. **Distance Metrics**: How to measure similarity between observations

### Practical Skills
1. **Feature Scaling**: Critical for distance-based algorithms
2. **Choosing k**: Elbow method and silhouette analysis
3. **Feature Engineering**: Creating behavioral features from transactions
4. **Cluster Interpretation**: Translating statistics into business insights

### Alternative Methods
1. **Hierarchical Clustering**: When you want to see nested groupings
2. **DBSCAN**: When clusters have irregular shapes or contain outliers

### Real-World Application
1. **Complete Journey Case Study**: End-to-end customer segmentation
2. **Profile Analysis**: Understanding what makes each segment unique
3. **Visualization**: Communicating findings effectively

---

## Next Steps

Now that you've mastered clustering, try:
1. Experimenting with different values of k
2. Adding or removing features to see how clusters change
3. Trying hierarchical clustering or DBSCAN on the Complete Journey data
4. Applying these techniques to your own datasets

Happy clustering!