# Lab 3: Clustering Analysis - Interactive Notebook
## Customer Segmentation for Marketing

### Learning Objectives

1. Implement K-Means clustering
2. Use hierarchical clustering with dendrograms
3. Apply DBSCAN for outlier detection
4. Evaluate clustering quality
5. Generate business insights

**Estimated Time:** 3-4 hours

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import cdist
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
print('‚úÖ Libraries loaded successfully!')

## Part 1: Load and Explore Data

We'll create a synthetic customer dataset for this lab.

In [None]:
# Generate synthetic customer data
np.random.seed(42)

n_customers = 200

# Create different customer segments
# Segment 1: Young, low income, high spending (40 customers)
seg1_age = np.random.randint(18, 30, 40)
seg1_income = np.random.randint(15, 40, 40)
seg1_spending = np.random.randint(60, 100, 40)

# Segment 2: Middle-aged, high income, high spending (60 customers)
seg2_age = np.random.randint(35, 55, 60)
seg2_income = np.random.randint(70, 140, 60)
seg2_spending = np.random.randint(70, 100, 60)

# Segment 3: Older, moderate income, low spending (50 customers)
seg3_age = np.random.randint(50, 70, 50)
seg3_income = np.random.randint(40, 80, 50)
seg3_spending = np.random.randint(1, 40, 50)

# Segment 4: Young, high income, moderate spending (50 customers)
seg4_age = np.random.randint(25, 40, 50)
seg4_income = np.random.randint(80, 130, 50)
seg4_spending = np.random.randint(40, 70, 50)

# Combine all segments
df = pd.DataFrame({
    'CustomerID': range(1, n_customers + 1),
    'Gender': np.random.choice(['Male', 'Female'], n_customers),
    'Age': np.concatenate([seg1_age, seg2_age, seg3_age, seg4_age]),
    'Annual Income (k$)': np.concatenate([seg1_income, seg2_income, seg3_income, seg4_income]),
    'Spending Score (1-100)': np.concatenate([seg1_spending, seg2_spending, seg3_spending, seg4_spending])
})

# Shuffle the data
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f'Dataset created with {len(df)} customers')
print(f'\nFirst 5 rows:')
df.head()

### üìù Task 1: Exploratory Data Analysis

In [None]:
# TODO: Display dataset info
# YOUR CODE HERE


# TODO: Check for missing values
# YOUR CODE HERE


# TODO: Calculate descriptive statistics
# YOUR CODE HERE


### üìä Task 2: Visualize Feature Distributions

In [None]:
# TODO: Create scatter plot of Income vs Spending Score
plt.figure(figsize=(10, 6))
# YOUR CODE HERE
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Income vs Spending')
plt.show()

In [None]:
# Solution
plt.figure(figsize=(10, 6))
plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'], 
           alpha=0.6, s=50, edgecolors='black')
plt.xlabel('Annual Income (k$)', fontsize=12)
plt.ylabel('Spending Score (1-100)', fontsize=12)
plt.title('Customer Income vs Spending', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.show()

## Part 2: K-Means Clustering

### Step 1: Prepare Data for Clustering

In [None]:
# Select features for clustering
X = df[['Annual Income (k$)', 'Spending Score (1-100)']].values

# TODO: Scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = # YOUR CODE HERE

print(f'Data scaled. Shape: {X_scaled.shape}')

### Step 2: Find Optimal Number of Clusters (Elbow Method)

In [None]:
# TODO: Calculate WCSS for k=1 to k=10
wcss = []
K_range = range(1, 11)

for k in K_range:
    # YOUR CODE HERE: Fit KMeans and append inertia to wcss
    pass

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, wcss, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS')
plt.title('Elbow Method For Optimal k')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# Solution
wcss = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K_range, wcss, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Within-Cluster Sum of Squares (WCSS)', fontsize=12)
plt.title('Elbow Method For Optimal k', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.axvline(5, color='red', linestyle='--', label='Elbow at k=5')
plt.legend()
plt.show()

print('Looking at the elbow curve, k=5 appears optimal')

### Step 3: Apply K-Means with Optimal k

In [None]:
# TODO: Fit K-Means with k=5
optimal_k = 5
kmeans = # YOUR CODE HERE

# Get cluster labels
df['Cluster'] = kmeans.labels_

print(f'Clustering complete with {optimal_k} clusters')
print(f'\nCluster distribution:')
print(df['Cluster'].value_counts().sort_index())

### Step 4: Visualize Clusters

In [None]:
# TODO: Create scatter plot with cluster colors
plt.figure(figsize=(12, 8))

# Plot each cluster with different color
for cluster in range(optimal_k):
    # YOUR CODE HERE
    pass

# Plot centroids
centroids = kmeans.cluster_centers_
# YOUR CODE HERE

plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments - K-Means Clustering')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

### Step 5: Evaluate Clustering Quality

In [None]:
# TODO: Calculate evaluation metrics
silhouette = # YOUR CODE HERE
davies_bouldin = # YOUR CODE HERE
calinski_harabasz = # YOUR CODE HERE

print('Clustering Evaluation Metrics:')
print(f'Silhouette Score: {silhouette:.4f}')
print(f'Davies-Bouldin Index: {davies_bouldin:.4f}')
print(f'Calinski-Harabasz Index: {calinski_harabasz:.4f}')

## Part 3: Business Insights

### Analyze Each Customer Segment

In [None]:
# Segment profiling
print('CUSTOMER SEGMENT PROFILES')
print('=' * 80)

for cluster in range(optimal_k):
    cluster_data = df[df['Cluster'] == cluster]
    
    print(f'\nCluster {cluster}:')
    print(f'  Size: {len(cluster_data)} customers ({len(cluster_data)/len(df)*100:.1f}%)')
    print(f'  Avg Age: {cluster_data["Age"].mean():.1f} years')
    print(f'  Avg Income: ${cluster_data["Annual Income (k$)"].mean():.1f}k')
    print(f'  Avg Spending Score: {cluster_data["Spending Score (1-100)"].mean():.1f}/100')
    print(f'  Gender distribution:')
    print(f'    {cluster_data["Gender"].value_counts().to_dict()}')

---

# üéØ Your Task: Name the Segments

Based on the characteristics, give each cluster a meaningful business name:

**Cluster 0:** _Your name here_

**Cluster 1:** _Your name here_

**Cluster 2:** _Your name here_

**Cluster 3:** _Your name here_

**Cluster 4:** _Your name here_

---

# üìù Summary

## What You Learned

‚úÖ Load and explore customer data
‚úÖ Implement K-Means clustering
‚úÖ Use elbow method to find optimal k
‚úÖ Evaluate clustering quality
‚úÖ Generate business insights
‚úÖ Visualize customer segments

## Next Steps

1. Try hierarchical clustering
2. Experiment with DBSCAN
3. Add more features (Age)
4. Create marketing strategies for each segment

**Excellent work! üéâ**