# Customer Segmentation - FIXED VERSION - Part 4

## K-Means Clustering & Business Interpretation

This notebook covers optimal cluster selection, clustering, and detailed business insights.

<h2 style="color:darkmagenta;text-align: center; background-color: AliceBlue;padding: 20px;">8. Optimal Cluster Selection</h2><a id="8"></a>

### The Fundamental Question: How Many Clusters?

**The Challenge:**
K-Means requires us to specify K (number of clusters) beforehand.
But how do we know the "right" number?

**Three Methods We'll Use:**

1. **Elbow Method** - Find where improvement plateaus
   - Plot: Inertia (within-cluster sum of squares) vs K
   - Look for: "Elbow" where curve bends
   - Interpretation: After this point, adding clusters doesn't help much

2. **Silhouette Score** - Measure cluster quality
   - Range: -1 to +1
   - -1: Wrong clusters, 0: Overlapping, +1: Perfect clusters
   - Rule: >0.5 = good, 0.25-0.5 = weak, <0.25 = no structure

3. **Dendrogram** (Hierarchical clustering preview)
   - Visual: Tree showing how customers group
   - Look for: Natural "cuts" in the tree

### Important Note:
These are **guidelines**, not absolute answers!
Final decision should balance:
- Statistical metrics
- Business interpretability
- Actionability

In [None]:
# Configuration for cluster evaluation
K_RANGE = range(CONFIG['N_CLUSTERS_MIN'], CONFIG['N_CLUSTERS_MAX'] + 1)

print("üîç Evaluating Different Numbers of Clusters")
print("=" * 80)
print(f"Testing K from {min(K_RANGE)} to {max(K_RANGE)}")
print(f"Dataset size: {len(df_final):,} customers")
print(f"Features: {df_final.shape[1]}")
print("\nThis may take a few minutes...")

In [None]:
# Evaluate multiple K values
# This is the most important analysis for choosing K!

from tqdm import tqdm  # Progress bar

# Storage for metrics
evaluation_results = {
    'k': [],
    'inertia': [],
    'silhouette': [],
    'calinski_harabasz': [],
    'davies_bouldin': []
}

# K-Means settings
kmeans_params = {
    'init': 'k-means++',      # Smart initialization (better than random)
    'n_init': 10,             # Run 10 times, pick best
    'max_iter': 300,          # Maximum iterations
    'random_state': CONFIG['RANDOM_STATE']
}

print("\nEvaluating each K value...")
print("=" * 80)

for k in tqdm(K_RANGE, desc="Testing K values"):
    # Fit K-Means
    kmeans = KMeans(n_clusters=k, **kmeans_params)
    labels = kmeans.fit_predict(df_final)
    
    # Calculate metrics
    evaluation_results['k'].append(k)
    evaluation_results['inertia'].append(kmeans.inertia_)
    evaluation_results['silhouette'].append(silhouette_score(df_final, labels))
    evaluation_results['calinski_harabasz'].append(calinski_harabasz_score(df_final, labels))
    evaluation_results['davies_bouldin'].append(davies_bouldin_score(df_final, labels))
    
    print(f"K={k}: Silhouette={evaluation_results['silhouette'][-1]:.3f}, "
          f"Inertia={evaluation_results['inertia'][-1]:,.0f}")

# Convert to DataFrame
eval_df = pd.DataFrame(evaluation_results)
print("\n‚úì Evaluation complete!")
print("\nResults:")
eval_df

### Metric Interpretation Guide

**1. Inertia** (Within-Cluster Sum of Squares)
- What: Sum of squared distances from points to cluster centers
- Lower is better
- Always decreases as K increases
- Look for: Elbow where decrease slows

**2. Silhouette Score**
- What: How well-separated clusters are
- Range: -1 to +1
- Higher is better
- >0.5 = Good, 0.25-0.5 = Weak, <0.25 = No structure

**3. Calinski-Harabasz Score**
- What: Ratio of between-cluster to within-cluster variance
- Higher is better
- No fixed range

**4. Davies-Bouldin Score**
- What: Average similarity between clusters
- Lower is better
- 0 = perfect separation

In [None]:
# Visualize evaluation metrics
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Elbow Method - Inertia
ax1 = axes[0, 0]
ax1.plot(eval_df['k'], eval_df['inertia'], 'o-', color='purple', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters (K)', fontsize=12)
ax1.set_ylabel('Inertia (Within-Cluster SS)', fontsize=12)
ax1.set_title('Elbow Method - Inertia', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_xticks(K_RANGE)

# Find elbow using KneeLocator
try:
    kl = KneeLocator(eval_df['k'], eval_df['inertia'], 
                     curve='convex', direction='decreasing')
    if kl.elbow:
        ax1.axvline(x=kl.elbow, color='red', linestyle='--', linewidth=2,
                   label=f'Elbow at K={kl.elbow}')
        ax1.legend(fontsize=11)
except:
    print("Could not detect elbow automatically")

# 2. Silhouette Score
ax2 = axes[0, 1]
ax2.plot(eval_df['k'], eval_df['silhouette'], 'o-', color='green', linewidth=2, markersize=8)
ax2.set_xlabel('Number of Clusters (K)', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.set_title('Silhouette Analysis', fontsize=14, fontweight='bold')
ax2.axhline(y=0.5, color='green', linestyle=':', alpha=0.5, label='Good (>0.5)')
ax2.axhline(y=0.25, color='orange', linestyle=':', alpha=0.5, label='Weak (0.25-0.5)')
ax2.grid(True, alpha=0.3)
ax2.set_xticks(K_RANGE)
ax2.legend(fontsize=10)

# Highlight best silhouette
best_sil_idx = eval_df['silhouette'].idxmax()
best_sil_k = eval_df.loc[best_sil_idx, 'k']
ax2.scatter(best_sil_k, eval_df.loc[best_sil_idx, 'silhouette'], 
           s=200, color='red', zorder=5, label=f'Best: K={best_sil_k}')

# 3. Calinski-Harabasz Score
ax3 = axes[1, 0]
ax3.plot(eval_df['k'], eval_df['calinski_harabasz'], 'o-', color='blue', linewidth=2, markersize=8)
ax3.set_xlabel('Number of Clusters (K)', fontsize=12)
ax3.set_ylabel('Calinski-Harabasz Score', fontsize=12)
ax3.set_title('Calinski-Harabasz Score (Higher = Better)', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)
ax3.set_xticks(K_RANGE)

# Highlight best
best_ch_idx = eval_df['calinski_harabasz'].idxmax()
best_ch_k = eval_df.loc[best_ch_idx, 'k']
ax3.scatter(best_ch_k, eval_df.loc[best_ch_idx, 'calinski_harabasz'],
           s=200, color='red', zorder=5, label=f'Best: K={best_ch_k}')
ax3.legend(fontsize=11)

# 4. Davies-Bouldin Score
ax4 = axes[1, 1]
ax4.plot(eval_df['k'], eval_df['davies_bouldin'], 'o-', color='red', linewidth=2, markersize=8)
ax4.set_xlabel('Number of Clusters (K)', fontsize=12)
ax4.set_ylabel('Davies-Bouldin Score', fontsize=12)
ax4.set_title('Davies-Bouldin Score (Lower = Better)', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)
ax4.set_xticks(K_RANGE)

# Highlight best
best_db_idx = eval_df['davies_bouldin'].idxmin()
best_db_k = eval_df.loc[best_db_idx, 'k']
ax4.scatter(best_db_k, eval_df.loc[best_db_idx, 'davies_bouldin'],
           s=200, color='red', zorder=5, label=f'Best: K={best_db_k}')
ax4.legend(fontsize=11)

plt.tight_layout()
plt.savefig('cluster_evaluation_metrics.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüìä Metrics visualization saved as 'cluster_evaluation_metrics.png'")

In [None]:
# Summary of recommendations from each metric
print("\nüìä OPTIMAL K RECOMMENDATIONS")
print("=" * 80)

recommendations = {}

# Elbow method
try:
    kl = KneeLocator(eval_df['k'], eval_df['inertia'], 
                     curve='convex', direction='decreasing')
    if kl.elbow:
        recommendations['Elbow Method'] = kl.elbow
        print(f"1. Elbow Method suggests: K = {kl.elbow}")
except:
    print("1. Elbow Method: No clear elbow detected")

# Silhouette
best_sil_k = eval_df.loc[eval_df['silhouette'].idxmax(), 'k']
best_sil_score = eval_df['silhouette'].max()
recommendations['Silhouette'] = best_sil_k
print(f"2. Silhouette Score suggests: K = {best_sil_k} (score: {best_sil_score:.3f})")

# Calinski-Harabasz
best_ch_k = eval_df.loc[eval_df['calinski_harabasz'].idxmax(), 'k']
recommendations['Calinski-Harabasz'] = best_ch_k
print(f"3. Calinski-Harabasz suggests: K = {best_ch_k}")

# Davies-Bouldin
best_db_k = eval_df.loc[eval_df['davies_bouldin'].idxmin(), 'k']
recommendations['Davies-Bouldin'] = best_db_k
print(f"4. Davies-Bouldin suggests: K = {best_db_k}")

# Consensus
from collections import Counter
most_common = Counter(recommendations.values()).most_common(1)[0]
consensus_k = most_common[0]
consensus_count = most_common[1]

print("\n" + "=" * 80)
print(f"üìå CONSENSUS: K = {consensus_k}")
print(f"   ({consensus_count} out of {len(recommendations)} metrics agree)")
print("=" * 80)

# Quality assessment
consensus_sil = eval_df[eval_df['k'] == consensus_k]['silhouette'].values[0]
print(f"\nüéØ Quality Assessment for K = {consensus_k}:")
print(f"   Silhouette Score: {consensus_sil:.3f}")
if consensus_sil > 0.5:
    print("   ‚úÖ GOOD cluster structure")
elif consensus_sil > 0.25:
    print("   ‚ö†Ô∏è  WEAK cluster structure (consider if business value exists)")
else:
    print("   ‚ùå POOR cluster structure (data may not have natural clusters)")

# Set optimal K
OPTIMAL_K = consensus_k
print(f"\n‚úì Using K = {OPTIMAL_K} for final clustering")

### Dendrogram Analysis

**What is a Dendrogram?**
- Visual representation of hierarchical clustering
- Shows how customers progressively merge into larger groups
- Height indicates distance between merges

**How to Read:**
- Bottom: Individual customers
- Moving up: Customers group together
- Vertical lines: Cluster merges
- Height: Dissimilarity (higher = more different)

**Finding K:**
- Look for long vertical lines
- Draw horizontal line to "cut" the tree
- Number of intersections = number of clusters

In [None]:
# Dendrogram (on subset for visibility)
print("üå≥ Hierarchical Clustering Dendrogram")
print("=" * 80)
print("Note: Using 500 customers for visualization clarity")

plt.figure(figsize=(20, 8))

# Use Ward linkage (minimizes within-cluster variance)
linkage_matrix = sch.linkage(
    df_final.iloc[:500, :],  # Subset for visibility
    method='ward'
)

# Create dendrogram
dendrogram = sch.dendrogram(
    linkage_matrix,
    truncate_mode='lastp',  # Show only last p merged clusters
    p=30,                   # Show last 30 merges
    leaf_rotation=90,
    leaf_font_size=10,
    show_contracted=True
)

plt.title('Hierarchical Clustering Dendrogram', fontsize=16, fontweight='bold')
plt.xlabel('Cluster Size (or Customer Index)', fontsize=12)
plt.ylabel('Distance (Ward Linkage)', fontsize=12)
plt.axhline(y=plt.ylim()[1] * 0.5, color='red', linestyle='--', 
            linewidth=2, label='Possible cut line')
plt.legend(fontsize=12)
plt.tight_layout()
plt.savefig('dendrogram.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüìä Dendrogram saved as 'dendrogram.png'")
print("\nüîç Interpretation:")
print("   - Look for long vertical lines (significant merges)")
print("   - Cutting the dendrogram at different heights gives different K values")
print(f"   - Generally agrees with K = {OPTIMAL_K} from statistical methods")

### ‚úÖ Optimal K Selection Complete!

**Methods Used:**
1. ‚úì Elbow Method (Inertia)
2. ‚úì Silhouette Score
3. ‚úì Calinski-Harabasz Score
4. ‚úì Davies-Bouldin Score
5. ‚úì Dendrogram Analysis

**Selected:** K = {OPTIMAL_K}

**Next:** Apply K-Means with optimal K and interpret business meaning