# Cluster Profiler Tutorial: Understanding What Makes Clusters Different

This notebook demonstrates how to use the **Cluster Profiler** package to statistically analyze what features characterize different clusters in your data.

## What is Cluster Profiling?

After clustering your data, you often ask: *"What makes each cluster unique?"* 

Cluster Profiler answers this by:
- Testing each feature for significant differences between each cluster vs. all others
- Calculating effect sizes to measure the magnitude of differences
- Correcting for multiple testing to avoid false discoveries
- Ranking features by statistical importance, not just p-values

Let's see it in action!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from cluster_profiler import ClusterProfiler

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

## Step 1: Creating a Realistic Test Dataset

We'll create a dataset that mimics real-world scenarios where:
1. **Some features truly differentiate clusters** (meaningful)
2. **Some features are just noise** (random)
3. **Features have different distributions** (normal, skewed, categorical)

This helps us test if our profiler correctly identifies meaningful vs. random features.

In [None]:
# Step 1a: Generate base clusters using make_blobs
# Why make_blobs? It creates well-separated clusters in continuous space
X, true_labels = make_blobs(n_samples=300, centers=3, n_features=4, 
                           cluster_std=1.5, random_state=42)

# Create DataFrame with meaningful names
data = pd.DataFrame(X, columns=['income', 'age', 'education_years', 'experience'])

print(f"Base dataset shape: {data.shape}")
print(f"Number of natural clusters: {len(np.unique(true_labels))}")

In [None]:
# Step 1b: Add a categorical feature that correlates with clusters
# This simulates real scenarios like "customer type" varying by cluster

job_categories = []
for label in true_labels:
    if label == 0:  # Cluster 0: Mostly "Tech" workers
        job_categories.append(np.random.choice(['Tech', 'Finance', 'Healthcare'], p=[0.7, 0.2, 0.1]))
    elif label == 1:  # Cluster 1: Mostly "Finance" workers  
        job_categories.append(np.random.choice(['Finance', 'Tech', 'Healthcare'], p=[0.6, 0.3, 0.1]))
    else:  # Cluster 2: Mostly "Healthcare" workers
        job_categories.append(np.random.choice(['Healthcare', 'Tech', 'Finance'], p=[0.8, 0.1, 0.1]))

data['job_category'] = job_categories

print("Job category distribution by true cluster:")
pd.crosstab(true_labels, data['job_category'], margins=True)

In [None]:
# Step 1c: Add a skewed feature that varies by cluster
# This simulates features like "spending", "response time", etc.

spending_amounts = []
for label in true_labels:
    if label == 0:  # Low spenders
        spending_amounts.append(np.random.exponential(500))  
    elif label == 1:  # Medium spenders
        spending_amounts.append(np.random.exponential(1500))  
    else:  # High spenders
        spending_amounts.append(np.random.exponential(3000))  

data['monthly_spending'] = spending_amounts

print(f"Spending by cluster - Mean (Std):")
for i in range(3):
    cluster_spending = data.loc[true_labels == i, 'monthly_spending']
    print(f"Cluster {i}: ${cluster_spending.mean():.0f} (${cluster_spending.std():.0f})")

In [None]:
# Step 1d: Add random features (should NOT be significant)
# These test if our profiler correctly identifies noise

data['random_category'] = np.random.choice(['A', 'B', 'C'], size=300)
data['random_normal'] = np.random.normal(50, 10, 300)
data['random_uniform'] = np.random.uniform(0, 100, 300)

print(f"Final dataset shape: {data.shape}")
print(f"\nFeature types:")
print(f"- Continuous (meaningful): income, age, education_years, experience, monthly_spending")
print(f"- Categorical (meaningful): job_category")
print(f"- Random features: random_category, random_normal, random_uniform")

## Step 2: Visualizing the Data

Let's visualize our clusters and feature distributions to understand what we expect to find.

In [None]:
# Perform clustering (we'll use 3 clusters to match our true structure)
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X)  # Use only the original blob features

# Add cluster labels to our dataframe
data['cluster'] = cluster_labels

print(f"Cluster sizes: {np.bincount(cluster_labels)}")

In [None]:
# Visualize clusters in 2D space
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: True clusters
scatter1 = axes[0].scatter(data['income'], data['age'], c=true_labels, alpha=0.7)
axes[0].set_title('True Clusters (from make_blobs)')
axes[0].set_xlabel('Income')
axes[0].set_ylabel('Age')
plt.colorbar(scatter1, ax=axes[0])

# Plot 2: KMeans clusters
scatter2 = axes[1].scatter(data['income'], data['age'], c=cluster_labels, alpha=0.7)
axes[1].set_title('KMeans Clusters (what we analyze)')
axes[1].set_xlabel('Income')
axes[1].set_ylabel('Age')
plt.colorbar(scatter2, ax=axes[1])

plt.tight_layout()
plt.show()

In [None]:
# Visualize feature distributions by cluster
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Monthly spending (skewed, meaningful)
for cluster in range(3):
    cluster_data = data[data['cluster'] == cluster]['monthly_spending']
    axes[0,0].hist(cluster_data, alpha=0.6, label=f'Cluster {cluster}', bins=20)
axes[0,0].set_title('Monthly Spending Distribution (Meaningful + Skewed)')
axes[0,0].set_xlabel('Spending ($)')
axes[0,0].legend()

# Job category (categorical, meaningful)
job_counts = pd.crosstab(data['cluster'], data['job_category'])
job_counts.plot(kind='bar', ax=axes[0,1])
axes[0,1].set_title('Job Category by Cluster (Meaningful)')
axes[0,1].set_xlabel('Cluster')
axes[0,1].tick_params(axis='x', rotation=0)

# Random normal (should not differ by cluster)
for cluster in range(3):
    cluster_data = data[data['cluster'] == cluster]['random_normal']
    axes[1,0].hist(cluster_data, alpha=0.6, label=f'Cluster {cluster}', bins=20)
axes[1,0].set_title('Random Normal Distribution (Should NOT be meaningful)')
axes[1,0].set_xlabel('Random Value')
axes[1,0].legend()

# Random category (should not differ by cluster)
random_counts = pd.crosstab(data['cluster'], data['random_category'])
random_counts.plot(kind='bar', ax=axes[1,1])
axes[1,1].set_title('Random Category by Cluster (Should NOT be meaningful)')
axes[1,1].set_xlabel('Cluster')
axes[1,1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

## Step 3: Assessing Feature Skewness

Before profiling, let's check which features are skewed and might benefit from preprocessing.

In [None]:
# Initialize the profiler
profiler = ClusterProfiler(alpha=0.05)

# Assess skewness of continuous features
skewness_report = profiler.assess_skewness(data)
print("Skewness Assessment:")
print("=" * 60)
print(skewness_report[['feature', 'skewness', 'interpretation', 'recommend_preprocessing']])

In [None]:
# Visualize skewness
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

continuous_features = ['income', 'age', 'education_years', 'experience', 'monthly_spending', 'random_normal']

for i, feature in enumerate(continuous_features):
    data[feature].hist(bins=30, ax=axes[i], alpha=0.7)
    skew_val = skewness_report[skewness_report['feature'] == feature]['skewness'].iloc[0]
    axes[i].set_title(f'{feature}\nSkewness: {skew_val:.2f}')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- monthly_spending is highly right-skewed (typical for spending data)")
print("- Other features are approximately symmetric")
print("- Preprocessing might help with monthly_spending")

## Step 4: Cluster Profiling - The Main Analysis

Now let's profile our clusters to see which features characterize each one.

In [None]:
# Profile clusters without preprocessing first
print("CLUSTER PROFILING WITHOUT PREPROCESSING")
print("=" * 60)

results_raw = profiler.profile_clusters(data.drop('cluster', axis=1), cluster_labels)
profiler.summary()

In [None]:
# Show top results by importance score
print("\nTop 10 Features by Statistical Importance:")
print("=" * 50)
top_results = results_raw.nlargest(10, 'importance_score')
display_cols = ['cluster', 'feature', 'effect_size', 'p_value_corrected', 'importance_score', 'significant_corrected']
print(top_results[display_cols].to_string(index=False))

In [None]:
# Now try with preprocessing for skewed features
print("\nCLUSTER PROFILING WITH YEO-JOHNSON PREPROCESSING")
print("=" * 60)

profiler_prep = ClusterProfiler(alpha=0.05, preprocessing='yeo-johnson')
results_prep = profiler_prep.profile_clusters(data.drop('cluster', axis=1), cluster_labels)
profiler_prep.summary()

In [None]:
# Compare preprocessing effects
print("\nComparison: Effect of Preprocessing on monthly_spending")
print("=" * 55)

spending_raw = results_raw[results_raw['feature'] == 'monthly_spending']
spending_prep = results_prep[results_prep['feature'] == 'monthly_spending']

comparison = pd.DataFrame({
    'Cluster': spending_raw['cluster'].values,
    'Raw_Effect_Size': spending_raw['effect_size'].values,
    'Raw_P_Value': spending_raw['p_value_corrected'].values,
    'Preprocessed_Effect_Size': spending_prep['effect_size'].values,
    'Preprocessed_P_Value': spending_prep['p_value_corrected'].values
})

print(comparison.to_string(index=False))
print("\nNote: Preprocessing can improve effect size detection for skewed features")

## Step 5: Interpreting Results by Cluster

Let's examine what characterizes each cluster specifically.

In [None]:
# Get characteristics for each cluster
print("CLUSTER CHARACTERISTICS (Ranked by Importance Score)")
print("=" * 60)

for cluster_id in range(3):
    print(f"\n🔍 CLUSTER {cluster_id} PROFILE:")
    print("-" * 40)
    
    cluster_chars = profiler.get_cluster_characteristics(
        cluster_id=cluster_id, top_n=8, rank_by='importance_score'
    )
    
    for _, row in cluster_chars.iterrows():
        # Significance markers
        if row['p_value_corrected'] < 0.001:
            sig = "***"
        elif row['p_value_corrected'] < 0.01:
            sig = "**"
        elif row['p_value_corrected'] < 0.05:
            sig = "*"
        else:
            sig = ""
        
        # Effect size interpretation
        if row['effect_size_type'] == 'cohens_d':
            if row['effect_size'] > 0.8:
                effect_desc = "Large"
            elif row['effect_size'] > 0.5:
                effect_desc = "Medium"
            elif row['effect_size'] > 0.2:
                effect_desc = "Small"
            else:
                effect_desc = "Negligible"
        else:  # Cramér's V
            if row['effect_size'] > 0.5:
                effect_desc = "Large"
            elif row['effect_size'] > 0.3:
                effect_desc = "Medium"
            elif row['effect_size'] > 0.1:
                effect_desc = "Small"
            else:
                effect_desc = "Negligible"
        
        print(f"  {row['feature']:18} | Importance: {row['importance_score']:6.2f} | "
              f"Effect: {effect_desc:8} ({row['effect_size']:.3f}) | p={row['p_value_corrected']:.4f} {sig}")
    
    # Show actual values for top features
    top_feature = cluster_chars.iloc[0]['feature']
    cluster_data = data[data['cluster'] == cluster_id]
    other_data = data[data['cluster'] != cluster_id]
    
    if top_feature in data.select_dtypes(include=[np.number]).columns:
        cluster_mean = cluster_data[top_feature].mean()
        other_mean = other_data[top_feature].mean()
        print(f"  📊 {top_feature}: Cluster avg = {cluster_mean:.1f}, Others avg = {other_mean:.1f}")
    else:
        cluster_mode = cluster_data[top_feature].mode().iloc[0]
        cluster_pct = (cluster_data[top_feature] == cluster_mode).mean() * 100
        print(f"  📊 {top_feature}: {cluster_pct:.1f}% are '{cluster_mode}' in this cluster")

## Step 6: Understanding the Statistical Approach

Let's explain what the profiler is doing under the hood.

In [None]:
print("STATISTICAL METHODOLOGY EXPLAINED")
print("=" * 50)
print()
print("🔬 TESTS PERFORMED:")
print("  • Continuous features: Kolmogorov-Smirnov test")
print("    - Tests if distributions differ between cluster vs. rest")
print("    - Non-parametric (no assumptions about data distribution)")
print()
print("  • Categorical features: Chi-square or Fisher's exact test")
print("    - Tests if category proportions differ between cluster vs. rest")
print("    - Fisher's exact used for small expected frequencies")
print()
print("📏 EFFECT SIZES:")
print("  • Cohen's d (continuous): Standardized difference between means")
print("    - Small: 0.2, Medium: 0.5, Large: 0.8+")
print()
print("  • Cramér's V (categorical): Strength of association")
print("    - Small: 0.1, Medium: 0.3, Large: 0.5+")
print()
print("🔧 MULTIPLE TESTING CORRECTION:")
print("  • Benjamini-Hochberg FDR correction")
print("  • Controls false discovery rate (more appropriate than Bonferroni)")
print()
print("⭐ IMPORTANCE SCORE:")
print("  • Formula: effect_size × (-log10(corrected_p_value))")
print("  • Combines statistical significance with practical importance")
print("  • Higher scores = more important for cluster characterization")

## Step 7: Validation - Did We Get Expected Results?

Let's check if our profiler correctly identified meaningful vs. random features.

In [None]:
# Analyze results by feature type
print("VALIDATION: Expected vs. Actual Results")
print("=" * 50)

# Group features by expected significance
meaningful_features = ['income', 'age', 'education_years', 'experience', 'monthly_spending', 'job_category']
random_features = ['random_category', 'random_normal', 'random_uniform']

# Check how many meaningful features are significant
meaningful_results = results_raw[results_raw['feature'].isin(meaningful_features)]
meaningful_significant = meaningful_results['significant_corrected'].sum()
meaningful_total = len(meaningful_results)

# Check how many random features are significant (should be few)
random_results = results_raw[results_raw['feature'].isin(random_features)]
random_significant = random_results['significant_corrected'].sum()
random_total = len(random_results)

print(f"✅ MEANINGFUL FEATURES:")
print(f"   Significant: {meaningful_significant}/{meaningful_total} ({meaningful_significant/meaningful_total*100:.1f}%)")
print(f"   Expected: High percentage (these should differentiate clusters)")
print()
print(f"❌ RANDOM FEATURES:")
print(f"   Significant: {random_significant}/{random_total} ({random_significant/random_total*100:.1f}%)")
print(f"   Expected: ~5% (false discovery rate)")
print()

# Show top meaningful vs random features
print("Top Meaningful Features by Importance:")
top_meaningful = meaningful_results.nlargest(5, 'importance_score')
for _, row in top_meaningful.iterrows():
    print(f"  {row['feature']:18} | Importance: {row['importance_score']:6.2f} | Cluster: {row['cluster']}")

print("\nRandom Features (should have low importance):")
for _, row in random_results.nlargest(3, 'importance_score').iterrows():
    print(f"  {row['feature']:18} | Importance: {row['importance_score']:6.2f} | Cluster: {row['cluster']}")

## Key Takeaways

### ✅ What Cluster Profiler Does Well:
1. **Identifies meaningful features** that truly differentiate clusters
2. **Controls false discoveries** through multiple testing correction
3. **Measures effect sizes** to distinguish statistical vs. practical significance
4. **Handles different data types** (continuous, categorical, skewed)
5. **Ranks by importance** rather than just p-values

### 🎯 When to Use Cluster Profiler:
- After clustering, to understand what makes each cluster unique
- To validate that clusters are meaningful (not just statistical artifacts)
- To select features for cluster interpretation or downstream analysis
- To compare different clustering approaches

### ⚠️ Important Considerations:
- **Effect size matters more than p-values** for practical importance
- **Preprocessing can help** with highly skewed features
- **Multiple testing correction is crucial** when testing many features
- **Sample size affects results** - larger samples detect smaller effects

### 🚀 Next Steps:
Try the profiler on your own data and see what insights you discover!