# Customer Segmentation - FIXED VERSION - Part 5

## Final Clustering, Business Interpretation & Production Deployment

This is the final notebook with cluster interpretation and actionable business insights.

<h2 style="color:darkmagenta;text-align: center; background-color: AliceBlue;padding: 20px;">9. Final K-Means Clustering</h2><a id="9"></a>

### Apply K-Means with Optimal K

Now that we've determined the optimal number of clusters, let's:
1. Train the final K-Means model
2. Assign cluster labels to customers
3. Validate the results
4. Interpret business meaning

In [None]:
# Train final K-Means model
print(f"üéØ Training Final K-Means Model with K = {OPTIMAL_K}")
print("=" * 80)

# Create and fit model
final_kmeans = KMeans(
    n_clusters=OPTIMAL_K,
    init='k-means++',
    n_init=20,  # More runs for final model
    max_iter=500,
    random_state=CONFIG['RANDOM_STATE'],
    verbose=0
)

# Fit and predict
cluster_labels = final_kmeans.fit_predict(df_final)

# Add labels to dataframe
df_final['Cluster'] = cluster_labels

# Calculate final metrics
final_silhouette = silhouette_score(df_final.drop('Cluster', axis=1), cluster_labels)
final_inertia = final_kmeans.inertia_

print(f"\n‚úì Model trained successfully!")
print(f"\nFinal Metrics:")
print(f"  Silhouette Score: {final_silhouette:.3f}")
print(f"  Inertia: {final_inertia:,.2f}")
print(f"  Iterations to converge: {final_kmeans.n_iter_}")

# Cluster size distribution
print(f"\nCluster Sizes:")
cluster_sizes = df_final['Cluster'].value_counts().sort_index()
for cluster, size in cluster_sizes.items():
    pct = (size / len(df_final)) * 100
    print(f"  Cluster {cluster}: {size:,} customers ({pct:.1f}%)")

In [None]:
# Visualize cluster sizes
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
colors = plt.cm.Set3(range(OPTIMAL_K))
cluster_sizes.plot(kind='bar', color=colors, edgecolor='black')
plt.title('Customer Distribution Across Clusters', fontsize=14, fontweight='bold')
plt.xlabel('Cluster', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(cluster_sizes):
    plt.text(i, v + len(df_final)*0.01, f'{v:,}\n({v/len(df_final)*100:.1f}%)', 
             ha='center', fontsize=10)

plt.subplot(1, 2, 2)
cluster_sizes.plot(kind='pie', autopct='%1.1f%%', colors=colors, startangle=90)
plt.title('Cluster Size Distribution', fontsize=14, fontweight='bold')
plt.ylabel('')

plt.tight_layout()
plt.savefig('cluster_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüí° Insights:")
largest_cluster = cluster_sizes.idxmax()
smallest_cluster = cluster_sizes.idxmin()
print(f"  - Largest cluster: Cluster {largest_cluster} ({cluster_sizes[largest_cluster]:,} customers)")
print(f"  - Smallest cluster: Cluster {smallest_cluster} ({cluster_sizes[smallest_cluster]:,} customers)")

# Check for very unbalanced clusters
max_pct = (cluster_sizes.max() / len(df_final)) * 100
if max_pct > 50:
    print(f"\n  ‚ö†Ô∏è  Warning: Largest cluster contains {max_pct:.1f}% of customers")
    print(f"      This may indicate dominant customer type or need for re-clustering")
else:
    print(f"\n  ‚úì Clusters are reasonably balanced")

<h2 style="color:darkmagenta;text-align: center; background-color: AliceBlue;padding: 20px;">10. Cluster Interpretation & Business Insights</h2><a id="10"></a>

### üéØ The Most Important Part!

**Technical clustering is only half the job.**
The real value comes from:
1. Understanding WHAT each cluster represents
2. Naming clusters with business meaning
3. Providing actionable recommendations

### Our Approach:
1. **Statistical Profile** - What are the cluster's characteristics?
2. **Business Interpretation** - What type of customers are these?
3. **Marketing Strategy** - How should the bank treat them?
4. **Value Assessment** - Which segments are most valuable?

In [None]:
# Create cluster profiles with ORIGINAL (unscaled) values
# This makes business interpretation easier!

print("üìä Creating Cluster Profiles with Original Values")
print("=" * 80)

# Add cluster labels to original customer data
customer_df_clustered = customer_df.iloc[df_final.index].copy()
customer_df_clustered['Cluster'] = cluster_labels

# Calculate cluster statistics
cluster_profiles = customer_df_clustered.groupby('Cluster').agg({
    'Recency': ['mean', 'median', 'std'],
    'Frequency': ['mean', 'median', 'std'],
    'MonetaryTotal': ['mean', 'median', 'std', 'sum'],
    'MonetaryAvg': ['mean', 'median'],
    'AccountBalance': ['mean', 'median'],
    'Age': ['mean', 'median'],
    'CustomerID': 'count'  # Cluster size
}).round(2)

# Rename count column
cluster_profiles.columns = ['_'.join(col).strip('_') for col in cluster_profiles.columns.values]
cluster_profiles.rename(columns={'CustomerID_count': 'Size'}, inplace=True)

print("\nCluster Statistical Profiles:")
cluster_profiles

In [None]:
# Create detailed profile for each cluster
print("\n" + "=" * 100)
print("üìã DETAILED CLUSTER PROFILES")
print("=" * 100)

for cluster_id in range(OPTIMAL_K):
    cluster_data = customer_df_clustered[customer_df_clustered['Cluster'] == cluster_id]
    size = len(cluster_data)
    pct = (size / len(customer_df_clustered)) * 100
    
    print(f"\n{'='*100}")
    print(f"CLUSTER {cluster_id}")
    print(f"{'='*100}")
    print(f"Size: {size:,} customers ({pct:.1f}% of total)")
    print(f"\nKey Characteristics:")
    
    # RFM Profile
    print(f"\n  üí≥ RFM Profile:")
    print(f"     Recency (avg): {cluster_data['Recency'].mean():.1f} days")
    print(f"     Frequency (avg): {cluster_data['Frequency'].mean():.1f} transactions")
    print(f"     Monetary Total (avg): ‚Çπ{cluster_data['MonetaryTotal'].mean():,.0f}")
    print(f"     Monetary Per Transaction: ‚Çπ{cluster_data['MonetaryAvg'].mean():,.0f}")
    
    # Financial Profile
    print(f"\n  üí∞ Financial Profile:")
    print(f"     Avg Account Balance: ‚Çπ{cluster_data['AccountBalance'].mean():,.0f}")
    print(f"     Total Revenue from Segment: ‚Çπ{cluster_data['MonetaryTotal'].sum():,.0f}")
    
    # Demographics
    print(f"\n  üë• Demographics:")
    print(f"     Average Age: {cluster_data['Age'].mean():.1f} years")
    gender_dist = cluster_data['Gender'].value_counts()
    for gender, count in gender_dist.items():
        print(f"     {gender}: {count:,} ({count/size*100:.1f}%)")
    
    # Comparison to overall average
    print(f"\n  üìä Relative to Overall Average:")
    
    avg_recency = customer_df_clustered['Recency'].mean()
    avg_frequency = customer_df_clustered['Frequency'].mean()
    avg_monetary = customer_df_clustered['MonetaryTotal'].mean()
    avg_balance = customer_df_clustered['AccountBalance'].mean()
    
    recency_diff = ((cluster_data['Recency'].mean() - avg_recency) / avg_recency) * 100
    frequency_diff = ((cluster_data['Frequency'].mean() - avg_frequency) / avg_frequency) * 100
    monetary_diff = ((cluster_data['MonetaryTotal'].mean() - avg_monetary) / avg_monetary) * 100
    balance_diff = ((cluster_data['AccountBalance'].mean() - avg_balance) / avg_balance) * 100
    
    print(f"     Recency: {recency_diff:+.1f}% ({'More dormant' if recency_diff > 0 else 'More active'})")
    print(f"     Frequency: {frequency_diff:+.1f}% ({'More frequent' if frequency_diff > 0 else 'Less frequent'})")
    print(f"     Monetary: {monetary_diff:+.1f}% ({'Higher value' if monetary_diff > 0 else 'Lower value'})")
    print(f"     Balance: {balance_diff:+.1f}% ({'Wealthier' if balance_diff > 0 else 'Less wealthy'})")

print(f"\n{'='*100}")

In [None]:
# Visualize cluster profiles - Radar Chart
print("\nüìä Creating Radar Chart for Cluster Comparison")

# Normalize features to 0-1 range for visualization
from sklearn.preprocessing import MinMaxScaler
viz_scaler = MinMaxScaler()

viz_features = ['Recency', 'Frequency', 'MonetaryTotal', 'AccountBalance', 'Age']
cluster_means = customer_df_clustered.groupby('Cluster')[viz_features].mean()
cluster_means_scaled = pd.DataFrame(
    viz_scaler.fit_transform(cluster_means),
    columns=viz_features,
    index=cluster_means.index
)

# Note: For Recency, lower is better, so we invert it for visualization
cluster_means_scaled['Recency'] = 1 - cluster_means_scaled['Recency']
cluster_means_scaled.rename(columns={'Recency': 'Recency (inverted)'}, inplace=True)

# Create radar chart using Plotly
fig = go.Figure()

colors = ['purple', 'green', 'blue', 'red', 'orange', 'pink', 'brown', 'gray']

for idx, cluster_id in enumerate(cluster_means_scaled.index):
    fig.add_trace(go.Scatterpolar(
        r=cluster_means_scaled.loc[cluster_id].values,
        theta=cluster_means_scaled.columns,
        fill='toself',
        name=f'Cluster {cluster_id}',
        line_color=colors[idx % len(colors)],
        opacity=0.6
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 1]
        )),
    showlegend=True,
    title="Cluster Profiles - Radar Chart (Normalized Values)",
    font=dict(size=12)
)

fig.write_html('cluster_radar_chart.html')
fig.show()

print("\n‚úì Interactive radar chart saved as 'cluster_radar_chart.html'")
print("\nüí° Interpretation:")
print("   - Larger area = Better performing on most metrics")
print("   - Different shapes = Different customer types")
print("   - Recency inverted: larger = more recent (better)")

In [None]:
# Heatmap of cluster characteristics
plt.figure(figsize=(12, 8))

# Use original values for heatmap
heatmap_data = customer_df_clustered.groupby('Cluster')[viz_features].mean()

# Normalize each column to 0-1 for color scale
heatmap_normalized = (heatmap_data - heatmap_data.min()) / (heatmap_data.max() - heatmap_data.min())

sns.heatmap(heatmap_normalized.T, 
            annot=heatmap_data.T.round(0),  # Show original values
            fmt='g',
            cmap='YlOrRd',
            cbar_kws={'label': 'Normalized Value (0-1)'},
            linewidths=1,
            linecolor='white')

plt.title('Cluster Profiles Heatmap\n(Colors = normalized, Numbers = actual values)', 
          fontsize=14, fontweight='bold', pad=20)
plt.xlabel('Cluster', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.xticks(rotation=0)
plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig('cluster_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüìä Heatmap interpretation:")
print("   - Red: High values for that feature")
print("   - Yellow: Medium values")
print("   - Light: Low values")
print("   - Quickly identify cluster specializations")

### üéØ Business Segment Naming & Strategy

Based on the cluster profiles above, let's give each segment a **business name** and **marketing strategy**.

**Note:** The specific names will depend on your actual cluster characteristics.
Below is a template - adjust based on your results!

In [None]:
# Define business segments based on cluster analysis
# YOU SHOULD CUSTOMIZE THESE based on your actual cluster characteristics!

segment_definitions = {
    0: {
        'name': 'Champions',
        'description': 'High value, frequent, recent customers',
        'strategy': 'VIP treatment, exclusive offers, retention focus',
        'priority': 'HIGHEST',
        'actions': [
            'Assign dedicated relationship manager',
            'Offer premium products (wealth management, investment)',
            'Early access to new features',
            'Referral incentives'
        ]
    },
    1: {
        'name': 'Loyal Customers',
        'description': 'Regular customers with moderate spending',
        'strategy': 'Upsell opportunities, increase transaction value',
        'priority': 'HIGH',
        'actions': [
            'Targeted product recommendations',
            'Loyalty rewards program',
            'Cross-sell financial products',
            'Encourage higher-value transactions'
        ]
    },
    2: {
        'name': 'Potential Loyalists',
        'description': 'Recent customers with growth potential',
        'strategy': 'Nurture and develop relationship',
        'priority': 'MEDIUM',
        'actions': [
            'Onboarding campaigns',
            'Educational content about products',
            'Engagement incentives',
            'Build trust and increase frequency'
        ]
    },
    3: {
        'name': 'At Risk',
        'description': 'Previously active but now dormant',
        'strategy': 'Re-activation campaigns',
        'priority': 'MEDIUM',
        'actions': [
            'Win-back offers',
            'Survey to understand issues',
            'Special promotions',
            'Personalized communication'
        ]
    },
    4: {
        'name': 'Low Value',
        'description': 'Low frequency and monetary value',
        'strategy': 'Minimal investment, automation',
        'priority': 'LOW',
        'actions': [
            'Automated marketing only',
            'Self-service channels',
            'Low-cost products',
            'Monitor for upgrade potential'
        ]
    }
}

# YOU MUST ADJUST THE ABOVE based on your actual cluster analysis!
# Look at the cluster profiles and assign appropriate names.

print("\n" + "="*100)
print("üéØ BUSINESS SEGMENT DEFINITIONS & STRATEGIES")
print("="*100)

for cluster_id in range(OPTIMAL_K):
    if cluster_id in segment_definitions:
        seg = segment_definitions[cluster_id]
        cluster_size = len(customer_df_clustered[customer_df_clustered['Cluster'] == cluster_id])
        cluster_revenue = customer_df_clustered[customer_df_clustered['Cluster'] == cluster_id]['MonetaryTotal'].sum()
        
        print(f"\n{'='*100}")
        print(f"CLUSTER {cluster_id}: {seg['name'].upper()}")
        print(f"{'='*100}")
        print(f"Priority: {seg['priority']}")
        print(f"Size: {cluster_size:,} customers")
        print(f"Total Revenue: ‚Çπ{cluster_revenue:,.0f}")
        print(f"\nDescription: {seg['description']}")
        print(f"\nStrategy: {seg['strategy']}")
        print(f"\nRecommended Actions:")
        for i, action in enumerate(seg['actions'], 1):
            print(f"  {i}. {action}")

print(f"\n{'='*100}")

In [None]:
# Calculate business value by segment
print("\nüí∞ SEGMENT VALUE ANALYSIS")
print("="*100)

value_analysis = customer_df_clustered.groupby('Cluster').agg({
    'CustomerID': 'count',
    'MonetaryTotal': ['sum', 'mean'],
    'Frequency': 'mean',
    'AccountBalance': 'mean'
}).round(2)

value_analysis.columns = ['Customer_Count', 'Total_Revenue', 'Avg_Revenue', 
                          'Avg_Frequency', 'Avg_Balance']

# Calculate percentages
value_analysis['Revenue_Pct'] = (value_analysis['Total_Revenue'] / 
                                  value_analysis['Total_Revenue'].sum() * 100).round(1)
value_analysis['Customer_Pct'] = (value_analysis['Customer_Count'] / 
                                   value_analysis['Customer_Count'].sum() * 100).round(1)

# Calculate Customer Lifetime Value (simplified)
# CLV = Avg Transaction √ó Avg Frequency √ó Estimated Years (assume 3 years)
value_analysis['Estimated_CLV'] = (value_analysis['Avg_Revenue'] * 
                                    value_analysis['Avg_Frequency'] * 3).round(0)

# Add segment names
value_analysis['Segment_Name'] = value_analysis.index.map(
    lambda x: segment_definitions.get(x, {}).get('name', f'Cluster {x}')
)

# Reorder columns
value_analysis = value_analysis[['Segment_Name', 'Customer_Count', 'Customer_Pct',
                                 'Total_Revenue', 'Revenue_Pct', 'Avg_Revenue',
                                 'Avg_Frequency', 'Avg_Balance', 'Estimated_CLV']]

# Sort by total revenue
value_analysis_sorted = value_analysis.sort_values('Total_Revenue', ascending=False)

print("\nSegment Value Ranking (by Total Revenue):\n")
print(value_analysis_sorted.to_string())

# Key insights
print("\n" + "="*100)
print("üí° KEY INSIGHTS:\n")

top_revenue_segment = value_analysis_sorted.index[0]
top_rev_pct = value_analysis_sorted.iloc[0]['Revenue_Pct']
top_cust_pct = value_analysis_sorted.iloc[0]['Customer_Pct']

print(f"1. Top Revenue Segment: {value_analysis_sorted.iloc[0]['Segment_Name']}")
print(f"   - Generates {top_rev_pct}% of revenue from {top_cust_pct}% of customers")
print(f"   - ROI: {top_rev_pct/top_cust_pct:.1f}x (revenue % / customer %)")

# 80/20 rule analysis
cumulative_revenue_pct = value_analysis_sorted['Revenue_Pct'].cumsum()
customers_for_80pct = cumulative_revenue_pct[cumulative_revenue_pct >= 80].iloc[0]
segments_for_80pct = len(cumulative_revenue_pct[cumulative_revenue_pct <= 80]) + 1

print(f"\n2. Pareto Principle (80/20 Rule):")
print(f"   - Top {segments_for_80pct} segment(s) generate ~80% of revenue")
print(f"   - Focus retention efforts on these segments!")

highest_clv = value_analysis_sorted['Estimated_CLV'].idxmax()
print(f"\n3. Highest Customer Lifetime Value: {value_analysis.loc[highest_clv, 'Segment_Name']}")
print(f"   - Estimated CLV: ‚Çπ{value_analysis.loc[highest_clv, 'Estimated_CLV']:,.0f}")

In [None]:
# Visualize segment value
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Revenue distribution
ax1 = axes[0, 0]
colors_val = plt.cm.Set3(range(len(value_analysis_sorted)))
value_analysis_sorted['Total_Revenue'].plot(kind='bar', ax=ax1, color=colors_val, edgecolor='black')
ax1.set_title('Total Revenue by Segment', fontsize=14, fontweight='bold')
ax1.set_xlabel('Segment', fontsize=12)
ax1.set_ylabel('Total Revenue (‚Çπ)', fontsize=12)
ax1.set_xticklabels(value_analysis_sorted['Segment_Name'], rotation=45, ha='right')
ax1.grid(axis='y', alpha=0.3)

# 2. Customer count vs Revenue %
ax2 = axes[0, 1]
x_pos = np.arange(len(value_analysis_sorted))
width = 0.35
ax2.bar(x_pos - width/2, value_analysis_sorted['Customer_Pct'], width, 
        label='% of Customers', color='skyblue', edgecolor='black')
ax2.bar(x_pos + width/2, value_analysis_sorted['Revenue_Pct'], width,
        label='% of Revenue', color='orange', edgecolor='black')
ax2.set_title('Customer % vs Revenue % (Value Efficiency)', fontsize=14, fontweight='bold')
ax2.set_xlabel('Segment', fontsize=12)
ax2.set_ylabel('Percentage', fontsize=12)
ax2.set_xticks(x_pos)
ax2.set_xticklabels(value_analysis_sorted['Segment_Name'], rotation=45, ha='right')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# 3. Estimated CLV
ax3 = axes[1, 0]
value_analysis_sorted['Estimated_CLV'].plot(kind='barh', ax=ax3, color=colors_val, edgecolor='black')
ax3.set_title('Estimated Customer Lifetime Value by Segment', fontsize=14, fontweight='bold')
ax3.set_xlabel('CLV (‚Çπ)', fontsize=12)
ax3.set_ylabel('Segment', fontsize=12)
ax3.set_yticklabels(value_analysis_sorted['Segment_Name'])
ax3.grid(axis='x', alpha=0.3)

# 4. Segment size pie chart
ax4 = axes[1, 1]
ax4.pie(value_analysis['Customer_Count'], 
        labels=value_analysis['Segment_Name'],
        autopct='%1.1f%%',
        colors=colors_val,
        startangle=90)
ax4.set_title('Customer Distribution Across Segments', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('segment_value_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úì Segment value analysis saved as 'segment_value_analysis.png'")

<h2 style="color:darkmagenta;text-align: center; background-color: AliceBlue;padding: 20px;">13. Model Persistence & Production Deployment</h2><a id="13"></a>

### Save Everything for Production Use

To use this model in production, we need to save:
1. The trained K-Means model
2. The StandardScaler (for preprocessing new data)
3. Feature names and categorical columns
4. Segment definitions
5. Model metadata

In [None]:
# Save all models and artifacts
import os

# Create models directory if it doesn't exist
os.makedirs('models', exist_ok=True)

print("üíæ Saving Models and Artifacts")
print("="*80)

# 1. Save K-Means model
model_path = 'models/kmeans_customer_segmentation.pkl'
joblib.dump(final_kmeans, model_path)
print(f"‚úì K-Means model saved: {model_path}")

# 2. StandardScaler already saved earlier
print(f"‚úì StandardScaler already saved: models/standard_scaler.pkl")

# 3. Categorical columns already saved
print(f"‚úì Categorical columns already saved: models/categorical_columns.pkl")

# 4. Save feature names
feature_info = {
    'numerical_features': numerical_features,
    'categorical_features': categorical_features,
    'all_features': list(df_final.columns.drop('Cluster'))
}
joblib.dump(feature_info, 'models/feature_info.pkl')
print(f"‚úì Feature information saved: models/feature_info.pkl")

# 5. Save segment definitions
with open('models/segment_definitions.json', 'w') as f:
    json.dump(segment_definitions, f, indent=2)
print(f"‚úì Segment definitions saved: models/segment_definitions.json")

# 6. Save cluster profiles
cluster_profiles.to_csv('models/cluster_profiles.csv')
print(f"‚úì Cluster profiles saved: models/cluster_profiles.csv")

# 7. Save model metadata
metadata = {
    'model_type': 'KMeans',
    'n_clusters': OPTIMAL_K,
    'training_date': datetime.now().isoformat(),
    'training_size': len(df_final),
    'silhouette_score': float(final_silhouette),
    'inertia': float(final_inertia),
    'random_state': CONFIG['RANDOM_STATE'],
    'features': list(df_final.columns.drop('Cluster'))
}

with open('models/model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"‚úì Model metadata saved: models/model_metadata.json")

print("\n" + "="*80)
print("‚úÖ All models and artifacts saved successfully!")
print("\nSaved files:")
for file in os.listdir('models'):
    filepath = os.path.join('models', file)
    size = os.path.getsize(filepath)
    print(f"  - {file} ({size:,} bytes)")

### üöÄ Production Deployment Code

Here's how to use the saved model to predict segments for new customers:

In [None]:
# Example: How to use the model in production
def predict_customer_segment(customer_data):
    """
    Predict customer segment for new customer data
    
    Parameters:
    -----------
    customer_data : dict or DataFrame
        Customer features: Recency, Frequency, MonetaryTotal, MonetaryAvg, 
                          AccountBalance, Age, Gender
    
    Returns:
    --------
    dict: Cluster assignment and segment details
    """
    # Load saved artifacts
    kmeans = joblib.load('models/kmeans_customer_segmentation.pkl')
    scaler = joblib.load('models/standard_scaler.pkl')
    feature_info = joblib.load('models/feature_info.pkl')
    categorical_cols = joblib.load('models/categorical_columns.pkl')
    
    with open('models/segment_definitions.json', 'r') as f:
        segments = json.load(f)
    
    # Convert to DataFrame if dict
    if isinstance(customer_data, dict):
        customer_data = pd.DataFrame([customer_data])
    
    # Extract numerical features
    numerical_features = feature_info['numerical_features']
    X_numerical = customer_data[numerical_features]
    
    # Scale numerical features
    X_numerical_scaled = scaler.transform(X_numerical)
    X_numerical_scaled = pd.DataFrame(X_numerical_scaled, columns=numerical_features)
    
    # One-hot encode categorical
    X_categorical = pd.get_dummies(
        customer_data[['Gender']], 
        drop_first=True,
        prefix='Gender'
    )
    
    # Ensure all expected categorical columns exist
    for col in categorical_cols:
        if col not in X_categorical.columns:
            X_categorical[col] = 0
    
    # Combine features
    X_final = pd.concat([X_numerical_scaled, X_categorical[categorical_cols]], axis=1)
    
    # Predict cluster
    cluster = kmeans.predict(X_final)[0]
    
    # Get segment details
    segment_info = segments.get(str(cluster), {'name': f'Cluster {cluster}'})
    
    return {
        'cluster_id': int(cluster),
        'segment_name': segment_info.get('name', 'Unknown'),
        'description': segment_info.get('description', ''),
        'strategy': segment_info.get('strategy', ''),
        'actions': segment_info.get('actions', [])
    }

# Test with example customer
example_customer = {
    'Recency': 15,
    'Frequency': 25,
    'MonetaryTotal': 50000,
    'MonetaryAvg': 2000,
    'AccountBalance': 75000,
    'Age': 35,
    'Gender': 'M'
}

print("\nüß™ Testing Production Prediction Function")
print("="*80)
print("\nExample Customer:")
for key, value in example_customer.items():
    print(f"  {key}: {value}")

result = predict_customer_segment(example_customer)

print("\nPrediction Result:")
print("="*80)
print(f"Cluster ID: {result['cluster_id']}")
print(f"Segment: {result['segment_name']}")
print(f"Description: {result['description']}")
print(f"Strategy: {result['strategy']}")
print(f"\nRecommended Actions:")
for i, action in enumerate(result['actions'], 1):
    print(f"  {i}. {action}")

print("\n‚úÖ Production prediction working correctly!")

---

## üéâ CONGRATULATIONS!

### You've successfully completed customer segmentation!

## ‚úÖ What We Accomplished:

### üîß Technical Achievements:
1. ‚úÖ **Fixed critical RFM bug** - Recency now correctly calculated
2. ‚úÖ **Proper categorical handling** - One-hot encoding instead of fake ordering
3. ‚úÖ **Robust data cleaning** - Analyzed before dropping, documented all changes
4. ‚úÖ **Comprehensive evaluation** - Multiple metrics for optimal K selection
5. ‚úÖ **Production-ready** - Saved models and prediction pipeline

### üìä Business Achievements:
1. ‚úÖ **Identified customer segments** with clear characteristics
2. ‚úÖ **Named segments** with business meaning
3. ‚úÖ **Created strategies** for each segment
4. ‚úÖ **Calculated segment value** - ROI, CLV, revenue contribution
5. ‚úÖ **Actionable recommendations** - Specific marketing actions

### üìà Deliverables:
- ‚úÖ Cleaned and validated dataset
- ‚úÖ Trained K-Means model
- ‚úÖ Segment profiles and characteristics
- ‚úÖ Business strategy document
- ‚úÖ Visualizations and reports
- ‚úÖ Production deployment code

---

## üöÄ Next Steps:

### Immediate Actions:
1. **Share findings** with business stakeholders
2. **Implement strategies** for high-value segments
3. **Deploy model** to production for real-time segmentation

### Future Enhancements:
1. **Add temporal analysis** - How do segments evolve over time?
2. **Predictive modeling** - Predict which segment new customers will join
3. **Churn prediction** - Identify at-risk customers early
4. **A/B testing** - Measure effectiveness of segment-specific campaigns
5. **Geographic analysis** - Location-based insights
6. **Product affinity** - What products does each segment prefer?

---

## üìö Key Learnings:

**1. Data Quality Matters:**
- Always analyze before cleaning
- Document all decisions
- Preserve valid "outliers"

**2. Feature Engineering is Critical:**
- RFM is powerful but must be calculated correctly
- Domain knowledge guides good features
- Proper handling of categorical vs numerical

**3. Interpretation > Technical Metrics:**
- Perfect silhouette score means nothing without business value
- Segment names and strategies make insights actionable
- Always tie back to business objectives

**4. Production Readiness:**
- Save everything needed to recreate results
- Create reusable prediction functions
- Document assumptions and limitations

---

## üôè Thank You!

This fixed version demonstrates best practices in:
- Data science methodology
- Customer analytics
- Production ML deployment
- Business communication

**Questions or feedback?** Feel free to reach out!

---

*Notebook created: 2025-12-18*
*Version: 1.0 (Fixed & Enhanced)*