# Tier 4: K-Means Clustering

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 3f3d3bb9-6068-4d6e-82de-72776edb6955

---

## Citation
Brandon Deloatch, "Tier 4: K-Means Clustering," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 3f3d3bb9-6068-4d6e-82de-72776edb6955
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import warnings
warnings.filterwarnings('ignore')

print(" Tier 4: K-Means Clustering - Libraries Loaded!")
print("=" * 50)
print("K-Means Techniques:")
print("• Standard K-Means with Lloyd's algorithm")
print("• K-Means++ smart initialization")
print("• Mini-Batch K-Means for large datasets")
print("• Elbow method for optimal K selection")
print("• Silhouette analysis for cluster validation")

 Tier 4: K-Means Clustering - Libraries Loaded!
K-Means Techniques:
• Standard K-Means with Lloyd's algorithm
• K-Means++ smart initialization
• Mini-Batch K-Means for large datasets
• Elbow method for optimal K selection
• Silhouette analysis for cluster validation


In [None]:
# Generate K-Means optimized datasets
np.random.seed(42)

# Customer segmentation data
n_customers = 1000
customer_data = pd.DataFrame({
    'annual_spending': np.random.gamma(2, 15000, n_customers),
    'visit_frequency': np.random.poisson(8, n_customers),
    'avg_transaction': np.random.lognormal(4, 0.5, n_customers),
    'loyalty_years': np.random.exponential(2, n_customers)
})

# Create synthetic clusters for validation
centers = [(30000, 12, 80, 3), (15000, 4, 40, 1), (50000, 20, 150, 5)]
true_clusters = make_blobs(n_samples=n_customers, centers=centers, n_features=4,
                          cluster_std=5000, random_state=42)[1]

print(" K-Means Datasets Created:")
print(f"Customer data: {len(customer_data)} samples with {customer_data.shape[1]} features")
print(f"Spending range: ${customer_data['annual_spending'].min():,.0f} - ${customer_data['annual_spending'].max():,.0f}")
print(f"Transaction range: ${customer_data['avg_transaction'].min():.0f} - ${customer_data['avg_transaction'].max():.0f}")

 K-Means Datasets Created:
Customer data: 1000 samples with 4 features
Spending range: $689 - $116,803
Transaction range: $11 - $321


In [None]:
# 1. K-MEANS CLUSTERING WITH VISUALIZATION
print(" 1. K-MEANS CLUSTERING ANALYSIS")
print("=" * 33)

# Standardize features
scaler = StandardScaler()
customer_scaled = scaler.fit_transform(customer_data)

# Apply K-Means with different K values
k_range = range(2, 11)
inertias = []
silhouette_scores = []
kmeans_models = {}

for k in k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42, n_init=10)
    clusters = kmeans.fit_predict(customer_scaled)

    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(customer_scaled, clusters))
    kmeans_models[k] = kmeans

    print(f"K={k}: Inertia={kmeans.inertia_:.0f}, Silhouette={silhouette_score(customer_scaled, clusters):.3f}")

# Find optimal K using elbow method
elbow_k = 3 # Typically determined by visual inspection
optimal_kmeans = kmeans_models[elbow_k]
customer_clusters = optimal_kmeans.fit_predict(customer_scaled)

print(f"\nOptimal K selected: {elbow_k}")
print(f"Final silhouette score: {silhouette_score(customer_scaled, customer_clusters):.3f}")

# Cluster analysis
customer_data['cluster'] = customer_clusters
cluster_summary = customer_data.groupby('cluster').agg({
    'annual_spending': ['mean', 'std'],
    'visit_frequency': ['mean', 'std'],
    'avg_transaction': ['mean', 'std'],
    'loyalty_years': ['mean', 'std']
}).round(2)

print(f"\nCluster Summary:")
for cluster in range(elbow_k):
    cluster_size = sum(customer_clusters == cluster)
    spending_avg = customer_data[customer_data['cluster'] == cluster]['annual_spending'].mean()
    print(f"Cluster {cluster}: {cluster_size} customers, avg spending ${spending_avg:,.0f}")

 1. K-MEANS CLUSTERING ANALYSIS
K=2: Inertia=3321, Silhouette=0.241
K=3: Inertia=2732, Silhouette=0.239
K=4: Inertia=2257, Silhouette=0.249
K=5: Inertia=1862, Silhouette=0.243
K=6: Inertia=1724, Silhouette=0.198
K=7: Inertia=1610, Silhouette=0.196
K=8: Inertia=1511, Silhouette=0.197
K=9: Inertia=1409, Silhouette=0.194
K=10: Inertia=1345, Silhouette=0.198

Optimal K selected: 3
Final silhouette score: 0.239

Cluster Summary:
Cluster 0: 604 customers, avg spending $20,923
Cluster 1: 176 customers, avg spending $26,263
Cluster 2: 220 customers, avg spending $61,996


In [5]:
# 2. INTERACTIVE VISUALIZATIONS
print(" 2. INTERACTIVE K-MEANS VISUALIZATIONS")
print("=" * 39)

# Create comprehensive visualization dashboard
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        'Elbow Method for Optimal K',
        'Silhouette Analysis',
        'Customer Clusters (Spending vs Frequency)',
        'Cluster Centers Comparison'
    ],
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# Elbow method plot
fig.add_trace(
    go.Scatter(x=list(k_range), y=inertias, mode='lines+markers',
               name='Inertia', line=dict(color='blue', width=3),
               marker=dict(size=8)),
    row=1, col=1
)

# Silhouette scores
fig.add_trace(
    go.Scatter(x=list(k_range), y=silhouette_scores, mode='lines+markers',
               name='Silhouette Score', line=dict(color='green', width=3),
               marker=dict(size=8)),
    row=1, col=2
)

# Customer clusters scatter plot
colors = ['red', 'blue', 'green', 'purple', 'orange']
for cluster in range(elbow_k):
    cluster_data = customer_data[customer_data['cluster'] == cluster]
    fig.add_trace(
        go.Scatter(x=cluster_data['annual_spending'],
                  y=cluster_data['visit_frequency'],
                  mode='markers',
                  name=f'Cluster {cluster}',
                  marker=dict(color=colors[cluster], size=6, opacity=0.7)),
        row=2, col=1
    )

# Add cluster centers
centers_original = scaler.inverse_transform(optimal_kmeans.cluster_centers_)
for i, center in enumerate(centers_original):
    fig.add_trace(
        go.Scatter(x=[center[0]], y=[center[1]], mode='markers',
                  marker=dict(color='black', size=15, symbol='x'),
                  name=f'Center {i}', showlegend=False),
        row=2, col=1
    )

# Cluster centers radar chart
features = ['Annual Spending', 'Visit Frequency', 'Avg Transaction', 'Loyalty Years']
for i, center in enumerate(centers_original):
    # Normalize for radar chart
    normalized_center = (center - customer_data.iloc[:, :4].min()) / (customer_data.iloc[:, :4].max() - customer_data.iloc[:, :4].min())
    fig.add_trace(
        go.Scatter(x=features, y=normalized_center, mode='lines+markers',
                  name=f'Cluster {i} Profile', line=dict(color=colors[i])),
        row=2, col=2
    )

fig.update_layout(height=800, title="K-Means Clustering Analysis Dashboard", showlegend=True)
fig.update_xaxes(title_text="Number of Clusters (K)", row=1, col=1)
fig.update_xaxes(title_text="Number of Clusters (K)", row=1, col=2)
fig.update_xaxes(title_text="Annual Spending ($)", row=2, col=1)
fig.update_xaxes(title_text="Features", row=2, col=2)
fig.update_yaxes(title_text="Inertia", row=1, col=1)
fig.update_yaxes(title_text="Silhouette Score", row=1, col=2)
fig.update_yaxes(title_text="Visit Frequency", row=2, col=1)
fig.update_yaxes(title_text="Normalized Values", row=2, col=2)
fig.show()

# Business insights
print(f"\n BUSINESS INSIGHTS:")
for cluster in range(elbow_k):
    cluster_data = customer_data[customer_data['cluster'] == cluster]
    cluster_size = len(cluster_data)

    avg_spending = cluster_data['annual_spending'].mean()
    avg_frequency = cluster_data['visit_frequency'].mean()
    avg_transaction = cluster_data['avg_transaction'].mean()
    avg_loyalty = cluster_data['loyalty_years'].mean()

    # Determine customer segment type
    if avg_spending > 40000 and avg_frequency > 15:
        segment_type = "VIP Customers"
    elif avg_spending > 25000:
        segment_type = "High-Value Customers"
    elif avg_frequency > 10:
        segment_type = "Frequent Shoppers"
    else:
        segment_type = "Casual Customers"

    total_revenue = cluster_size * avg_spending

    print(f"\nCluster {cluster}: {segment_type}")
    print(f"• Size: {cluster_size} customers ({cluster_size/len(customer_data)*100:.1f}%)")
    print(f"• Annual spending: ${avg_spending:,.0f}")
    print(f"• Visit frequency: {avg_frequency:.1f} times/year")
    print(f"• Avg transaction: ${avg_transaction:.0f}")
    print(f"• Customer loyalty: {avg_loyalty:.1f} years")
    print(f"• Total cluster revenue: ${total_revenue:,.0f}")

# ROI calculation
total_revenue = customer_data['annual_spending'].sum()
targeting_efficiency = 0.25 # 25% improvement in marketing efficiency
marketing_roi = total_revenue * targeting_efficiency * 0.1 # 10% of revenue as marketing impact

print(f"\n K-MEANS CLUSTERING ROI:")
print(f"• Total customer revenue: ${total_revenue:,.0f}")
print(f"• Marketing efficiency improvement: {targeting_efficiency*100:.0f}%")
print(f"• Estimated annual ROI: ${marketing_roi:,.0f}")
print(f"• Implementation cost: $75,000")
print(f"• Net ROI: {(marketing_roi - 75000)/75000*100:.0f}%")

 2. INTERACTIVE K-MEANS VISUALIZATIONS



 BUSINESS INSIGHTS:

Cluster 0: Casual Customers
• Size: 604 customers (60.4%)
• Annual spending: $20,923
• Visit frequency: 7.7 times/year
• Avg transaction: $62
• Customer loyalty: 1.2 years
• Total cluster revenue: $12,637,555

Cluster 1: High-Value Customers
• Size: 176 customers (17.6%)
• Annual spending: $26,263
• Visit frequency: 8.4 times/year
• Avg transaction: $57
• Customer loyalty: 5.4 years
• Total cluster revenue: $4,622,374

Cluster 2: High-Value Customers
• Size: 220 customers (22.0%)
• Annual spending: $61,996
• Visit frequency: 8.1 times/year
• Avg transaction: $60
• Customer loyalty: 1.4 years
• Total cluster revenue: $13,639,025

 K-MEANS CLUSTERING ROI:
• Total customer revenue: $30,898,954
• Marketing efficiency improvement: 25%
• Estimated annual ROI: $772,474
• Implementation cost: $75,000
• Net ROI: 930%
