<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/demos/week05_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 5 Demo ‚Äî Customer Lifetime Value, RFM Analysis, and K-Means Clustering
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Chapters 6 & 7** | Competencies: 2.1, 2.2, 2.3, 2.4, 6 (partial)

| Session | Content | Chapter |
|---------|---------|---------|
| **Session 1** | RFM Metrics ‚Üí CLTV ‚Üí Pareto Analysis ‚Üí Manual Segmentation | Ch. 6 |
| **Session 2** | K-Means Clustering ‚Üí Elbow Method ‚Üí Cluster Profiling ‚Üí Comparison | Ch. 7 |

**The business problem:** A UK online retailer spent ¬£50,000 on a "Dear Valued Customer" email blast ‚Äî same message to all 4,300+ customers. Response rate: 2.1%. Most of that budget targeted customers who'd already left or who were going to buy anyway. Today we build the system that ensures that never happens again.

**Pipeline position:** Weeks 3‚Äì4 were **supervised learning** (predict a known target). This week we shift to **unsupervised learning** ‚Äî there's no target column. We're discovering structure the data didn't tell us it had.

---
## Setup

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run this cell to load all libraries. Do not modify.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ All libraries loaded")

---
## Load the Online Retail II Dataset

In [None]:
# Load from GitHub (Excel format ‚Äî may take 10-15 seconds)
url = "https://github.com/fenago/datasets/raw/refs/heads/main/Online%20Retail%20(1).xlsx"
df = pd.read_excel(url, engine="openpyxl")

print(f"Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"\nColumn types:")
print(df.dtypes)
print(f"\nMissing values:")
print(df.isnull().sum())
df.head()

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHAT WE SEE</strong><br>
  541,909 transactions from a UK online retailer (Dec 2009‚ÄìDec 2011). ~135,000 rows have missing CustomerID (25%) ‚Äî these are guest checkouts we can't track. We also have returns (negative quantities) and zero-price items that need cleaning.
</div>

---
# SESSION 1 ‚Äî Chapter 6: RFM Analysis and Customer Lifetime Value

---
# Example 1 ‚Äî Calculate RFM Metrics

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  RFM reduces a customer's entire transaction history into three numbers:
  <ul>
    <li><strong>Recency:</strong> How many days since their last purchase? (Lower = better)</li>
    <li><strong>Frequency:</strong> How many separate orders did they place? (Higher = better)</li>
    <li><strong>Monetary:</strong> How much total revenue did they generate? (Higher = better)</li>
  </ul>
  These three metrics capture 80% of what you need to know about a customer's value and risk.
</div>

In [None]:
# Step 1: Clean the data
df_clean = df.dropna(subset=["CustomerID"]).copy()
df_clean = df_clean[df_clean["Quantity"] > 0]
df_clean = df_clean[df_clean["UnitPrice"] > 0]

# Step 2: Create revenue column
df_clean["TotalPrice"] = df_clean["Quantity"] * df_clean["UnitPrice"]

# Step 3: Set snapshot date (1 day after last transaction)
df_clean["InvoiceDate"] = pd.to_datetime(df_clean["InvoiceDate"])
snapshot_date = df_clean["InvoiceDate"].max() + pd.Timedelta(days=1)
print(f"Last transaction: {df_clean['InvoiceDate'].max()}")
print(f"Snapshot date:    {snapshot_date}")

# Step 4: Calculate RFM
rfm = df_clean.groupby("CustomerID").agg({
    "InvoiceDate": lambda x: (snapshot_date - x.max()).days,  # Recency
    "InvoiceNo":   "nunique",                                  # Frequency
    "TotalPrice":  "sum"                                       # Monetary
}).reset_index()

rfm.columns = ["CustomerID", "Recency", "Frequency", "Monetary"]

print(f"\nAfter cleaning: {df_clean.shape[0]:,} transactions ‚Üí {rfm.shape[0]:,} unique customers")
print(f"\nRFM Summary Statistics:")
print(rfm[["Recency", "Frequency", "Monetary"]].describe().round(2))

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE SUMMARY</strong><br>
  <ul>
    <li><strong>Recency:</strong> Median ~50 days ‚Äî half our customers bought in the last 2 months</li>
    <li><strong>Frequency:</strong> Median ~2 orders ‚Äî most customers are one-time or two-time buyers</li>
    <li><strong>Monetary:</strong> Mean is ~3x the median ‚Äî first clue about revenue concentration</li>
  </ul>
  That mean/median gap is the Pareto Principle announcing itself. A small group of customers generates a disproportionate share of revenue.
</div>

### Your Turn #1

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Pick CustomerID 17850. Filter the raw <code>df_clean</code> and trace their RFM values manually. Does your calculation match the <code>rfm</code> DataFrame?
</div>

In [None]:
# Your Turn #1 ‚Äî Trace CustomerID 17850


---
# Example 2 ‚Äî RFM Scoring and Manual Segmentation

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Raw RFM values are hard to compare ‚Äî a Recency of 30 days and a Monetary of ¬£2,000 are on completely different scales. <strong>Quintile scoring</strong> converts each metric to a 1‚Äì5 scale, where 5 is always "best." Then we combine them into a composite score and assign business-friendly segment labels.
</div>

In [None]:
# Quintile scoring (1‚Äì5, where 5 = best)
# Recency: LOWER is better ‚Üí reverse the labels
rfm["R_Score"] = pd.qcut(rfm["Recency"], q=5, labels=[5, 4, 3, 2, 1]).astype(int)

# Frequency: use rank to handle duplicate bin edges
rfm["F_Score"] = pd.qcut(rfm["Frequency"].rank(method="first"), q=5, labels=[1, 2, 3, 4, 5]).astype(int)

# Monetary: higher is better
rfm["M_Score"] = pd.qcut(rfm["Monetary"], q=5, labels=[1, 2, 3, 4, 5]).astype(int)

# Composite score (sum of R + F + M)
rfm["RFM_Score"] = rfm["R_Score"] + rfm["F_Score"] + rfm["M_Score"]

print("Score distributions:")
print(f"  R_Score range: {rfm['R_Score'].min()} ‚Äì {rfm['R_Score'].max()}")
print(f"  F_Score range: {rfm['F_Score'].min()} ‚Äì {rfm['F_Score'].max()}")
print(f"  M_Score range: {rfm['M_Score'].min()} ‚Äì {rfm['M_Score'].max()}")
print(f"  Composite range: {rfm['RFM_Score'].min()} ‚Äì {rfm['RFM_Score'].max()}")
rfm.head(10)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE ‚Äî Recency Reversal</strong><br>
  Recency is the only metric where <strong>lower is better</strong> (more recent = more engaged). That's why the labels are <code>[5, 4, 3, 2, 1]</code> ‚Äî the lowest-recency customers get score 5. If you forget to reverse this, your "Champions" will be customers who haven't bought in months.
</div>

In [None]:
# Manual segmentation based on composite score
def segment_customer(score):
    if score >= 12:
        return "Champions"
    elif score >= 9:
        return "Loyal"
    elif score >= 7:
        return "Potential Loyalist"
    elif score >= 5:
        return "At Risk"
    elif score >= 4:
        return "Can't Lose"
    else:
        return "Lost"

rfm["Customer_Segment"] = rfm["RFM_Score"].apply(segment_customer)

print("Customer Segments:")
print(rfm["Customer_Segment"].value_counts())
print(f"\nTotal customers: {rfm.shape[0]:,}")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE SEGMENTS</strong><br>
  <ul>
    <li><strong>Champions + Loyal:</strong> Your core ‚Äî high scores across all three dimensions</li>
    <li><strong>Potential Loyalist:</strong> Mid-range ‚Äî one good campaign could push them up</li>
    <li><strong>At Risk + Can't Lose + Lost:</strong> Vulnerable segments ‚Äî roughly 30% of the base</li>
  </ul>
  The key insight: "At Risk" customers aren't gone yet. They spent money but haven't been back recently. These are your highest-ROI retention targets.
</div>

### Your Turn #2

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Raise the Champions threshold from ‚â•12 to ‚â•13. How many Champions remain? What segment absorbs the ones that dropped?
</div>

In [None]:
# Your Turn #2 ‚Äî Adjust threshold


---
# Example 3 ‚Äî Full Pipeline: CLTV + Pareto Analysis + Revenue at Risk

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Segments tell you <strong>who</strong> your customers are. CLTV tells you <strong>what they're worth</strong>. The Pareto curve shows how concentrated that value is. The revenue-at-risk table is what turns an analytical exercise into a board-level conversation.
</div>

In [None]:
# Part A: RFM Distribution Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Recency histogram
axes[0, 0].hist(rfm["Recency"], bins=50, color="steelblue", edgecolor="white")
axes[0, 0].set_title("Recency Distribution (days)")
axes[0, 0].set_xlabel("Days Since Last Purchase")
axes[0, 0].axvline(rfm["Recency"].median(), color="red", linestyle="--", label=f'Median: {rfm["Recency"].median():.0f}')
axes[0, 0].legend()

# Frequency histogram
axes[0, 1].hist(rfm["Frequency"], bins=50, color="coral", edgecolor="white")
axes[0, 1].set_title("Frequency Distribution (orders)")
axes[0, 1].set_xlabel("Number of Orders")
axes[0, 1].axvline(rfm["Frequency"].median(), color="red", linestyle="--", label=f'Median: {rfm["Frequency"].median():.0f}')
axes[0, 1].legend()

# Monetary histogram (log scale)
axes[1, 0].hist(rfm["Monetary"], bins=50, color="seagreen", edgecolor="white")
axes[1, 0].set_title("Monetary Distribution (¬£) ‚Äî Note the tail")
axes[1, 0].set_xlabel("Total Revenue (¬£)")
axes[1, 0].set_yscale("log")
axes[1, 0].axvline(rfm["Monetary"].median(), color="red", linestyle="--", label=f'Median: ¬£{rfm["Monetary"].median():,.0f}')
axes[1, 0].legend()

# Segment bar chart
seg_order = ["Champions", "Loyal", "Potential Loyalist", "At Risk", "Can't Lose", "Lost"]
seg_counts = rfm["Customer_Segment"].value_counts().reindex(seg_order)
colors = ["#27AE60", "#2ECC71", "#F39C12", "#E67E22", "#E74C3C", "#C0392B"]
axes[1, 1].bar(seg_counts.index, seg_counts.values, color=colors)
axes[1, 1].set_title("Customer Segments")
axes[1, 1].set_ylabel("Count")
axes[1, 1].tick_params(axis="x", rotation=30)

plt.tight_layout()
plt.show()

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE FOUR PANELS</strong><br>
  <ul>
    <li><strong>Recency:</strong> Bimodal ‚Äî cluster of recent buyers and a long tail of lapsed customers</li>
    <li><strong>Frequency:</strong> Heavy right skew ‚Äî most customers ordered once or twice</li>
    <li><strong>Monetary:</strong> Log scale needed to see the full range ‚Äî massive tail of high-value customers</li>
    <li><strong>Segments:</strong> Champions leading, but the red/orange bars (vulnerable) add up to ~30% of the base</li>
  </ul>
</div>

In [None]:
# Part B: Customer Lifetime Value Calculation
# CLTV = AOV √ó Purchase Frequency √ó Customer Lifespan

cltv_data = df_clean.groupby("CustomerID").agg({
    "InvoiceNo":   "nunique",                                    # Number of purchases
    "TotalPrice":  "sum",                                        # Total revenue
    "InvoiceDate": lambda x: (x.max() - x.min()).days            # Customer lifespan in days
}).reset_index()

cltv_data.columns = ["CustomerID", "NumPurchases", "TotalRevenue", "CustomerLifespan"]

# Convert lifespan to years (minimum 1 day to avoid division by zero)
cltv_data["CustomerLifespan"] = cltv_data["CustomerLifespan"].apply(lambda x: max(x, 1)) / 365.25

# Calculate components
cltv_data["AOV"] = cltv_data["TotalRevenue"] / cltv_data["NumPurchases"]
cltv_data["PurchaseFrequency"] = cltv_data["NumPurchases"] / cltv_data["CustomerLifespan"]
cltv_data["CLTV"] = cltv_data["AOV"] * cltv_data["PurchaseFrequency"] * cltv_data["CustomerLifespan"]

print("CLTV Summary:")
print(f"  Mean CLTV:   ¬£{cltv_data['CLTV'].mean():,.0f}")
print(f"  Median CLTV: ¬£{cltv_data['CLTV'].median():,.0f}")
print(f"  Mean/Median: {cltv_data['CLTV'].mean() / cltv_data['CLTV'].median():.1f}x")
print(f"  Max CLTV:    ¬£{cltv_data['CLTV'].max():,.0f}")
print(f"\n  Total customers: {cltv_data.shape[0]:,}")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° THE MEAN/MEDIAN RATIO</strong><br>
  When the mean is 2x+ the median, the distribution is heavily right-skewed ‚Äî a small number of customers are pulling the average way up. This is the Pareto Principle in action. Let's see exactly how concentrated the revenue is.
</div>

In [None]:
# Part C: Pareto Visualization ‚Äî Revenue Concentration
cltv_sorted = cltv_data["CLTV"].sort_values(ascending=False).reset_index(drop=True)
cumulative_pct = cltv_sorted.cumsum() / cltv_sorted.sum() * 100
customer_pct = np.arange(1, len(cltv_sorted) + 1) / len(cltv_sorted) * 100

fig, ax1 = plt.subplots(figsize=(12, 6))

# CLTV distribution
ax1.bar(range(len(cltv_sorted)), cltv_sorted.values, color="steelblue", alpha=0.4, width=1.0)
ax1.set_xlabel("Customers (ranked by CLTV)")
ax1.set_ylabel("Individual CLTV (¬£)", color="steelblue")
ax1.tick_params(axis="y", labelcolor="steelblue")

# Cumulative line
ax2 = ax1.twinx()
ax2.plot(range(len(cumulative_pct)), cumulative_pct.values, color="#E74C3C", linewidth=2)
ax2.set_ylabel("Cumulative % of Total CLTV", color="#E74C3C")
ax2.tick_params(axis="y", labelcolor="#E74C3C")

# 80/20 reference lines
twenty_pct_idx = int(len(cltv_sorted) * 0.2)
eighty_pct_revenue = cumulative_pct.iloc[twenty_pct_idx]
ax2.axhline(y=80, color="gray", linestyle="--", alpha=0.5, label="80% of revenue")
ax2.axvline(x=twenty_pct_idx, color="gray", linestyle="--", alpha=0.5, label="Top 20% of customers")

plt.title(f"Pareto Curve ‚Äî Top 20% of Customers Contribute {eighty_pct_revenue:.0f}% of Revenue")
ax2.legend(loc="center right")
plt.tight_layout()
plt.show()

print(f"Top 20% of customers ({twenty_pct_idx:,}) contribute {eighty_pct_revenue:.0f}% of total CLTV")

In [None]:
# Part D: Revenue at Risk by Segment
rfm_cltv = rfm.merge(cltv_data[["CustomerID", "CLTV"]], on="CustomerID", how="left")

segment_summary = rfm_cltv.groupby("Customer_Segment").agg(
    Customers=("CustomerID", "count"),
    Total_Revenue=("Monetary", "sum"),
    Avg_CLTV=("CLTV", "mean"),
    Total_CLTV=("CLTV", "sum")
).round(0)

segment_summary["Pct_of_Customers"] = (segment_summary["Customers"] / segment_summary["Customers"].sum() * 100).round(1)
segment_summary["Pct_of_Revenue"] = (segment_summary["Total_Revenue"] / segment_summary["Total_Revenue"].sum() * 100).round(1)

# Reorder
seg_order = ["Champions", "Loyal", "Potential Loyalist", "At Risk", "Can't Lose", "Lost"]
segment_summary = segment_summary.reindex(seg_order)

print("Revenue at Risk by Segment:")
print(segment_summary.to_string())

# Highlight vulnerable segments
vulnerable = segment_summary.loc[["At Risk", "Can't Lose", "Lost"]]
vuln_customers = vulnerable["Customers"].sum()
vuln_revenue = vulnerable["Total_Revenue"].sum()
print(f"\n‚ö†Ô∏è  Vulnerable segments (At Risk + Can't Lose + Lost):")
print(f"   {vuln_customers:,.0f} customers = {vuln_customers/segment_summary['Customers'].sum()*100:.0f}% of base")
print(f"   ¬£{vuln_revenue:,.0f} in revenue at risk")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° THIS IS THE TABLE THAT MATTERS</strong><br>
  This revenue-at-risk summary is what turns an analytical exercise into a board-level conversation. "We have 1,700+ customers representing over ¬£1.5 million in revenue who are drifting away ‚Äî and we have the data to intervene before they're gone."
</div>

### Your Turn #3

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Filter to UK customers only (<code>df_clean[df_clean["Country"] == "United Kingdom"]</code>) and rebuild the RFM table. Does the Pareto curve steepen or flatten compared to the full dataset?
</div>

In [None]:
# Your Turn #3 ‚Äî UK-only analysis


<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî End of Session 1</strong><br>
  You should have:
  <ul>
    <li>An <code>rfm</code> DataFrame with ~4,300 customers, each with R/F/M scores and a segment label</li>
    <li>A <code>cltv_data</code> DataFrame with CLTV calculated for each customer</li>
    <li>A Pareto curve showing top 20% contributing ~60% of revenue</li>
    <li>A revenue-at-risk table showing ¬£1.5M+ in vulnerable segments</li>
  </ul>
  <strong>Session 2 picks up this exact <code>rfm</code> DataFrame and asks: can K-Means do better than our manual rules?</strong>
</div>

---
# SESSION 2 ‚Äî Chapter 7: K-Means Clustering and Segment Discovery

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° THE QUESTION</strong><br>
  Manual RFM segmentation uses human-defined thresholds (score ‚â• 12 = Champion). But what if the data has natural groupings we can't see? <strong>K-Means clustering</strong> finds those groups mathematically ‚Äî no rules needed. Then we compare what the algorithm finds with what our rules say.
</div>

In [None]:
# New imports for Session 2
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

print("‚úÖ sklearn imported ‚Äî ready for clustering")

---
# Example 1 ‚Äî First K-Means on RFM Scores

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY SCALING MATTERS</strong><br>
  K-Means uses <strong>distance</strong> to assign customers to clusters. If Monetary ranges from 0‚Äì200,000 and R_Score ranges from 1‚Äì5, Monetary will dominate every distance calculation. <code>StandardScaler</code> puts all features on the same scale so each dimension has equal influence.
</div>

In [None]:
# Scale the RFM scores
features = rfm[["R_Score", "F_Score", "M_Score"]]
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# First K-Means with k=4 (a guess ‚Äî we'll improve this)
kmeans_4 = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm["Cluster_k4"] = kmeans_4.fit_predict(features_scaled)

# Show centroids in original score space
centroids = scaler.inverse_transform(kmeans_4.cluster_centers_)
centroid_df = pd.DataFrame(centroids, columns=["R_Score", "F_Score", "M_Score"])
centroid_df.index.name = "Cluster"

print("Cluster centroids (original score space):")
print(centroid_df.round(2))
print(f"\nCluster sizes:")
print(rfm["Cluster_k4"].value_counts().sort_index())

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING CENTROIDS</strong><br>
  Even with k=4 (a guess), you can already see cluster profiles forming ‚Äî one cluster with high scores across the board (Champions-like) and one with low scores (Lost-like). But we guessed k. Let's let the data tell us the right number.
</div>

### Your Turn #4

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Change <code>n_clusters</code> to 3, then to 6. Which k produces the most distinct cluster profiles?
</div>

In [None]:
# Your Turn #4 ‚Äî Try different k values


---
# Example 2 ‚Äî Elbow Method and Cluster Profiling

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  The <strong>Elbow Method</strong> plots the "cost" (inertia) of each k value. As k increases, inertia always drops ‚Äî but at some point the improvement slows dramatically. That bend in the curve is the "elbow" ‚Äî the point of diminishing returns.
</div>

In [None]:
# Part A: Elbow Method ‚Äî find optimal k
inertias = []
k_range = range(2, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(features_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(10, 5))
plt.plot(k_range, inertias, "bo-", linewidth=2, markersize=8)
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (within-cluster sum of squares)")
plt.title("Elbow Method ‚Äî Finding the Optimal k")

# Annotate the elbow
plt.annotate("Elbow ‚âà k=5", xy=(5, inertias[3]),
             xytext=(7, inertias[3] + 1000),
             arrowprops=dict(arrowstyle="->", color="red"),
             fontsize=12, color="red", fontweight="bold")

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

for k, inertia in zip(k_range, inertias):
    drop = "" if k == 2 else f"  (Œî = {inertias[k_range.index(k)-1] - inertia:,.0f})"
    print(f"  k={k}: Inertia = {inertia:,.0f}{drop}")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE ELBOW</strong><br>
  Substantial drop from k=4 to k=5, then the curve flattens. Adding a 6th cluster doesn't improve things much ‚Äî it just splits an existing group. <strong>k=5 is our choice.</strong>
</div>

In [None]:
# Part B: Fit K-Means with optimal k=5
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
rfm["Cluster"] = kmeans.fit_predict(features_scaled)

print("Cluster sizes (k=5):")
print(rfm["Cluster"].value_counts().sort_index())

In [None]:
# Part C: Cluster profiling heatmap
cluster_profile = rfm.groupby("Cluster")[["R_Score", "F_Score", "M_Score"]].mean()

plt.figure(figsize=(8, 5))
sns.heatmap(cluster_profile, annot=True, fmt=".2f", cmap="YlOrRd",
            xticklabels=["Recency", "Frequency", "Monetary"],
            linewidths=0.5)
plt.title("Cluster Profiles ‚Äî Mean RFM Scores")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()

print("Cluster profile (mean scores):")
print(cluster_profile.round(2))

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE HEATMAP ‚Äî Name Your Clusters</strong><br>
  Each row is a cluster. Hot colors = high scores. Look for patterns:
  <ul>
    <li>High R + High F + High M ‚Üí <strong>Champions</strong></li>
    <li>Low R + Low F + Low M ‚Üí <strong>Lost Customers</strong></li>
    <li>Low R + High F + High M ‚Üí <strong>Drifting Big Spenders</strong> (the discovery!)</li>
    <li>High R + Low F + Low M ‚Üí <strong>New / Casual Buyers</strong></li>
    <li>Medium across the board ‚Üí <strong>Mid-Tier Regulars</strong></li>
  </ul>
  That third profile ‚Äî high value but drifting away ‚Äî is the cluster manual rules often miss. The algorithm found it because it doesn't respect our threshold boundaries.
</div>

### Your Turn #5

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Add Silhouette Score: <code>from sklearn.metrics import silhouette_score</code> ‚Üí compute for k=4, 5, and 6. Which k scores highest?
</div>

In [None]:
# Your Turn #5 ‚Äî Silhouette Score comparison


---
# Full Pipeline: K-Means + Manual Segment Comparison

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  We now have two segmentation systems on the same customers: human rules (6 segments) and K-Means (5 clusters). The power isn't in choosing one ‚Äî it's in comparing them. Where do they agree? Where does the algorithm see something our rules missed?
</div>

In [None]:
# Pre-filled ‚Äî 2D Scatter Plots (colored by cluster)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

scatter_pairs = [
    ("Recency", "Monetary", axes[0]),
    ("Recency", "Frequency", axes[1]),
    ("Frequency", "Monetary", axes[2])
]

cluster_colors = {0: "#E74C3C", 1: "#3498DB", 2: "#2ECC71", 3: "#F39C12", 4: "#9B59B6"}

for x_col, y_col, ax in scatter_pairs:
    for cluster in sorted(rfm["Cluster"].unique()):
        mask = rfm["Cluster"] == cluster
        ax.scatter(rfm.loc[mask, x_col], rfm.loc[mask, y_col],
                   c=cluster_colors.get(cluster, "gray"), label=f"Cluster {cluster}",
                   alpha=0.5, s=15)
    ax.set_xlabel(x_col)
    ax.set_ylabel(y_col)
    ax.set_title(f"{x_col} vs {y_col}")
    ax.legend(fontsize=8)

plt.suptitle("K-Means Clusters in 2D Space", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° NOTICE THE BOUNDARIES</strong><br>
  K-Means creates <em>diagonal</em> boundaries ‚Äî it captures patterns that axis-aligned thresholds (like our manual rules) cannot. A customer with R=3, F=4, M=5 might land in a different cluster than R=4, F=3, M=5 even though they have the same composite score.
</div>

In [None]:
# Pre-filled ‚Äî 3D Scatter Plot
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection="3d")

for cluster in sorted(rfm["Cluster"].unique()):
    mask = rfm["Cluster"] == cluster
    ax.scatter(rfm.loc[mask, "R_Score"], rfm.loc[mask, "F_Score"], rfm.loc[mask, "M_Score"],
               c=cluster_colors.get(cluster, "gray"), label=f"Cluster {cluster}",
               alpha=0.5, s=15)

ax.set_xlabel("Recency Score")
ax.set_ylabel("Frequency Score")
ax.set_zlabel("Monetary Score")
ax.set_title("K-Means Clusters in 3D RFM Space")
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

In [None]:
# Pre-filled ‚Äî Crosstab: Manual Segments vs K-Means Clusters
crosstab = pd.crosstab(rfm["Customer_Segment"], rfm["Cluster"], margins=True)

# Reorder rows
row_order = ["Champions", "Loyal", "Potential Loyalist", "At Risk", "Can't Lose", "Lost", "All"]
crosstab = crosstab.reindex(row_order)

print("Manual Segments (rows) vs K-Means Clusters (columns):")
print(crosstab)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE CROSSTAB</strong><br>
  <ul>
    <li><strong>Champions ‚Üî one cluster:</strong> These should align tightly ‚Äî the best customers are obvious to both methods</li>
    <li><strong>At Risk ‚Äî scattered:</strong> Look at the "At Risk" row ‚Äî it's spread across multiple clusters. The manual label covers customers with very different underlying behaviors.</li>
    <li><strong>The key discovery:</strong> One cluster pulls customers from 3‚Äì4 different manual segments. These are the "Drifting Big Spenders" ‚Äî high F and M but low R. Our rules split them; the algorithm grouped them.</li>
  </ul>
</div>

---
## Your Turn ‚Äî Revenue by Cluster + Key Discovery

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS ‚Äî Live Class Participation</strong><br>
  Build the cluster revenue summary table and identify the key discovery cluster.
</div>

In [None]:
# YOUR CODE HERE ‚Äî Revenue summary by cluster
# Build the same revenue-at-risk table from Session 1, but by Cluster instead of Segment
# Columns: Customers, Total_Revenue, Avg_CLTV, Total_CLTV, Pct_of_Customers, Pct_of_Revenue
# Hint: merge rfm with cltv_data, then groupby "Cluster"



In [None]:
# YOUR CODE HERE ‚Äî Key Discovery
# Identify the "Drifting Big Spenders" cluster (low R, high F, high M)
# Print: customer count, total revenue, avg CLTV
# How many different manual segments were these customers split across?



---
## Takeaway

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ WHAT WE BUILT TODAY</strong><br>
  A complete customer segmentation pipeline using two approaches: human rules (RFM scoring) and algorithmic discovery (K-Means clustering). The real power isn't in choosing one ‚Äî it's in using both and comparing.
</div>

**Supervised vs Unsupervised comparison:**

| | Weeks 3‚Äì4 (Supervised) | Week 5 (Unsupervised) |
|---|---|---|
| Target | Known (churn yes/no) | None ‚Äî we're discovering structure |
| Question | "Will this customer churn?" | "What groups exist in our customer base?" |
| Algorithm | Logistic Regression, Neural Networks | K-Means Clustering |
| Validation | Accuracy, AUC, confusion matrix | Elbow Method, Silhouette Score, business sense |
| Output | Predictions per customer | Cluster labels per customer |

**Next week preview:** Same dataset, new question ‚Äî what do these customers buy *together*? That's Market Basket Analysis.

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 5 Demo ‚Äî RFM + CLTV + K-Means Clustering
</p>