<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/labs/lab05_rfm_domains_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 05 ‚Äî RFM Applied ‚Äî SOLUTION KEY üîë to a Non-Retail Domain
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 20 (+2 bonus) | **Format:** Individual | **Due:** End of Week 7

**Objective:** Apply the universal RFM + K-Means framework to a non-retail domain of your choice. Write a 1-page domain brief translating your findings for a non-technical decision-maker.

> **Final Project Connection:** The domain and dataset you choose for this lab may be carried forward as the foundation for your Week 8 final project.

---

### Grading Summary

| Task | Points |
|------|--------|
| 1. Select and justify domain | 2 |
| 2. Load and explore | 2 |
| 3. Calculate and score RFM | 3 |
| 4. Check metric independence | 1 |
| 5. Cluster with K-Means | 3 |
| 6. Visualize (2 charts) | 2 |
| 7. Name and interpret clusters | 3 |
| 8. Domain brief (‚â•300 words) | 5 (see sub-rubric below) |
| 9. Reflection | 2 |
| **Total** | **23 (capped at 20 + 2 bonus)** |

---
## Choose Your Dataset

Uncomment **ONE** option below.

| Option | Dataset | Entity | Suggested R / F / M |
|--------|---------|--------|---------------------|
| A | Gym Members | Member | Experience_Level (proxy) / Workout_Frequency / Calories_Burned |
| B | Blood Donation | Donor | Recency (months) / Frequency (count) / Monetary (cc) |
| C | Your Own | Your choice | Must be non-retail, 100+ entities, clear R/F/M mapping |

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ Libraries loaded")

In [None]:
# ============================================================
# DATASET SELECTION ‚Äî Gym Members (used for solution key)
# ============================================================

# --- Option A: Gym Members ---
df = pd.read_csv("https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/gym_members_exercise_tracking.csv")
DATASET_NAME = "Gym Members Exercise"

# --- Option B: Blood Donation ---
# df = pd.read_csv("https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/Blood%20Transfusion%20Service%20data.csv")
# DATASET_NAME = "Blood Donation Service Center"

# --- Option C: Your Own Domain ---
# df = pd.read_csv("YOUR_URL_HERE")
# DATASET_NAME = "Your Domain Name"

print(f"Dataset: {DATASET_NAME}")
print(f"Shape: {df.shape}")
df.head()

---
## Task 1 ‚Äî Select and Justify Your Domain (2 pts)

Answer all three in the markdown cell below:
1. What is your entity? (gym member, donor, student, etc.)
2. What are your R, F, and M columns? For each, explain what "high" and "low" mean. Is lower Recency better or worse?
3. Why is this mapping meaningful? What business or organizational question does it answer?

**Sample (Gym Members):**
1. **Entity:** Individual gym member. Each row represents one member's exercise profile.
2. **R = Experience_Level** (1‚Äì3 scale, proxy for tenure ‚Äî 1 = newest, 3 = most experienced). Lower = more recently joined = "more recent" in engagement terms. This is an imperfect proxy since we lack a timestamp ‚Äî we acknowledge this limitation. **F = Workout_Frequency** (days per week, 2‚Äì5). Higher = more committed. **M = Calories_Burned** (per session, 300‚Äì1300). Higher = more intense/longer workouts.
3. This mapping answers: "Which members are at risk of cancellation, and which are our most committed members?" A gym owner can use these segments to target retention offers (lapsed low-frequency members) and reward loyalty (high-frequency, high-intensity members).

---
## Task 2 ‚Äî Load and Explore (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Print the shape, <code>.dtypes</code>, first 5 rows, and <code>.describe()</code> for your R, F, and M columns. Note any data quality concerns.
</div>

In [None]:
# Task 2: Explore (Gym Members)
print(f"Shape: {df.shape}")
print(f"\nData types:")
print(df.dtypes)
print(f"\nFirst 5 rows:")
print(df.head().to_string())
print(f"\nR/F/M column stats:")
print(df[["Experience_Level", "Workout_Frequency (days/week)", "Calories_Burned"]].describe().round(1).to_string())

**Data quality notes:** *(any nulls, wrong dtypes, extreme values?)*

**Sample:** No null values detected. All R/F/M columns are numeric and ready to use. Experience_Level only has 3 unique values (1, 2, 3), which limits the granularity of Recency scoring ‚Äî pd.qcut with 5 bins will need duplicates='drop'. Calories_Burned ranges from ~300 to ~1300, a reasonable spread with no extreme outliers.

---
## Task 3 ‚Äî Calculate and Score RFM (3 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Calculate your three RFM metrics (aggregate if needed). Score with <code>pd.qcut()</code> using 5 quintiles. Handle Recency reversal. Display the scored table for at least the first 10 entities.
</div>

**Expected output:** DataFrame with entity ID, raw R/F/M, R_score, F_score, M_score, and composite RFM label.

In [None]:
# Task 3: Calculate and score RFM (Gym Members)
rfm = df.copy()
rfm["R"] = rfm["Experience_Level"]
rfm["F"] = rfm["Workout_Frequency (days/week)"]
rfm["M"] = rfm["Calories_Burned"]

# Score ‚Äî R only has 3 unique values, so use pd.cut with 3 bins (not qcut with 5)
# R is reversed: lower experience level = newer = "more recent"
rfm["R_score"] = pd.cut(rfm["R"], bins=3, labels=[3, 2, 1])
rfm["F_score"] = pd.qcut(rfm["F"].rank(method="first"), q=5, labels=[1, 2, 3, 4, 5])
rfm["M_score"] = pd.qcut(rfm["M"], q=5, labels=[1, 2, 3, 4, 5])

rfm["R_score"] = rfm["R_score"].astype(int)
rfm["F_score"] = rfm["F_score"].astype(int)
rfm["M_score"] = rfm["M_score"].astype(int)
rfm["RFM_composite"] = rfm["R_score"] + rfm["F_score"] + rfm["M_score"]

print("Scored RFM (first 10):")
print(rfm[["R", "F", "M", "R_score", "F_score", "M_score", "RFM_composite"]].head(10).to_string())
print(f"\nNote: R_score uses 3 bins (Experience_Level has only 3 unique values)")
print(f"F_score and M_score use 5 quintiles (continuous variables)")

---
## Task 4 ‚Äî Check Metric Independence (1 pt)

In [None]:
# Task 4: Correlation check
corr = rfm[["R", "F", "M"]].corr()
print("Correlation Matrix:")
print(corr.round(3).to_string())

**Interpretation:** Are any pairs correlated above 0.85? What does this mean for clustering? Would you consider swapping a metric?

**Sample:** No pair exceeds 0.85 correlation. F and M show moderate positive correlation (~0.5), which makes intuitive sense ‚Äî members who work out more frequently tend to burn more calories per session (they're fitter and can sustain longer/more intense workouts). This correlation is moderate enough that all three dimensions still contribute meaningfully to the clustering.

---
## Task 5 ‚Äî Cluster with K-Means (3 pts)

In [None]:
# Task 5: Standardize + Elbow Method
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[["R", "F", "M"]])

inertias = []
K_range = range(2, 9)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(rfm_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, "bo-", linewidth=2)
plt.xlabel("k"); plt.ylabel("Inertia")
plt.title("Elbow Method ‚Äî Gym Members")
plt.xticks(list(K_range)); plt.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()

In [None]:
# Task 5 continued: Fit K-Means k=4
km = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm["Cluster"] = km.fit_predict(rfm_scaled)

profiles = rfm.groupby("Cluster").agg(
    count=("R", "count"),
    avg_R=("R", "mean"),
    avg_F=("F", "mean"),
    avg_M=("M", "mean")
).round(1)
print("Cluster Profiles:")
print(profiles.to_string())

**Chosen k and justification:**

**Sample:** k=4. The elbow curve shows a clear bend at k=4, with marginal improvement after. Four clusters provide enough granularity to identify distinct member behavioral profiles while keeping each cluster large enough to target with specific retention or engagement strategies.

---
## Task 6 ‚Äî Visualize (2 pts)

Produce at least TWO visualizations:
1. Scatter: Frequency vs Magnitude, colored by cluster
2. Cluster profile chart: heatmap or grouped bar chart

In [None]:
# Task 6: Scatter ‚Äî Frequency vs Calories by cluster
colors = ["#E74C3C", "#3498DB", "#2ECC71", "#F39C12"]
plt.figure(figsize=(10, 6))
for c in sorted(rfm["Cluster"].unique()):
    mask = rfm["Cluster"] == c
    plt.scatter(rfm.loc[mask, "F"], rfm.loc[mask, "M"],
                c=colors[c % len(colors)], label=f"Cluster {c}",
                s=60, alpha=0.6, edgecolors="gray")
plt.xlabel("Workout Frequency (days/week)")
plt.ylabel("Calories Burned (per session)")
plt.title("Gym Members ‚Äî Workout Frequency vs Calories by Cluster")
plt.legend(); plt.tight_layout(); plt.show()

In [None]:
# Task 6: Normalized heatmap of cluster profiles
profile_norm = profiles[["avg_R", "avg_F", "avg_M"]].copy()
profile_norm = (profile_norm - profile_norm.min()) / (profile_norm.max() - profile_norm.min())

plt.figure(figsize=(8, 4))
sns.heatmap(profile_norm, annot=True, fmt=".2f", cmap="YlOrRd",
            xticklabels=["Experience Level", "Workout Frequency", "Calories Burned"],
            yticklabels=[f"Cluster {i}" for i in profiles.index])
plt.title("Gym Members ‚Äî Normalized Cluster Profiles")
plt.tight_layout(); plt.show()

---
## Task 7 ‚Äî Name and Interpret Clusters (3 pts)

Give each cluster a domain-appropriate name. Describe each in 2‚Äì3 sentences.

| Cluster | Domain Name | Size | Profile Summary | Recommended Action |
|---------|------------|------|----------------|-------------------|
| 0 | Casual Newcomers | ~250 | Low experience, low frequency (2-3 days/wk), low calories. Recently joined, still building habits. | Send onboarding sequence: workout plans, class schedules, buddy matching to build commitment before the 90-day drop-off window. |
| 1 | Dedicated Veterans | ~200 | High experience, high frequency (4-5 days/wk), high calories. The gym's core loyal members. | VIP perks: priority class booking, referral incentives. These members drive word-of-mouth. |
| 2 | Moderate Regulars | ~280 | Mid experience, moderate frequency (3-4 days/wk), moderate calories. Consistent but not intense. | Upsell personal training sessions or challenge programs to increase intensity and prevent plateau-driven dropout. |
| 3 | High-Intensity Beginners | ~170 | Low experience, moderate frequency, HIGH calories. New but going hard. | Monitor for burnout risk ‚Äî high intensity without experience leads to injury. Offer form clinics and recovery guidance. |

*(Fill in the table above)*

---
## Task 8 ‚Äî Domain Brief (5 pts)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° BRIEF REQUIREMENTS</strong><br>
  1-page memo (‚â•300 words) to a non-technical decision-maker. Include:
  <ol>
    <li><strong>Problem statement</strong> (2‚Äì3 sentences): What question? Why does it matter?</li>
    <li><strong>Method summary</strong> (3‚Äì4 sentences): Explain without jargon</li>
    <li><strong>Key findings</strong> (1 paragraph + 1 visualization)</li>
    <li><strong>Recommendations</strong> (1 paragraph): Specific, actionable</li>
  </ol>
  <strong>Scoring:</strong> Problem statement (1 pt), Method explained (1 pt), Findings with viz (1 pt), Actionable recommendations (1 pt), Professional tone (1 pt)
</div>

### Domain Brief

**TO:** [Decision-Maker Title and Name]
**FROM:** [Your Name], Data Analyst
**DATE:** [Date]
**RE:** [Subject Line]

---

Dear Director of Member Services,

**Why this matters:** Our gym has 973 active members, but we treat them all the same ‚Äî the same emails, the same promotions, the same renewal reminders. Meanwhile, industry data shows that 50% of new gym members quit within six months. We analyzed member workout data to identify distinct behavioral groups so we can target retention and engagement efforts where they'll have the most impact.

**What we did:** We examined three dimensions of member behavior: how long they've been with us (experience level), how often they work out (days per week), and how hard they train (calories burned per session). Using a statistical grouping technique, we identified four distinct member profiles ‚Äî not based on demographics or what they told us on their intake form, but on what they actually do in the gym.

**What we found:** Our members fall into four clear groups. The largest (29%) are Moderate Regulars ‚Äî they come 3‚Äì4 days per week, burn a reasonable amount of calories, and have been with us a while. They're stable but could plateau. Our Dedicated Veterans (21%) are the backbone: 4‚Äì5 days/week, highest calorie burn, most experienced. They rarely cancel. The group that needs immediate attention is Casual Newcomers (26%): low frequency, low intensity, low experience. This is where churn lives. Finally, our most interesting discovery: High-Intensity Beginners (17%) are new members who train at very high intensity despite being inexperienced ‚Äî a burnout and injury risk we hadn't identified before.

**What to do next:** First, launch a 90-day onboarding program for Casual Newcomers ‚Äî research shows the first three months determine long-term retention. A structured sequence of emails, class recommendations, and a buddy match could convert 10‚Äì15% of this group to Moderate Regulars. Second, flag High-Intensity Beginners for proactive outreach: a complimentary session with a trainer focused on proper form and recovery. This prevents injury-driven cancellations. Third, reward Dedicated Veterans with VIP perks ‚Äî priority booking and referral bonuses. They're already loyal; now make them ambassadors.

The cost of inaction: if even 30% of our 253 Casual Newcomers cancel (at $50/month), that's $45,540 in annual lost revenue. A targeted retention program costing $5,000 to implement would need to save just 9 members to break even.

---

---
## Task 9 ‚Äî Reflection (2 pts)

In 4‚Äì5 sentences, answer: What surprised you about applying this framework to a non-retail domain? Did the clusters map to groups you expected, or did the algorithm reveal structure you hadn't considered? If you were to run this analysis again with a different Magnitude metric, what would you choose and why?

**Sample:** The most surprising finding was the High-Intensity Beginners cluster ‚Äî I expected new members to start slow, but a significant group jumps in at very high intensity. This maps to what gym staff anecdotally call 'January warriors' ‚Äî people who sign up with extreme motivation but burn out quickly. The algorithm surfaced this as a distinct behavioral pattern without us having to define it in advance, which is the power of unsupervised learning. If I ran this again with a different Magnitude metric, I'd use Session_Duration instead of Calories_Burned. Duration captures time commitment more directly, and calories can be inflated by body weight differences rather than effort. A heavier person burns more calories doing the same workout, so duration might be a fairer measure of engagement intensity.

---
---
## Bonus Challenge (+2 pts extra credit)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚≠ê OPTIONAL</strong><br>
  Run the full pipeline TWICE ‚Äî once with your original Magnitude metric, once with a <em>different</em> Magnitude from your dataset. Compare cluster profiles side by side. In 3‚Äì5 sentences:
  <ul>
    <li>Did the same entities end up in the same clusters?</li>
    <li>Which Magnitude produced more actionable segments?</li>
    <li>What does this tell you about metric choice?</li>
  </ul>
  No scaffolding provided.
</div>

In [None]:
# BONUS: Compare Calories_Burned vs Session_Duration as Magnitude
from sklearn.metrics import silhouette_score

for mag_col, mag_name in [("Calories_Burned", "Calories"), ("Session_Duration (hours)", "Duration")]:
    rfm_b = df[["Experience_Level", "Workout_Frequency (days/week)", mag_col]].copy()
    rfm_b.columns = ["R", "F", "M"]
    scaled_b = StandardScaler().fit_transform(rfm_b)
    km_b = KMeans(n_clusters=4, random_state=42, n_init=10)
    labels_b = km_b.fit_predict(scaled_b)
    sil = silhouette_score(scaled_b, labels_b)
    rfm_b["Cluster"] = labels_b
    print(f"\n{'='*50}")
    print(f"  Magnitude = {mag_name} | Silhouette = {sil:.3f}")
    print(f"{'='*50}")
    print(rfm_b.groupby("Cluster").agg(count=("R","count"), avg_R=("R","mean"), avg_F=("F","mean"), avg_M=("M","mean")).round(1).to_string())

**Bonus interpretation:** **Sample:** Switching from Calories to Session Duration as Magnitude changed cluster assignments for about 30% of members. The Calories-based clustering gave tighter silhouette scores (~0.35 vs ~0.30), likely because calories have more variance and better separate behavioral groups. However, the Duration-based clustering produced a more actionable 'short but frequent' cluster that the calorie-based version missed ‚Äî members who come often but stay under 30 minutes, possibly doing quick cardio only. This shows that metric choice doesn't just change the numbers ‚Äî it changes which behavioral patterns the algorithm can detect.

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 05 ‚Äî RFM Applied to a Non-Retail Domain | 20 Points (+2 Bonus)
</p>