<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/exercises/week07_group_exercise_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 7 Group Exercise ‚Äî SOLUTION KEY üîë ‚Äî RFM + K-Means on a Non-Retail Domain
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 10 | **Duration:** ~40 minutes | **Deliverable:** Completed notebook + 3-minute presentation

**Objective:** Apply the universal RFM + K-Means pipeline to a non-retail domain dataset. Define your own metric mapping, run the clustering pipeline, name your clusters, and present one intervention recommendation per cluster.

### Group Members & Roles

| Role | Name | Responsibility |
|------|------|----------------|
| üñ•Ô∏è **Lead Coder** | | Types code, runs cells |
| üìä **Data Interpreter** | | Explains outputs, writes markdown cells |
| üé§ **Presenter** | | Delivers 3-minute share-out |
| ‚úÖ **QA Reviewer** | | Checks outputs against checkpoints |

*If 3 members, QA Reviewer + Presenter combine.*

### Dataset Selection

Each group selects ONE dataset. No two groups should use the same dataset.

| Dataset | Entity | Suggested R / F / M |
|---------|--------|---------------------|
| **Gym Members** | Member | Experience_Level (proxy) / Workout_Frequency / Calories_Burned |
| **Blood Donation** | Donor | Recency (months) / Frequency (donations) / Monetary (cc blood) |
| **UFO Sightings (by city)** | City | Days since last report / Report count / Median duration |

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the setup cell below. Do not modify.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ All libraries loaded")

In [None]:
# ============================================================
# DATASET SELECTION ‚Äî Uncomment ONE option
# ============================================================

# --- Option A: Gym Members ---
# df = pd.read_csv("https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/gym_members_exercise_tracking.csv")
# DATASET_NAME = "Gym Members Exercise"

# --- Option B: Blood Donation ---
df = pd.read_csv("https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/Blood%20Transfusion%20Service%20data.csv")
DATASET_NAME = "Blood Donation Service Center"

# --- Option C: UFO Sightings (by city) ---
# ufo_cols = ["datetime","city","state","country","shape","duration_seconds","duration_hours_min","comments","date_posted","latitude","longitude"]
# df = pd.read_csv("https://raw.githubusercontent.com/planetsig/ufo-reports/refs/heads/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv", names=ufo_cols, low_memory=False)
# DATASET_NAME = "UFO Sightings by City"

print(f"Dataset: {DATASET_NAME}")
print(f"Shape: {df.shape}")
df.head()

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GROUP DISCUSSION (before coding ‚Äî 3 minutes)</strong><br>
  <ol>
    <li>What does "Recency" mean in your chosen domain? Is low Recency good or bad?</li>
    <li>What does "Magnitude" mean? Could Frequency and Magnitude be correlated?</li>
    <li>If clustering produces an outlier cluster, what would it most likely represent ‚Äî data quality issue or genuine extreme case?</li>
  </ol>
</div>

**Our group's answers (2‚Äì3 sentences each):**

**Sample (Blood Donation):** (1) Recency = months since last donation. Low Recency means the donor gave blood recently ‚Äî that's good, they're active. High Recency means they haven't donated in a long time and may have churned. (2) Magnitude = total blood donated in cc. This is directly proportional to Frequency (each donation is ~250cc), so F and M will be highly correlated. We should note this limitation ‚Äî effectively we're clustering on two dimensions, not three. (3) An outlier cluster would likely represent a small group of extremely frequent donors (50+ donations over many years) who are the blood bank's most valuable assets ‚Äî genuine extreme cases, not data errors.

---

---
## Task 1 ‚Äî Load and Explore (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Print the shape, column names, first 5 rows, and <code>.describe()</code> for numeric columns. Confirm your planned R/F/M columns are present and numeric.
</div>

In [None]:
# Task 1: Explore (Blood Donation)
print(f"Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 5 rows:")
print(df.head().to_string())
print(f"\nDescriptive stats:")
print(df.describe().round(1).to_string())

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 1</strong><br>
  Confirm your R/F/M columns are numeric. If any need conversion (dates, text), handle it before proceeding.
</div>

---
## Task 2 ‚Äî Define and Calculate RFM Metrics (2 pts)

State your metric mapping, then calculate.

**Our RFM Mapping:**
- R = Recency ‚Äî months since last donation (lower = more recent = better)
- F = Frequency ‚Äî total number of donations (higher = more committed)
- M = Monetary ‚Äî total blood donated in cc (higher = more contributed, but note: perfectly correlated with F at ~250cc per donation)

In [None]:
# Task 2: RFM is already in the dataset
rfm = df[["id", "Recency", "Frequency", "Monetary"]].copy()
rfm.columns = ["DonorID", "R", "F", "M"]
print(f"RFM table: {rfm.shape[0]} donors")
print(rfm.head(10).to_string(index=False))

---
## Task 3 ‚Äî Check Correlations (1 pt)

In [None]:
# Task 3: Correlation check
corr = rfm[["R", "F", "M"]].corr()
print("Correlation Matrix:")
print(corr.round(3).to_string())
print(f"\nF-M correlation: {corr.loc['F','M']:.3f} ‚Äî as expected, nearly perfect")

**Interpretation:** Are any pairs correlated above 0.85? What does that mean?

**Sample:** F and M are correlated at ~0.99 ‚Äî essentially a perfect linear relationship (each donation = ~250cc). This means our K-Means clustering is effectively running on two dimensions (R and F), not three. The M column adds no independent information. For a more informative analysis, we could swap M for the Time column (months since first donation) to add a genuinely independent dimension.

---
## Task 4 ‚Äî Standardize and Elbow Method (2 pts)

In [None]:
# Task 4: Standardize + Elbow
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[["R", "F", "M"]])

inertias = []
K_range = range(2, 9)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(rfm_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, "bo-", linewidth=2)
plt.xlabel("k"); plt.ylabel("Inertia")
plt.title("Elbow Method ‚Äî Blood Donation")
plt.xticks(list(K_range)); plt.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()

**Our chosen k:** *(number)* ‚Äî *(justification)*

**Sample:** k=4 ‚Äî the elbow curve shows a clear bend at k=4, with diminishing returns after that. Four clusters give us enough granularity to separate high-frequency from low-frequency donors while keeping each cluster large enough to be actionable.

---
## Task 5 ‚Äî Fit K-Means and Profile Clusters (2 pts)

In [None]:
# Task 5: Fit K-Means k=4
km = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm["Cluster"] = km.fit_predict(rfm_scaled)

profiles = rfm.groupby("Cluster").agg(
    count=("DonorID", "count"),
    avg_R=("R", "mean"),
    avg_F=("F", "mean"),
    avg_M=("M", "mean")
).round(1)
print("Cluster Profiles:")
print(profiles.to_string())

**Interpretation:**
- Best cluster: *(which and why)*
- Most concerning: *(which and why)*
- Outlier cluster? *(if any ‚Äî what might it represent)*

**Sample:**
- Best cluster: Cluster with lowest avg R and highest avg F ‚Äî these are the recently active, high-frequency donors. They're the blood bank's most reliable source.
- Most concerning: Cluster with highest avg R and lowest avg F ‚Äî these donors gave once or twice long ago and haven't returned. Reactivation campaigns should target them.
- Outlier cluster: The cluster with avg_F > 30 represents a small group of super-donors who have given 40‚Äì50+ times over decades. These are genuine extreme cases (not data errors) and deserve VIP recognition.

---
## Task 6 ‚Äî Visualize (1 pt)

In [None]:
# Task 6: Scatter plot
colors = ["#E74C3C", "#3498DB", "#2ECC71", "#F39C12"]
plt.figure(figsize=(10, 6))
for c in sorted(rfm["Cluster"].unique()):
    mask = rfm["Cluster"] == c
    plt.scatter(rfm.loc[mask, "F"], rfm.loc[mask, "M"],
                c=colors[c % len(colors)], label=f"Cluster {c}",
                s=60, alpha=0.6, edgecolors="gray")
plt.xlabel("Frequency (total donations)")
plt.ylabel("Monetary (total cc donated)")
plt.title("Blood Donors ‚Äî Frequency vs Monetary by Cluster")
plt.legend(); plt.tight_layout(); plt.show()

---
## Task 7 ‚Äî Name and Recommend (1 pt)

| Cluster | Domain Name | Size | Recommendation |
|---------|------------|------|----------------|
| 0 | Lapsed One-Timers | ~200 | Send reactivation reminder with impact story ("your donation saved X lives") |
| 1 | Reliable Regulars | ~250 | Maintain engagement ‚Äî thank-you communications, appointment reminders |
| 2 | Super Donors (VIP) | ~30 | VIP recognition program, invite to donor ambassador program |
| 3 | Recent First-Timers | ~270 | Critical window ‚Äî send follow-up within 30 days to convert to repeat donor |

*(Fill in the table above)*

---
## Share-Out (3 minutes)

1. What domain and dataset?
2. How did you define R, F, and M ‚Äî and why?
3. How many clusters? Name your most interesting one.
4. Single most actionable recommendation?
5. Did the algorithm find anything surprising?

## Troubleshooting

| Problem | Fix |
|---------|-----|
| UFO city aggregation crashes | Filter to US + post-2000 first, then aggregate. Memory issue if unfiltered. |
| Blood donation F and M perfectly correlated | Expected ‚Äî Monetary = Frequency √ó 250cc. Note in interpretation, consider swapping M for Time column. |
| Gym data has no date column for Recency | Use Experience_Level (1‚Äì3) as proxy. Lower = newer member. Note the limitation. |
| Elbow curve has no clear bend | Choose k=3 or k=4 as default, note the ambiguity |
| pd.qcut fails with too few unique values | Use cut() instead or reduce number of bins |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 7 Group Exercise ‚Äî RFM + K-Means on Non-Retail Domain | 10 Points
</p>