<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/labs/lab05_rfm_domains.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 05 ‚Äî RFM Applied to a Non-Retail Domain
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 20 (+2 bonus) | **Format:** Individual | **Due:** End of Week 7

**Objective:** Apply the universal RFM + K-Means framework to a non-retail domain of your choice. Write a 1-page domain brief translating your findings for a non-technical decision-maker.

> **Final Project Connection:** The domain and dataset you choose for this lab may be carried forward as the foundation for your Week 8 final project.

---

### Grading Summary

| Task | Points |
|------|--------|
| 1. Select and justify domain | 2 |
| 2. Load and explore | 2 |
| 3. Calculate and score RFM | 3 |
| 4. Check metric independence | 1 |
| 5. Cluster with K-Means | 3 |
| 6. Visualize (2 charts) | 2 |
| 7. Name and interpret clusters | 3 |
| 8. Domain brief (‚â•300 words) | 5 (see sub-rubric below) |
| 9. Reflection | 2 |
| **Total** | **23 (capped at 20 + 2 bonus)** |

---
## Choose Your Dataset

Uncomment **ONE** option below.

| Option | Dataset | Entity | Suggested R / F / M |
|--------|---------|--------|---------------------|
| A | Gym Members | Member | Experience_Level (proxy) / Workout_Frequency / Calories_Burned |
| B | Blood Donation | Donor | Recency (months) / Frequency (count) / Monetary (cc) |
| C | Your Own | Your choice | Must be non-retail, 100+ entities, clear R/F/M mapping |

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ Libraries loaded")

In [None]:
# ============================================================
# DATASET SELECTION ‚Äî Uncomment ONE option
# ============================================================

# --- Option A: Gym Members ---
# df = pd.read_csv("https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/gym_members_exercise_tracking.csv")
# DATASET_NAME = "Gym Members Exercise"

# --- Option B: Blood Donation ---
# df = pd.read_csv("https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/Blood%20Transfusion%20Service%20data.csv")
# DATASET_NAME = "Blood Donation Service Center"

# --- Option C: Your Own Domain ---
# df = pd.read_csv("YOUR_URL_HERE")
# DATASET_NAME = "Your Domain Name"

print(f"Dataset: {DATASET_NAME}")
print(f"Shape: {df.shape}")
df.head()

---
## Task 1 ‚Äî Select and Justify Your Domain (2 pts)

Answer all three in the markdown cell below:
1. What is your entity? (gym member, donor, student, etc.)
2. What are your R, F, and M columns? For each, explain what "high" and "low" mean. Is lower Recency better or worse?
3. Why is this mapping meaningful? What business or organizational question does it answer?

*(Write here)*

---
## Task 2 ‚Äî Load and Explore (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Print the shape, <code>.dtypes</code>, first 5 rows, and <code>.describe()</code> for your R, F, and M columns. Note any data quality concerns.
</div>

In [None]:
# Task 2: Explore your dataset
# Print shape, dtypes, head, describe for R/F/M columns
# YOUR CODE HERE


**Data quality notes:** *(any nulls, wrong dtypes, extreme values?)*

*(Write here)*

---
## Task 3 ‚Äî Calculate and Score RFM (3 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Calculate your three RFM metrics (aggregate if needed). Score each metric using <code>pd.qcut()</code> with up to 5 quintiles. If a column has too few unique values for 5 bins (e.g., Experience_Level has only 3), use <code>pd.cut()</code> with fewer bins instead. Handle Recency reversal. Display the scored table for at least the first 10 entities.
</div>

**Expected output:** DataFrame with entity ID, raw R/F/M, R_score, F_score, M_score, and composite RFM label.

In [None]:
# Task 3: Calculate and score RFM
# YOUR CODE HERE


---
## Task 4 ‚Äî Check Metric Independence (1 pt)

In [None]:
# Task 4: Correlation matrix for R, F, M
# YOUR CODE HERE


**Interpretation:** Are any pairs correlated above 0.85? What does this mean for clustering? Would you consider swapping a metric?

*(Write here)*

---
## Task 5 ‚Äî Cluster with K-Means (3 pts)

In [None]:
# Task 5: StandardScaler + Elbow Method (k=2 to k=8)
# YOUR CODE HERE


In [None]:
# Task 5 continued: Fit K-Means with chosen k, print profiles
# YOUR CODE HERE


**Chosen k and justification:**

*(Write here)*

---
## Task 6 ‚Äî Visualize (2 pts)

Produce at least TWO visualizations:
1. Scatter: Frequency vs Magnitude, colored by cluster
2. Cluster profile chart: heatmap or grouped bar chart

In [None]:
# Task 6: Visualization 1 ‚Äî Scatter
# YOUR CODE HERE


In [None]:
# Task 6: Visualization 2 ‚Äî Profile chart
# YOUR CODE HERE


---
## Task 7 ‚Äî Name and Interpret Clusters (3 pts)

Give each cluster a domain-appropriate name. Describe each in 2‚Äì3 sentences.

| Cluster | Domain Name | Size | Profile Summary | Recommended Action |
|---------|------------|------|----------------|-------------------|
| 0 | | | | |
| 1 | | | | |
| 2 | | | | |
| ... | | | | |

*(Fill in the table above)*

---
## Task 8 ‚Äî Domain Brief (5 pts)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° BRIEF REQUIREMENTS</strong><br>
  1-page memo (‚â•300 words) to a non-technical decision-maker. Include:
  <ol>
    <li><strong>Problem statement</strong> (2‚Äì3 sentences): What question? Why does it matter?</li>
    <li><strong>Method summary</strong> (3‚Äì4 sentences): Explain without jargon</li>
    <li><strong>Key findings</strong> (1 paragraph + 1 visualization)</li>
    <li><strong>Recommendations</strong> (1 paragraph): Specific, actionable</li>
  </ol>
  <strong>Scoring:</strong> Problem statement (1 pt), Method explained (1 pt), Findings with viz (1 pt), Actionable recommendations (1 pt), Professional tone (1 pt)
</div>

### Domain Brief

**TO:** [Decision-Maker Title and Name]
**FROM:** [Your Name], Data Analyst
**DATE:** [Date]
**RE:** [Subject Line]

---

*(Write your brief here ‚Äî minimum 300 words)*

---

---
## Task 9 ‚Äî Reflection (2 pts)

In 4‚Äì5 sentences, answer: What surprised you about applying this framework to a non-retail domain? Did the clusters map to groups you expected, or did the algorithm reveal structure you hadn't considered? If you were to run this analysis again with a different Magnitude metric, what would you choose and why?

*(Write here ‚Äî minimum 4‚Äì5 sentences)*

---
---
## Bonus Challenge (+2 pts extra credit)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚≠ê OPTIONAL</strong><br>
  Run the full pipeline TWICE ‚Äî once with your original Magnitude metric, once with a <em>different</em> Magnitude from your dataset. Compare cluster profiles side by side. In 3‚Äì5 sentences:
  <ul>
    <li>Did the same entities end up in the same clusters?</li>
    <li>Which Magnitude produced more actionable segments?</li>
    <li>What does this tell you about metric choice?</li>
  </ul>
  No scaffolding provided.
</div>

In [None]:
# BONUS: Alternative Magnitude comparison
# YOUR CODE HERE


**Bonus interpretation:** *(Write here if attempting)*

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 05 ‚Äî RFM Applied to a Non-Retail Domain | 20 Points (+2 Bonus)
</p>