<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/exercises/week06_group_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 6 Group Exercise ‚Äî Market Basket Analysis
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 10 | **Duration:** ~45 minutes | **Deliverable:** Completed notebook + 3-minute presentation

**Objective:** Run a complete Market Basket Analysis pipeline on the Online Retail II dataset. Identify the top 10 rules by lift, produce one visualization, and present one product placement and one cross-sell recommendation.

### Group Members & Roles

| Role | Name | Responsibility |
|------|------|----------------|
| üñ•Ô∏è **Lead Coder** | | Writes code, drives the notebook |
| üìä **Data Interpreter** | | Explains outputs, writes interpretation cells |
| üé§ **Presenter** | | Delivers the 3-minute share-out |
| ‚úÖ **QA Reviewer** | | Checks outputs against checkpoints |

*If 3 members, Data Interpreter also handles QA.*

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GROUP DISCUSSION (before coding ‚Äî 3 minutes)</strong><br>
  You've already analyzed this dataset for customer segmentation in Week 5. Before writing any code:
  <ol>
    <li>What new question does Market Basket Analysis answer that RFM and K-Means could not?</li>
    <li>If you were the marketing director of this online retailer, name one specific business decision you'd want MBA to inform. Why?</li>
  </ol>
</div>

**Our group's answers (2‚Äì3 sentences each):**

*(Write here)*

---

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the setup cell below. It loads the dataset, cleans it, and delivers a ready-to-use DataFrame. <strong>Do not modify.</strong>
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
!pip install mlxtend -q

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from mlxtend.frequent_patterns import apriori, association_rules

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

# Load and clean Online Retail II
url = "https://github.com/fenago/datasets/raw/refs/heads/main/Online%20Retail%20(1).xlsx"
df = pd.read_excel(url, engine="openpyxl")

df["InvoiceNo"] = df["InvoiceNo"].astype(str)
df = df[~df["InvoiceNo"].str.startswith("C")]
df = df.dropna(subset=["CustomerID"])
df = df[df["Quantity"] > 0]
df = df[df["UnitPrice"] > 0]
df = df[df["StockCode"].astype(str).str.match(r"^\d")]
df["Description"] = df["Description"].str.strip()
df = df.dropna(subset=["Description"])

print(f"‚úÖ Dataset loaded and cleaned")
print(f"   {df.shape[0]:,} transactions | {df['InvoiceNo'].nunique():,} invoices | {df['Description'].nunique():,} products")

---
## Task 1 ‚Äî Transform to Basket Format (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Convert the cleaned DataFrame into a one-hot encoded basket matrix: one row per Invoice, one column per Description, Boolean values (True/False). Print the shape.
</div>

In [None]:
# Task 1: Convert to basket format
# group by Invoice √ó Description, sum Quantity, unstack, fill NaN with 0, convert to Boolean
# YOUR CODE HERE


<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 1</strong><br>
  Shape should be approximately <strong>18,000 rows √ó 3,800 columns</strong>. If you have far fewer columns, check that you used Description (not StockCode). If you have far fewer rows, check your cleaning filters.
</div>

## Task 2 ‚Äî Run Apriori (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run <code>apriori()</code> with <code>min_support=0.02</code> and <code>use_colnames=True</code>. Print the count of frequent itemsets and display the top 10 by support.
</div>

In [None]:
# Task 2: Run Apriori
# YOUR CODE HERE


<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 2</strong><br>
  You should find approximately <strong>60‚Äì100 frequent itemsets</strong>. If fewer than 20, your basket transformation may have an issue. If more than 500, check that non-product StockCodes were filtered.
</div>

---
## Task 3 ‚Äî Generate and Filter Rules (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Generate rules with <code>metric="lift"</code> and <code>min_threshold=2.0</code>. Add a second filter: <code>confidence >= 0.3</code>. Sort by lift descending. Display the top 15 rules with all 5 metrics.
</div>

In [None]:
# Task 3: Generate rules, filter, sort, display top 15
# YOUR CODE HERE


---
## Task 4 ‚Äî Interpret the Top 3 Rules (2 pts)

Write a plain-language interpretation of the top 3 rules by lift. For each rule:
- What products are associated?
- How strong is the association (lift value)?
- Why might these products be bought together?

**Rule 1:** *(antecedent)* ‚Üí *(consequent)*
- Lift:
- Interpretation:

**Rule 2:**
- Lift:
- Interpretation:

**Rule 3:**
- Lift:
- Interpretation:

*(Fill in above ‚Äî minimum 2‚Äì3 sentences per rule)*

---
## Task 5 ‚Äî Produce One Visualization (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Choose <strong>ONE</strong> of the following:
  <ul>
    <li><strong>Option A:</strong> Network graph ‚Äî top 15 rules, nodes = products, edges = rules, thickness = lift</li>
    <li><strong>Option B:</strong> Support vs confidence scatter ‚Äî all rules, bubble size = lift, colorbar for lift</li>
    <li><strong>Option C:</strong> Lift heatmap ‚Äî top 20 rules, rows = antecedents, columns = consequents, color = lift</li>
  </ul>
</div>

In [None]:
# Task 5: Produce ONE visualization
# YOUR CODE HERE


<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 3</strong><br>
  Your visualization should clearly show which product pairs have the strongest associations. If labels overlap or colors look flat, reduce the number of rules displayed.
</div>

---
## Task 6 ‚Äî Business Recommendations (2 pts)

Write two specific recommendations based on your analysis:

**Product Placement Recommendation:**
Which products should be displayed together (on the website or in a physical store)? Cite the specific rule and its lift.

*(Write here ‚Äî minimum 3 sentences)*

**Cross-Sell Recommendation:**
Which product should be suggested as "Frequently Bought Together" when a customer adds a specific item to their cart? Cite the specific rule and its confidence.

*(Write here ‚Äî minimum 3 sentences)*

---

## Share-Out (3 minutes)

Present to the class:
1. **One surprising finding** ‚Äî A rule you didn't expect. What was the lift? Why do you think these products are associated?
2. **The visualization** ‚Äî Show the chart. Point out the strongest cluster, hottest cell, or largest bubble.
3. **One business recommendation** ‚Äî Product placement OR cross-sell. Be specific: which products, what action, why the metric justifies it.

## Troubleshooting

| Problem | Fix |
|---------|-----|
| `apriori()` takes forever or crashes | Your min_support is too low. Start at 0.03 minimum. |
| Zero frequent itemsets | Basket matrix may not be Boolean ‚Äî check `basket.dtypes`. Must be True/False. |
| `association_rules()` gives empty DataFrame | Lower `min_threshold` (try 1.5 instead of 2.0) or lower confidence filter |
| Network graph labels overlap | Increase `k` in `nx.spring_layout(G, k=3)` or reduce to top 10 rules |
| Heatmap is one solid color | You may have too few rules, or all lifts are similar. Try top 20 instead of all rules. |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 6 Group Exercise ‚Äî Market Basket Analysis | 10 Points
</p>