<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/labs/lab04_market_basket_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 04 ‚Äî Market Basket Analysis
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 20 (+3 bonus) | **Format:** Individual | **Due:** End of Week 6

**Objective:** Run a complete Market Basket Analysis pipeline on a dataset of your choice. Identify the top 10 association rules by lift, produce at least two visualizations, and write a 1-page business memo to a non-technical store owner explaining three actionable rules.

---

### Grading Summary

| Part | Points |
|------|--------|
| Part 1 ‚Äî Data Loading & Cleaning | 3 |
| Part 2 ‚Äî Basket Transformation | 2 |
| Part 3 ‚Äî Apriori & Rule Generation | 4 |
| Part 4 ‚Äî Visualizations (2 of 3) | 4 |
| Part 5 ‚Äî Business Memo (‚â•250 words) | 5 |
| Part 6 ‚Äî Reflection | 2 |
| **Total** | **20** |
| Bonus ‚Äî Temporal Comparison | +3 |

---
## Choose Your Dataset

Uncomment **ONE** option below and run the setup cell.

| Option | Dataset | Difficulty | Notes |
|--------|---------|------------|-------|
| A | **Instacart** | Expert | Download from Kaggle. Multiple CSVs requiring joins. 3.4M+ orders. |
| B | **Brazilian E-Commerce (Olist)** | Intermediate | One merge required. Use product_category_name as item. |
| C | **Restaurant Orders** | Beginner | Single CSV. Fewer products = higher min_support. |

In [None]:
# ============================================================
# Setup ‚Äî Run this cell first. Do not modify.
# ============================================================
!pip install mlxtend -q

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import warnings
warnings.filterwarnings("ignore")

from mlxtend.frequent_patterns import apriori, association_rules

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ Libraries loaded (including mlxtend and networkx)")

In [None]:
# ============================================================
# DATASET SELECTION ‚Äî Uncomment ONE option and run
# ============================================================

# --- OPTION A: Instacart (Expert) ---
# Download from: https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset
# Upload the CSV files to Colab, then:
# orders = pd.read_csv("order_products__prior.csv")
# products = pd.read_csv("products.csv")
# aisles = pd.read_csv("aisles.csv")
# df = orders.merge(products, on="product_id").merge(aisles, on="aisle_id")
# # Use 'product_name' or 'aisle' as item identifier
# # Consider subsetting to first 100K orders for performance
# DATASET_NAME = "Instacart Online Grocery"

# --- OPTION B: Brazilian E-Commerce / Olist (Intermediate) ---
# base = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data"
# items = pd.read_csv(f"{base}/olist_order_items_dataset.csv")
# products = pd.read_csv(f"{base}/olist_products_dataset.csv")
# df = items.merge(products, on="product_id")
# # Use 'product_category_name' as item identifier
# # Transaction ID = 'order_id'
# DATASET_NAME = "Brazilian E-Commerce (Olist)"

# --- OPTION C: Restaurant Orders (Beginner) ---
url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/restaurant-1-orders.csv"
df = pd.read_csv(url)
DATASET_NAME = "Restaurant Orders"

print(f"Dataset: {DATASET_NAME}")
print(f"Shape: {df.shape}")
df.head()

---
## Part 1 ‚Äî Data Loading and Cleaning (3 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 1.1</strong><br>
  Clean your dataset: handle nulls, remove invalid transactions (cancellations, returns, zero-quantity rows), verify no duplicates within a single transaction. Print the clean shape.
</div>

In [None]:
# Task 1.1: Clean the data
# YOUR CODE HERE


<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 1.2</strong><br>
  Print a summary: number of unique transactions, number of unique items, and the top 10 most frequent items.
</div>

In [None]:
# Task 1.2: Summary statistics
# YOUR CODE HERE


**Interpretation:** In 2‚Äì3 sentences, describe your dataset. What does each transaction represent? Any data quality issues?

*(Write here)*

---
## Part 2 ‚Äî Basket Transformation (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 2.1</strong><br>
  Transform your data into one-hot encoded basket format: one row per transaction, one column per item, Boolean values. Print the shape and density (percentage of True values).
</div>

In [None]:
# Task 2.1: Basket transformation
# YOUR CODE HERE


**Interpretation:** Is the basket matrix sparse or dense? Why does sparsity matter for MBA?

*(Write here)*

---
## Part 3 ‚Äî Apriori and Rule Generation (4 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 3.1</strong><br>
  Run <code>apriori()</code> with a min_support threshold of your choice. <strong>Justify your choice</strong> in the markdown cell below.
</div>

In [None]:
# Task 3.1: Run Apriori
# YOUR CODE HERE


**min_support justification:** Why did you choose this threshold? (1‚Äì2 sentences)

*(Write here)*

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 3.2</strong><br>
  Generate association rules with <code>metric="lift"</code> and <code>min_threshold=1.5</code>. Apply an additional confidence filter of your choice (justify it). Display the top 15 rules by lift with all 5 metrics.
</div>

In [None]:
# Task 3.2: Generate and filter rules
# YOUR CODE HERE


**Interpretation:** How many rules before and after filtering? What does that tell you about the density of associations?

*(Write here)*

---
## Part 4 ‚Äî Visualizations (4 pts)

Produce **at least two** of the following three visualizations.

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° VISUALIZATION OPTIONS</strong><br>
  <ul>
    <li><strong>Option A:</strong> Network graph ‚Äî top 15 rules, nodes = items, edges = rules, thickness = lift</li>
    <li><strong>Option B:</strong> Support vs confidence scatter ‚Äî each dot = one rule, bubble size = lift, colorbar</li>
    <li><strong>Option C:</strong> Lift heatmap ‚Äî top 20 antecedent-consequent pairs</li>
  </ul>
</div>

In [None]:
# Task 4.1: Visualization 1
# YOUR CODE HERE


In [None]:
# Task 4.2: Visualization 2
# YOUR CODE HERE


**Interpretation:** What patterns do you see? Do any product clusters emerge? Any surprises?

*(Write here)*

---
## Part 5 ‚Äî Business Memo (5 pts)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° MEMO REQUIREMENTS</strong><br>
  Write a 1-page memo addressed to a non-technical store owner. Include:
  <ol>
    <li><strong>What you did</strong> ‚Äî 2‚Äì3 sentences explaining MBA in plain language</li>
    <li><strong>Three actionable rules</strong> ‚Äî For each: what products, how strong, one specific action</li>
    <li><strong>One recommendation to skip</strong> ‚Äî A rule that appeared but isn't worth acting on. Explain why.</li>
  </ol>
  <strong>Minimum 250 words.</strong> No code, no jargon. Write as if presenting to a client.
</div>

### Business Memo

**TO:** [Store Owner Name]
**FROM:** [Your Name], Data Analyst
**DATE:** [Date]
**RE:** Product Association Analysis ‚Äî Recommendations for Store Layout and Promotions

---

*(Write your memo here ‚Äî minimum 250 words)*

---

---
## Part 6 ‚Äî Reflection (2 pts)

In 3‚Äì4 sentences, answer: What surprised you most about the association rules your data revealed? Was there a rule that didn't make intuitive sense at first ‚Äî and if so, what explanation did you come up with? If you were to run this analysis again in three months, what might change and why?

*(Write here ‚Äî minimum 3‚Äì4 sentences)*

---
---
## Bonus Challenge (Up to 3 Extra Points)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚≠ê OPTIONAL</strong><br>
  Filter your dataset to <strong>two different time periods</strong> (Q1 vs Q4, weekdays vs weekends, morning vs evening). Run the full MBA pipeline on each subset separately.
  <ol>
    <li>Show the top 5 rules by lift for each time period</li>
    <li>Identify at least one rule that appears in one period but not the other</li>
    <li>Explain what this tells you about temporal purchasing behavior (3‚Äì4 sentences)</li>
  </ol>
  No scaffolding ‚Äî apply the pipeline independently to each subset.
</div>

In [None]:
# BONUS: Temporal comparison
# YOUR CODE HERE


**Bonus interpretation:** *(Write here if attempting)*

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 04 ‚Äî Market Basket Analysis | 20 Points (+3 Bonus)
</p>