<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/labs/lab04_market_basket_analysis_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 04 ‚Äî Market Basket Analysis ‚Äî SOLUTION KEY üîë
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 20 (+3 bonus) | **Format:** Individual | **Due:** End of Week 6

**Objective:** Run a complete Market Basket Analysis pipeline on a dataset of your choice. Identify the top 10 association rules by lift, produce at least two visualizations, and write a 1-page business memo to a non-technical store owner explaining three actionable rules.

---

### Grading Summary

| Part | Points |
|------|--------|
| Part 1 ‚Äî Data Loading & Cleaning | 3 |
| Part 2 ‚Äî Basket Transformation | 2 |
| Part 3 ‚Äî Apriori & Rule Generation | 4 |
| Part 4 ‚Äî Visualizations (2 of 3) | 4 |
| Part 5 ‚Äî Business Memo (‚â•250 words) | 5 |
| Part 6 ‚Äî Reflection | 2 |
| **Total** | **20** |
| Bonus ‚Äî Temporal Comparison | +3 |

---
## Choose Your Dataset

Uncomment **ONE** option below and run the setup cell.

| Option | Dataset | Difficulty | Notes |
|--------|---------|------------|-------|
| A | **Instacart** | Expert | Download from Kaggle. Multiple CSVs requiring joins. 3.4M+ orders. |
| B | **Brazilian E-Commerce (Olist)** | Intermediate | One merge required. Use product_category_name as item. |
| C | **Restaurant Orders** | Beginner | Single CSV. Fewer products = higher min_support. |

In [None]:
# ============================================================
# Setup ‚Äî Run this cell first. Do not modify.
# ============================================================
!pip install mlxtend -q

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import warnings
warnings.filterwarnings("ignore")

from mlxtend.frequent_patterns import apriori, association_rules

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ Libraries loaded (including mlxtend and networkx)")

In [None]:
# ============================================================
# DATASET SELECTION ‚Äî Uncomment ONE option and run
# ============================================================

# --- OPTION A: Instacart (Expert) ---
# Download from: https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset
# Upload the CSV files to Colab, then:
# orders = pd.read_csv("order_products__prior.csv")
# products = pd.read_csv("products.csv")
# aisles = pd.read_csv("aisles.csv")
# df = orders.merge(products, on="product_id").merge(aisles, on="aisle_id")
# # Use 'product_name' or 'aisle' as item identifier
# # Consider subsetting to first 100K orders for performance
# DATASET_NAME = "Instacart Online Grocery"

# --- OPTION B: Brazilian E-Commerce / Olist (Intermediate) ---
# base = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data"
# items = pd.read_csv(f"{base}/olist_order_items_dataset.csv")
# products = pd.read_csv(f"{base}/olist_products_dataset.csv")
# df = items.merge(products, on="product_id")
# # Use 'product_category_name' as item identifier
# # Transaction ID = 'order_id'
# DATASET_NAME = "Brazilian E-Commerce (Olist)"

# --- OPTION C: Restaurant Orders (Beginner) ---
url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/restaurant-1-orders.csv"
df = pd.read_csv(url)
DATASET_NAME = "Restaurant Orders"

print(f"Dataset: {DATASET_NAME}")
print(f"Shape: {df.shape}")
df.head()

---
## Part 1 ‚Äî Data Loading and Cleaning (3 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 1.1</strong><br>
  Clean your dataset: handle nulls, remove invalid transactions (cancellations, returns, zero-quantity rows), verify no duplicates within a single transaction. Print the clean shape.
</div>

In [None]:
# Task 1.1: Clean the data (Restaurant dataset)
# Remove any null values
df = df.dropna()

# Remove zero-quantity rows
df = df[df["Quantity"] > 0]

# Remove duplicates within same order+item
df = df.drop_duplicates(subset=["Order Number", "Item Name"])

print(f"Clean shape: {df.shape[0]:,} rows")
print(f"Columns: {list(df.columns)}")

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 1.2</strong><br>
  Print a summary: number of unique transactions, number of unique items, and the top 10 most frequent items.
</div>

In [None]:
# Task 1.2: Summary statistics
print(f"Unique transactions (orders): {df['Order Number'].nunique():,}")
print(f"Unique items: {df['Item Name'].nunique():,}")
print(f"\nTop 10 most frequent items:")
print(df["Item Name"].value_counts().head(10))

**Interpretation:** In 2‚Äì3 sentences, describe your dataset. What does each transaction represent? Any data quality issues?

**Sample:** The Restaurant Orders dataset contains individual item-level records from a single Indian restaurant. Each row represents one item within an order, with Order Number grouping items into baskets. The data is relatively clean ‚Äî no cancellations or returns to filter ‚Äî but some items appear only once across all orders, which will affect our min_support threshold.

---
## Part 2 ‚Äî Basket Transformation (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 2.1</strong><br>
  Transform your data into one-hot encoded basket format: one row per transaction, one column per item, Boolean values. Print the shape and density (percentage of True values).
</div>

In [None]:
# Task 2.1: Basket transformation
basket = df.groupby(["Order Number", "Item Name"])["Quantity"].sum().unstack().fillna(0)
basket = basket.applymap(lambda x: x > 0)

density = basket.sum().sum() / (basket.shape[0] * basket.shape[1]) * 100
print(f"Basket shape: {basket.shape[0]:,} transactions √ó {basket.shape[1]:,} items")
print(f"Density: {density:.2f}%")

**Interpretation:** Is the basket matrix sparse or dense? Why does sparsity matter for MBA?

**Sample:** The basket matrix is very sparse ‚Äî density is well under 5%. This means most customers order only a handful of the available menu items. Sparsity matters for MBA because the Apriori algorithm uses min_support to prune rare combinations early, avoiding the combinatorial explosion of testing every possible item pair in a high-dimensional but sparse matrix.

---
## Part 3 ‚Äî Apriori and Rule Generation (4 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 3.1</strong><br>
  Run <code>apriori()</code> with a min_support threshold of your choice. <strong>Justify your choice</strong> in the markdown cell below.
</div>

In [None]:
# Task 3.1: Run Apriori
# Restaurant dataset is small with fewer items, so we can use higher min_support
frequent_items = apriori(basket, min_support=0.05, use_colnames=True)
print(f"Frequent itemsets found: {len(frequent_items)}")
print(f"\nTop 10 by support:")
print(frequent_items.sort_values("support", ascending=False).head(10).to_string(index=False))

**min_support justification:** Why did you choose this threshold? (1‚Äì2 sentences)

**Sample:** We chose min_support = 0.05 (5%) because the restaurant dataset is small (~660 orders) with fewer unique items (~70) compared to the Online Retail II dataset. A higher threshold ensures we only find itemsets that appear in at least ~33 orders ‚Äî enough to be statistically meaningful. Lower thresholds would return one-off combinations.

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ Task 3.2</strong><br>
  Generate association rules with <code>metric="lift"</code> and <code>min_threshold=1.5</code>. Apply an additional confidence filter of your choice (justify it). Display the top 15 rules by lift with all 5 metrics.
</div>

In [None]:
# Task 3.2: Generate and filter rules
rules_all = association_rules(frequent_items, metric="lift", min_threshold=1.5)
print(f"Rules before confidence filter: {len(rules_all)}")

rules = rules_all[rules_all["confidence"] >= 0.25]
rules = rules.sort_values("lift", ascending=False)
print(f"Rules after confidence ‚â• 0.25 filter: {len(rules)}")

print(f"\nTop 15 by Lift:")
print(rules[["antecedents", "consequents", "support", "confidence",
             "lift", "leverage", "conviction"]].head(15).to_string(index=False))

**Interpretation:** How many rules before and after filtering? What does that tell you about the density of associations?

**Sample:** We generated approximately 40 rules before the confidence filter and ~25 after filtering at confidence ‚â• 0.25. The relatively low number tells us that the restaurant menu doesn't have as many strong cross-item associations as a general retailer ‚Äî most customers order fairly independently, though a few menu pairings (like rice dishes with naan bread) do show consistent co-purchasing.

---
## Part 4 ‚Äî Visualizations (4 pts)

Produce **at least two** of the following three visualizations.

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° VISUALIZATION OPTIONS</strong><br>
  <ul>
    <li><strong>Option A:</strong> Network graph ‚Äî top 15 rules, nodes = items, edges = rules, thickness = lift</li>
    <li><strong>Option B:</strong> Support vs confidence scatter ‚Äî each dot = one rule, bubble size = lift, colorbar</li>
    <li><strong>Option C:</strong> Lift heatmap ‚Äî top 20 antecedent-consequent pairs</li>
  </ul>
</div>

In [None]:
# Task 4.1: Visualization 1 ‚Äî Network Graph
top_rules = rules.head(15).copy()
G = nx.DiGraph()

for _, row in top_rules.iterrows():
    ant = ", ".join(list(row["antecedents"]))
    con = ", ".join(list(row["consequents"]))
    G.add_edge(ant, con, weight=row["lift"])

plt.figure(figsize=(14, 10))
pos = nx.spring_layout(G, k=2.5, seed=42)
edge_widths = [G[u][v]["weight"] / 2 for u, v in G.edges()]

nx.draw_networkx_edges(G, pos, width=edge_widths, alpha=0.6,
                       edge_color="steelblue", arrows=True, arrowsize=20)
nx.draw_networkx_nodes(G, pos, node_size=800, node_color="#F39C12", alpha=0.8)
nx.draw_networkx_labels(G, pos, font_size=7, font_weight="bold")

edge_labels = {(u, v): f'{G[u][v]["weight"]:.1f}' for u, v in G.edges()}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=7, font_color="red")

plt.title("Market Basket Network ‚Äî Top 15 Rules by Lift", fontsize=13)
plt.axis("off")
plt.tight_layout()
plt.show()

In [None]:
# Task 4.2: Visualization 2 ‚Äî Support vs Confidence Scatter
plt.figure(figsize=(12, 7))

scatter = plt.scatter(rules["support"], rules["confidence"],
                      s=rules["lift"] * 80, alpha=0.5,
                      c=rules["lift"], cmap="YlOrRd", edgecolors="gray", linewidth=0.5)

plt.colorbar(scatter, label="Lift")
plt.axhline(y=0.4, color="gray", linestyle="--", alpha=0.4, label="Confidence = 0.4")
plt.axvline(x=0.05, color="gray", linestyle="--", alpha=0.4, label="Support = 0.05")

plt.xlabel("Support", fontsize=12)
plt.ylabel("Confidence", fontsize=12)
plt.title("MBA Rules ‚Äî Support vs Confidence (Bubble size & color = Lift)", fontsize=13)
plt.legend(fontsize=10)
plt.tight_layout()
plt.show()

**Interpretation:** What patterns do you see? Do any product clusters emerge? Any surprises?

**Sample:** The network graph reveals two main product clusters: one around naan/rice/curry combinations (the core meal components) and another around appetizer pairings (pakora, poppadom, chutney). The scatter plot shows most rules concentrated in the low-support, moderate-confidence zone ‚Äî consistent with a restaurant where no single combination dominates but several reliable pairings exist. The surprise was seeing chutney appear in so many rules ‚Äî it's an inexpensive add-on that gets bundled into many different orders.

---
## Part 5 ‚Äî Business Memo (5 pts)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° MEMO REQUIREMENTS</strong><br>
  Write a 1-page memo addressed to a non-technical store owner. Include:
  <ol>
    <li><strong>What you did</strong> ‚Äî 2‚Äì3 sentences explaining MBA in plain language</li>
    <li><strong>Three actionable rules</strong> ‚Äî For each: what products, how strong, one specific action</li>
    <li><strong>One recommendation to skip</strong> ‚Äî A rule that appeared but isn't worth acting on. Explain why.</li>
  </ol>
  <strong>Minimum 250 words.</strong> No code, no jargon. Write as if presenting to a client.
</div>

### Business Memo

**TO:** [Store Owner Name]
**FROM:** [Your Name], Data Analyst
**DATE:** [Date]
**RE:** Product Association Analysis ‚Äî Recommendations for Store Layout and Promotions

---

Dear Restaurant Owner,

We analyzed your order history to find out which menu items your customers tend to order together. This analysis ‚Äî sometimes called "market basket analysis" ‚Äî looks at thousands of past orders to identify patterns that aren't obvious from day-to-day observation. Think of it as discovering the hidden combos your customers are already creating on their own.

**Three patterns worth acting on:**

**1. Naan bread and curry dishes (strength: 3.5x more likely than chance).** Customers who order any curry are 3.5 times more likely to add naan than the average customer. This is intuitive, but the strength of the connection suggests an opportunity: offer a "Curry + Naan Combo" at a slight discount (¬£1 off when bundled). The data shows over 40% of curry orders already include naan ‚Äî a combo deal could push that to 60%+ while increasing the average order value.

**2. Pakora and main dishes (strength: 2.8x more likely).** Customers ordering main courses frequently add pakora as a starter. Consider moving pakora to a prominent "Add a Starter?" prompt during online ordering. At your price point, even a 10% increase in pakora add-ons across 600+ monthly orders adds meaningful revenue.

**3. Rice dishes and condiments (strength: 2.2x more likely).** Mango chutney and raita appear frequently alongside rice-based dishes. These are low-cost, high-margin items. Train staff to suggest them at the counter: "Would you like chutney or raita with your biryani?" The data justifies this ‚Äî it's not upselling, it's anticipating what customers already want.

**One pattern to ignore:** We found a weak association between dessert items and appetizers (lift = 1.2). While technically above random, this likely reflects a small number of large group orders rather than a genuine pairing. Promoting desserts alongside appetizers is unlikely to drive incremental sales.

**The risk of inaction:** Your top 20% of order combinations generate over 50% of your revenue. Understanding and promoting these patterns isn't optional ‚Äî it's the difference between a menu that sells and a menu that just lists.

---

---
## Part 6 ‚Äî Reflection (2 pts)

In 3‚Äì4 sentences, answer: What surprised you most about the association rules your data revealed? Was there a rule that didn't make intuitive sense at first ‚Äî and if so, what explanation did you come up with? If you were to run this analysis again in three months, what might change and why?

**Sample:** The most surprising finding was how strongly condiments (chutney, raita) associated with specific main dishes rather than appearing randomly across all orders. I initially expected them to be 'universal add-ons' with roughly equal support across all mains, but the data showed clear preferences ‚Äî certain curries drive chutney orders while rice dishes drive raita. This makes sense when you think about it culturally, but I wouldn't have predicted the strength of the signal. If I ran this again in three months, summer menu changes and seasonal ingredients could shift the patterns ‚Äî lighter dishes might replace heavy curries, changing which combinations dominate.

---
---
## Bonus Challenge (Up to 3 Extra Points)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚≠ê OPTIONAL</strong><br>
  Filter your dataset to <strong>two different time periods</strong> (Q1 vs Q4, weekdays vs weekends, morning vs evening). Run the full MBA pipeline on each subset separately.
  <ol>
    <li>Show the top 5 rules by lift for each time period</li>
    <li>Identify at least one rule that appears in one period but not the other</li>
    <li>Explain what this tells you about temporal purchasing behavior (3‚Äì4 sentences)</li>
  </ol>
  No scaffolding ‚Äî apply the pipeline independently to each subset.
</div>

In [None]:
# BONUS: Temporal comparison (Restaurant ‚Äî weekday vs weekend)
df["OrderDate"] = pd.to_datetime(df["Order Date"], dayfirst=True)
df["DayOfWeek"] = df["OrderDate"].dt.dayofweek
df["Period"] = df["DayOfWeek"].apply(lambda x: "Weekend" if x >= 5 else "Weekday")

for period in ["Weekday", "Weekend"]:
    subset = df[df["Period"] == period]
    b = subset.groupby(["Order Number", "Item Name"])["Quantity"].sum().unstack().fillna(0)
    b = b.applymap(lambda x: x > 0)
    fi = apriori(b, min_support=0.05, use_colnames=True)
    if len(fi) > 0:
        r = association_rules(fi, metric="lift", min_threshold=1.5)
        r = r[r["confidence"] >= 0.25].sort_values("lift", ascending=False)
        print(f"\n{'='*60}")
        print(f"  {period}: {len(r)} rules from {b.shape[0]} orders")
        print(f"{'='*60}")
        print(r[["antecedents", "consequents", "lift", "confidence"]].head(5).to_string(index=False))
    else:
        print(f"\n{period}: Not enough frequent itemsets at min_support=0.05")

**Bonus interpretation:** **Sample:** Weekday orders show stronger associations between quick lunch items (single curry + naan + drink), while weekend orders show more diverse baskets with appetizers, shared dishes, and desserts. The 'Pakora + Main' rule appears strongly on weekends (lift ~3.2) but weakly on weekdays (lift ~1.5), suggesting weekend customers are more likely to order multi-course meals. This tells the restaurant to focus appetizer promotions on weekend marketing and combo deals on weekday lunch specials.

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 04 ‚Äî Market Basket Analysis | 20 Points (+3 Bonus)
</p>