<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/demos/week06_demo_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 6 Demo ‚Äî Market Basket Analysis: What Do Customers Buy Together?
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Chapter 8** | Competencies: 4.1‚Äì4.6, 2.1‚Äì2.2

| Session | Content |
|---------|---------|
| **Session 1** | MBA metrics by hand (toy dataset) ‚Üí Apriori with mlxtend ‚Üí transition to real data |
| **Session 2** | Full pipeline on Online Retail II ‚Üí visualizations ‚Üí group exercise breakout |

**The business problem:** An online retailer runs "Frequently Bought Together" suggestions ‚Äî but they're based on gut feeling, not data. Some recommendations boost sales; others confuse customers. Today we build the system that replaces intuition with evidence.

**Pipeline position:** Week 5 answered "who are our customers?" (segments). This week answers "what do they buy *together*?" ‚Äî a fundamentally different question that uses the same data but a completely different algorithm.

**Dataset continuity:** This is the same Online Retail II dataset from Week 5. Same data, different lens.

---
## Setup

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run this cell to install mlxtend and load all libraries. Do not modify.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
!pip install mlxtend -q

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
from collections import Counter
from itertools import combinations

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

print("‚úÖ All libraries loaded (including mlxtend)")

# NOTE: You may see DeprecationWarning messages about datetime.utcnow()
# These come from Colab's internal libraries, not your code. Ignore them.

---
# SESSION 1 ‚Äî Understanding MBA Metrics

---
# Example 1 ‚Äî Calculate Support, Confidence, and Lift by Hand

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Before handing data to a library, you need to understand what the library calculates. We'll compute MBA metrics manually on a tiny dataset so you can verify the library's output later. Three metrics:
  <ul>
    <li><strong>Support:</strong> How often does this combination appear? (frequency)</li>
    <li><strong>Confidence:</strong> If someone buys A, how likely are they to also buy B? (conditional probability)</li>
    <li><strong>Lift:</strong> How much more likely is B when A is present vs. when it's not? (signal strength ‚Äî lift > 1 means real association)</li>
  </ul>
</div>

In [None]:
# Toy dataset ‚Äî 12 grocery transactions with known associations
transactions = [
    ['Diapers', 'Beer'],
    ['Diapers', 'Beer', 'Milk'],
    ['Diapers', 'Beer', 'Bread'],
    ['Diapers', 'Beer', 'Eggs'],
    ['Diapers', 'Cookies'],
    ['Beer', 'Chips'],
    ['Cheese', 'Wine', 'Grapes'],
    ['Cheese', 'Wine'],
    ['Cheese', 'Wine', 'Crackers'],
    ['Cheese', 'Bread'],
    ['Wine'],
    ['Bread', 'Eggs']
]

N = len(transactions)
print(f"Total transactions: {N}")
print(f"Unique items: {len(set(item for txn in transactions for item in txn))}")
for i, txn in enumerate(transactions):
    print(f"  T{i+1}: {txn}")

In [None]:
# Manual calculation: Diapers ‚Üí Beer
# Step 1: Count occurrences
diapers_count = sum(1 for txn in transactions if 'Diapers' in set(txn))
beer_count = sum(1 for txn in transactions if 'Beer' in set(txn))
both_count = sum(1 for txn in transactions if {'Diapers', 'Beer'}.issubset(set(txn)))

print(f"Transactions with Diapers: {diapers_count}/{N}")
print(f"Transactions with Beer:    {beer_count}/{N}")
print(f"Transactions with BOTH:    {both_count}/{N}")

# Step 2: Calculate metrics
support = both_count / N
confidence = both_count / diapers_count  # P(Beer | Diapers)
lift = confidence / (beer_count / N)     # Confidence / P(Beer)

print(f"\n--- Diapers ‚Üí Beer ---")
print(f"Support:    {support:.3f}  ({both_count} out of {N} transactions)")
print(f"Confidence: {confidence:.3f}  ({both_count} out of {diapers_count} Diaper transactions)")
print(f"Lift:       {lift:.2f}   (>{1:.0f} means real association)")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE NUMBERS</strong><br>
  <ul>
    <li><strong>Support = 0.333:</strong> This pair appears in 33% of all baskets ‚Äî common enough to matter</li>
    <li><strong>Confidence = 0.80:</strong> When someone buys Diapers, there's an 80% chance they also buy Beer</li>
    <li><strong>Lift = 1.92:</strong> Customers who buy Diapers are ~2x more likely to buy Beer than the average customer. Lift > 1 = real signal.</li>
  </ul>
</div>

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE ‚Äî Direction Matters</strong><br>
  Confidence is <strong>not symmetric</strong>. Diapers ‚Üí Beer (confidence = 0.80) is NOT the same as Beer ‚Üí Diapers. The support is the same, but confidence changes because the denominator changes. This is why we generate rules in both directions.
</div>

### Your Turn #1

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Calculate Support, Confidence, and Lift for <strong>Cheese ‚Üí Wine</strong>. Then calculate <strong>Beer ‚Üí Diapers</strong> ‚Äî is the confidence the same as Diapers ‚Üí Beer?
</div>

In [None]:
# Your Turn #1 ‚Äî Cheese ‚Üí Wine and Beer ‚Üí Diapers

# Cheese ‚Üí Wine
cheese_count = sum(1 for txn in transactions if 'Cheese' in set(txn))
wine_count = sum(1 for txn in transactions if 'Wine' in set(txn))
both_cw = sum(1 for txn in transactions if {'Cheese', 'Wine'}.issubset(set(txn)))

support_cw = both_cw / N
confidence_cw = both_cw / cheese_count
lift_cw = confidence_cw / (wine_count / N)

print(f"--- Cheese ‚Üí Wine ---")
print(f"Support:    {support_cw:.3f}")
print(f"Confidence: {confidence_cw:.3f}")
print(f"Lift:       {lift_cw:.2f}")

# Beer ‚Üí Diapers
both_bd = sum(1 for txn in transactions if {'Beer', 'Diapers'}.issubset(set(txn)))

support_bd = both_bd / N
confidence_bd = both_bd / beer_count
lift_bd = confidence_bd / (diapers_count / N)

print(f"\n--- Beer ‚Üí Diapers ---")
print(f"Support:    {support_bd:.3f}")
print(f"Confidence: {confidence_bd:.3f}")
print(f"Lift:       {lift_bd:.2f}")

---
# Example 2 ‚Äî Apriori with mlxtend on the Toy Dataset

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Manual calculation works for 12 transactions and 10 items. Real datasets have millions of transactions and thousands of items. The <strong>Apriori algorithm</strong> efficiently finds frequent itemsets without checking every possible combination ‚Äî it prunes items that don't meet a minimum support threshold.
</div>

In [None]:
# Step 1: Transform transactions to one-hot encoding
te = TransactionEncoder()
te_array = te.fit_transform(transactions)
basket = pd.DataFrame(te_array, columns=te.columns_)

print(f"Basket matrix shape: {basket.shape}")
print(f"Columns: {list(basket.columns)}")
basket.head()

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHAT JUST HAPPENED</strong><br>
  <code>TransactionEncoder</code> converts a list of lists into a Boolean matrix ‚Äî one row per transaction, one column per item. True = item was in the basket. This is the format Apriori needs.
</div>

In [None]:
# Step 2: Run Apriori ‚Äî find frequent itemsets
frequent_items = apriori(basket, min_support=0.15, use_colnames=True)
print(f"Frequent itemsets found: {len(frequent_items)}")
frequent_items.sort_values("support", ascending=False)

In [None]:
# Step 3: Generate association rules
rules = association_rules(frequent_items, metric="lift", min_threshold=1.0)
rules = rules.sort_values("lift", ascending=False)

print(f"Association rules generated: {len(rules)}")
print(f"\nAll 5 metrics:")
rules[["antecedents", "consequents", "support", "confidence", "lift", "leverage", "conviction"]].head(10)

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND VERIFY</strong><br>
  Check: Does the Diapers ‚Üí Beer rule match our manual calculation?
  <ul>
    <li>Support ‚âà 0.333 ‚úì</li>
    <li>Confidence ‚âà 0.80 ‚úì</li>
    <li>Lift ‚âà 1.92 ‚úì</li>
  </ul>
  The library gives us two bonus metrics:
  <ul>
    <li><strong>Leverage:</strong> How much more often A and B appear together than if they were independent (0 = independent)</li>
    <li><strong>Conviction:</strong> How strongly A implies B (‚àû = perfect implication, 1 = independent)</li>
  </ul>
</div>

### Your Turn #2

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Lower <code>min_support</code> to 0.05 ‚Äî how many more itemsets? Raise <code>min_threshold</code> (lift) to 2.0 ‚Äî what survives?
</div>

In [None]:
# Your Turn #2 ‚Äî Experiment with thresholds

# Try stricter support
strict = apriori(basket, min_support=0.25, use_colnames=True)
print(f"min_support=0.25: {len(strict)} itemsets")

# Try looser support
loose = apriori(basket, min_support=0.08, use_colnames=True)
print(f"min_support=0.08: {len(loose)} itemsets")

# Generate rules at different lift thresholds
rules_loose = association_rules(loose, metric="lift", min_threshold=1.0)
print(f"\nRules at lift ‚â• 1.0: {len(rules_loose)}")
rules_strict = association_rules(loose, metric="lift", min_threshold=1.5)
print(f"Rules at lift ‚â• 1.5: {len(rules_strict)}")

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë END OF SESSION 1 CHECKPOINT</strong><br>
  You now understand:
  <ul>
    <li><strong>Support</strong> = how common is this combination?</li>
    <li><strong>Confidence</strong> = if A is in the basket, what's the probability of B?</li>
    <li><strong>Lift</strong> = how much more likely is B with A vs. without? (the signal)</li>
    <li><strong>Leverage / Conviction</strong> = additional strength measures</li>
  </ul>
  <strong>Session 2:</strong> We apply this to 400,000+ real transactions and build visualizations.
</div>

---
# SESSION 2 ‚Äî Full Pipeline: MBA on Online Retail II

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° THE SHIFT</strong><br>
  Toy dataset: 12 transactions, 10 items. Now: 400,000+ transactions, 3,800+ products. Same algorithm, same metrics ‚Äî but the scale reveals patterns no human could spot manually.
</div>

---
# Example 3 ‚Äî Complete MBA Pipeline

### Part 1: Load and Clean

In [None]:
# Load Online Retail II (same dataset from Week 5)
url = "https://github.com/fenago/datasets/raw/refs/heads/main/Online%20Retail%20(1).xlsx"
df = pd.read_excel(url, engine="openpyxl")
print(f"Raw shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")

# Clean for MBA
df["InvoiceNo"] = df["InvoiceNo"].astype(str)
df = df[~df["InvoiceNo"].str.startswith("C")]          # Remove cancellations
df = df.dropna(subset=["CustomerID"])                # Remove guest checkouts
df = df[df["Quantity"] > 0]                           # Remove returns/zero
df = df[df["UnitPrice"] > 0]                              # Remove free items
df = df[df["StockCode"].astype(str).str.match(r"^\d")]  # Keep only product codes
df["Description"] = df["Description"].str.strip()     # Clean whitespace
df = df.dropna(subset=["Description"])                # Remove null descriptions

print(f"Clean shape: {df.shape[0]:,} rows")
print(f"Unique invoices: {df['InvoiceNo'].nunique():,}")
print(f"Unique products: {df['Description'].nunique():,}")

# Top 10 most purchased products
print(f"\nTop 10 Products:")
print(df["Description"].value_counts().head(10))

### Part 2: Transform to Basket Matrix

In [None]:
# Create basket matrix: 1 row per invoice, 1 column per product, Boolean values
basket = df.groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().fillna(0)
basket = basket.applymap(lambda x: x > 0)  # Convert to True/False

print(f"Basket matrix shape: {basket.shape[0]:,} invoices √ó {basket.shape[1]:,} products")
density = basket.sum().sum() / (basket.shape[0] * basket.shape[1]) * 100
print(f"Matrix density: {density:.2f}% (very sparse ‚Äî most cells are False)")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY IT'S SPARSE</strong><br>
  Each invoice contains maybe 5‚Äì20 products out of 3,800+. That means 99%+ of cells are False. This sparsity is normal for MBA ‚Äî and it's why Apriori's pruning strategy is essential. Without it, we'd be testing millions of combinations.
</div>

### Part 3: Run Apriori and Generate Rules

In [None]:
# Run Apriori ‚Äî find frequent itemsets (min 1% of all baskets)
frequent_items = apriori(basket, min_support=0.01, use_colnames=True)
print(f"Frequent itemsets found: {len(frequent_items)}")
print(f"\nTop 10 by support:")
print(frequent_items.sort_values("support", ascending=False).head(10).to_string(index=False))

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è WHY min_support = 0.03?</strong><br>
  With 3,800 products, most individual items appear in less than 1% of baskets. Setting min_support = 0.03 (3%) keeps the computation manageable while still catching meaningful patterns. Going lower (0.01) might work but takes longer. Going to 0.001 will likely crash your session ‚Äî the number of candidate pairs explodes combinatorially.
</div>

In [None]:
# Generate association rules
rules = association_rules(frequent_items, metric="lift", min_threshold=1.5)
rules = rules[rules["confidence"] >= 0.2]  # Second filter: reasonable confidence
rules = rules.sort_values("lift", ascending=False)

print(f"Rules after filtering (lift ‚â• 1.5, confidence ‚â• 0.2): {len(rules)}")
print(f"\nTop 15 Rules by Lift:")
print(rules[["antecedents", "consequents", "support", "confidence",
             "lift", "leverage", "conviction"]].head(15).to_string(index=False))

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE RULES</strong><br>
  <ul>
    <li><strong>High lift (>3):</strong> Strong signal ‚Äî these products are bought together far more than chance</li>
    <li><strong>High confidence (>0.5):</strong> Reliable ‚Äî when A is in the basket, B usually follows</li>
    <li><strong>Support matters too:</strong> A rule with lift=10 but support=0.001 might only apply to 20 transactions ‚Äî not worth a store-wide promotion</li>
  </ul>
  The best rules balance all three: high lift (real signal), high confidence (reliable), and decent support (enough transactions to matter).
</div>

---
### Part 4: Visualizations

#### Visualization 1 ‚Äî Network Graph of Top 15 Rules

In [None]:
import networkx as nx

# Use more rules to get variety, then deduplicate
top_rules = rules.head(30).copy()
G = nx.DiGraph()

for _, row in top_rules.iterrows():
    ant = ", ".join([x[:25] for x in sorted(row["antecedents"])])
    con = ", ".join([x[:25] for x in sorted(row["consequents"])])
    # Only add if reverse edge doesn't already exist with higher lift
    if G.has_edge(con, ant) and G[con][ant]["weight"] >= row["lift"]:
        continue
    G.add_edge(ant, con, weight=row["lift"], confidence=row["confidence"])

plt.figure(figsize=(16, 12))
pos = nx.spring_layout(G, k=2.0, seed=42)

# Cap edge width so high-lift rules don't dominate visually
edge_widths = [min(G[u][v]["weight"] / 10, 5) for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, width=edge_widths, alpha=0.5,
                       edge_color="steelblue", arrows=True, arrowsize=15)
nx.draw_networkx_nodes(G, pos, node_size=900, node_color="#F39C12", alpha=0.8)
nx.draw_networkx_labels(G, pos, font_size=6, font_weight="bold")

edge_labels = {(u, v): f'{G[u][v]["weight"]:.1f}' for u, v in G.edges()}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=6, font_color="red")

plt.title("Market Basket Network ‚Äî Top Rules by Lift\n(Edge thickness ‚àù Lift, capped for readability)", fontsize=13)
plt.axis("off")
plt.tight_layout()
plt.show()

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE NETWORK</strong><br>
  Thicker arrows = higher lift. Clusters of connected nodes = product families that are frequently bought together. Isolated pairs = niche associations. The arrows show direction ‚Äî which product triggers the association.
</div>

#### Visualization 2 ‚Äî Support vs Confidence Scatter

In [None]:
# Pre-filled ‚Äî Support vs Confidence scatter (bubble size = lift)
plt.figure(figsize=(12, 7))

scatter = plt.scatter(rules["support"], rules["confidence"],
                      s=rules["lift"] * 50, alpha=0.5,
                      c=rules["lift"], cmap="YlOrRd", edgecolors="gray", linewidth=0.5)

plt.colorbar(scatter, label="Lift")
plt.axhline(y=0.5, color="gray", linestyle="--", alpha=0.4, label="Confidence = 0.5")
plt.axvline(x=0.05, color="gray", linestyle="--", alpha=0.4, label="Support = 0.05")

plt.xlabel("Support (how common)", fontsize=12)
plt.ylabel("Confidence (how reliable)", fontsize=12)
plt.title("MBA Rules ‚Äî Support vs Confidence\n(Bubble size & color = Lift)", fontsize=13)
plt.legend(fontsize=10)
plt.tight_layout()
plt.show()

# Annotate the sweet spot
high_quality = rules[(rules["support"] >= 0.05) & (rules["confidence"] >= 0.5)]
print(f"Rules in the 'sweet spot' (support ‚â• 0.05, confidence ‚â• 0.5): {len(high_quality)}")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° READING THE SCATTER</strong><br>
  <ul>
    <li><strong>Top-right quadrant</strong> (high support + high confidence) = the gold ‚Äî common AND reliable</li>
    <li><strong>Top-left</strong> (low support + high confidence) = niche but strong ‚Äî works for targeted campaigns</li>
    <li><strong>Bottom-right</strong> (high support + low confidence) = common but unreliable ‚Äî many buy A without B</li>
    <li><strong>Large, hot-colored bubbles</strong> = highest lift ‚Äî the strongest associations regardless of quadrant</li>
  </ul>
</div>

---
### Your Turn ‚Äî Build the Heatmap and Top 10 Table

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS ‚Äî Live Class Participation</strong><br>
  Build: (1) A lift heatmap of the top 20 rules, and (2) a clean Top 10 Actionable Rules table.
</div>

In [None]:
# YOUR CODE HERE ‚Äî Visualization 3: Lift Heatmap
# Create a pivot-like structure: rows = antecedents, columns = consequents, values = lift
# Use rules.head(20) for readability
# Hint: Create antecedent/consequent string columns, then pivot
# Visualization 3: Lift Heatmap
top20 = rules.head(20).copy()
top20["ant"] = top20["antecedents"].apply(lambda x: ", ".join([i[:20] for i in x]))
top20["con"] = top20["consequents"].apply(lambda x: ", ".join([i[:20] for i in x]))

heatmap_data = top20.pivot_table(index="ant", columns="con", values="lift", aggfunc="max")

plt.figure(figsize=(14, 8))
sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="YlOrRd",
            linewidths=0.5, cbar_kws={"label": "Lift"})
plt.title("Lift Heatmap ‚Äî Top 20 Association Rules", fontsize=13)
plt.xlabel("Consequent (then buy...)")
plt.ylabel("Antecedent (if buy...)")
plt.tight_layout()
plt.show()

In [None]:
# YOUR CODE HERE ‚Äî Top 10 Actionable Rules Table
# Print a clean, formatted table of the top 10 rules by lift
# Include: antecedents, consequents, support, confidence, lift
# Format numbers to 3 decimal places
# Top 10 Actionable Rules Table
top10 = rules.head(10).copy()
top10["antecedents"] = top10["antecedents"].apply(lambda x: ", ".join(x))
top10["consequents"] = top10["consequents"].apply(lambda x: ", ".join(x))

display_cols = top10[["antecedents", "consequents", "support", "confidence", "lift"]].copy()
display_cols.columns = ["If customer buys...", "They also buy...", "Support", "Confidence", "Lift"]
display_cols = display_cols.reset_index(drop=True)
display_cols.index = display_cols.index + 1
display_cols.index.name = "Rank"

print(display_cols.to_string(float_format=lambda x: f"{x:.3f}"))


### Your Turn #3

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ TRY THIS</strong><br>
  Filter the data to France-only transactions (<code>df[df["Country"] == "France"]</code>) and rebuild the basket + run Apriori. Do the top rules change? Try December-only ‚Äî do seasonal patterns appear?
</div>

In [None]:
# Your Turn #3 ‚Äî Country or seasonal filter
# Your Turn #3 ‚Äî France-only MBA
df_france = df[df["Country"] == "France"]
print(f"France transactions: {df_france['InvoiceNo'].nunique()} invoices")

basket_fr = df_france.groupby(["InvoiceNo", "Description"])["Quantity"].sum().unstack().fillna(0)
basket_fr = basket_fr.map(lambda x: x > 0)

freq_fr = apriori(basket_fr, min_support=0.05, use_colnames=True)
print(f"Frequent itemsets: {len(freq_fr)}")

if len(freq_fr) > 0:
    rules_fr = association_rules(freq_fr, metric="lift", min_threshold=1.5)
    rules_fr = rules_fr[rules_fr["confidence"] >= 0.2].sort_values("lift", ascending=False)
    print(f"Rules found: {len(rules_fr)}")
    if len(rules_fr) > 0:
        rules_fr["antecedents"] = rules_fr["antecedents"].apply(lambda x: ", ".join(x))
        rules_fr["consequents"] = rules_fr["consequents"].apply(lambda x: ", ".join(x))
        print(rules_fr[["antecedents", "consequents", "support", "confidence", "lift"]].head(10).to_string(index=False))
    else:
        print("No rules ‚Äî try lowering min_support")
else:
    print("No frequent itemsets ‚Äî try lowering min_support")

---
## Takeaway

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ WHAT WE BUILT TODAY</strong><br>
  A complete Market Basket Analysis pipeline: from raw transactions to actionable product association rules, with three visualization types to communicate findings.
</div>

**The MBA Workflow:**

| Step | What Happens | Key Decision |
|------|-------------|-------------|
| 1. Clean | Remove cancellations, nulls, non-products | What counts as a valid transaction? |
| 2. Transform | One-hot basket matrix | Which column is the "item"? |
| 3. Apriori | Find frequent itemsets | What min_support? (Too low = explosion, too high = nothing) |
| 4. Rules | Generate association rules | What metric and threshold? |
| 5. Filter | Confidence + lift thresholds | Balance signal strength vs coverage |
| 6. Visualize | Network, scatter, heatmap | Which chart for which audience? |
| 7. Recommend | Translate rules to business actions | Product placement, bundles, cross-sell |

**Same data, three lenses:**

| Week | Question | Method | Output |
|------|----------|--------|--------|
| 5 | Who are our customers? | RFM + K-Means | Customer segments |
| 6 | What do they buy together? | Apriori + MBA | Product association rules |
| 7‚Äì8 | What will happen next? | Decision Trees + Text Mining | Predictions + unstructured data |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 6 Demo ‚Äî Market Basket Analysis
</p>