# CSIS4260: Association Rules and Market Basket Analysis

In this lesson, we will learn how to find interesting patterns in customer purchases, like:
- People who buy **bread** also buy **milk**
- Customers who purchase **diapers** often buy **baby wipes** too

This type of analysis is called **Market Basket Analysis**. It is done using **Association Rules** and algorithms like **Apriori** and **FP-Growth**.

## Why Do We Use Association Rules?
Imagine you are running a grocery store. You want to know what items are frequently bought together so that you can:
- Place them near each other in the store
- Offer discounts to increase sales
- Recommend related products online

**Association rules** help discover these patterns automatically by analyzing transaction data.

## Important Definitions
- **Itemset**: A group of items that appear together in a transaction (like `{'bread', 'milk'}`)
- **Support**: How often the itemset appears in all transactions (e.g., `3 out of 5 = 60%`)
- **Confidence**: If a customer buys X, how often do they also buy Y? (e.g., `80% of bread buyers also buy milk`)
- **Lift**: How much more likely Y is bought when X is bought, compared to chance (greater than 1 means it’s a useful rule)

##  Let's Learn: Support, Confidence, and Lift

Imagine you run a small candy shop. You have a notebook where you write down what each customer buys.

Here are 5 transactions (like your 5 customers):

| Transaction | Items Bought                |
|-------------|-----------------------------|
| 1           | Milk, Bread, Apple          |
| 2           | Milk, Bread, Nuts           |
| 3           | Milk, Bread                 |
| 4           | Milk, Bread, Apple          |
| 5           | Milk, Bread, Apple          |

---

###  Example 1: `Milk and Bread → Apple`

This means: if someone buys milk and bread, do they also buy apple?

Let’s break it down:

#### 1. **Support**
Support = How many transactions have ALL 3 items: Milk, Bread, and Apple?

- Transactions 1, 4, and 5 have all three.
- So that's 3 out of 5 → `3 / 5 = 0.60 = 60%`

 **Support = 60%**

---

#### 2. **Confidence**
Confidence = Of all the people who bought milk and bread, how many also bought apple?

- Milk and Bread appear together in **all 5** transactions.
- But Apple is also included only in 3 of those.

So → `3 / 5 = 60%`

 **Confidence = 60%**

---

#### 3. **Lift**
Lift = Confidence ÷ Support of just Apple

- Confidence = 60% (we just calculated that)
- Support of just Apple = Apple appears in Transactions 1, 4, 5 → 3 out of 5 = `60%`

So → `0.6 / 0.6 = 1.0`

 **Lift = 1.0**
> This means Apple is bought *just as often with Milk & Bread as without* → not super special

---

###  Example 2: `Bread → Milk`

This means: if someone buys bread, do they also buy milk?

#### 1. **Support**
How many transactions have both Bread and Milk?

- All 5 have both → `5 / 5 = 100%`

 **Support = 100%**

---

#### 2. **Confidence**
How many of the people who bought Bread also bought Milk?

- Bread appears in all 5
- All of them also have Milk → `5 / 5 = 100%`

 **Confidence = 100%**

---

#### 3. **Lift**
What’s the support of just Milk?

- Milk is also in all 5 → `5 / 5 = 100%`

So → `1.0 / 1.0 = 1.0`

 **Lift = 1.0**
> Again, it’s not a surprising combo — Milk and Bread are always bought together anyway.

---

###  So, What Makes a Rule "Good"?

-  **High Support** = Happens often
-  **High Confidence** = When X happens, Y happens too
-  **Lift > 1** = Buying X makes Y *more* likely

 If Lift is **greater than 1**, it means the relationship is **interesting and useful!**


In [None]:
# First install the mlxtend package
!pip install mlxtend

In [1]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth
from mlxtend.preprocessing import TransactionEncoder

## Let's Create a Simple Grocery Dataset
We'll create a dataset with 5 customers and what they bought.

In [2]:
import pandas as pd
import random
from mlxtend.preprocessing import TransactionEncoder

# Base items that should appear together frequently
base_items = ['milk', 'bread', 'apple']

# Extra noise items to reach a total of 10
extra_items = ['eggs', 'butter', 'juice', 'cheese', 'yogurt', 'banana', 'chicken']

# Create transactions
random.seed(42)
transactions = []
for _ in range(100):
    if random.random() < 0.7:
        basket = base_items.copy()
    else:
        basket = []
    basket += random.sample(extra_items, random.randint(2, 4))
    transactions.append(list(set(basket)))

# Encode
# Encode and convert boolean to integers
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_).astype(int)


# Check support of base itemset
support = df[['milk', 'bread', 'apple']].all(axis=1).mean()
print("Support for {milk, bread, apple}:", support)
df


Support for {milk, bread, apple}: 0.61


Unnamed: 0,apple,banana,bread,butter,cheese,chicken,eggs,juice,milk,yogurt
0,1,1,1,0,0,0,0,1,1,0
1,1,1,1,0,0,0,1,0,1,0
2,1,0,1,0,1,1,1,0,1,1
3,1,0,1,1,0,0,0,0,1,1
4,1,1,1,1,1,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...
95,1,0,1,0,0,1,1,0,1,1
96,1,0,1,0,1,1,1,1,1,0
97,1,0,1,1,1,0,1,1,1,0
98,1,1,1,1,0,1,0,1,1,0


In [3]:
df

Unnamed: 0,apple,banana,bread,butter,cheese,chicken,eggs,juice,milk,yogurt
0,1,1,1,0,0,0,0,1,1,0
1,1,1,1,0,0,0,1,0,1,0
2,1,0,1,0,1,1,1,0,1,1
3,1,0,1,1,0,0,0,0,1,1
4,1,1,1,1,1,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...
95,1,0,1,0,0,1,1,0,1,1
96,1,0,1,0,1,1,1,1,1,0
97,1,0,1,1,1,0,1,1,1,0
98,1,1,1,1,0,1,0,1,1,0


## Step-by-Step: Using Apriori to Find Patterns

In [6]:
# Step 1: Find frequent itemsets with at least 60% support
frequent_itemsets = apriori(df, min_support=0.20, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.61,(apple)
1,0.41,(banana)
2,0.61,(bread)
3,0.50,(butter)
4,0.35,(cheese)
...,...,...
61,0.24,"(cheese, apple, bread, milk)"
62,0.29,"(apple, chicken, bread, milk)"
63,0.31,"(apple, eggs, bread, milk)"
64,0.30,"(juice, apple, bread, milk)"


In [7]:
# Step 2: Generate rules from the frequent itemsets
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(banana),(apple),0.23,0.560976,0.919632
1,(apple),(bread),0.61,1.000000,1.639344
2,(bread),(apple),0.61,1.000000,1.639344
3,(butter),(apple),0.25,0.500000,0.819672
4,(cheese),(apple),0.24,0.685714,1.124122
...,...,...,...,...,...
161,"(yogurt, bread, milk)",(apple),0.25,1.000000,1.639344
162,"(apple, yogurt)","(bread, milk)",0.25,1.000000,1.639344
163,"(yogurt, bread)","(apple, milk)",0.25,1.000000,1.639344
164,"(yogurt, milk)","(apple, bread)",0.25,1.000000,1.639344


In [8]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(banana),(apple),0.41,0.61,0.23,0.560976,0.919632,1.0,-0.0201,0.888333,-0.129012,0.291139,-0.125704,0.469012
1,(apple),(bread),0.61,0.61,0.61,1.000000,1.639344,1.0,0.2379,inf,1.000000,1.000000,1.000000,1.000000
2,(bread),(apple),0.61,0.61,0.61,1.000000,1.639344,1.0,0.2379,inf,1.000000,1.000000,1.000000,1.000000
3,(butter),(apple),0.50,0.61,0.25,0.500000,0.819672,1.0,-0.0550,0.780000,-0.305556,0.290698,-0.282051,0.454918
4,(cheese),(apple),0.35,0.61,0.24,0.685714,1.124122,1.0,0.0265,1.240909,0.169872,0.333333,0.194139,0.539578
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161,"(yogurt, bread, milk)",(apple),0.25,0.61,0.25,1.000000,1.639344,1.0,0.0975,inf,0.520000,0.409836,1.000000,0.704918
162,"(apple, yogurt)","(bread, milk)",0.25,0.61,0.25,1.000000,1.639344,1.0,0.0975,inf,0.520000,0.409836,1.000000,0.704918
163,"(yogurt, bread)","(apple, milk)",0.25,0.61,0.25,1.000000,1.639344,1.0,0.0975,inf,0.520000,0.409836,1.000000,0.704918
164,"(yogurt, milk)","(apple, bread)",0.25,0.61,0.25,1.000000,1.639344,1.0,0.0975,inf,0.520000,0.409836,1.000000,0.704918


## Try Another Way: FP-Growth Algorithm

In [11]:
# Same process using FP-Growth instead of Apriori
fp_itemsets = fpgrowth(df, min_support=0.4, use_colnames=True)
rules_fp = association_rules(fp_itemsets, metric="confidence", min_threshold=0.6)
rules_fp[['antecedents', 'consequents', 'support', 'confidence', 'lift']]



Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(bread),(milk),0.61,1.0,1.639344
1,(milk),(bread),0.61,1.0,1.639344
2,(apple),(bread),0.61,1.0,1.639344
3,(bread),(apple),0.61,1.0,1.639344
4,(apple),(milk),0.61,1.0,1.639344
5,(milk),(apple),0.61,1.0,1.639344
6,"(apple, bread)",(milk),0.61,1.0,1.639344
7,"(apple, milk)",(bread),0.61,1.0,1.639344
8,"(bread, milk)",(apple),0.61,1.0,1.639344
9,(apple),"(bread, milk)",0.61,1.0,1.639344


## How to Read These Rules
Let's say we found this rule:

`{'milk', 'bread'} → {'apple'}`

- This means: if someone buys milk and bread, they probably also buy apple
- If **confidence = 80%**, then 80% of people who bought milk and bread also bought apple
- If **lift = 1.2**, this is 20% more likely than random chance — a helpful rule

You can now use this info to:
- Suggest apples when someone adds bread & milk to their cart
- Offer apple as a discount to boost sales

## More Theory: Downward Closure Property
The **Downward Closure Property** (also called the Apriori Property) states:
> If an itemset is frequent, then all of its subsets must also be frequent.

**Example:**
If `{'milk', 'bread', 'apple'}` appears in 60% of transactions, then:
- `{'milk', 'bread'}` must also appear in at least 60%
- `{'milk', 'apple'}`, `{'bread', 'apple'}`, `{'milk'}`, etc., must also be frequent

This property helps Apriori to **prune** itemsets early, reducing the number of combinations to check.


## Support Count Table Example
Here is how many times each itemset appears in the dataset:


In [None]:
# Calculate and print support counts for all combinations of 1 or 2 items
from itertools import combinations

# Define function to count support for given itemsets
def get_support_counts(df, max_len=2):
    from collections import Counter
    counts = Counter()
    for row in df:
        for r in range(1, max_len+1):
            for combo in combinations(sorted(row), r):
                counts[combo] += 1
    return pd.DataFrame(counts.items(), columns=['Itemset', 'Support Count'])

# Display support counts
get_support_counts(df)


## Manual Calculation of Confidence and Lift
Let's calculate the confidence and lift for the rule: `{'milk', 'bread'} → {'apple'}`

We need:
- Support(`milk, bread, apple`) = 3
- Support(`milk, bread`) = 5
- Support(`apple`) = 3
- Total transactions = 5

**Confidence** = Support(A ∪ B) / Support(A) = `3 / 5 = 0.6` (60%)  
**Support** = `3 / 5 = 0.6`  
**Lift** = Confidence / Support(B) = `0.6 / (3/5) = 1.0`

This tells us: buying apple is equally likely with or without buying milk and bread. No strong association.


## Redundant Rules
Some rules may be **redundant** — they don’t give new information.

**Example:**
Both of these rules may appear:
- `{'bread'} → {'milk'}`
- `{'bread', 'apple'} → {'milk'}`

The second rule might not add new insight if the first one is already strong.  
You can remove rules that have the same consequent and a subset of antecedents with equal or lower confidence.


## Summary: Apriori vs FP-Growth
| Feature | Apriori | FP-Growth |
|---------|---------|-----------|
| Approach | Breadth-first | Depth-first using FP-tree |
| Needs Multiple Passes? | Yes | No |
| Speed on Large Data | Slower | Faster |
| Memory Use | Low | Higher |
| Easy to Understand? | Yes | Medium |

**When to use FP-Growth?**
- When your dataset is large and sparse  
- When speed is critical
