# Lab 08: Association Rule Mining

**Objective:** This lab aims to introduce the concept of association rule mining, a popular technique for discovering interesting relationships hidden in large datasets. We will explore how to use existing Python packages to perform association rule mining and interpret the results. Finally, you will apply these techniques to a new dataset.

## 1. Introduction to Association Rule Mining

Association rule mining is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

The classic example is the "market basket analysis," where a retailer tries to understand purchasing behaviors of customers. For example, an association rule might be: `{Diapers} -> {Beer}`. This rule suggests that customers who buy diapers also tend to buy beer.

Key concepts in association rule mining include:

* **Itemset:** A collection of one or more items. E.g., `{Milk, Bread, Diaper}`.
* **Support:** The fraction of transactions that contain an itemset. It indicates the popularity of an itemset. 
    $$Support(X) = \frac{\text{Number of transactions containing X}}{\text{Total number of transactions}}$$
* **Confidence:** Measures how often items in Y appear in transactions that contain X. It indicates the likelihood of item Y being purchased when item X is purchased.
    $$Confidence(X \rightarrow Y) = \frac{Support(X \cup Y)}{Support(X)}$$
* **Lift:** Measures how much more often X and Y occur together than expected if they were statistically independent. A lift greater than 1 suggests a positive association.
    $$Lift(X \rightarrow Y) = \frac{Support(X \cup Y)}{Support(X) \times Support(Y)}$$
* **Antecedent (LHS):** The itemset on the left-hand side of the rule (e.g., `{Diapers}`).
* **Consequent (RHS):** The itemset on the right-hand side of the rule (e.g., `{Beer}`).

The most common algorithm for association rule mining is the **Apriori algorithm**.

## 2. Setup

First, let's install and import the necessary Python libraries. We'll primarily use `pandas` for data manipulation and `mlxtend` for association rule mining.

In [None]:
# Install mlxtend if you haven't already
!pip install mlxtend

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

## 3. Example 1: Basic Association Rule Mining

Let's start with a simple dataset of transactions.

In [None]:
# Sample transaction data
dataset1 = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
            ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
            ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
            ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
            ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

# Print the dataset
print("Raw Dataset 1:")
for transaction in dataset1:
    print(transaction)

### 3.1. Data Preprocessing

The Apriori algorithm expects data in a one-hot encoded format, where each row represents a transaction and each column represents an item. The value is `True` or `1` if the item is in the transaction, and `False` or `0` otherwise.

In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset1).transform(dataset1)
df1 = pd.DataFrame(te_ary, columns=te.columns_)

print("\nOne-Hot Encoded DataFrame 1:")
df1

### 3.2. Apply Apriori Algorithm

Now, we apply the Apriori algorithm to find frequent itemsets. The `min_support` parameter specifies the minimum support threshold for an itemset to be considered frequent.

In [None]:
# Find frequent itemsets with min_support = 0.6 (i.e., itemset appears in at least 60% of transactions)
frequent_itemsets1 = apriori(df1, min_support=0.6, use_colnames=True)

print("Frequent Itemsets (min_support=0.6):")
frequent_itemsets1

### 3.3. Generate Association Rules

Once we have the frequent itemsets, we can generate association rules. We'll use `confidence` as the metric and set a minimum threshold (e.g., `min_threshold=0.7`).

In [None]:
# Generate association rules with min_confidence = 0.7
rules1 = association_rules(frequent_itemsets1, metric="confidence", min_threshold=0.7)

print("Association Rules (min_confidence=0.7):")
rules1[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

### 3.4. Interpret the Results

Let's look at one of the rules:
- **Rule:** `{Onion} -> {Eggs}`
- **Support:** This value (e.g., 0.6) means that 60% of all transactions contain both Onion and Eggs.
- **Confidence:** If the confidence is 1.0, it means that 100% of the transactions that contain Onion also contain Eggs.
- **Lift:** If the lift is, for example, 1.25, it means that customers are 1.25 times more likely to buy Eggs if they buy Onion, compared to if the purchase of Eggs was independent of the purchase of Onion. A lift > 1 indicates a positive correlation.

## 4. Example 2: Exploring Different Parameters and a slightly different view

Let's use another small dataset and see how changing `min_support` and `min_threshold` for confidence affects the rules generated. This dataset is already in a list of lists format, suitable for `TransactionEncoder`.

In [None]:
dataset2 = [['bread', 'milk', 'butter'],
            ['bread', 'butter', 'cheese', 'jam'],
            ['milk', 'butter', 'cheese'],
            ['bread', 'milk', 'jam'],
            ['bread', 'milk', 'butter', 'cheese'],
            ['tea', 'milk'],
            ['bread', 'butter', 'jam']]

te2 = TransactionEncoder()
te_ary2 = te2.fit(dataset2).transform(dataset2)
df2 = pd.DataFrame(te_ary2, columns=te2.columns_)

print("One-Hot Encoded DataFrame 2:")
df2

### 4.1. Experimenting with `min_support`

Let's try a lower `min_support` first.

In [None]:
# Lower min_support
frequent_itemsets2_low_support = apriori(df2, min_support=0.2, use_colnames=True) # itemset appears in at least ~20% of transactions
print("Frequent Itemsets (min_support=0.2):")
print(frequent_itemsets2_low_support)

# Generate rules with min_confidence = 0.6
rules2_low_support = association_rules(frequent_itemsets2_low_support, metric="confidence", min_threshold=0.6)
print("\nAssociation Rules (min_support=0.2, min_confidence=0.6):")
rules2_low_support[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Now, let's try a higher `min_support`.

In [None]:
# Higher min_support
frequent_itemsets2_high_support = apriori(df2, min_support=0.5, use_colnames=True) # itemset appears in at least 50% of transactions
print("Frequent Itemsets (min_support=0.5):")
print(frequent_itemsets2_high_support)

# Generate rules with min_confidence = 0.6
rules2_high_support = association_rules(frequent_itemsets2_high_support, metric="confidence", min_threshold=0.6)
print("\nAssociation Rules (min_support=0.5, min_confidence=0.6):")
rules2_high_support[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

**Observation:**
You should notice that a lower `min_support` results in more frequent itemsets and potentially more rules. A higher `min_support` leads to fewer, more common itemsets and rules.

### 4.2. Experimenting with `min_threshold` for Confidence

Let's use the frequent itemsets from `min_support=0.2` and vary the `min_threshold` for confidence.

In [None]:
# Using frequent_itemsets2_low_support (min_support=0.2)

# Lower min_confidence
rules2_low_confidence = association_rules(frequent_itemsets2_low_support, metric="confidence", min_threshold=0.5)
print("Association Rules (min_support=0.2, min_confidence=0.5):")
print(rules2_low_confidence[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

# Higher min_confidence
rules2_high_confidence = association_rules(frequent_itemsets2_low_support, metric="confidence", min_threshold=0.8)
print("\nAssociation Rules (min_support=0.2, min_confidence=0.8):")
rules2_high_confidence[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

**Observation:**
A lower `min_threshold` for confidence will generate more rules, including those where the antecedent doesn't strongly imply the consequent. A higher `min_threshold` will yield fewer, but stronger, rules where the implication is more certain.

## 5. Task: Groceries Dataset

Now it's your turn! We have a small dataset of grocery items from a few transactions. Your task is to perform association rule mining on this dataset.

In [None]:
# Student dataset
student_dataset = [
    ['Apples', 'Bananas', 'Cereal'],
    ['Milk', 'Bread', 'Butter'],
    ['Apples', 'Bread', 'Eggs'],
    ['Bananas', 'Milk', 'Cereal', 'Sugar'],
    ['Apples', 'Milk', 'Bread', 'Butter'],
    ['Coffee', 'Sugar', 'Cookies'],
    ['Apples', 'Bananas', 'Bread'],
    ['Milk', 'Cereal', 'Sugar'],
    ['Apples', 'Bread', 'Butter', 'Cheese'],
    ['Bananas', 'Cereal', 'Yogurt']
]

# Print the student dataset
print("Student Dataset:")
for transaction in student_dataset:
    print(transaction)

### Your Tasks:

1.  **Load and Preprocess the Data:**
    * Use `TransactionEncoder` to transform `student_dataset` into a one-hot encoded pandas DataFrame.

In [None]:
# Your code here for Task 1


print("One-Hot Encoded Student DataFrame:")


2.  **Apply the Apriori Algorithm:**
    * Find frequent itemsets using the Apriori algorithm. Choose a `min_support` value that you think is reasonable for this dataset (e.g., an itemset should appear in at least 2 or 3 transactions). Justify your choice briefly.

In [None]:
# Your code here for Task 2
# Justification for min_support:
# The dataset has 10 transactions. A min_support of 0.2 means an itemset must appear in at least 10 * 0.2 = 2 transactions.
# A min_support of 0.3 means an itemset must appear in at least 10 * 0.3 = 3 transactions.
# Let's start with min_support = 0.2, as it's a small dataset and we want to see some initial patterns.



print(f"Frequent Itemsets (min_support={min_support}):")


3.  **Generate Association Rules:**
    * Generate association rules from the frequent itemsets. Choose a `min_threshold` for confidence (e.g., 0.5 or 0.6). Justify your choice briefly.

In [None]:
# Your code here for Task 3
# Justification for min_threshold (confidence):
# A confidence of 0.5 means that in 50% of the cases where the antecedent is present, the consequent is also present.
# For a small dataset, this might reveal some initial interesting rules without being too restrictive.

min_confidence = 0.5

print(f"Association Rules (min_confidence={min_confidence}):")
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

4.  **Identify and Interpret Interesting Rules:**
    * From the generated rules, select 2-3 rules that you find interesting.
    * For each selected rule, explain what it means in the context of the grocery data. Discuss its support, confidence, and lift.

*(Double-click here to edit and write your interpretation for Task 4)*

**Example Interpretation (you will pick your own rules from your results):**

**Rule 1: {Antecedent} -> {Consequent}**
* **Support:** [Value from your table] - This means that [Value]% of all transactions contain both {Antecedent} and {Consequent}.
* **Confidence:** [Value from your table] - This means that [Value]% of the transactions that contain {Antecedent} also contain {Consequent}.
* **Lift:** [Value from your table] - This means that customers are [Value] times more likely to buy {Consequent} if they buy {Antecedent}, compared to if the purchases were independent. (Interpret if >1, <1, or =1).
* **Interestingness:** Why do you find this rule interesting in the context of grocery shopping?

**Rule 2: ...**


5.  **Experiment (Optional but Recommended):**
    * Try changing your `min_support` and `min_threshold` (confidence) values. For example, make `min_support` lower or higher, and `min_threshold` for confidence lower or higher.
    * How does this affect the number and type of rules generated? Briefly describe your observations.

In [None]:
# Your code here for Task 5 (Optional)

# Example: Lowering min_support further and keeping confidence moderate
min_support_v2 = 0.1 # At least 1 transaction


print(f"\n--- Experiment: min_support={min_support_v2}, min_confidence=0.5 ---")
print("Frequent Itemsets:")
print(frequent_itemsets_v2)
print("\nAssociation Rules:")
print(rules_student_v2[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

# Example: Using original min_support (0.2) but increasing confidence
min_confidence_v3 = 0.8


print(f"\n--- Experiment: min_support={min_support}, min_confidence={min_confidence_v3} ---")
print("Frequent Itemsets (same as Task 2):")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules_v3[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

*(Double-click here to edit and write your observations for Task 5)*

**Student's Observations for Task 5:**

* **Lowering `min_support` (e.g., to 0.1 from 0.2) while keeping `min_confidence` constant (e.g., at 0.5):**
    
    
* **Increasing `min_confidence` (e.g., to 0.8 from 0.5) while keeping `min_support` constant (e.g., at 0.2):**
    
    

## 6. Conclusion

In this lab, we explored association rule mining using the Apriori algorithm. We learned how to:
* Prepare data for association rule mining.
* Use the `mlxtend` library to find frequent itemsets and generate rules.
* Interpret key metrics like support, confidence, and lift.
* Understand how parameters like `min_support` and `min_threshold` (for confidence) influence the outcome.

Association rule mining is a powerful tool for uncovering hidden patterns in transactional data, with applications in retail, e-commerce, healthcare, and more.