In [51]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [60]:
# load the data set ans show the first five transaction
df = pd.read_csv("https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv")
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [30]:
print(df['1'].unique())

['Wine' 'Cheese' 'Meat' 'Pencil' 'Bread' 'Diaper' 'Eggs' nan 'Bagel'
 'Milk']


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [64]:
#create an itemset based on the products
all_items = df.values.flatten()

unique_items = set(all_items)

presence_map = {item: 0 for item in unique_items}

first_row_items = df.iloc[0].values
for item in first_row_items:
    presence_map[item] = 1

# encoding the feature

# Reshape the data to be a list of items per customer (excluding NaN)
reshaped_data = df.values.tolist()

# Flatten the list of items
flat_items = [item for sublist in reshaped_data for item in sublist]

# Reshape the data to a 2D array where each item is a row
flat_items_array = np.array(flat_items).reshape(-1, 1)

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the data
encoded_data = encoder.fit_transform(flat_items_array)

# Convert the encoded data to integer (1 and 0)
encoded_data = encoded_data.astype(int)

# Create a DataFrame with item names as columns
encoded_df = pd.DataFrame(encoded_data, columns=encoder.categories_[0])

# Now, create a customer ID list for the rows to map the one-hot encoding back to the original customers
customer_ids = []
for i, row in enumerate(reshaped_data):
    customer_ids.extend([i] * len(row))

# Add the customer IDs to the DataFrame
encoded_df['customer_id'] = customer_ids

# Pivot the DataFrame to get one-hot encoding by customer
final_df = encoded_df.groupby('customer_id').sum()

presence_map

{'Milk': 0,
 nan: 0,
 'Meat': 1,
 'Wine': 1,
 'Pencil': 1,
 'Diaper': 1,
 'Bagel': 0,
 'Cheese': 1,
 'Bread': 1,
 'Eggs': 1}

In [65]:
  # create new dataframe from the encoded features
transformed_df = final_df

  # show the new dataframe
transformed_df.head()

Unnamed: 0_level_0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine,nan
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,1,1,1,1,1,0,1,1,0
1,0,1,1,1,0,1,1,1,1,0
2,0,0,1,0,1,1,1,0,1,2
3,0,0,1,0,1,1,1,0,1,2
4,0,0,0,0,0,1,0,1,1,4


In [66]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.

transformed_df = transformed_df.drop(columns=["nan"])

transformed_df.head()

Unnamed: 0_level_0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,1,1,1,1,1,0,1,1
1,0,1,1,1,0,1,1,1,1
2,0,0,1,0,1,1,1,0,1
3,0,0,1,0,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [70]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules

apriori(transformed_df, min_support=0.2, use_colnames=True)

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.361905,(Pencil)
8,0.438095,(Wine)
9,0.279365,"(Bagel, Bread)"


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [74]:
frequent_itemsets = apriori(transformed_df, min_support=0.2, use_colnames=True)

confidence_threshold = 0.6
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=confidence_threshold)
rules.drop(columns=['zhangs_metric'], inplace=True)
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
2,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
3,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
4,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
5,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
8,"(Meat, Cheese)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714
9,"(Meat, Eggs)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)


1. **Antecedent Support**:
Definition: This is the proportion of transactions in which the antecedent (left-hand side of the rule) occurs. It shows how often the items on the left side of the rule appear in the dataset.
Interpretation: For example, for the rule (Bagel) → (Bread), the antecedent support (Bagel) is 0.425397, meaning Bagels appear in about 42.5% of the transactions.
Kaggle Insight: A higher antecedent support suggests the antecedent item is commonly bought by customers, increasing the potential for finding strong association rules. In practice, items with low antecedent support may be overlooked in rule mining, as they appear infrequently.

2. **Consequent Support**:
Definition: This is the proportion of transactions in which the consequent (right-hand side of the rule) occurs. It shows how often the items on the right side of the rule appear in the dataset.
Interpretation: For the same rule (Bagel) → (Bread), the consequent support (Bread) is 0.504762, meaning Bread appears in about 50.5% of the transactions.
Kaggle Insight: A high consequent support implies that the consequent item is popular among customers. This could influence the relevance of rules, as items with a high consequent support might be involved in multiple association rules.

3. **Support**:
Definition: The support of an itemset (antecedent and consequent together) is the proportion of transactions that contain both the antecedent and consequent. It indicates the frequency of the occurrence of both items in the transactions.
Interpretation: In the rule (Bagel) → (Bread), the support is 0.279365, meaning that both Bagels and Bread appear together in 27.9% of the transactions.
Kaggle Insight: Higher support values show stronger associations. Support can also help in rule pruning—rules with low support may not be useful for decision-making, so they are typically discarded.

4. **Confidence**:
Definition: Confidence is the likelihood that the consequent occurs when the antecedent occurs. It's a measure of the reliability of the rule.
Interpretation: In the rule (Bagel) → (Bread), the confidence is 0.656716, which means that if a customer buys a Bagel, there's a 65.7% chance they will also buy Bread.
Kaggle Insight: Confidence helps assess the strength of an association. High confidence indicates a strong predictive power, meaning the antecedent reliably predicts the consequent. However, high confidence alone doesn't mean the rule is statistically significant—this is where lift comes in.

5. **Lift**:
Definition: Lift is the ratio of the observed support to the expected support if the antecedent and consequent were independent. A lift value greater than 1 indicates that the antecedent and consequent are positively correlated (i.e., they occur together more often than expected by chance).
Interpretation: In the rule (Bagel) → (Bread), the lift is 1.301042. This means that the occurrence of Bagels increases the likelihood of purchasing Bread by about 30% compared to if they were independent.
Kaggle Insight: Lift is a key metric to identify interesting rules. A lift greater than 1 indicates a positive correlation, suggesting that the items are likely bought together more often than by random chance, while a lift of less than 1 suggests a negative association (i.e., the items are often bought separately).

6. **Leverage**:
Definition: Leverage is the difference between the observed support and the expected support under independence. It gives an idea of how much more frequently the items appear together than would be expected by chance.
Interpretation: For the rule (Bagel) → (Bread), the leverage is 0.064641. This indicates that Bagels and Bread are 6.5% more likely to appear together than would be expected if they were independent.
Kaggle Insight: Leverage complements lift by quantifying the absolute difference in frequency between the observed and expected support. If the leverage is close to 0, it suggests a weak or no relationship between the items.

7. **Conviction**:
Definition: Conviction is a measure that quantifies the degree of implication of the rule. It compares how many times the consequent occurs without the antecedent, relative to the times the antecedent occurs. A higher value indicates a stronger rule.
Interpretation: In the rule (Bagel) → (Bread), the conviction is 1.442650, meaning that the rule (Bagel) → (Bread) is more likely to occur than a random occurrence of Bread without Bagels.
Kaggle Insight: Conviction provides a different perspective on the strength of association, especially when dealing with imbalanced data where other metrics like confidence may be misleading. A high conviction score suggests a strong dependence between the antecedent and consequent.

**Reference**
- https://chatgpt.com/share/67482e52-c900-800f-93b9-aea60f45f945
- https://www.kaggle.com/code/evrenermis/association-rule-based-learning-explained
