In [31]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

! pip install mlxtend



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [54]:
# load the data set and show the first five transaction
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df = pd.read_csv(url)

df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [55]:
unique_products = set(df.values.flatten())
print(unique_products)

{nan, 'Milk', 'Cheese', 'Diaper', 'Eggs', 'Bagel', 'Bread', 'Meat', 'Pencil', 'Wine'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [56]:
#create an itemset based on the products
te = TransactionEncoder()
itemset = te.fit_transform(df.apply(lambda x: x.dropna().tolist()))
df_itemset = pd.DataFrame(itemset, columns=te.columns_)

# encoding the feature
encoded_vals = []
for index, row in df.iterrows(): 
    labels = {}
    uncommons = list(set(df_itemset) - set(row))
    commons = list(set(df_itemset).intersection(row))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)

In [59]:
# create new dataframe from the encoded features
one_df = pd.DataFrame(encoded_vals)

# show the new dataframe
one_df.head()

Unnamed: 0,Milk,Bagel,Cheese,Diaper,Eggs,Bread,Meat,Pencil,Wine
0,0,0,1,1,1,1,1,1,1
1,1,0,1,1,0,1,1,1,1
2,1,0,1,0,1,0,1,0,1
3,1,0,1,0,1,0,1,0,1
4,0,0,0,0,0,0,1,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [60]:
#The NaN column is already dropped in the itemset creation. The dropna method is applied to the resulting df_itemset after the encoding step. This way, NaN values are dropped from the encoded dataframe instead of before the encoding.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [63]:
freq_items = apriori(one_df, min_support = 0.2, use_colnames = True, verbose = 1)
freq_items.head()

Processing 144 combinations | Sampling itemset size 3




Unnamed: 0,support,itemsets
0,0.501587,(Milk)
1,0.425397,(Bagel)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [64]:
association_rules(freq_items, metric = "confidence", min_threshold = 0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
2,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
3,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
4,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
5,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
8,"(Meat, Milk)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816
9,"(Meat, Cheese)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845,0.296655


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

Type Markdown and LaTeX:  𝛼2

In [None]:
Antecedent Support (Antecedent_Sup): The support of the antecedent itemset. It represents the proportion of transactions in the dataset that contain the antecedent.

Consequent Support (Consequent_Sup): The support of the consequent itemset. It represents the proportion of transactions in the dataset that contain the consequent.

Support (Sup): The support of the rule, which is the proportion of transactions in the dataset that contain both the antecedent and the consequent.

Confidence (Conf): The confidence of the rule, which is the conditional probability of the consequent given the antecedent. It is calculated as the support of the rule divided by the support of the antecedent.

Lift: The lift of the rule measures how much more likely the consequent is, given the antecedent, compared to if they were independent. A lift greater than 1 indicates that the presence of the antecedent increases the likelihood of the consequent.

Leverage: Leverage measures the difference between the observed frequency of the itemset and the frequency expected if the antecedent and consequent were independent. It is calculated as support minus the product of antecedent support and consequent support.

Conviction: Conviction is a measure of how much the consequent relies on the antecedent. It is the ratio of the expected frequency that the antecedent occurs without the consequent to the observed frequency. A high conviction value indicates that the consequent is highly dependent on the antecedent.