In [23]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [44]:
# load the data set ans show the first five transaction
url = r'https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [45]:
purchased_product = set(np.ravel(df))
print(purchased_product)

{'Eggs', 'Meat', 'Bread', 'Pencil', 'Cheese', 'Milk', 'Bagel', 'Wine', nan, 'Diaper'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [46]:
#create an itemset based on the products
itemset = set(purchased_product)

# encoding the feature
encode = []
for index, row in df.iterrows():
    rowset = set(row)
    labels = {}
    uncommons = list(itemset-rowset)
    commons = list(itemset.intersection(rowset))
    for i in uncommons:
        labels[i] = 0
    for j in commons:
        labels[j] = 1
    encode.append(labels)

print(labels)

{'Milk': 0, 'Cheese': 0, 'Diaper': 0, 'Pencil': 0, 'Eggs': 1, 'Meat': 1, 'Bread': 1, 'Bagel': 1, 'Wine': 1, nan: 1}


In [47]:
  # create new dataframe from the encoded features
encode_df = pd.DataFrame(encode)
  # show the new dataframe
encode_df.head()

Unnamed: 0,Milk,Bagel,NaN,Eggs,Meat,Bread,Pencil,Cheese,Wine,Diaper
0,0,0,0,1,1,1,1,1,1,1
1,1,0,0,0,1,1,1,1,1,1
2,1,0,1,1,1,0,0,1,1,0
3,1,0,1,1,1,0,0,1,1,0
4,0,0,1,0,1,0,1,0,1,0


In [48]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.



Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [50]:
encode_df = encode_df.drop(encode_df.columns[2], axis=1)
encode_df

Unnamed: 0,Milk,Bagel,Eggs,Meat,Bread,Pencil,Cheese,Wine,Diaper
0,0,0,1,1,1,1,1,1,1
1,1,0,0,1,1,1,1,1,1
2,1,0,1,1,0,0,1,1,0
3,1,0,1,1,0,0,1,1,0
4,0,0,0,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...
310,0,0,1,0,1,0,1,0,0
311,1,0,0,1,0,1,0,0,0
312,0,0,1,1,1,1,1,1,1
313,0,0,0,1,0,0,1,0,0


## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [51]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules
freqPurchasedProd = apriori(encode_df, min_support=0.2, use_colnames=True)
freqPurchasedProd.head(33)



Unnamed: 0,support,itemsets
0,0.501587,(Milk)
1,0.425397,(Bagel)
2,0.438095,(Eggs)
3,0.47619,(Meat)
4,0.504762,(Bread)
5,0.361905,(Pencil)
6,0.501587,(Cheese)
7,0.438095,(Wine)
8,0.406349,(Diaper)
9,0.225397,"(Milk, Bagel)"


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [52]:
assRules = association_rules(freqPurchasedProd, metric="confidence", min_threshold=0.6)
assRules.head(14)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
2,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
3,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
4,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
5,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
6,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
7,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
8,"(Milk, Meat)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816
9,"(Milk, Cheese)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429,0.410959


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

1. Antecedent support 
-  Refers to the support of the item or itemset that appears on the left side (antecedent). It measures how frequently the antecedent occurs in the dataset.
- Formula : support(A)=proportion of transaction containing A

2. Consequent support
- Refers to the support of the item or itemset that appears on the right side.(consequent). It measures how frequently the consequent occurs in the dataset.
- Formula : support(C)=proportion of transaction containing C

3. Support
- Measures how frequently a particular itemset appears in the dataset. It is a generalization of both antecedent and consequent support.
- Formula :support(A→C)=support(A∪C)

4. Confidence
- Measure of the likelihood that the consequent occurs given that the antecedent has occurred. It indicates the strength of the association rule.
- Formula : confidence(A→C)=support(A→C)/support(A)

5. Lift
- Measures how much more likely the consequent is to occur given that the antecedent has occurred, compared to the likelihood of the consequent occurring independently.  A lift greater than 1 indicates a positive association.
- Formula : lift(A→C)=confidence(A→C)/support(C)

6. Leverage
- Measures the difference between the observed frequency of the itemset and the expected frequency if the two items were independent. It helps to understand the strength of the association.
- Formula : levarage(A→C)=support(A→C)−support(A)×support(C)

7. Conviction
- A measure of the degree of implication of the rule. It considers the confidence of the rule and the probability of the consequent occurring without the antecedent.  It considers the confidence of the rule and the probability of the consequent occurring without the antecedent.
Formula : conviction(A→C)=(1−support(C))/(1−confidence(A→C))
