In [2]:
pip install mlxtend==0.23.1

Defaulting to user installation because normal site-packages is not writeable
Collecting mlxtend==0.23.1
  Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 2.9 MB/s eta 0:00:01
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.1
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from mlxtend.preprocessing import TransactionEncoder

# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [5]:
# load the data set ans show the first five transaction
df = pd.read_csv(r"https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv")
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [6]:
uniqueitemsset = set(np.ravel(df))
for i in df:
  uniqueitemsset.update(df[i].unique())
print(uniqueitemsset)

{nan, 'Eggs', 'Bread', 'Diaper', 'Meat', 'Bagel', 'Milk', 'Cheese', 'Wine', 'Pencil'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [7]:
#create an itemset based on the products
itemset = set(uniqueitemsset)

# encoding the feature
encodedValue = []
for index, row in df.iterrows():
    rowset = set(row) 
    items = {}
    uncommons = list(itemset - rowset)
    commons = list(itemset.intersection(rowset))
    for i in uncommons:
        items[i] = 0
    for j in commons:
        items[j] = 1
    encodedValue.append(items)

items

{'Pencil': 0,
 'Diaper': 0,
 'Cheese': 0,
 'Milk': 0,
 nan: 1,
 'Eggs': 1,
 'Bread': 1,
 'Meat': 1,
 'Bagel': 1,
 'Wine': 1}

In [8]:
  # create new dataframe from the encoded features
newdf = pd.DataFrame(encodedValue)
  # show the new dataframe
newdf.head()

Unnamed: 0,NaN,Bagel,Milk,Eggs,Bread,Diaper,Meat,Cheese,Wine,Pencil
0,0,0,0,1,1,1,1,1,1,1
1,0,0,1,0,1,1,1,1,1,1
2,1,0,1,1,0,0,1,1,1,0
3,1,0,1,1,0,0,1,1,1,0
4,1,0,0,0,0,0,1,0,1,1


In [9]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.
newdf = newdf.drop(newdf.columns[1], axis=1)
newdf.head()

Unnamed: 0,NaN,Milk,Eggs,Bread,Diaper,Meat,Cheese,Wine,Pencil
0,0,0,1,1,1,1,1,1,1
1,0,1,0,1,1,1,1,1,1
2,1,1,1,0,0,1,1,1,0
3,1,1,1,0,0,1,1,1,0
4,1,0,0,0,0,1,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [12]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules

freq = apriori(newdf, min_support=0.2, use_colnames=True)
freq.head()



Unnamed: 0,support,itemsets
0,0.869841,(nan)
1,0.501587,(Milk)
2,0.438095,(Eggs)
3,0.504762,(Bread)
4,0.406349,(Diaper)


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [13]:
associationrule = association_rules(freq, metric="confidence", min_threshold=0.6)
associationrule.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Milk),(nan),0.501587,0.869841,0.409524,0.816456,0.938626,-0.026778,0.709141,-0.115976
1,(Eggs),(nan),0.438095,0.869841,0.336508,0.768116,0.883053,-0.044565,0.56131,-0.190735
2,(Bread),(nan),0.504762,0.869841,0.396825,0.786164,0.903801,-0.042237,0.608683,-0.176903
3,(Diaper),(nan),0.406349,0.869841,0.31746,0.78125,0.898152,-0.035999,0.595011,-0.160381
4,(Meat),(nan),0.47619,0.869841,0.368254,0.773333,0.889051,-0.045956,0.57423,-0.192405


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

-Antecedent Support: Frequency of the "if" part of the rule appearing in transactions. Higher values indicate common items.

-Consequent Support: Frequency of the "then" part of the rule appearing in transactions. Higher values indicate common items.

-Support: Proportion of transactions containing both antecedent and consequent. Higher support indicates stronger associations.

-Confidence: Probability of the consequent given the antecedent. Higher confidence means the rule is more reliable.

-Lift: Measures how much more likely the consequent is given the antecedent compared to random occurrence. A lift > 1 indicates a positive association.

-Leverage: Difference between observed and expected frequency of the antecedent and consequent together. Positive values indicate a stronger association.

-Conviction: Measures how much the consequent depends on the antecedent. Higher values indicate a strong dependency.