In [17]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

! pip install mlxtend



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [18]:
# load the data set and show the first five transaction https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv
df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [19]:
#get the unique products that has been purchased
products = df['0'].unique()
products.sort()
products

array(['Bagel', 'Bread', 'Cheese', 'Diaper', 'Eggs', 'Meat', 'Milk',
       'Pencil', 'Wine'], dtype=object)

## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [25]:
#create an itemset based on the products
itemset = set(products)
itemset

# encoding the feature
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(itemset).transform(itemset)



In [26]:
  # create new dataframe from the encoded features
df_encode = pd.DataFrame(te_ary, columns=te.columns_)

  # show the new dataframe
df_encode.head()

Unnamed: 0,B,C,D,E,M,P,W,a,c,d,...,g,h,i,k,l,n,p,r,s,t
0,True,False,False,False,False,False,False,True,False,False,...,True,False,False,False,True,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,True,False
2,True,False,False,False,False,False,False,True,False,True,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,True,True,True,False,False,False,False,False
4,False,False,True,False,False,False,False,True,False,False,...,False,False,True,False,False,False,True,True,False,False


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [22]:
#Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column
df_encode = df_encode.iloc[:,1:]
df_encode.head()

Unnamed: 0,C,D,E,M,P,W,a,c,d,e,g,h,i,k,l,n,p,r,s,t
0,False,False,False,False,False,False,True,False,False,True,True,False,False,False,True,False,False,False,False,False
1,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True,False,True,True,False,False,False,False,False,False,False,True,False,False
3,False,False,False,True,False,False,False,False,False,False,False,False,True,True,True,False,False,False,False,False
4,False,True,False,False,False,False,True,False,False,True,False,False,True,False,False,False,True,True,False,False


## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [23]:
from mlxtend.frequent_patterns import apriori

apriori(df_encode, min_support=0.2, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.222222,(M)
1,0.444444,(a)
2,0.777778,(e)
3,0.222222,(g)
4,0.444444,(i)
5,0.333333,(l)
6,0.222222,(n)
7,0.222222,(r)
8,0.222222,(s)
9,0.444444,"(e, a)"


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [24]:
#Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6
from mlxtend.frequent_patterns import association_rules

rules = association_rules(apriori(df_encode, min_support=0.2), metric="confidence", min_threshold=0.6)
print(rules)

   antecedents consequents  antecedent support  consequent support   support  \
0          (6)         (9)            0.444444            0.777778  0.444444   
1         (17)         (6)            0.222222            0.444444  0.222222   
2         (12)         (9)            0.444444            0.777778  0.333333   
3         (14)         (9)            0.333333            0.777778  0.222222   
4         (15)         (9)            0.222222            0.777778  0.222222   
5         (17)         (9)            0.222222            0.777778  0.222222   
6         (14)        (12)            0.333333            0.444444  0.222222   
7         (15)        (12)            0.222222            0.444444  0.222222   
8      (9, 17)         (6)            0.222222            0.444444  0.222222   
9      (17, 6)         (9)            0.222222            0.777778  0.222222   
10        (17)      (9, 6)            0.222222            0.444444  0.222222   
11     (9, 12)        (15)            0.

Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

__antecedent__ __support__
How often items on the left (antecedent) appear in the data.

__consequent support__
How often items on the right (consequent) appear in the data.

__support__
How often both antecedent and consequent appear together.

__confidence__
Given the antecedent is present, how likely the consequent will also be present.

__lift__
How much more often the antecedent and consequent occur together than expected if they were statistically independent. Lift > 1 means they appear together more than expected, Lift < 1 means less.

__leverage__
The difference between how often the items occur together and what would be expected if they were independent. Leverage of 0 means the items are independent.

__conviction__
A measure of the rule’s implications. High conviction means the consequent strongly depends on the antecedent. Conviction is ‘inf’ (infinity) for a perfect confidence score.

For example, if 30 out of 100 transactions include Milk, the antecedent support for Milk is 0.3 or 30%. If 20 transactions include Bread, the consequent support for Bread is 0.2 or 20%. If 10 transactions include both Milk and Bread, the support for {Milk, Bread} is 0.1 or 10%. The confidence of the rule {Milk -> Bread} is then 0.1/0.3 = 0.33 or 33%, meaning that 33% of transactions that include Milk also include Bread. The lift, leverage, and conviction would require further calculations.