# Data Mining Lab Week 7: Association Analysis

## Introduction

The purpose of this lab session is to provide you with an opportunity to gain experience in **association analysis** using typical Python libraries.

This session starts with a tutorial that uses examples to introduce you to the practical knowledge that you will need for the corresponding assignment. 

## 1. Frequent itemsets

In order to present functionalities for association analysis in Python, we adapt an example from the ``mlxtend`` documentation.

Consider a dataset composed of five transactions. This dataset is represented by a list of five elements, each of which is a list of items bought during a trip to a supermarket.

In [3]:
import numpy as np

In [4]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

The library ``mlxtend`` requires that each transaction is represented by a binary vector where each element indicates the presence or absence of a specific item.

The method ``TransactionEncoder.fit_transform`` can be used to convert the dataset created above into this expected format. This method returns a binary matrix (numpy array) where each transaction corresponds to a row and each column corresponds to an item.

In [5]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit_transform(dataset)
print(te_ary)

[[False False False  True False  True  True  True  True False  True]
 [False False  True  True False  True False  True  True False  True]
 [ True False False  True False  True  True False False False False]
 [False  True False False False  True  True False False  True  True]
 [False  True False  True  True  True False False  True False False]]


The item corresponding to each column is stored by the ``TransactionEncoder`` object in a variable called ``columns_``. This variable can be used to create a ``DataFrame`` that conveniently represents the transaction dataset.

In [6]:
import pandas as pd

df = pd.DataFrame(te_ary, columns=te.columns_)
display(df)
print(df.shape)

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


(5, 11)


The ``mlxtend`` function ``apriori`` receives a ``DataFrame`` that represents a transaction dataset and a parameter that specifies the support threshold. This function returns a ``DataFrame`` that contains one row for each frequent itemset. Each row contains a python ``frozenset`` that represents the itemset (by column indices) and a number that represents the support of this itemset.

In [7]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(df, min_support=0.2)
display(frequent_itemsets)

itemset = frequent_itemsets.loc[5]
print('Itemset: {0}. Support: {1}.'.format(itemset['itemsets'], itemset['support']))

Unnamed: 0,support,itemsets
0,0.2,(0)
1,0.4,(1)
2,0.2,(2)
3,0.8,(3)
4,0.2,(4)
...,...,...
144,0.4,"(3, 5, 7, 8, 10)"
145,0.2,"(3, 6, 7, 8, 10)"
146,0.2,"(5, 6, 7, 8, 10)"
147,0.2,"(2, 3, 5, 7, 8, 10)"


Itemset: frozenset({5}). Support: 1.0.


Conveniently, if the parameter ``use_colnames`` is set to ``True``,  the ``mlxtend`` function ``apriori`` may instead return a ``DataFrame`` that represents itemsets by ``frozensets`` of item names.

In [8]:
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
display(frequent_itemsets)

Unnamed: 0,support,itemsets
0,0.2,(Apple)
1,0.4,(Corn)
2,0.2,(Dill)
3,0.8,(Eggs)
4,0.2,(Ice cream)
...,...,...
144,0.4,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Onion)"
145,0.2,"(Yogurt, Nutmeg, Eggs, Onion, Milk)"
146,0.2,"(Kidney Beans, Yogurt, Nutmeg, Onion, Milk)"
147,0.2,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Dill, Onion)"


Using typical ``pandas`` functionalities, it is easy to include a column in such a ``DataFrame`` to register the number of items in each frequent itemset, which can be used to filter itemsets by length.

In [9]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x)) # length of each frozenset
print('Frequent 3-itemsets:')
display(frequent_itemsets[frequent_itemsets['length'] == 3])

Frequent 3-itemsets:


Unnamed: 0,support,itemsets,length
47,0.2,"(Kidney Beans, Apple, Eggs)",3
48,0.2,"(Apple, Eggs, Milk)",3
49,0.2,"(Kidney Beans, Apple, Milk)",3
50,0.2,"(Corn, Eggs, Ice cream)",3
51,0.2,"(Corn, Eggs, Kidney Beans)",3
52,0.2,"(Corn, Onion, Eggs)",3
53,0.2,"(Corn, Kidney Beans, Ice cream)",3
54,0.2,"(Corn, Onion, Ice cream)",3
55,0.2,"(Corn, Kidney Beans, Milk)",3
56,0.2,"(Corn, Onion, Kidney Beans)",3


It is also easy to create a ``dict`` that maps any frequent itemset (represented by a ``frozenset``) to its support.

In [10]:
support = {}
for _, row in frequent_itemsets.iterrows():
    support[row['itemsets']] = row['support']

itemset = frozenset(['Onion', 'Eggs'])
print('Itemset: {0}. Support: {1}.'.format(itemset, support[itemset]))

Itemset: frozenset({'Onion', 'Eggs'}). Support: 0.6.


In [11]:
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.2,(Apple),1
1,0.4,(Corn),1
2,0.2,(Dill),1
3,0.8,(Eggs),1
4,0.2,(Ice cream),1
...,...,...,...
144,0.4,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Onion)",5
145,0.2,"(Yogurt, Nutmeg, Eggs, Onion, Milk)",5
146,0.2,"(Kidney Beans, Yogurt, Nutmeg, Onion, Milk)",5
147,0.2,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Dill, Onion)",6


## 2. Association rules

The ``mlxtend`` function ``association_rules`` receives a ``DataFrame`` that represents the set of frequent itemsets and returns a ``DataFrame`` that represents strong association rules for a specified confidence threshold. Each row in the resulting  ``DataFrame`` contains an association rule together with some potentially useful measures (we have not covered lift, leverage, or conviction). 

In [12]:
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.2,(Apple),1
1,0.4,(Corn),1
2,0.2,(Dill),1
3,0.8,(Eggs),1
4,0.2,(Ice cream),1
...,...,...,...
144,0.4,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Onion)",5
145,0.2,"(Yogurt, Nutmeg, Eggs, Onion, Milk)",5
146,0.2,"(Kidney Beans, Yogurt, Nutmeg, Onion, Milk)",5
147,0.2,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Dill, Onion)",6


In [13]:
from mlxtend.frequent_patterns import association_rules

strong_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
display(strong_rules)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Apple),(Eggs),0.2,0.8,0.2,1.0,1.250000,0.04,inf
1,(Apple),(Kidney Beans),0.2,1.0,0.2,1.0,1.000000,0.00,inf
2,(Apple),(Milk),0.2,0.6,0.2,1.0,1.666667,0.08,inf
3,(Ice cream),(Corn),0.2,0.4,0.2,1.0,2.500000,0.12,inf
4,(Corn),(Kidney Beans),0.4,1.0,0.4,1.0,1.000000,0.00,inf
...,...,...,...,...,...,...,...,...,...
628,"(Nutmeg, Eggs, Milk)","(Yogurt, Kidney Beans, Onion)",0.2,0.4,0.2,1.0,2.500000,0.12,inf
629,"(Nutmeg, Milk, Onion)","(Yogurt, Kidney Beans, Eggs)",0.2,0.4,0.2,1.0,2.500000,0.12,inf
630,"(Onion, Eggs, Milk)","(Yogurt, Kidney Beans, Nutmeg)",0.2,0.4,0.2,1.0,2.500000,0.12,inf
631,"(Nutmeg, Milk)","(Yogurt, Kidney Beans, Onion, Eggs)",0.2,0.4,0.2,1.0,2.500000,0.12,inf


## Assignment 3 [Part 1/2]


1. What is the advantage of using the Apriori algorithm in comparison with computing the support of every subset of an itemset in order to find the frequent itemsets in a transaction dataset? [0.5/5]
2. Let $\mathcal{L}_1$ denote the set of frequent $1$-itemsets. For $k \geq 2$, why must every frequent $k$-itemset be a superset of an itemset in $\mathcal{L}_1$? [0.5/5]
3. Let $\mathcal{L}_2 = \{ \{1,2\}, \{1,5\}, \{2, 3\}, \{3, 4\}, \{3, 5\}\}$. Compute the set of candidates $\mathcal{C}_3$ that is obtained by joining every pair of joinable itemsets from $\mathcal{L}_2$. [0.5/5]
4. Let $S_1$ denote the support of the association rule $\{ \text{popcorn, soda} \} \Rightarrow \{ \text{movie} \}$. Let $S_2$ denote the support of the association rule $\{ \text{popcorn} \} \Rightarrow \{ \text{movie} \}$. What is the relationship between $S_1$ and $S_2$? [0.5/5]
5. What is the support of the rule $\{  \} \Rightarrow \{ \text{Kidney Beans} \}$ in the transaction dataset used in the tutorial presented above? [0.5/5]
6. In the transaction dataset used in the tutorial presented above, what is the maximum length of a frequent itemset for a support threshold of 0.2? [0.5/5]
7. Implement a function that receives a ``DataFrame`` of frequent itemsets and a **strong** association rule (represented by a ``frozenset`` of antecedents and a ``frozenset`` of consequents). This function should return the corresponding Kulczynski measure. Include the code in your report. [1/5]
8. Implement a function that receives a ``DataFrame`` of frequent itemsets and a **strong** association rule (represented by a ``frozenset`` of antecedents and a ``frozenset`` of consequents). This function should return the corresponding imbalance ratio. Include the code in your report. [1/5]


## Q7


In [14]:
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.2,(Apple),1
1,0.4,(Corn),1
2,0.2,(Dill),1
3,0.8,(Eggs),1
4,0.2,(Ice cream),1
...,...,...,...
144,0.4,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Onion)",5
145,0.2,"(Yogurt, Nutmeg, Eggs, Onion, Milk)",5
146,0.2,"(Kidney Beans, Yogurt, Nutmeg, Onion, Milk)",5
147,0.2,"(Kidney Beans, Yogurt, Nutmeg, Eggs, Dill, Onion)",6


In [15]:
def kulczynski(df, antecedents, consequents):
  try:
    itemsets = df['itemsets']
    for idx,itemset in enumerate(itemsets):
      #Finds the support for the itemset that matches antecedents 
      if set(itemset) == antecedents:
        ant_idx = idx
        supp_ant = df.loc[ant_idx, 'support']

      #Finds the support for the itemset that matches consequents
      if set(itemset) == consequents:
        con_idx = idx
        supp_con = df.loc[con_idx, 'support']

      #Finds the support for the joint itemset, combining antecedents and consequents
      fullSet = [item for item in antecedents]
      fullSet.extend(consequents)

      if set(itemset) == set(fullSet):
        fullSet_idx = idx
        supp_fullSet = df.loc[fullSet_idx, 'support']

    
    kulc_measure = 0.5 * ((supp_fullSet/supp_con) + (supp_fullSet/supp_ant))
    print('Antecedent(',str(antecedents),') --------> idx:', ant_idx, '----- Support:', supp_ant)
    print('Consequent(',str(consequents),') ---------> idx:', con_idx,'----- Support:', supp_con)
    print('Full set(',str(fullSet),') -------------> idx:', fullSet_idx, '----- Support:', supp_fullSet,'\n')
    return kulc_measure
  except:
    print('Some of the chosen itemsets don\'t exist in the dataset, or didn\'t make it past the threshold!')

In [16]:
a = frozenset(['Eggs']) 
c = frozenset(['Eggs'])

print('Kulczynski measure: ', kulczynski(frequent_itemsets, a, c))

Antecedent( frozenset({'Eggs'}) ) --------> idx: 3 ----- Support: 0.8
Consequent( frozenset({'Eggs'}) ) ---------> idx: 3 ----- Support: 0.8
Full set( ['Eggs', 'Eggs'] ) -------------> idx: 3 ----- Support: 0.8 

Kulczynski measure:  1.0


In [17]:
# arr = [1,2,3]
# new_arr = [3,4,5]
# testSet = [item for item in arr]
# testSet.extend(new_arr)
# testSet

# set1 = frozenset(['Milk', 'Apple', 'Kidney Beans'])
# set2 = frozenset(['Apple'])
# print(len(set1) == len(set2))
# print(set1)
# print(set2)


itemsets = set(frequent_itemsets['itemsets'])
print(itemsets)


{frozenset({'Corn', 'Eggs'}), frozenset({'Kidney Beans', 'Nutmeg', 'Eggs', 'Onion', 'Milk'}), frozenset({'Yogurt'}), frozenset({'Kidney Beans', 'Onion', 'Eggs'}), frozenset({'Corn', 'Milk'}), frozenset({'Kidney Beans'}), frozenset({'Yogurt', 'Onion', 'Eggs'}), frozenset({'Yogurt', 'Dill', 'Onion', 'Nutmeg'}), frozenset({'Kidney Beans', 'Yogurt', 'Nutmeg', 'Eggs', 'Dill', 'Onion'}), frozenset({'Kidney Beans', 'Onion', 'Eggs', 'Milk'}), frozenset({'Yogurt', 'Kidney Beans', 'Onion', 'Eggs'}), frozenset({'Onion', 'Eggs', 'Nutmeg'}), frozenset({'Corn', 'Eggs', 'Ice cream'}), frozenset({'Kidney Beans', 'Yogurt', 'Nutmeg', 'Eggs', 'Onion', 'Milk'}), frozenset({'Kidney Beans', 'Apple'}), frozenset({'Yogurt', 'Nutmeg', 'Eggs'}), frozenset({'Yogurt', 'Onion', 'Milk', 'Nutmeg'}), frozenset({'Yogurt', 'Nutmeg', 'Eggs', 'Dill', 'Onion'}), frozenset({'Yogurt', 'Kidney Beans'}), frozenset({'Yogurt', 'Dill', 'Eggs', 'Kidney Beans'}), frozenset({'Yogurt', 'Kidney Beans', 'Nutmeg', 'Eggs'}), frozenset({

## Q8

In [25]:
frequent_itemsets[0:59]

Unnamed: 0,support,itemsets,length
0,0.2,(Apple),1
1,0.4,(Corn),1
2,0.2,(Dill),1
3,0.8,(Eggs),1
4,0.2,(Ice cream),1
5,1.0,(Kidney Beans),1
6,0.6,(Milk),1
7,0.4,(Nutmeg),1
8,0.6,(Onion),1
9,0.2,(Unicorn),1


In [35]:
def imbalance_ratio(df, antecedents, consequents):
  try:
    itemsets = df['itemsets']

    for idx,itemset in enumerate(itemsets):
      #Finds the support for the itemset that matches antecedents 
      if itemset == antecedents:
        ant_idx = idx
        supp_ant = df.loc[ant_idx, 'support']

      #Finds the support for the itemset that matches consequents
      if itemset == consequents:
        con_idx = idx
        supp_con = df.loc[con_idx, 'support']

      #Finds the support for the joint itemset, combining antecedents and consequents
      fullSet = [item for item in antecedents]
      fullSet.extend(consequents)

      if itemset == set(fullSet):
        fullSet_idx = idx
        supp_fullSet = df.loc[fullSet_idx, 'support']

    IR = (abs(supp_ant - supp_con)) / (supp_ant + supp_con - supp_fullSet)

    print('Antecedent(',str(antecedents),') -------> idx:', ant_idx, '----- Support:', supp_ant)
    print('Consequent(',str(consequents),') ---------------> idx:', con_idx,'----- Support:', supp_con)
    print('Full set(',str(set(fullSet)),') ------------> idx:', fullSet_idx, '----- Support:', supp_fullSet,'\n')
    return IR

  except:
    print('Some of the chosen itemsets don\'t exist in the dataset, or didn\'t make it past the threshold!')

In [36]:
a = frozenset(['Kidney Beans']) 
c = frozenset(['Eggs'])

print('Imbalance ratio: ', np.round(imbalance_ratio(frequent_itemsets, a, c),3))

Antecedent( frozenset({'Kidney Beans'}) ) -------> idx: 5 ----- Support: 1.0
Consequent( frozenset({'Eggs'}) ) ---------------> idx: 3 ----- Support: 0.8
Full set( {'Kidney Beans', 'Eggs'} ) ------------> idx: 27 ----- Support: 0.8 

Imbalance ratio:  0.2
