# ECLAT works with a vertical data format.

In [1]:
!pip install pyECLAT

Collecting pyECLAT
  Downloading pyECLAT-1.0.2-py3-none-any.whl (6.3 kB)
Installing collected packages: pyECLAT
Successfully installed pyECLAT-1.0.2




In [2]:
# store the item sets as lists of strings in a list
transactions = [
    ['beer', 'wine', 'cheese'],
    ['beer', 'potato chips'],
    ['eggs', 'flower', 'butter', 'cheese'],
    ['eggs', 'flower', 'butter', 'beer', 'potato chips'],
    ['wine', 'cheese'],
    ['potato chips'],
    ['eggs', 'flower', 'butter', 'wine', 'cheese'],
    ['eggs', 'flower', 'butter', 'beer', 'potato chips'],
    ['wine', 'beer'],
    ['beer', 'potato chips'],
    ['butter', 'eggs'],
    ['beer', 'potato chips'],
    ['flower', 'eggs'],
    ['beer', 'potato chips'],
    ['eggs', 'flower', 'butter', 'wine', 'cheese'],
    ['beer', 'wine', 'potato chips', 'cheese'],
    ['wine', 'cheese'],
    ['beer', 'potato chips'],
    ['wine', 'cheese'],
    ['beer', 'potato chips']
]

In [3]:
import pandas as pd

# you simply convert the transaction list into a dataframe
data = pd.DataFrame(transactions)
data

Unnamed: 0,0,1,2,3,4
0,beer,wine,cheese,,
1,beer,potato chips,,,
2,eggs,flower,butter,cheese,
3,eggs,flower,butter,beer,potato chips
4,wine,cheese,,,
5,potato chips,,,,
6,eggs,flower,butter,wine,cheese
7,eggs,flower,butter,beer,potato chips
8,wine,beer,,,
9,beer,potato chips,,,


Now that you have the data, you need to specify a number of algorithm parameters. Firstly, you need to specify the smallest itemset size that you are interested in. In this case, we are interested in product associations, so we want to leave out individual (1-item) itemsets: the minimum size needs to be 2.

We also need to transcribe our minimum support value as a percentage, which is easy to do as seen in the code below.

Finally, the pyECLAT package wants us to specify a maximum size. We do not have a maximum size for the itemsets (we would be interested in large product associations as well). Therefore, we take the maximum transaction size.

In [4]:
# we are looking for itemSETS
# we do not want to have any individual products returned
min_n_products = 2

# we want to set min support to 7
# but we have to express it as a percentage
min_support = 7/len(transactions)

# we have no limit on the size of association rules
# so we set it to the longest transaction
max_length = max([len(x) for x in transactions])

In [5]:
from pyECLAT import ECLAT

# create an instance of eclat
my_eclat = ECLAT(data=data, verbose=True)

# fit the algorithm
rule_indices, rule_supports = my_eclat.fit(min_support=min_support,
                                           min_combination=min_n_products,
                                           max_combination=max_length)

100%|███████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 233.50it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 483.10it/s]


Combination 2 by 2


10it [00:00, 133.31it/s]


Combination 3 by 3


10it [00:00, 131.64it/s]


Combination 4 by 4


5it [00:00, 111.22it/s]


Combination 5 by 5


1it [00:00, 60.26it/s]


# The fit method returns two things: the so-called association rule indices and the so-called association rule supports. As I explained before, there will be not a lot of metrics. The only interesting thing here is to look at the rule supports using the following code:

In [6]:
print(rule_supports)

{'wine & cheese': 0.35, 'potato chips & beer': 0.45}


# The interpretation of this is that within the transactions of our night store, there are two product combinations that are relatively strong. People often buy Wine and Cheese together. People also often buy Potato Chips and Beer together. Clearly, it could be a good idea to put those products together so that people can easily get to both of them. Or maybe the shop owner could think about packaging the products in an attractive offer to boost sales of those products even more.

# Example 2: 

In [7]:
# importing dataset ( example 1 and example 2 are datasets in pyECLAT)
from pyECLAT import Example2

# storing the dataset in a variable
dataset = Example2().get()

# printing the dataset
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams
1,burgers,meatballs,eggs,,,,
2,chutney,,,,,,
3,turkey,avocado,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,


In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3001 non-null   object
 1   1       2315 non-null   object
 2   2       1774 non-null   object
 3   3       1374 non-null   object
 4   4       1048 non-null   object
 5   5       775 non-null    object
 6   6       581 non-null    object
dtypes: object(7)
memory usage: 164.2+ KB


In [9]:
## Visualizing the frequent items
# importing the ECLAT module
from pyECLAT import ECLAT

# loading transactions DataFrame to ECLAT class
eclat = ECLAT(data=dataset)

# DataFrame of binary values
eclat.df_bin

Unnamed: 0,spaghetti,asparagus,shallot,sparkling water,energy bar,escalope,soda,frozen vegetables,yogurt cake,light cream,...,yams,butter,pickles,antioxydant juice,ketchup,strawberries,green beans,whole weat flour,gums,parmesan cheese
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2999,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In this binary dataset, every row represents a transaction. Columns are possible products that might appear in every transaction. Every cell contains one of two possible values:

0 — the product was not included in the transaction
1 — the transaction contains the product

In [10]:
# count items in each column. Yhis means how many times each item was purchase in different transactions. 
items_total = eclat.df_bin.astype(int).sum(axis=0)

items_total

spaghetti           549
asparagus             7
shallot              22
sparkling water      13
energy bar           80
                   ... 
strawberries         59
green beans          17
whole weat flour     26
gums                 27
parmesan cheese      58
Length: 119, dtype: int64

In [11]:
# count items in each row (Transaction)
items_per_transaction = eclat.df_bin.astype(int).sum(axis=1)

items_per_transaction

0       7
1       3
2       1
3       2
4       5
       ..
2996    1
2997    2
2998    3
2999    7
3000    5
Length: 3001, dtype: int64

In [12]:
items_total.index

Index(['spaghetti', 'asparagus', 'shallot', 'sparkling water', 'energy bar',
       'escalope', 'soda', 'frozen vegetables', 'yogurt cake', 'light cream',
       ...
       'yams', 'butter', 'pickles', 'antioxydant juice', 'ketchup',
       'strawberries', 'green beans', 'whole weat flour', 'gums',
       'parmesan cheese'],
      dtype='object', length=119)

In [13]:
items_total.values

array([549,   7,  22,  13,  80, 212,  18, 276,  72,  50, 144,  14,   3,
       236,  22,  10,  86,  18,  26,  71,   1, 463,   6, 198,  56,  36,
        25, 381,  25,  10, 281,  25,  83,  25,  22, 231,  77, 257, 136,
         8,  25,  12, 170,  20,  59,  32,  15,  65,  86,  52,  12,  34,
        50,  72, 711,  13,  69,  52,  31,  54,  29, 206, 163,  71,   8,
        77,  18,  48,  50,  34,  11,  26,  10,   7, 151, 166,  29,  11,
       107, 170,  65,  28, 232, 198, 340,  59, 117, 116,  39, 231,  13,
        91, 532,  17, 144,  43, 485,  31,  16,  70,   3,  10,  15,  50,
        10,  91,  10,  18, 128,  43,  89,  17,  18,  11,  59,  17,  26,
        27,  58], dtype=int64)

In [14]:
## Frequent ItemList
import pandas as pd

# Loading items per column stats to the DataFrame
df = pd.DataFrame({'items': items_total.index, 'transactions': items_total.values}) 

# cloning pandas DataFrame for visualization purpose  
df_table = df.sort_values("transactions", ascending=False)

#  Top 5 most popular products/items
df_table.head(5).style.background_gradient(cmap='Blues')

Unnamed: 0,items,transactions
54,mineral water,711
0,spaghetti,549
92,eggs,532
96,chocolate,485
21,french fries,463


To generate association rules, we need to define:

Minimum support — should be provided as a percentage of the overall items from the dataset
Minumum combinations — the minimum amount of items in the transaction
Maximum combinations — the minimum amount of items in the transaction.

When we call the function in python, we need to pass the minimal and maximum combinations that the algorithm will make.

The end result is frequent items with their support. If you were waiting for other measures like lift or confidence… sorry, the ECLAT just give us the support.

In [15]:
# the item shoud appear at least at 5% of transactions
min_support = 5/100

# start from transactions containing at least 2 items
min_combination = 2

# up to maximum items per transaction
max_combination = max(items_per_transaction)

rule_indices, rule_supports = eclat.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' & ',
                                                 verbose=True)

Combination 2 by 2


253it [00:04, 63.02it/s]


Combination 3 by 3


1771it [00:29, 60.06it/s]


Combination 4 by 4


8855it [02:36, 56.65it/s]


Combination 5 by 5


33649it [13:40, 40.99it/s]


Combination 6 by 6


100947it [32:01, 52.52it/s]


Combination 7 by 7


245157it [1:19:45, 51.23it/s]


In [16]:
rule_supports.items()

dict_items([('spaghetti & mineral water', 0.06064645118293902)])

In [17]:
import pandas as pd

result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)

Unnamed: 0,Item,Support
0,spaghetti & mineral water,0.060646


# We found that mineral water and spaghetti are commonly purchased by the customers based on the transactions data in our dataset and accomplish the minimum support value we’ve provided.

# ---------------------------------------------------------------------------------------------------------------

# FP Growth example: 

In [3]:

# store the item sets as lists of strings in a list
transactions = [
    ["beer", "wine", "cheese"],
    ["beer", "potato chips"],
    ["eggs", "flower", "butter", "cheese"],
    ["eggs", "flower", "butter", "beer", "potato chips"],
    ["wine", "cheese"],
    ["potato chips"],
    ["eggs", "flower", "butter", "wine", "cheese"],
    ["eggs", "flower", "butter", "beer", "potato chips"],
    ["wine", "beer"],
    ["beer", "potato chips"],
    ["butter", "eggs"],
    ["beer", "potato chips"],
    ["flower", "eggs"],
    ["beer", "potato chips"],
    ["eggs", "flower", "butter", "wine", "cheese"],
    ["beer", "wine", "potato chips", "cheese"],
    ["wine", "cheese"],
    ["beer", "potato chips"],
    ["wine", "cheese"],
    ["beer", "potato chips"],
]

In [20]:
# to install the package
!pip install mlxtend

# just in case you are running on Google Colab, you may run into a problem later on if you do not upgrade the package
%pip install mlxtend --upgrade






Note: you may need to restart the kernel to use updated packages.




In [23]:
!pip install mlxtend -q
%pip install mlxtend --upgrade -q



Note: you may need to restart the kernel to use updated packages.




In [1]:
# it is necessary for mlxtend to reorganise the data
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder


In [4]:
# instantiate a transaction encoder
my_transactionencoder = TransactionEncoder()

# fit the transaction encoder using the list of transaction tuples
my_transactionencoder.fit(transactions)

# transform the list of transaction tuples into an array of encoded transactions
encoded_transactions = my_transactionencoder.transform(transactions)

# convert the array of encoded transactions into a dataframe
encoded_transactions_df = pd.DataFrame(encoded_transactions, columns=my_transactionencoder.columns_)
encoded_transactions_df

Unnamed: 0,beer,butter,cheese,eggs,flower,potato chips,wine
0,True,False,True,False,False,False,True
1,True,False,False,False,False,True,False
2,False,True,True,True,True,False,False
3,True,True,False,True,True,True,False
4,False,False,True,False,False,False,True
5,False,False,False,False,False,True,False
6,False,True,True,True,True,False,True
7,True,True,False,True,True,True,False
8,True,False,False,False,False,False,True
9,True,False,False,False,False,True,False


In [5]:
# our min support is 7, but it has to be expressed as a percentage for mlxtend
min_support = 7/len(transactions) 

# compute the frequent itemsets using fpgriowth from mlxtend
from mlxtend.frequent_patterns.fpgrowth import fpgrowth
frequent_itemsets = fpgrowth(encoded_transactions_df, min_support=min_support, use_colnames = True)

# print the frequent itemsets
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.55,(beer)
1,0.4,(wine)
2,0.4,(cheese)
3,0.5,(potato chips)
4,0.35,(eggs)
5,0.35,"(cheese, wine)"
6,0.45,"(potato chips, beer)"


You will see the support for each of the itemsets. The items that are not in here are filtered out because they do not reach the minimum support level. By the way, note that the minimum support level here is expressed as a percentage, 

As the last step, we need to use the association_rules function to convert those frequent itemsets into association rules. 

In [6]:
# Compute the association rules based on the frequent itemsets
from mlxtend.frequent_patterns import association_rules

# compute and print the association rules
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(cheese),(wine),0.4,0.4,0.35,0.875,2.1875,0.19,4.8,0.904762
1,(wine),(cheese),0.4,0.4,0.35,0.875,2.1875,0.19,4.8,0.904762
2,(potato chips),(beer),0.5,0.55,0.45,0.9,1.636364,0.175,4.5,0.777778
3,(beer),(potato chips),0.55,0.5,0.45,0.818182,1.636364,0.175,2.75,0.864198


# Firstly, we can conclude that there are two product combinations, and both associations are bidirectional. People who buy cheese, also buy wine and people who buy wine also buy cheese. Separately, we see that people who buy beer also buy potato chips and vice versa.

# There is not necessarily one overall metric that we can use to decide which rules to ‘officially’ accept or discard. After all, the method is more of a tool for exploration than it is a tool for confirmation.