# Importing and exploring the dataset
We will use the built-in dataset available in the pyECLAT module. Let us first import the pyECLAT module and the build-in dataset.

In [3]:
from pyECLAT import Example1
dataset = Example1().get()
dataset.head()

Unnamed: 0,0,1,2,3
0,milk,beer,bread,butter
1,coffe,bread,butter,
2,coffe,bread,butter,
3,milk,coffe,bread,butter
4,beer,,,


Each row represents a customer’s purchase at a supermarket in this dataset. For example, in row 1, the customer purchased only burgers, meatballs, and eggs.
Let’s get more information about the dataset by printing more details.

In [4]:
# printing the info
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       10 non-null     object
 1   1       5 non-null      object
 2   2       4 non-null      object
 3   3       2 non-null      object
dtypes: object(4)
memory usage: 448.0+ bytes


# Visualizing the frequent items
To visualize the frequent items, let’s load the dataset to the ECLAT class and generate binary DataFrame:

In [5]:
# importing the ECLAT module
from pyECLAT import ECLAT

# loading transactions DataFrame to ECLAT class
eclat = ECLAT(data=dataset)

# DataFrame of binary values
eclat.df_bin

Unnamed: 0,bean,bread,butter,rice,beer,coffe,milk
0,0,1,1,0,1,0,1
1,0,1,1,0,0,1,0
2,0,1,1,0,0,1,0
3,0,1,1,0,0,1,1
4,0,0,0,0,1,0,0
5,0,0,1,0,0,0,0
6,0,1,0,0,0,0,0
7,1,0,0,0,0,0,0
8,1,0,0,1,0,0,0
9,0,0,0,1,0,0,0


In this binary dataset, every row represents a transaction. Columns are possible products that might appear in every transaction. Every cell contains one of two possible values:

0 – the product was not included in the transaction

1 – the transaction contains the product

Now, we need to count items for every column in the DataFrame:

In [6]:
# count items in each column
items_total = eclat.df_bin.astype(int).sum(axis=0)

items_total

bean      2
bread     5
butter    5
rice      2
beer      2
coffe     3
milk      2
dtype: int64

In [16]:
# count items in each row
items_per_transaction = eclat.df_bin.astype(int).sum(axis=1)

items_per_transaction

0    4
1    3
2    3
3    4
4    1
5    1
6    1
7    1
8    2
9    1
dtype: int64

In [17]:
import pandas as pd

# Loading items per column stats to the DataFrame
df = pd.DataFrame({'items': items_total.index, 'transactions': items_total.values}) 

# cloning pandas DataFrame for visualization purpose  
df_table = df.sort_values("transactions", ascending=False)

#  Top 5 most popular products/items
df_table.head(5).style.background_gradient(cmap='Blues')

Unnamed: 0,items,transactions
1,bread,5
2,butter,5
5,coffe,3
0,bean,2
3,rice,2


In [18]:
# importing required module
import plotly.express as px

# to have a same origin
df_table["all"] = "Tree Map" 

# creating tree map using plotly
fig = px.treemap(df_table.head(50), path=['all', "items"], values='transactions',
                  color=df_table["transactions"].head(50), hover_data=['items'],
                  color_continuous_scale='Blues',
                )
# ploting the treemap
fig.show()

# Generating association rules
To generate association rules, we need to define:

Minimum support – should be provided as a percentage of the overall items from the dataset

Minumum combinations – the minimum amount of items in the transaction

Maximum combinations – the minimum amount of items in the transaction

Note: the higher the value of the maximum combinations the longer the calculation will take.

In [22]:
# the item shoud appear at least at 5% of transactions
min_support = 10/100

# start from transactions containing at least 2 items
min_combination = 2

# up to maximum items per transaction
max_combination = max(items_per_transaction)

rule_indices, rule_supports = eclat.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' & ',
                                                 verbose=True)

Combination 2 by 2


21it [00:00, 138.35it/s]


Combination 3 by 3


35it [00:00, 223.09it/s]


Combination 4 by 4


35it [00:00, 159.73it/s]


In [23]:
import pandas as pd
result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)

Unnamed: 0,Item,Support
1,bread & butter,0.4
3,bread & coffe,0.3
6,butter & coffe,0.3
11,bread & butter & coffe,0.3
4,bread & milk,0.2
7,butter & milk,0.2
12,bread & butter & milk,0.2
0,bean & rice,0.1
13,bread & beer & milk,0.1
17,bread & butter & beer & milk,0.1


In [24]:
# the item shoud appear at least at 5% of transactions
min_support = 20/100

# start from transactions containing at least 2 items
min_combination = 2

# up to maximum items per transaction
max_combination = max(items_per_transaction)

rule_indices, rule_supports = eclat.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' & ',
                                                 verbose=True)

Combination 2 by 2


21it [00:00, 171.53it/s]


Combination 3 by 3


35it [00:00, 213.12it/s]


Combination 4 by 4


35it [00:00, 171.02it/s]


In [25]:
import pandas as pd
result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)

Unnamed: 0,Item,Support
0,bread & butter,0.4
1,bread & coffe,0.3
3,butter & coffe,0.3
5,bread & butter & coffe,0.3
2,bread & milk,0.2
4,butter & milk,0.2
6,bread & butter & milk,0.2
