# Bayesian Association Rule Mining Algorithm


<h4> 1 - Importing the required libraries

In [1]:
!pip install pgmpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pgmpy
  Downloading pgmpy-0.1.21-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pgmpy
Successfully installed pgmpy-0.1.21


In [2]:
import numpy as np
import pandas as pd
import os 
from tqdm import tqdm
from mlxtend.frequent_patterns import apriori, association_rules
from pgmpy.models import BayesianNetwork
import io

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<h4>2- Loading and exploring the data

In [4]:
data = pd.read_csv("/content/drive/MyDrive/FDS_proj/Online_Retail_csv.csv",encoding= 'unicode_escape')
data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 8.26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 8.26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 8.26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 8.26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 8.26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,09-12-2011 12.50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,09-12-2011 12.50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,09-12-2011 12.50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,09-12-2011 12.50,4.15,12680.0,France


We've successfully transfered the "Online Retail" dataset to a pandas DataFrame. Let's now analyse the dataset further:

In [5]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 8.26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 8.26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 8.26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 8.26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 8.26,3.39,17850.0,United Kingdom


In [6]:
data.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

Our dataset has entries about numerous countries:

In [7]:
data.Country.unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

 <h4>3- Cleaning the Data

We clean the dataset by removing unnecessary characters and incomplete rows: 

In [8]:

data['Description'] = data['Description'].str.strip()
  
# Removing rows with no invoice number
data.dropna(axis = 0, subset =['InvoiceNo'], inplace = True)
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
  
# Removing credit transactions
data = data[~data['InvoiceNo'].str.contains('C')]

<h4>4- Choosing a specific region for our demonstration

We've selected France to demonstrate our Bayesian Network. We create a table containing data about transactions of France:

In [9]:
basket_France = (data[data['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
basket_France  

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
580986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581279,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h4>5- Converting integer values to boolean

For association rules, it is important to know if an event has occurred. However, in the same transaction it doesn't matter if the same item was bought at a quantity of 1000 or at a quantity of 1. All that matters is it was bought. 

Thus, we convert any non zero integers to a true boolean value (the integer 1).

In [10]:
# Defining the hot encoding function to make the data suitable 
# for the concerned libraries
def int_to_bool(x):
    if(x<= 0):
        return 0
    if(x>= 1):
        return 1
  
# Encoding the datasets
basket_encoded = basket_France.applymap(int_to_bool)
basket_France = basket_encoded




<h4> 6- Building apriori association rules

We now use an apriori algorithm to create the association rules based on our dataset:

In [11]:
# Building the model
frq_items = apriori(basket_France, min_support = 0.05, use_colnames = True)
  
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])

In [12]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
45,(JUMBO BAG WOODLAND ANIMALS),(POSTAGE),0.076531,0.765306,0.076531,1.0,1.306667,0.017961,inf
260,"(RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...",(POSTAGE),0.05102,0.765306,0.05102,1.0,1.306667,0.011974,inf
272,"(RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...",(POSTAGE),0.053571,0.765306,0.053571,1.0,1.306667,0.012573,inf
302,"(SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...",(SET/6 RED SPOTTY PAPER PLATES),0.102041,0.127551,0.09949,0.975,7.644,0.086474,34.897959
301,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.102041,0.137755,0.09949,0.975,7.077778,0.085433,34.489796


We see that we have acquired many association rules. Consider #302 as an example: This rule says that buying paper plates (red) -> buying paper cups. Logically this makes sense as we expect someone buying paper plates (say, for a party) to also buy paper cups along with it.

Varius values like confidence, support, lift for the apriori rules are also given in the table. We show a table below, that only focuses on the rule itself:

In [13]:
rules[['antecedents', 'consequents',]]

Unnamed: 0,antecedents,consequents
45,(JUMBO BAG WOODLAND ANIMALS),(POSTAGE)
260,"(RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...",(POSTAGE)
272,"(RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...",(POSTAGE)
302,"(SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...",(SET/6 RED SPOTTY PAPER PLATES)
301,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS)
...,...,...
36,(POSTAGE),(JAM MAKING SET PRINTED)
27,(POSTAGE),(CIRCUS PARADE CHILDRENS EGG CUP)
97,(POSTAGE),(PARTY BUNTING)
226,(POSTAGE),"(LUNCH BAG RED RETROSPOT, LUNCH BAG WOODLAND)"


**Creating a Bayesian Network for every Apriori rule**

In [14]:
#extracting important columns as lists 
ant = rules['antecedents'].tolist()
con = rules['consequents'].tolist()
ant_supports = rules['antecedent support'].to_list()
con_supports = rules['consequent support'].to_list()

#storing the association rules as tuples in an array:
atob = []
for i in range(len(ant)):
  a = list(ant[i])
  b = list(con[i])
  el = [a,b]
  atob.append(el)

In [15]:
#The list that stores all the bayesian networks
networks = []


for rule in atob:
  a = rule[0]
  b = rule[1]

  edges = []
  for itema in a:
    for itemb in b:
      edges.append((itema,itemb))
      #adding an edge from itema to itemb

  #creating the network for the given rule
  #and appending it to the 'networks' array
  networks.append(BayesianNetwork(edges))



#Training the CPDS for each node
i = 0
for n in networks:
  n.fit(basket_France)
  print("Network " + str(i) + " is being initialised")
  i+=1


Network 0 is being initialised
Network 1 is being initialised
Network 2 is being initialised
Network 3 is being initialised
Network 4 is being initialised
Network 5 is being initialised
Network 6 is being initialised
Network 7 is being initialised
Network 8 is being initialised
Network 9 is being initialised
Network 10 is being initialised
Network 11 is being initialised
Network 12 is being initialised
Network 13 is being initialised
Network 14 is being initialised
Network 15 is being initialised
Network 16 is being initialised
Network 17 is being initialised
Network 18 is being initialised
Network 19 is being initialised
Network 20 is being initialised
Network 21 is being initialised
Network 22 is being initialised
Network 23 is being initialised
Network 24 is being initialised
Network 25 is being initialised
Network 26 is being initialised
Network 27 is being initialised
Network 28 is being initialised
Network 29 is being initialised
Network 30 is being initialised
Network 31 is bein

  tabular_cpd.values = (cpd / cpd.sum(axis=0)).reshape(tabular_cpd.cardinality)


Network 90 is being initialised
Network 91 is being initialised
Network 92 is being initialised
Network 93 is being initialised
Network 94 is being initialised
Network 95 is being initialised
Network 96 is being initialised
Network 97 is being initialised
Network 98 is being initialised
Network 99 is being initialised
Network 100 is being initialised
Network 101 is being initialised
Network 102 is being initialised
Network 103 is being initialised
Network 104 is being initialised
Network 105 is being initialised
Network 106 is being initialised
Network 107 is being initialised
Network 108 is being initialised
Network 109 is being initialised
Network 110 is being initialised
Network 111 is being initialised
Network 112 is being initialised
Network 113 is being initialised
Network 114 is being initialised
Network 115 is being initialised
Network 116 is being initialised
Network 117 is being initialised
Network 118 is being initialised
Network 119 is being initialised
Network 120 is being

**Calculating Bayesian Lift and Bayesian Confidence**

In [16]:
#This function is used to retrive P(Xi|Par(Xi)) 
#from the CPD table of Xi
def getval(tabcpd):
  k = 0
  try:
    k = tabcpd.values[-1][-1][-1][-1]
  except:
    try:
      k = tabcpd.values[-1][-1][-1]
    except:
      try:
        k = tabcpd.values[-1][-1]
      except:
        k = tabcpd.values[-1]
  
  return k

In [17]:
bc = [0]*(len(networks)) #array to store bayesian confidence vals
bl = [0]*(len(networks)) #array to store bayesian lift values

for i in range(len(atob)):
  rule = atob[i]
  bay_net = networks[i]

  L = len(bay_net.nodes())
  all_cpds = bay_net.get_cpds()

  cpdprod = 1
  for cptab in all_cpds:
    cpdprod *= getval(cptab)

  bcval = pow(cpdprod/ant_supports[i], L)
  bc[i] = bcval
  bl[i] = bcval/con_supports[i]

print(bc)
print(bl)

[1.0, 0.21352988429499498, 0.19296714393566475, 0.005323128895677053, 0.004225671340494538, 0.0006734002771636011, 0.0004949693065966533, 0.9298469387755101, 0.9245562130177513, 0.9216, 0.9216, 0.9184027777777779, 0.9149338374291116, 0.9111570247933886, 0.6532806695693077, 0.9070294784580498, 0.8948137326515703, 0.87890625, 0.8751300728407907, 0.8573388203017831, 0.8573388203017831, 0.8573388203017831, 0.01887893223038244, 0.016068415308874626, 0.014608586440023327, 0.8444119795471148, 0.0001650333152749078, 0.8264462809917354, 0.8264462809917354, 0.8264462809917354, 0.024627648541428897, 0.0011413049267287034, 0.803804994054697, 0.001590226555839128, 0.7971938775510204, 0.7901234567901234, 0.7901234567901234, 0.7825443786982248, 0.7825443786982248, 0.7760770975056689, 0.7744, 2.754630588772672e-08, 0.7722681359044995, 0.7715485756026297, 0.765625, 0.765625, 0.7625471136679232, 0.47678197297539565, 0.7561436672967864, 0.7541551246537397, 0.0015314562740429882, 0.0014137056176027746, 0.

**Sorting the rules based on highest BC and BL**

In [18]:
#First sorted by BC and then by BL:

answer = []
for i in range(len(atob)):
  answer.append([bc[i],bl[i],atob[i][0], atob[i][1]])

answer.sort(reverse = True)
ans_table = pd.DataFrame(answer,columns=["Bayesian Confidence","Bayesian Lift","A->","B"])
ans_table

Unnamed: 0,Bayesian Confidence,Bayesian Lift,A->,B
0,1.000000e+00,1.306667e+00,[JUMBO BAG WOODLAND ANIMALS],[POSTAGE]
1,9.298469e-01,1.215000e+00,[RED RETROSPOT PICNIC BAG],[POSTAGE]
2,9.245562e-01,1.208087e+00,[SET OF 9 BLACK SKULL BALLOONS],[POSTAGE]
3,9.216000e-01,6.690133e+00,[SET/6 RED SPOTTY PAPER PLATES],[SET/6 RED SPOTTY PAPER CUPS]
4,9.216000e-01,1.204224e+00,[PACK OF 6 SKULL PAPER CUPS],[POSTAGE]
...,...,...,...,...
343,1.365410e-07,1.784135e-07,"[SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOT...",[POSTAGE]
344,2.754631e-08,3.599384e-08,"[ALARM CLOCK BAKELIKE PINK, ALARM CLOCK BAKELI...",[POSTAGE]
345,7.424753e-10,1.077964e-08,[POSTAGE],"[PLASTERS IN TIN WOODLAND ANIMALS, PLASTERS IN..."
346,8.962873e-11,9.008837e-10,[POSTAGE],"[SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOT..."


In [19]:
answer2 = []
for i in range(len(atob)):
  answer2.append((bl[i],bc[i],atob[i][0],atob[i][1]))

answer2.sort(reverse = True)

ans_table2 = pd.DataFrame(answer2,columns=["Bayesian Lift","Bayesian Confidence","A  ->","B"])
ans_table2

Unnamed: 0,Bayesian Lift,Bayesian Confidence,A ->,B
0,1.295868e+01,8.264463e-01,[PACK OF 6 SKULL PAPER PLATES],[PACK OF 6 SKULL PAPER CUPS]
1,1.200274e+01,8.573388e-01,[CHILDRENS CUTLERY SPACEBOY],[CHILDRENS CUTLERY DOLLY GIRL]
2,1.157407e+01,7.971939e-01,[CHILDRENS CUTLERY DOLLY GIRL],[CHILDRENS CUTLERY SPACEBOY]
3,1.140364e+01,6.400000e-01,[PACK OF 6 SKULL PAPER CUPS],[PACK OF 6 SKULL PAPER PLATES]
4,7.241398e+00,7.019722e-01,[ALARM CLOCK BAKELIKE RED],[ALARM CLOCK BAKELIKE GREEN]
...,...,...,...,...
343,1.784135e-07,1.365410e-07,"[SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOT...",[POSTAGE]
344,3.599384e-08,2.754631e-08,"[ALARM CLOCK BAKELIKE PINK, ALARM CLOCK BAKELI...",[POSTAGE]
345,1.077964e-08,7.424753e-10,[POSTAGE],"[PLASTERS IN TIN WOODLAND ANIMALS, PLASTERS IN..."
346,9.008837e-10,8.962873e-11,[POSTAGE],"[SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOT..."


**Observation**

We see that the rules with high confidence, lift and support from our apriori algorithm also have high Bayesian confidence and Bayesian Lift. 

We also see that our top rule in Table 2 says that Buying paper plates (skull) implies byuying paper cups (skull). This is a very real life situation that comes up purely through our data analysis and association rule formation. This shows that our algorithms, Bayesian networks, confidence and lift work well!

**Thank you!**