## Groceries Market Basket Dataset

The dataset contains supermarket transactions, where each row represents a single transaction and each column represents a specific item. A value of True indicates that the item was purchased in that transaction, while False indicates it was not purchased.

First, we import the libraries:

*   Pandas - for data manipulation.
*   MatPlotLib - for data visualization.
*   Seaborn - for data visualization.
*   MLXtend - to apply the Apriori Algorithm.

In [2]:
# importing the libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from matplotlib import pyplot as plt
import seaborn as sns

We load the dataset into a Pandas DataFrame and view the first few rows to inspect the structure.

In [6]:
# loading the dataset
df = pd.read_csv('input/sol1.csv', index_col=0)
df.head()

Unnamed: 0,Apple,Bread,Butter,Cheese,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Sugar,Unicorn,Yogurt,chocolate
0,False,True,False,False,True,True,False,True,False,False,False,False,True,False,True,True
1,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
2,True,False,True,False,False,True,False,True,False,True,False,False,False,False,True,True
3,False,False,True,True,False,True,False,False,False,True,True,True,False,False,False,False
4,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


#### **Q1** How many transactions and items are there in the data set?


We check the number of transactions and the number of items in the dataset. Rows correspond to transactions, and columns correspond to items.

In [7]:
# finding the dimensions of the dataframe
df.shape

(999, 16)

As we can see, there are 999 rows, meaning 999 transactions, and 16 columns, meaning 16 items in the dataset.

To prepare the data for the following questions, we apply the Apriori algorithm on the dataframe and set the minimum support parameter to 2%.

In [8]:
# applying the apriori algorithm
frequent_itemsets = apriori(df, min_support=0.02, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.383383,(Apple),1
1,0.384384,(Bread),1
2,0.420420,(Butter),1
3,0.404404,(Cheese),1
4,0.407407,(Corn),1
...,...,...,...
5540,0.020020,"(Ice cream, Nutmeg, Cheese, Kidney Beans, Onio...",6
5541,0.021021,"(Unicorn, Cheese, chocolate, Kidney Beans, Oni...",6
5542,0.020020,"(Sugar, Ice cream, Nutmeg, Cheese, chocolate, ...",6
5543,0.020020,"(Yogurt, Nutmeg, Corn, Unicorn, chocolate, Onion)",6


#### **Q4** Find top selling items with minimum support of 5%.

To solve Question 4, first we sort the dataframe by support in the descending order by using the sort_values() function from the Pandas library and setting the by and ascending parameters to support and False respectively.

In [9]:
# sorting the dataframe
frequent_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)

Next, we filter the dataframe to find items with length 1 and support more than equal to 5%. Then we splice the sorted and filtered dataframe to show only the top 5 entries.

In [12]:
# finding top 5 items with minimum support of 5%
frequent_itemsets[ (frequent_itemsets['length'] == 1) &
                   (frequent_itemsets['support'] >= 0.05) ][0:5]

Unnamed: 0,support,itemsets,length
15,0.421421,(chocolate),1
2,0.42042,(Butter),1
14,0.42042,(Yogurt),1
7,0.41041,(Ice cream),1
12,0.409409,(Sugar),1


As we can see, chocolate, butter, yogurt, ice cream and sugar are the top 5 selling items with support of 42%, 42%, 42%, 41%, and 40% respectively.

#### **Q5.** Find all frequent itemsets with minimum support of 20%.

To solve Question 5, we filter the dataframe to find itemsets having length more than 1, and support more than 5%.

In [None]:
# finding itemsets having length more than 1 and minimum support of 20%
frequent_itemsets[(frequent_itemsets['length'] > 1) & 
                  (frequent_itemsets['support'] >= 0.20)]

Unnamed: 0,support,itemsets,length
120,0.211211,"(Milk, chocolate)",2
49,0.207207,"(Ice cream, Butter)",2
107,0.202202,"(chocolate, Ice cream)",2
50,0.202202,"(Butter, Kidney Beans)",2
57,0.202202,"(chocolate, Butter)",2
62,0.2002,"(Kidney Beans, Cheese)",2


As we can see, there are only 6 itemsets, having support of around 20%.

#### **Q6.**  Find all frequent itemsets of length 2 with minimum support of 19%.

To solve Question 6, we filter the dataframe to find itemsets having length 2 and minimum support of 19%.

In [None]:
# finding itemsets having length 2 and minimum support of 19%
frequent_itemsets[(frequent_itemsets['length'] == 2) & 
                  (frequent_itemsets['support'] >= 0.19)]

Unnamed: 0,support,itemsets,length
120,0.211211,"(Milk, chocolate)",2
49,0.207207,"(Ice cream, Butter)",2
107,0.202202,"(chocolate, Ice cream)",2
50,0.202202,"(Butter, Kidney Beans)",2
57,0.202202,"(chocolate, Butter)",2
62,0.2002,"(Kidney Beans, Cheese)",2
90,0.199199,"(chocolate, Dill)",2
108,0.199199,"(Milk, Kidney Beans)",2
52,0.198198,"(Butter, Nutmeg)",2
51,0.198198,"(Milk, Butter)",2


As we can see, there are 37 itemsets having length 2 with support more than or equal to 19%. The support ranges between 21% and 19% with milk and chocolate having the highest support, and sugar and onion having the minimum support.

#### **Q7.** Find the top 10 association rules with minimum support of 20%, sorted by confidence in descending order.


To solve Question 7, we first find the association rules using the association_rules() function from the MLXtend library and set the parameter metric to support, and the min_threshold to 20%.

In [29]:
# finding top 10 association rules with minimum support of 19%
rules = association_rules(frequent_itemsets, metric='support', min_threshold=0.20)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Milk),(chocolate),0.405405,0.421421,0.211211,0.520988,1.236263,1.0,0.040365,1.207857,0.321413,0.343089,0.172088,0.511088
1,(chocolate),(Milk),0.421421,0.405405,0.211211,0.501188,1.236263,1.0,0.040365,1.192021,0.33031,0.343089,0.161088,0.511088
2,(Ice cream),(Butter),0.41041,0.42042,0.207207,0.504878,1.200889,1.0,0.034662,1.170579,0.283728,0.332263,0.145722,0.498868
3,(Butter),(Ice cream),0.42042,0.41041,0.207207,0.492857,1.200889,1.0,0.034662,1.162571,0.288629,0.332263,0.139837,0.498868
4,(chocolate),(Ice cream),0.421421,0.41041,0.202202,0.47981,1.169098,1.0,0.029246,1.133412,0.249991,0.321145,0.117708,0.486246
5,(Ice cream),(chocolate),0.41041,0.421421,0.202202,0.492683,1.169098,1.0,0.029246,1.140467,0.245323,0.321145,0.123167,0.486246
6,(Butter),(Kidney Beans),0.42042,0.408408,0.202202,0.480952,1.177626,1.0,0.030499,1.139764,0.260247,0.322684,0.122625,0.488025
7,(Kidney Beans),(Butter),0.408408,0.42042,0.202202,0.495098,1.177626,1.0,0.030499,1.147905,0.254963,0.322684,0.128848,0.488025
8,(chocolate),(Butter),0.421421,0.42042,0.202202,0.47981,1.141262,1.0,0.025028,1.114169,0.213933,0.316119,0.10247,0.480381
9,(Butter),(chocolate),0.42042,0.421421,0.202202,0.480952,1.141262,1.0,0.025028,1.114693,0.213564,0.316119,0.102892,0.480381


Then we sort the generated association rules in the descending order by confidence by using the sort_values() function from the Pandas library and setting the by and ascending parameters to confidence and False respectively. Then we splice the sorted dataframe to show the top 10 rules.

In [30]:
# sorting the rules in the descending order by confidence
rules.sort_values(by='confidence', ascending=False)[0:10]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Milk),(chocolate),0.405405,0.421421,0.211211,0.520988,1.236263,1.0,0.040365,1.207857,0.321413,0.343089,0.172088,0.511088
2,(Ice cream),(Butter),0.41041,0.42042,0.207207,0.504878,1.200889,1.0,0.034662,1.170579,0.283728,0.332263,0.145722,0.498868
1,(chocolate),(Milk),0.421421,0.405405,0.211211,0.501188,1.236263,1.0,0.040365,1.192021,0.33031,0.343089,0.161088,0.511088
7,(Kidney Beans),(Butter),0.408408,0.42042,0.202202,0.495098,1.177626,1.0,0.030499,1.147905,0.254963,0.322684,0.128848,0.488025
11,(Cheese),(Kidney Beans),0.404404,0.408408,0.2002,0.49505,1.212143,1.0,0.035038,1.171583,0.293849,0.326797,0.146454,0.492623
3,(Butter),(Ice cream),0.42042,0.41041,0.207207,0.492857,1.200889,1.0,0.034662,1.162571,0.288629,0.332263,0.139837,0.498868
5,(Ice cream),(chocolate),0.41041,0.421421,0.202202,0.492683,1.169098,1.0,0.029246,1.140467,0.245323,0.321145,0.123167,0.486246
10,(Kidney Beans),(Cheese),0.408408,0.404404,0.2002,0.490196,1.212143,1.0,0.035038,1.168284,0.295838,0.326797,0.144043,0.492623
6,(Butter),(Kidney Beans),0.42042,0.408408,0.202202,0.480952,1.177626,1.0,0.030499,1.139764,0.260247,0.322684,0.122625,0.488025
9,(Butter),(chocolate),0.42042,0.421421,0.202202,0.480952,1.141262,1.0,0.025028,1.114693,0.213564,0.316119,0.102892,0.480381


#### **Q8.** Find association rules with minimum support of 20% and lift of more than 1.0.


To solve Question 8, we filter the dataframe to have lift more than 1.

In [None]:
# finding association rules with minimum support of 20% and having lift more than 1
rules[(rules['support'] >= 0.20) &
      (rules['lift'] > 1.0)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Milk),(chocolate),0.405405,0.421421,0.211211,0.520988,1.236263,1.0,0.040365,1.207857,0.321413,0.343089,0.172088,0.511088
1,(chocolate),(Milk),0.421421,0.405405,0.211211,0.501188,1.236263,1.0,0.040365,1.192021,0.33031,0.343089,0.161088,0.511088
2,(Ice cream),(Butter),0.41041,0.42042,0.207207,0.504878,1.200889,1.0,0.034662,1.170579,0.283728,0.332263,0.145722,0.498868
3,(Butter),(Ice cream),0.42042,0.41041,0.207207,0.492857,1.200889,1.0,0.034662,1.162571,0.288629,0.332263,0.139837,0.498868
4,(chocolate),(Ice cream),0.421421,0.41041,0.202202,0.47981,1.169098,1.0,0.029246,1.133412,0.249991,0.321145,0.117708,0.486246
5,(Ice cream),(chocolate),0.41041,0.421421,0.202202,0.492683,1.169098,1.0,0.029246,1.140467,0.245323,0.321145,0.123167,0.486246
6,(Butter),(Kidney Beans),0.42042,0.408408,0.202202,0.480952,1.177626,1.0,0.030499,1.139764,0.260247,0.322684,0.122625,0.488025
7,(Kidney Beans),(Butter),0.408408,0.42042,0.202202,0.495098,1.177626,1.0,0.030499,1.147905,0.254963,0.322684,0.128848,0.488025
8,(chocolate),(Butter),0.421421,0.42042,0.202202,0.47981,1.141262,1.0,0.025028,1.114169,0.213933,0.316119,0.10247,0.480381
9,(Butter),(chocolate),0.42042,0.421421,0.202202,0.480952,1.141262,1.0,0.025028,1.114693,0.213564,0.316119,0.102892,0.480381
