### OBJECTIVE: Identify frequent itemsets using Apriori Algorithm.

Apriori algorithm is a machine learning model used in Association Rule Learning to identify frequent itemsets from a dataset. This model has been highly applied on transactions datasets by large retailers to determine items that customers frequently buy together with high probability.

The Apriori algorithm uses three matrices to find the best association rules from a dataset:
Support - It measures the number of times a particular item or combination of items occur in a dataset. 
Confidence - It measures how the consumer is likely to consume commodity x given that they have consumed commodity y. 
Lift - A lift is a metric that determines the strength of association between the best rules. It is obtained by taking confidence and diving it with support.

In [1]:
#Importing necessary libraries
from mlxtend.frequent_patterns import apriori,association_rules
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
import time

In [2]:
#Importing and reading the dataset
data = [['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
       ['tropical fruit', 'yogurt', 'coffee'], 
       ['whole milk'], 
       ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'], 
       ['other vegetables', 'whole milk', 'condensed milk', 'long life bakery product'],
       ['whole milk', 'butter', 'yogurt', 'rice', 'abrasive cleaner'],
       ['rolls/buns'], 
       ['other vegetables', 'UHT-milk', 'rolls/buns', 'bottled beer', 'liquor (appetizer)'],
       ['potted plants'], 
       ['whole milk', 'cereals'], 
       ['tropical fruit', 'other vegetables', 'white bread', 'bottled water', 'chocolate'],
       ['citrus fruit', 'tropical fruit', 'whole milk', 'butter', 'curd', 'yogurt', 'flour', 'bottled water', 'dishes'],
       ['beef'], ['frankfurter', 'rolls/buns', 'soda'], 
       ['chicken', 'tropical fruit'], 
       ['butter', 'sugar', 'fruit/vegetable juice', 'newspapers'], 
       ['fruit/vegetable juice'],
       ['packaged fruit/vegetables'], 
       ['chocolate'], 
       ['specialty bar'], 
       ['other vegetables'], 
       ['butter milk', 'pastry'],
       ['whole milk'], 
       ['tropical fruit', 'cream cheese', 'processed cheese', 'detergent', 'newspapers']]
data

[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
 ['tropical fruit', 'yogurt', 'coffee'],
 ['whole milk'],
 ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'],
 ['other vegetables',
  'whole milk',
  'condensed milk',
  'long life bakery product'],
 ['whole milk', 'butter', 'yogurt', 'rice', 'abrasive cleaner'],
 ['rolls/buns'],
 ['other vegetables',
  'UHT-milk',
  'rolls/buns',
  'bottled beer',
  'liquor (appetizer)'],
 ['potted plants'],
 ['whole milk', 'cereals'],
 ['tropical fruit',
  'other vegetables',
  'white bread',
  'bottled water',
  'chocolate'],
 ['citrus fruit',
  'tropical fruit',
  'whole milk',
  'butter',
  'curd',
  'yogurt',
  'flour',
  'bottled water',
  'dishes'],
 ['beef'],
 ['frankfurter', 'rolls/buns', 'soda'],
 ['chicken', 'tropical fruit'],
 ['butter', 'sugar', 'fruit/vegetable juice', 'newspapers'],
 ['fruit/vegetable juice'],
 ['packaged fruit/vegetables'],
 ['chocolate'],
 ['specialty bar'],
 ['other vegetables'],
 ['butter mi

In [3]:
#Converting the dataset into pandas dataframe
data1 = pd.DataFrame(data)
data1

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,citrus fruit,semi-finished bread,margarine,ready soups,,,,,
1,tropical fruit,yogurt,coffee,,,,,,
2,whole milk,,,,,,,,
3,pip fruit,yogurt,cream cheese,meat spreads,,,,,
4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,
5,whole milk,butter,yogurt,rice,abrasive cleaner,,,,
6,rolls/buns,,,,,,,,
7,other vegetables,UHT-milk,rolls/buns,bottled beer,liquor (appetizer),,,,
8,potted plants,,,,,,,,
9,whole milk,cereals,,,,,,,


In [5]:
#Finding the most popular items that has been purchased
popular=data1[0].value_counts()
popular

whole milk                   4
tropical fruit               3
other vegetables             3
citrus fruit                 2
pip fruit                    1
rolls/buns                   1
potted plants                1
beef                         1
frankfurter                  1
chicken                      1
butter                       1
fruit/vegetable juice        1
packaged fruit/vegetables    1
chocolate                    1
specialty bar                1
butter milk                  1
Name: 0, dtype: int64

In [6]:
#Initializing the transactionEncoder
te = TransactionEncoder()
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,UHT-milk,abrasive cleaner,beef,bottled beer,bottled water,butter,butter milk,cereals,chicken,chocolate,...,rice,rolls/buns,semi-finished bread,soda,specialty bar,sugar,tropical fruit,white bread,whole milk,yogurt
0,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
5,False,True,False,False,False,True,False,False,False,False,...,True,False,False,False,False,False,False,False,True,True
6,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
7,True,False,False,True,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,True,False


In [7]:
#Picking frequent itemsets
apriori(df, min_support=0.05)

Unnamed: 0,support,itemsets
0,0.083333,(4)
1,0.125,(5)
2,0.083333,(9)
3,0.083333,(10)
4,0.083333,(13)
5,0.083333,(19)
6,0.083333,(24)
7,0.166667,(25)
8,0.125,(33)
9,0.208333,(38)


In [8]:
#Picking frequent itemsets now with names of items
apriori(df, min_support=0.05, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.083333,(bottled water)
1,0.125,(butter)
2,0.083333,(chocolate)
3,0.083333,(citrus fruit)
4,0.083333,(cream cheese)
5,0.083333,(fruit/vegetable juice)
6,0.083333,(newspapers)
7,0.166667,(other vegetables)
8,0.125,(rolls/buns)
9,0.208333,(tropical fruit)


Next, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset

In [9]:
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.083333,(bottled water),1
1,0.125,(butter),1
2,0.083333,(chocolate),1
3,0.083333,(citrus fruit),1
4,0.083333,(cream cheese),1
5,0.083333,(fruit/vegetable juice),1
6,0.083333,(newspapers),1
7,0.166667,(other vegetables),1
8,0.125,(rolls/buns),1
9,0.208333,(tropical fruit),1


In [10]:
#Exploring dataset according to specific criterias
frequent_itemsets[(frequent_itemsets['length'] == 3) &
                   (frequent_itemsets['support'] > 0.05)]

Unnamed: 0,support,itemsets,length
17,0.083333,"(butter, yogurt, whole milk)",3


Here, we see that items that fulfills our criteria of having length being 3 and support being greater than the minimum spport is butter, whole milk and yogurt.

In [11]:
#Exploring datasets by specifying the items and finding its length and support
frequent_itemsets[frequent_itemsets['itemsets'] == {'yogurt', 'whole milk'} ]

Unnamed: 0,support,itemsets,length
16,0.083333,"(yogurt, whole milk)",2


Here, we see that for the item yogurt and butter, the length is 2 and the support is greater than the minimum support, i.e., 0.05.

**MINING ASSOCIATION RULE

We know that the association rules are simply the if-else statements. The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common.
So, here we create antecedents and consequents.

In [13]:
#We set our metric as "Lift" to define whether antecedents & consequents are dependent our not
rules = association_rules(frequent_itemsets , metric="lift", min_threshold=0.05)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(tropical fruit),(bottled water),0.208333,0.083333,0.083333,0.4,4.8,0.065972,1.527778
1,(bottled water),(tropical fruit),0.083333,0.208333,0.083333,1.0,4.8,0.065972,inf
2,(butter),(whole milk),0.125,0.25,0.083333,0.666667,2.666667,0.052083,2.25
3,(whole milk),(butter),0.25,0.125,0.083333,0.333333,2.666667,0.052083,1.3125
4,(yogurt),(butter),0.166667,0.125,0.083333,0.5,4.0,0.0625,1.75
5,(butter),(yogurt),0.125,0.166667,0.083333,0.666667,4.0,0.0625,2.5
6,(tropical fruit),(yogurt),0.208333,0.166667,0.083333,0.4,2.4,0.048611,1.388889
7,(yogurt),(tropical fruit),0.166667,0.208333,0.083333,0.5,2.4,0.048611,1.583333
8,(yogurt),(whole milk),0.166667,0.25,0.083333,0.5,2.0,0.041667,1.5
9,(whole milk),(yogurt),0.25,0.166667,0.083333,0.333333,2.0,0.041667,1.25


The output above shows the values of various supporting components. To get more insights from the data, we sort the data by the confidence value.

In [15]:
#Sorting values based on confidence
rules.sort_values("confidence",ascending=True)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(whole milk),(butter),0.25,0.125,0.083333,0.333333,2.666667,0.052083,1.3125
9,(whole milk),(yogurt),0.25,0.166667,0.083333,0.333333,2.0,0.041667,1.25
15,(whole milk),"(yogurt, butter)",0.25,0.083333,0.083333,0.333333,4.0,0.0625,1.375
0,(tropical fruit),(bottled water),0.208333,0.083333,0.083333,0.4,4.8,0.065972,1.527778
6,(tropical fruit),(yogurt),0.208333,0.166667,0.083333,0.4,2.4,0.048611,1.388889
4,(yogurt),(butter),0.166667,0.125,0.083333,0.5,4.0,0.0625,1.75
7,(yogurt),(tropical fruit),0.166667,0.208333,0.083333,0.5,2.4,0.048611,1.583333
8,(yogurt),(whole milk),0.166667,0.25,0.083333,0.5,2.0,0.041667,1.5
14,(yogurt),"(whole milk, butter)",0.166667,0.083333,0.083333,0.5,6.0,0.069444,1.833333
2,(butter),(whole milk),0.125,0.25,0.083333,0.666667,2.666667,0.052083,2.25


The above table shows the relationship between different items and the likelihood of a customer buying those items together. For example, according to the table, the customers who purchased whole milk are expected to buy butter with a likelihood of 33% (confidence).