# Association Rule Learning: Grocery Store

## Setup

While working as an intern in MLSolutions Inc. you are assigned your first big project: help the local grocery store to figure out which pairs of products sell best together. You are given a big dataset consisting of over 9000 records of all purchases made over 3 days and the products that were bought. What would you do?

## Theory First!

Firstly, we need to understand what type of machine learning algorithm we are expected to use. For this particular case, we will need to use one of the association rule learning methods. In short, association rule learning is a rule-based method for discovering interesting relations between variables in a dataset. 

Two most common algorithms are apriori and eclat. Each of them has its own advantages and disadvantages but the main difference is that eclat works better on the smaller datasets whereas apriori is better for big datasets.
Since we have 9000+ observations, we are going to use apriori.

Before we start, we need to refresh our knowledge of some common terms that would help us to evaluate the results of our model. 

1. Support. This is an indication of how frequently the item(or rule) appears in the dataset. We would expect some items, such as milk or sugar, to have high support values. The support values range from 0 to 1 where 0 means item did not appear in the dataset and 1 meaning that product appeared in all the transactions. 

2. Confidence. This is an indication of how frequently the rule is true. For instance, if we have a rule that sugar -> flour (i.e. people who buy sugar also buy flour) then the confidence of this rule would be the proportion of transactions containing both sugar and flour to all transactions containing sugar. The confidence values range from 0 to 1 where 0 means rule did not appear at all, and 1 meaning that rule appears in all relevant transactions.

3. Lift (= confidence of rule / support of product). This is an indication of how important is the rule. The lift value is between 0 and infinity. When lift values < 1 then rule appears less than expected and it is not relevant. Values > 1 means the opposite. So we want to have all lift values greater than 1. 

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

While apriori algorithm can be implemented from the scratch, for this project we are going to use an apyori package I found on github. 

In [2]:
!pip install apyori
from apyori import apriori



## Data Preprocessing

As always, lets first read dataset into a pandas dataframe to have a closer look!

In [3]:
dataset = pd.read_csv('datasets/Grocery Products Purchase.csv')
dataset.head()

Unnamed: 0,Product 1,Product 2,Product 3,Product 4,Product 5,Product 6,Product 7,Product 8,Product 9,Product 10,...,Product 23,Product 24,Product 25,Product 26,Product 27,Product 28,Product 29,Product 30,Product 31,Product 32
0,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,,...,,,,,,,,,,
1,tropical fruit,yogurt,coffee,,,,,,,,...,,,,,,,,,,
2,whole milk,,,,,,,,,,...,,,,,,,,,,
3,pip fruit,yogurt,cream cheese,meat spreads,,,,,,,...,,,,,,,,,,
4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,,...,,,,,,,,,,


We can see that it has 32 columns, so this means that the largest observation (i.e. transaction) has 32 products, and other transactions with less products have 'NaN' values. 
For apriori function we want to extract transactions from the dataframe and have a list where each transaction is a separate list. We also want to ignore all 'NaN' values. 

In [4]:
shape = dataset.shape
transactions = []
for i in range(0, shape[0]):
    transactions.append([str(dataset.values[i,j]) for j in range(0, shape[1]) if str(dataset.values[i,j]) != 'nan'])

## Initial Apriori Algorithm

Now it is time to use apriori fucntion we have imported above. It is important to understand how the function works in order to use it properly. This function takes in list of lists of transactions (thats what we did above), as well as minimum support for the rules(since we want to have only relevant rules), minimum confidence and minimum lift. Apriori returns the set of rules in a specific format, which we will discuss later. 

Now lets choose the parameters values we want to include. Lets start with min_support. Assume we want the rule to appear at least 10 times a day, or 30 times over 3 days. Then to calculate min_support we divide 30 by 9000 and multiply by 100 to get 0.33%. This gives us min_support = 0.003.

Next lets choose min_confidence value. Recall that confidence refers to to the proportion of transactions with a rule to all relevant transactions. Lets assume that for this case we want rule to appear in at least 50% of relevant transactions. If this value is too high, we will choose a lower value later. 

For a minimum_lift parameter, we have discussed that values > 1 suggest that the rule appears more than expected. For the sake of this assignment, lets assume min_lift = 2. 

Finally, since we are looking for a pair of elements, lets choose min_length = 2 and max_length = 2. 


In [6]:
rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.5, min_lift=2, min_length=2, max_length=2)

As mentioned above, apriori function returns a list of rules. Lets have a closer look at a single rule, so we know how to properly extract information.

In [7]:
results = list(rules)
result = results[0]
print(result)

RelationRecord(items=frozenset({'whole milk', 'baking powder'}), support=0.009252669039145907, ordered_statistics=[OrderedStatistic(items_base=frozenset({'baking powder'}), items_add=frozenset({'whole milk'}), confidence=0.5229885057471264, lift=2.0467934556398677)])


Firstly, the rule is baking powder -> whole milk. This rule has a support of 0.009 (it appears in 0.9% of all transactions), confidence of 0.523 (half of the purchases containing baking powder also contained whole milk) and lift of 2. Now we can create a function to extract the information and use it in a dataframe.

In [8]:
def inspect_pair(results):
    prod1, prod2, supports, confidences, lifts = [], [], [], [], []
    for result in results:
        prod1.append(tuple(result[2][0][0])[0])
        prod2.append(tuple(result[2][0][1])[0])
        supports.append(result[1])
        confidences.append(result[2][0][2])
        lifts.append(result[2][0][3])
    columns =['Product 1', 'Product 2', 'Support', 'Confidence', 'Lift']
    return list(zip(prod1,prod2,supports, confidences, lifts)), columns

In [9]:
resultsinDataFrame = pd.DataFrame(inspect_pair(results)[0], columns = inspect_pair(results)[1])

In [10]:
resultsinDataFrame.sort_values(by = 'Lift', ascending = False)

Unnamed: 0,Product 1,Product 2,Support,Confidence,Lift
2,rice,other vegetables,0.003965,0.52,2.687441
3,specialty cheese,other vegetables,0.00427,0.5,2.584078
1,cereals,whole milk,0.00366,0.642857,2.515917
4,rice,whole milk,0.004677,0.613333,2.400371
0,baking powder,whole milk,0.009253,0.522989,2.046793


So we can see that there are only 5 rules we can find with the minimum confidence of 50% and minimum lift of two. What we can do is to lower confidence to 40% and see how many rules we can get.

## Updated Apriori

In [11]:
rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.4, min_lift=2, min_length=2, max_length=2)
results = list(rules)
resultsinDataFrame = pd.DataFrame(inspect_pair(results)[0], columns = inspect_pair(results)[1])
resultsinDataFrame.sort_values(by = 'Lift', ascending = False)

Unnamed: 0,Product 1,Product 2,Support,Confidence,Lift
2,liquor,bottled beer,0.004677,0.422018,5.240594
9,herbs,root vegetables,0.007016,0.43125,3.956477
19,rice,root vegetables,0.003152,0.413333,3.792102
11,rice,other vegetables,0.003965,0.52,2.687441
16,specialty cheese,other vegetables,0.00427,0.5,2.584078
17,turkey,other vegetables,0.003965,0.4875,2.519476
4,cereals,whole milk,0.00366,0.642857,2.515917
8,herbs,other vegetables,0.007728,0.475,2.454874
12,roll products,other vegetables,0.004779,0.465347,2.404983
20,rice,whole milk,0.004677,0.613333,2.400371


We can see that now we have 20 rules with a confidence of 40%. This is going to be much more useful for the store owner! Lets update this dataframe with even more rules with confidence >= 30%. 

In [12]:
rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.3, min_lift=2, min_length=2, max_length=2)
results = list(rules)
resultsinDataFrame = pd.DataFrame(inspect_pair(results)[0], columns = inspect_pair(results)[1])
resultsinDataFrame.sort_values(by = 'Lift', ascending = False)

Unnamed: 0,Product 1,Product 2,Support,Confidence,Lift
0,Instant food products,hamburger meat,0.00305,0.379747,11.421438
5,liquor,bottled beer,0.004677,0.422018,5.240594
16,herbs,root vegetables,0.007016,0.43125,3.956477
28,rice,root vegetables,0.003152,0.413333,3.792102
3,beef,root vegetables,0.017387,0.331395,3.040367
30,roll products,root vegetables,0.003152,0.306931,2.815917
19,onions,root vegetables,0.009456,0.304918,2.797452
20,rice,other vegetables,0.003965,0.52,2.687441
25,specialty cheese,other vegetables,0.00427,0.5,2.584078
26,turkey,other vegetables,0.003965,0.4875,2.519476


With minimum confidence of 30% we were able to identify 34 unique product pairs. While some of them are pretty obvious (cereals -> milk gave us an astonishing 64% confidence), others are more subtle. For instance, berries -> yogurt. I would assume most yogurt products already have fruits in them, unless it is greek yogurt of course. 

## Handling More Than a Pair

While original assignment was to find products that go together, what if we can find 3 products that go together. For instance, maybe there is a rule that people who buy baking powder and milk also buy sugar. This might be useful for the owner, right?

In [13]:
rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.3, min_lift=3, min_length=2, max_length=3)
results = list(rules)
result = results[-1]
result

RelationRecord(items=frozenset({'tropical fruit', 'whipped/sour cream', 'yogurt'}), support=0.0062023385866802234, ordered_statistics=[OrderedStatistic(items_base=frozenset({'tropical fruit', 'whipped/sour cream'}), items_add=frozenset({'yogurt'}), confidence=0.44852941176470584, lift=3.215223589435774)])

We can see that the format of the output is different from the case where we have only had two products. So lets modify inspect_pair function so it can handle different inputs.

In [14]:
def inspect(results):
    base_products, add_product, supports, confidences, lifts = [], [], [], [], []
    for result in results:
        all_products = ''
        for product in tuple(result[2][0][0]):
            all_products = all_products + ',' + product
        base_products.append(all_products.rstrip(',').lstrip(','))
        add_product.append(tuple(result[2][0][1])[0])
        supports.append(result[1])
        confidences.append(result[2][0][2])
        lifts.append(result[2][0][3])
    columns =['Base Products', 'Additional Product', 'Support', 'Confidence', 'Lift']
    return list(zip(base_products,add_product,supports, confidences, lifts)), columns

In [15]:
resultsinDataFrame = pd.DataFrame(inspect(results)[0], columns = inspect(results)[1])
resultsinDataFrame.sort_values(by = 'Lift', ascending = False)

Unnamed: 0,Base Products,Additional Product,Support,Confidence,Lift
0,Instant food products,hamburger meat,0.003050,0.379747,11.421438
2,liquor,bottled beer,0.004677,0.422018,5.240594
13,"berries,whole milk",whipped/sour cream,0.004270,0.362069,5.050990
60,"herbs,whole milk",root vegetables,0.004169,0.539474,4.949369
59,"herbs,other vegetables",root vegetables,0.003864,0.500000,4.587220
...,...,...,...,...,...
26,"yogurt,chicken",other vegetables,0.004881,0.585366,3.025262
82,"whipped/sour cream,pip fruit",root vegetables,0.003050,0.329670,3.024541
48,"domestic eggs,other vegetables",root vegetables,0.007321,0.328767,3.016254
36,"citrus fruit,whipped/sour cream",yogurt,0.004575,0.420561,3.014734


In [16]:
len(resultsinDataFrame)

92

As we can see we have 92 unique combinations of products that have a confidence at least 30%. What will happen if we allow any size combination but set the minimum confidence to 50%?

In [17]:
results = list(apriori(transactions=transactions, min_support=0.003, min_confidence=0.5, min_lift=3, min_length=2, max_length=100))

In [18]:
resultsinDataFrame = pd.DataFrame(inspect(results)[0], columns = inspect(results)[1])
resultsinDataFrame.sort_values(by = 'Confidence', ascending = False)

Unnamed: 0,Base Products,Additional Product,Support,Confidence,Lift
30,"root vegetables,butter,yogurt",whole milk,0.00305,0.789474,3.089723
32,"tropical fruit,citrus fruit,root vegetables",other vegetables,0.004474,0.785714,4.060694
27,"brown bread,root vegetables,other vegetables",whole milk,0.003152,0.775,3.033078
47,"tropical fruit,whipped/sour cream,root vegetables",other vegetables,0.003355,0.733333,3.789981
42,"root vegetables,whole milk,onions",other vegetables,0.003254,0.680851,3.518744
24,"sliced cheese,root vegetables",other vegetables,0.003762,0.672727,3.476759
41,"margarine,whole milk,root vegetables",other vegetables,0.003254,0.653061,3.375122
1,"brown bread,whipped/sour cream",other vegetables,0.00305,0.652174,3.370536
20,"tropical fruit,onions",other vegetables,0.00366,0.642857,3.322386
2,"pip fruit,butter milk",other vegetables,0.003254,0.64,3.30762


In [19]:
len(resultsinDataFrame)

53

Wow! As we can see, around 79% of people who bought yogurt, root vegetables and butter also bought whole milk. That is a very interesting insight!

Overall, we can see that given a large amount of data, Apriori algorithm can give us a lot of insight into consumer pattern. I am sure that the store owner would benefit from the information we have gathered, and hopefully this project would impress higher management at MLSolutions and they will consider offering him a full-time position!