# Association Rules Mining

Based on the transaction history, many patterns can be exploited to understand customer behaviors. One of the insights that can be achieved is the associations between items purchased. Using the data of all items bought, we can determine which items are usually purchased or purchased together, thereby defining and building up product association models. Conclusions must be tested for probability and reliability. For example, a person goes to a supermarket to buy bread. It is an 80% chance that he/she might purchase jam or ham to eat with bread with 90% confidence.

## Apriori Algorithm

Apriori is an algorithm used to identify frequent item sets (in our case, item pairs). It does so using a "bottom up" approach, first identifying individual items that satisfy a minimum occurence threshold. It then extends the item set, adding one item at a time and checking if the resulting item set still satisfies the specified threshold. The algorithm stops when there are no more items to add that meet the minimum occurrence requirement. Here's an example of apriori in action, assuming a minimum occurence threshold of 3:

```text
order 1: apple, egg, milk  
order 2: carrot, milk  
order 3: apple, egg, carrot
order 4: apple, egg
order 5: apple, carrot


Iteration 1:  Count the number of times each item occurs   
item set      occurrence count    
{apple}              4   
{egg}                3   
{milk}               2   
{carrot}             2   

{milk} and {carrot} are eliminated because they do not meet the minimum occurrence threshold.


Iteration 2: Build item sets of size 2 using the remaining items from Iteration 1 
             (ie: apple, egg)  
item set           occurence count  
{apple, egg}             3  

Only {apple, egg} remains and the algorithm stops since there are no more items to add.
```

## Association Rules Mining

Once the item sets have been generated using apriori, we can start mining association rules. Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}. One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

1. support
This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total and {apple,egg} occurs in 3 of them, so:

             support{apple,egg} = 3/5 or 60%

The minimum support threshold required by apriori can be set based on knowledge of your domain. In this grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.

2. confidence
Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased. This is expressed as:

             confidence{A->B} = support{A,B} / support{A}   

Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that item A is purchased, given that item B was purchased:

             confidence{B->A} = support{A,B} / support{B}    

In our example, the percentage of times that egg is purchased, given that apple was purchased is:

             confidence{apple->egg} = support{apple,egg} / support{apple}
                                    = (3/5) / (4/5)
                                    = 0.75 or 75%

A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg. Now, we look at the confidence measure in the opposite direction (ie: egg->apple):

             confidence{egg->apple} = support{apple,egg} / support{egg}
                                    = (3/5) / (3/5)
                                    = 1 or 100%  

Here we see that all of the orders that contain egg also contain apple. But, does this mean that there is a relationship between these two items, or are they occurring together in the same orders simply by chance? To answer this question, we look at another measure which takes into account the popularity of both items.

3. lift
Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occuring together in the same orders simply by chance (ie: at random). Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}:

             lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})   

In our example, we compute lift as follows:

  lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg})
                  = (3/5) / (4/5 * 3/5) 
                  = 1.25    

One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same order if there was no relationship between them. In the example above, if apple occurred in 80% of the orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would expect both of them to show up together in the same order 48% of the time (ie: 80% * 60%). The numerator, on the other hand, represents how often apple and egg actually appear together in the same order. In this example, that is 60% of the time. Taking the numerator and dividing it by the denominator, we get to how many more times apple and egg actually appear in the same order, compared to if there was no relationship between them (ie: that they are occurring together simply at random).

In summary, lift can take on the following values:

 * lift = 1 implies no relationship between A and B. 
   (ie: A and B occur together only by chance)

 * lift > 1 implies that there is a positive relationship between A and B.
   (ie:  A and B occur together more often than random)

 * lift < 1 implies that there is a negative relationship between A and B.
   (ie:  A and B occur together less often than random)

In our example, apple and egg occur together 1.25 times more than random, so we conclude that there exists a positive relationship between them.

In [1]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter

## Dataset

For this article, I will use a dataset from a bakery (License: CC0: Public Domain). This dataset includes 20507 entries, over 9000 transactions, and 5 columns. You can download the data from this [link](https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket).

### Load order data

In [12]:
orders = pd.read_csv('./bread basket.csv')
orders.head()

Unnamed: 0,Transaction,Item,date_time,period_day,weekday_weekend
0,1,Bread,30-10-2016 09:58,morning,weekend
1,2,Scandinavian,30-10-2016 10:05,morning,weekend
2,2,Scandinavian,30-10-2016 10:05,morning,weekend
3,3,Hot chocolate,30-10-2016 10:07,morning,weekend
4,3,Jam,30-10-2016 10:07,morning,weekend


In [4]:
orders.tail()

Unnamed: 0,Transaction,Item,date_time,period_day,weekday_weekend
20502,9682,Coffee,09-04-2017 14:32,afternoon,weekend
20503,9682,Tea,09-04-2017 14:32,afternoon,weekend
20504,9683,Coffee,09-04-2017 14:57,afternoon,weekend
20505,9683,Pastry,09-04-2017 14:57,afternoon,weekend
20506,9684,Smoothies,09-04-2017 15:04,afternoon,weekend


### Transform the dataset

Because items in a transaction are split into different rows, I will group those items into a place. A list of lists of items is converted as below:

In [5]:
transactions = []
# Combine items in the same transaction into one place
for i in orders.Transaction.unique():
    list_trans = list(set(orders[orders.Transaction == i]["Item"]))
    if len(list_trans) > 0:
        transactions.append(list_trans)

The Apriori module requires a data frame with values of 0 and 1 or True and False. Therefore, I will use One Hot Encode the data to meet the requirement of the Apriori module given by mlxtend library.

You may need to install `mlxtend` library first by

```bash
pip install mlxtend
```

In [7]:
import mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules, apriori

trans_encoding = TransactionEncoder()
df2 = trans_encoding.fit(transactions).transform(transactions)
df3 = pd.DataFrame(df2, columns=trans_encoding.columns_)

Now, we can easily apply the apriori module from mlxtend library. The frequent sets are found with just one line of code. In this example, I will use the `min_support = 0.05` , which implies the minimum support required for an itemset to be chosen. Meanwhile,`use_colnames = True` keeps column names for itemsets to make them more understandable.

In [8]:
frequent_set = apriori(df3, min_support = 0.05, use_colnames = True)

In [9]:
print(frequent_set)

     support         itemsets
0   0.327205          (Bread)
1   0.103856           (Cake)
2   0.478394         (Coffee)
3   0.054411        (Cookies)
4   0.058320  (Hot chocolate)
5   0.061807      (Medialuna)
6   0.086107         (Pastry)
7   0.071844       (Sandwich)
8   0.142631            (Tea)
9   0.090016  (Coffee, Bread)
10  0.054728   (Coffee, Cake)


From these frequent sets, I continue to find the association rules which determine if A is bought, then B is also purchased. I set the metric = 'lift' with the minimum threshold = 1.

In [10]:

rules = association_rules(frequent_set, metric = 'lift', min_threshold = 1)

In [11]:
print(rules)

  antecedents consequents  antecedent support  consequent support   support  \
0    (Coffee)      (Cake)            0.478394            0.103856  0.054728   
1      (Cake)    (Coffee)            0.103856            0.478394  0.054728   

   confidence      lift  leverage  conviction  
0    0.114399  1.101515  0.005044    1.011905  
1    0.526958  1.101515  0.005044    1.102664  


As we can see, the results consist of two association rules only. Cake and Coffee are bought more frequently than random with the lift = 1.1 and 53% confidence.

## References

- [Simple Market Basket Analysis with Association Rules Mining](https://towardsdatascience.com/introduction-to-simple-association-rules-mining-for-market-basket-analysis-ef8f2d613d87)

- [Association Rules Mining/Market Basket Analysis](https://www.kaggle.com/code/datatheque/association-rules-mining-market-basket-analysis/notebook)