# Apriori for Frequent Pattern Mining

## Introduction

### What Is Association Rule Mining?

Association rule mining is a technique to identify frequent patterns and associations among a set of items. For example, understanding customer buying habits. By finding correlations and associations between different items that customers place in their ‘shopping basket,’ recurring patterns can be derived.

Say, Joshua goes to buy a bottle of wine from the supermarket. He also grabs a couple of chips as well. The manager there analyses that, not only Joshua, people often tend to buy wine and chips together. After finding out the pattern, the manager starts to arrange these items together and notices an increase in sales.

This process of identifying an association between products/items is called association rule mining.

### What Is an Apriori Algorithm?

Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Say, a transaction containing {wine, chips, bread} also contains {wine, bread}. So, according to the principle of Apriori, if {wine, chips, bread} is frequent, then {wine, bread} must also be frequent.

### How Does the Apriori Algorithm Work?

The key concept in the Apriori algorithm is that it assumes all subsets of a frequent itemset to be frequent. Similarly, for any infrequent itemset, all its supersets must also be infrequent. Let us try and understand the working of an Apriori algorithm with the help of a very famous business scenario, market basket analysis.

Here is a dataset consisting of six transactions in an hour. Each transaction is a combination of 0s and 1s, where 0 represents the absence of an item and 1 represents the presence of it.

In [None]:
import pandas as pd  
  
# assign data of lists.  
data = {'Transaction Id': ['1', '2', '3', '4', '5', '6'], 'Wine': [1,1,0,0,1,1], 'Chips': [1,0,0,1,1,1],'Bread':[1,1,1,0,1,0], 'Milk':[1,1,1,0,1,1] }  
  
# Create DataFrame  
df = pd.DataFrame(data)  
  
# Print the output.  
print(df)  

  Transaction Id  Wine  Chips  Bread  Milk
0              1     1      1      1     1
1              2     1      0      1     1
2              3     0      0      1     1
3              4     0      1      0     0
4              5     1      1      1     1
5              6     1      1      0     1


We can find multiple rules from this scenario. For example, in a transaction of wine, chips, and bread, if wine and chips are bought, then customers also buy bread.  {wine, chips} =>; {bread}
In order to select the interesting rules out of multiple possible rules from this small business scenario, we will be using the following measures:

- Support

- Confidence

- Lift

- Conviction

### Support :

Support of item x is nothing but the ratio of the number of transactions in which item x appears to the total number of transactions.i.e.,

Support(wine) = Number of transactions in which wine appears/ total number of transactions = 4/6
Support(wine) =  = 0.66667

### Confidence:

Confidence (x => y) signifies the likelihood of the item y being purchased when item x is purchased. This method takes into account the popularity of item x. i.e.,

Conf({wine, chips} => {bread}) = support(wine, chips, bread)/ support(wine,chips) =(2/6) / (3/6)
Conf({wine, chips} => {bread})=  = 0.667

### Lift:

Lift (x => y) is nothing but the ‘interestingness’ or the likelihood of the item y being purchased when item x is sold. Unlike confidence (x => y), this method takes into account the popularity of the item y. i.e.,

lift ({wine, chips} => {bread}) = support(wine, chips, bread)/ support(wine)* support(chips) = (2/6) / (3/6 *4/6)
lift ({wine, chips} => {bread}) ==1

-  Lift (x => y) = 1 means that there is no correlation within the itemset

-  Lift (x => y) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, x and y, are more likely to be bought together.

-  Lift (x => y) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, x and y, are unlikely to be bought together.

### Conviction:

Conviction of a rule can be defined as follows:

conv(x => y) = (1- support(y)) / (1-conf(x=>y))  i.e.,

conv({wine, chips} => {bread} ) = (1- support(bread)) / (1-conf(wine, chips)=>(bread)))  =(1 -4/6)/ (1-2/3) =1

Its value range is [0, +∞]

-  Conv(x => y) = 1 means that x has no relation with y.

-  Greater the conviction higher the interest in the rule.

## A simple count

Now that we know the methods to find out the interesting rules, let us go back to the example. Before we get started, let us fix the support threshold to 50%. Min_Support=0.5 (or we need 3 out of 6 transaction).

Step 1: Create a frequency table of all the items that occur in all transactions

In [None]:
# assign data of lists.  
data = {'Item': ['Wine', 'Chips', 'Bread', 'Milk'], 'Frequency': [4,4,4,5] }  
  
# Create DataFrame  
df = pd.DataFrame(data)  
  
# Print the output.  
print(df)  

    Item  Frequency
0   Wine          4
1  Chips          4
2  Bread          4
3   Milk          5


Step 2: Find the frequent 1-itemset based on the support threshold
Support threshold = 3 transaction

In [None]:
# assign data of lists.  
data = {'Item': ['Wine', 'Chips', 'Bread', 'Milk'], 'Frequency': [4,4,4,5] }  
  
# Create DataFrame  
df = pd.DataFrame(data)  
  
# Print the output.  
print(df)  

    Item  Frequency
0   Wine          4
1  Chips          4
2  Bread          4
3   Milk          5


Step 3: From the frequent 1-itemset, make possible pairs irrespective of the order as 2-itemsets

In [None]:
# assign data of lists.  
data = {'Item': ['Wine, Chips', 'Wine, Bread', 'Wine, Milk', 'Chips, Bread', 'Chips, Milk','Bread, Milk'], 'Frequency': [3,3,4,2,3,4] }  
  
# Create DataFrame  
df = pd.DataFrame(data)  
  
# Print the output.  
print(df)  

           Item  Frequency
0   Wine, Chips          3
1   Wine, Bread          3
2    Wine, Milk          4
3  Chips, Bread          2
4   Chips, Milk          3
5   Bread, Milk          4


Step 4: Again, find the frequent 2-itemsets based on the support threshold

In [None]:
# assign data of lists.  
data = {'Item': ['Wine, Chips', 'Wine, Bread', 'Wine, Milk',  'Chips, Milk','Bread, Milk'], 'Frequency': [3,3,4,3,4] }  
  
# Create DataFrame  
df = pd.DataFrame(data)  
  
# Print the output.  
print(df)  

          Item  Frequency
0  Wine, Chips          3
1  Wine, Bread          3
2   Wine, Milk          4
3  Chips, Milk          3
4  Bread, Milk          4


Step 5: Now, make a set of 3-itemsets that are bought together based on the frequent 2-itemset from Step 4

In [None]:
# assign data of lists.  
data = {'Item': ['Wine, Bread, Milk', 'Wine, Chips, Milk'], 'Frequency': [3, 2] }  
  
# Create DataFrame  
df = pd.DataFrame(data)  
  
# Print the output.  
print(df)  

                Item  Frequency
0  Wine, Bread, Milk          3
1  Wine, Chips, Milk          2


Findings:- {Wine, Bread, Milk} is the frequent 3-itemsets we have got from the given data. But in real-world scenarios, we would have dozens of items to build rules from. Then, we might have to make four/five-pair itemsets.

## Run with a package

### Apriori Algorithm in Python- Market Basket Analysis

Problem Statement:
The manager of a retail store is trying to find out an association rule between six items, to figure out which items are more often bought together so that he can keep the items together in order to increase sales.

Environment Setup:

In [None]:
pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25ldone
[?25h  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5974 sha256=e3bd070dce2ea54826fcf23aaa0348431c4a9ee9bde10facaa172a7bd6244ddd
  Stored in directory: /root/.cache/pip/wheels/32/2a/54/10c595515f385f3726642b10c60bf788029e8f3a1323e3913a
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


Step 1: Import the libraries

In [None]:
import numpy as np
import pandas as pd
from apyori import apriori

Step 2: Load the dataset

In [None]:
# Install openpyxl to read data from xlsx file
!pip install openpyxl 

Collecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 KB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [None]:
basket_data = pd.read_excel('Day1.xlsx')
basket_data

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,Wine,Chips,Bread,Butter,Milk,Apple
1,Wine,,Bread,Butter,Milk,
2,,,Bread,Butter,Milk,
3,,Chips,,,,Apple
4,Wine,Chips,Bread,Butter,Milk,Apple
5,Wine,Chips,,,Milk,
6,Wine,Chips,Bread,Butter,,Apple
7,Wine,Chips,,,Milk,
8,Wine,,Bread,,,Apple
9,Wine,,Bread,Butter,Milk,


In [None]:
basket_data.shape

(22, 6)

Step 5: Convert Pandas DataFrame into a list of lists

In [None]:
records=[]

for i in range(0,22):
    records.append([str(basket_data.values[i,j]) for j in range(0,6)])

In [None]:
records

[['Wine', 'Chips', 'Bread', 'Butter', 'Milk ', 'Apple'],
 ['Wine', 'nan', 'Bread', 'Butter', 'Milk ', 'nan'],
 ['nan', 'nan', 'Bread', 'Butter', 'Milk ', 'nan'],
 ['nan', 'Chips', 'nan', 'nan', 'nan', 'Apple'],
 ['Wine', 'Chips', 'Bread', 'Butter', 'Milk ', 'Apple'],
 ['Wine', 'Chips', 'nan', 'nan', 'Milk ', 'nan'],
 ['Wine', 'Chips', 'Bread', 'Butter', 'nan', 'Apple'],
 ['Wine', 'Chips', 'nan', 'nan', 'Milk ', 'nan'],
 ['Wine', 'nan', 'Bread', 'nan', 'nan', 'Apple'],
 ['Wine', 'nan', 'Bread', 'Butter', 'Milk ', 'nan'],
 ['nan', 'Chips', 'Bread', 'Butter', 'nan', 'Apple'],
 ['Wine', 'nan', 'nan', 'Butter', 'Milk ', 'Apple'],
 ['Wine', 'Chips', 'Bread', 'Butter', 'Milk ', 'nan'],
 ['Wine', 'nan', 'Bread', 'nan', 'Milk ', 'Apple'],
 ['Wine', 'nan', 'Bread', 'Butter', 'Milk ', 'Apple'],
 ['Wine', 'Chips', 'Bread', 'Butter', 'Milk ', 'Apple'],
 ['nan', 'Chips', 'Bread', 'Butter', 'Milk ', 'Apple'],
 ['nan', 'Chips', 'nan', 'Butter', 'Milk ', 'Apple'],
 ['Wine', 'Chips', 'Bread', 'Butter', 

Step 6: Build the Apriori model

In [None]:
association_rules = apriori(records, min_support=0.40, min_confidence=0.70, min_lift=1.2, min_length=2)
association_results = list(association_rules)

Step 7: Print out the number of rules

In [None]:
print(len(association_results))

5


Step 8: Have a glance at the rule

In [None]:
print(association_results)

[RelationRecord(items=frozenset({'Wine', 'Bread', 'Apple'}), support=0.45454545454545453, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Wine', 'Apple'}), items_add=frozenset({'Bread'}), confidence=0.9090909090909091, lift=1.25)]), RelationRecord(items=frozenset({'Butter', 'Apple', 'Milk '}), support=0.4090909090909091, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Apple', 'Milk '}), items_add=frozenset({'Butter'}), confidence=0.8181818181818182, lift=1.2000000000000002)]), RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk '}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Butter'}), items_add=frozenset({'Bread', 'Milk '}), confidence=0.7333333333333334, lift=1.241025641025641), OrderedStatistic(items_base=frozenset({'Bread', 'Milk '}), items_add=frozenset({'Butter'}), confidence=0.8461538461538461, lift=1.241025641025641)]), RelationRecord(items=frozenset({'Butter', 'Bread', 'Wine'}), support=0.45454545454545453, ordered_sta

Step 9: Have a detailed look at the rule

In [None]:
for r in association_results:
    print("=====================================")
    print('Frequent itemset:{} with support'.format(list(r[0])), r[1])
    print('--Association Rules')
    for a in r[2]:
        print('----Rule: {} -> {}'.format(list(a[0]), list(a[1])))
        print('------Confidence: {}'.format(a[2]))
        print('------Lift: {}'.format(a[3]))

Frequent itemset:['Wine', 'Bread', 'Apple'] with support 0.45454545454545453
--Association Rules
----Rule: ['Wine', 'Apple'] -> ['Bread']
------Confidence: 0.9090909090909091
------Lift: 1.25
Frequent itemset:['Butter', 'Apple', 'Milk '] with support 0.4090909090909091
--Association Rules
----Rule: ['Apple', 'Milk '] -> ['Butter']
------Confidence: 0.8181818181818182
------Lift: 1.2000000000000002
Frequent itemset:['Butter', 'Bread', 'Milk '] with support 0.5
--Association Rules
----Rule: ['Butter'] -> ['Bread', 'Milk ']
------Confidence: 0.7333333333333334
------Lift: 1.241025641025641
----Rule: ['Bread', 'Milk '] -> ['Butter']
------Confidence: 0.8461538461538461
------Lift: 1.241025641025641
Frequent itemset:['Butter', 'Bread', 'Wine'] with support 0.45454545454545453
--Association Rules
----Rule: ['Butter', 'Wine'] -> ['Bread']
------Confidence: 0.9090909090909091
------Lift: 1.25
Frequent itemset:['Butter', 'Bread', 'Milk ', 'Wine'] with support 0.4090909090909091
--Association Ru

The support value for the first rule is 0.5. This number is calculated by dividing the number of transactions containing ‘Milk,’ ‘Bread,’ and ‘Butter’ by the total number of transactions.

The confidence level for the rule is 0.846, which shows that out of all the transactions that contain both “Milk” and “Bread”, 84.6 % contain ‘Butter’ too.

The lift of 1.241 tells us that ‘Butter’ is 1.241 times more likely to be bought by the customers who buy both ‘Milk’ and ‘Butter’ compared to the default likelihood sale of ‘Butter.’

## Discussion

### Limitations of Apriori Algorithm

Despite being a simple one, Apriori algorithms have some limitations including:

-  Waste of time when it comes to handling a large number of candidates with frequent itemsets.

-  The efficiency of this algorithm goes down when there is a large number of transactions going on through a limited memory capacity.

-  Required high computation power and need to scan the entire database.

### Applications of Apriori Algorithm

-  Used in forest departments to understand the intensity and probability of forest fires. 

-  The Healthcare department used such algorithms to analyze the patients’ database and predict which patients might develop blood pressure, diabetes, another common disease. 

-  E-commerce websites use it in their recommendation systems to provide a better user experience.

## Reference and Credit

### References

- https://intellipaat.com/blog/data-science-apriori-algorithm/#1

### Credit

Apriori Algorithm by Bhawneet Singh and Di Wu is licensed by [CC BY NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).