## Association Rules Learning - Apriori

Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show associations between items.It allows retailers to identify relationships between the items that people buy together frequently. Association Rule Mining is sometimes referred to as `Market Basket Analysis`, as it was the first application area of association mining. 

**The Problem**
When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a distinctive list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips. Understanding these buying patterns can help to increase sales in several ways. If there is a pair of items, X and Y, that are frequently bought together:

- Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.
- Promotional discounts could be applied to just one out of the two items.
- Advertisements on X could be targeted at buyers who purchase Y.
- X and Y could be combined into a new product, such as having Y in flavors of X.


Given a set of transactions, we can find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

|  TID   |          ITEMS             |
| ------ |--------------------------- |
|    1   |        Bread, Milk         |
|    2   | Bread, Diaper, Beer, Eggs  |
|    3   |  Milk, Diaper, Beer, Coke  |
|    4   |  Bread, Milk, Diaper, Beer |
|    5   |  Bread, Milk, Diaper, Coke |




Before we start defining the rule, let us first see the basic definitions.

**Support Count(sigma)**: Frequency of occurrence of a itemset.

> Here sigma({Milk, Bread, Diaper})=2 

**Frequent Itemset**: An itemset whose support is greater than or equal to minsup threshold.

**Association Rule**: An implication expression of the form X -> Y, where X and Y are any 2 itemsets.

> Example: {Milk, Diaper}->{Beer} 

**Rule Evaluation Metrics**:

<img src="https://miro.medium.com/max/994/1*9J50LPtmb0fcgR5FhnDljQ.png">

- **Support(s):** The number of transactions that include items in the {X} and {Y} parts of the rule as a percentage of the total number of transaction.It is a measure of how frequently the collection of items occur together as a percentage of all transactions.
- **Support = sigma(X+Y) / total –** It is interpreted as fraction of transactions that contain both X and Y.
- **Confidence(c):** It is the ratio of the no of transactions that includes all items in {B} as well as the no of transactions that includes all items in {A} to the no of transactions that includes all items in {A}.
- **Conf(X=>Y) = Supp(X\cupY) / Supp(X):** It measures how often each item in Y appears in transactions that contains items in X also.
- **Lift(l) –** The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence, assuming that the itemsets X and Y are independent of each other.The expected confidence is the confidence divided by the frequency of {Y}.
- **Lift(X=>Y) = Conf(X=>Y) / Supp(Y):** Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1 means they appear together more than expected and less than 1 means they appear less than expected.Greater lift values indicate stronger association.



**Example** 

- Support : If we have, A, B, C, D, E as items and ABC, ACD etc forms item-set.

<img src="https://miro.medium.com/max/1400/1*-hURyp3RPWQcen0Vp2KVuA.png">
<img src="https://miro.medium.com/max/1400/1*3dUDSVioOKfK7Z6FMs2ztQ.png">
<img src="https://miro.medium.com/max/1400/1*j77mw2rM4Rkjd_3iBgCjQQ.png">

- Confidence:

<img src="https://miro.medium.com/max/1400/1*hOKIpx-Zazt9JPPa-wcsig.png">
<img src="https://miro.medium.com/max/1400/1*w6jsKZhk35RqrT9OeP7JJg.png">

**Why do we neet to calculate Support and Confidence?**

Support and confidence is calculated to overcome the issue where some patterns in data can occur by chance.
Support is an important measure because a rule that has very low support may occur simply by chance. Moreover a low support rule is also less interesting for business perspective also because it may not be profitable to promote items which customers seldom buy together.
On the other hand, confidence measures the reliability of the inference made by a rule. For a rule {A} →{B} with high confidence indicates that B is more likely to occur with A.

**The need for calculating Lift**

For rule {A} →{B}
- By calculating support we know whether the rule is significant.
- By calculating confidence, we come to know how likely A and B will occur together.

But these measures do not give us any idea on what extent the occurrence of one item(say A) or item-set increase the occurrence of the another item(say B) or item-set.

> So we calculate lift to know, how the antecedent and consequent are related to one another.

For a rule: {Antecedent} →{Consequent}
- If lift = 1, then it means that the possibility of occurrence of Antecedent and Consequent are not dependent on one another.
- If lift < 1, then it means the occurrence of Antecedent has negative effect on occurrence on Consequent and vice versa.
- If lift > 1, then it means that the two occurrences are dependent on one another, and these rules are very useful in determining the consequent in latter cases. It also lets us know to what extent the occurrences are dependent on one another.

**Lift**

Lift(A →B) indicates the rise in probability of occurrence of B when A has already occurred.
In the above example

<img src="https://miro.medium.com/max/1400/1*iPyvAKpu-v7Umklzs9MROw.png">
<img src="https://miro.medium.com/max/1400/1*oNCbKCTQujmWlNFXTLKHdg.png">


The Association rule is very useful in analyzing datasets. The data is collected using bar-code scanners in supermarkets. Such databases consists of a large number of transaction records which list all items bought by a customer on a single purchase. So the manager could know if certain groups of items are consistently purchased together and use this data for adjusting store layouts, cross-selling, promotions based on statistics.

**Applications**
The main applications of Association Rules are in:
- data analysis
- classification
- cross-marketing
- clustering
- catalogue design
- loss-leader analysis 


Implenting using another lib by [geeksforgeeks](https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-python/)

References and more [geeksforgeeks](https://www.geeksforgeeks.org/association-rule/) and [wikipedia](https://en.wikipedia.org/wiki/Association_rule_learning)

## Implementation

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Data Preprocessing
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
dataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7497,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7498,chicken,,,,,,,,,,,,,,,,,,,
7499,escalope,green tea,,,,,,,,,,,,,,,,,,


In [3]:
transactions = []
for i in range(0, 7501):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

In [4]:
# Training Apriori on the dataset
from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

In [5]:
# Visualising the results
results = list(rules)

In [6]:
results

[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'honey', 'fromage blanc'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0