In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../notebook_format')
from formats import load_style
load_style()

In [2]:
os.chdir(path)
import numpy as np

# Association Analysis

Many business enterprise accumulates marketing-basket transactions data. For example, a typical marketing-basket transactions may look like: 

![](img/data.png)

Where each row is a unique transaction (usually identified by a unqiue id) and a set of items purchased within this transaction. 

These information can be useful for many business-related applications, but here we'll focus on a specific one called **Association Analysis**. With **Association Analysis**, the outputs are association rules that explain the relationship (co-occurrence) between the items. A classic rule example is {Beer} ==> {Diapers} which states that transactions including {Beer} tend to include {Diapers}. Or intuitively, it means that customer who buys beer also tends to buy diapers. With this information, business can use them to identify new opportunities for crossselling their products to the customers.


## Some Terminology

**Binary Representation:** Market basket data can be represented in a binary format as shown below, where each row still corresponds to a transaction, but each column now corresponds to an item. Then the value in each cell is 1 if the item is present in a transaction and 0 otherwise. e.g. if we transform the orginal table above to binary representation it would look like:

![](img/binary_data.png)

**Itemset:** A collection of k items is termed as an k-itemset. For instance, {Beer, Diapers, Milk} is an example of a 3-itemset. The null (or empty) set is an itemset that does not contain any items.

**Association Rule:** An association rule is an implication expression that takes the form $X \rightarrow Y $, where X and Y are disjoint itemsets, i.e., $X \cap Y = \emptyset$. The strength of an association rule can be measured in terms of its **support** and **confidence**.


**Support** measures the relative frequency of the items in a transaction relative to all other transactions. Mathematically: $P( X \cup Y )$. e.g. Consider the rule {Milk, Diapers} $\rightarrow$ {Beer}. Since the frequency of the itemset {Milk, Diapers, Beer} is occuring in all of the transaction is 2 (or you can say the support count of the itemset {Milk, Diapers, Beer} is 2) and the total number of transactions is 5, the rule’s support is 2/5=0.4.

Support is an important measure because a rule that has very low support may occur simply by chance. A
low support rule is also likely to be uninteresting from a business perspective because it may not be profitable to promote items that customers seldom buy together.

**Confidence** measures the relative strength of the rule: how often the right hand side (RHS) item occurs in transactions containing the left-hand-side (LHS) item set. Mathematically: $\frac{ P( X \cup Y )}{P(X)}$. e.g. Consider the rule {Milk, Diapers} once again. The rule’s confidence is obtained by dividing the support count for {Milk, Diapers, Beer} by the support count for {Milk, Diapers}. Since there are 3 transactions that contain milk and diapers, the confidence for this rule is 2/3=0.67.

Confidence measures the reliability of the inference made by a rule. For a given rule $X \rightarrow Y$, the higher the confidence, the more likely it is for Y to be present in transactions that contain X. Confidence also provides an estimate of the conditional probability of, $P(Y|X)$, the probability of Y given X.

> Support and confidence are the most common measures used to restrict the number of association rules to just the ones that have higher quality.

# Apriori Algorithm

A common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks:

1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the minsup (minimum support) threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent itemsets found in the previous step. These rules are called strong rules.

Here we'll look at the widely known **Apriori Algorithm**. The figure below provides a high-level illustration of the frequent itemset generation part of the Apriori algorithm for the toy transactions data shown at the last section. We assume that the support threshold is 60% (this is a hyperparamter that we have to specify), which is equivalent to a minimum support count equal to 3.

<img src = img/apriori_stage1.png width = 800 height = 800>


Initially, every item is considered as a candidate 1-itemset. After counting their supports, the candidate itemsets {Cola} and {Eggs} are discarded because they appear in fewer than 3 transactions. In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets. Because there are only four frequent 1-itemsets. During the 2-itemsets stage, Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be infrequent after computing their support values. The remaining four candidates are frequent, and thus will be used to generate the 3-itemsets candidate.  The only candidate that has this property is {Bread, Diapers, Milk}.

## Reference

- [Introduction to Data Mining Chapter 6: Association Analysis: Basic Concepts and Algorithms](http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf)

https://github.com/asaini/Apriori

https://github.com/timothyasp/apriori-python