## Chapter 10 - Association Analysis with the Apriori Algorithm

### Machine Learning in Retail
In retail, we can see many applications of machine learning. Some examples are:
1. merchandising
2. loyalty programs
3. vouchers & promotions

These examples usually involve a large amount of data crunching to increase sales. In this chapter we look at how algorithms can discover items commonly purchased together, known as <b>association analysis</b> or <b>association rule mining</b>.

The first instinct to perform association analysis is to use brute force methods (e.g. iterating through the whole dataset) but they are expensive in terms of computing power. The time taken to find rules is proportionate to the size of the dataset. In response, we explore the Apriori algorithm to solve this problem.

In [1]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from mlxtend.frequent_patterns import apriori, association_rules

### Association Analysis
Association analysis is the task of finding interesting relationships in large datasets. They take on two forms:
- <b>Frequent itemsets</b> are a collection of items that frequently occur together
- <b>Association rules</b> that suggest a strong relationship exists between two items. 

The relationships are quantified by two values:
- <b>support</b> - is the percentage of transactions that contain the itemset. (an itemset can contain one or multiple items)
- <b>confidence</b> - Letting $A$ be the support of the antecedents and $C$ be the support of the consequents, <b>confidence</b> is the support of $A \cup C$, divided by the support of $A$. A numercial example below is used to illustrate the following.

#### Example 1
Consider the following list of transactions:

```python
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
```

In [2]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

# One-hot encoding
mlb = MultiLabelBinarizer()
res = mlb.fit_transform(dataset)
# frequent itemsets
frequent_itemsets = apriori(pd.DataFrame(res, columns=mlb.classes_), min_support=0.50, use_colnames=True)
display(frequent_itemsets)

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Eggs, Kidney Beans)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Milk, Kidney Beans)"
8,0.6,"(Onion, Kidney Beans)"
9,0.6,"(Yogurt, Kidney Beans)"


The support of `'Kidney Beans'` is the largest at $100\%$. as it appears in all transactions and we say it is the most commonly appearing item. The most commonly appearing <u>group of items</u> is `['Eggs', 'Kidney Beans']` since they appear in 4 out of 5 (support of $80\%$) of all transactions. This is the solution to the <b>frequent itemsets</b> problem.

The confidence of `'Eggs' -> 'Kidney Beans'` is the highest at $100\%$. This is because all transactions that contain Eggs also contain Kidney beans. This solves the <b>association rules</b> problem.
<center>$\diamond$</center>

### The Apriori Principle for Frequent Itemsets
Instinctively, we will want to calculate the support of every itemsset from the original dataset. This grows very quickly with the number of products in transactions. 

#### Example 2
If a dataset of transactions contains 4 different products, $p_0, p_1, p_2, p_3$, there are 15 different itemsets to account for:
- ${4 \choose 1} = 4$ ways to choose 1 product
- ${4 \choose 2} = 6$ ways to choose 2 products
- ${4 \choose 3} = 4$ ways to choose 3 products
- ${4 \choose 4} = 1$ ways to choose 4 products

Similarly, if there are 5 different products, the number of itemsets grows to 31. We need a more efficient method to enumerate itemsets. $\diamond$

To reduce the time needed for computation, the <b>Apriori principle</b> can help us reduce the number of interesting itemsets. It says that if an itemset is frequent, then all of its subsets are frequent. Conversely, <b>if an itemset is infrequent, its supersets are also infrequent</b>. Consider Example 2. If $\left\{p_2, p_3\right\}$ is infrequent, then its supersets $\left\{p_0, p_2, p_3\right\}$, $\left\{p_1, p_2, p_3\right\}$ and $\left\{p_0, p_1, p_2, p_3\right\}$ are all infrequent and so we do not need to find those.

### Extending the Apriori Principle for Mining Association Rules

Given the frequent itemsets and their associated supports, we now attempt to mind the association rules. For each frequent itemset, how many association rules can we mine? 

#### Example 3
It is much easier to see this with a decomposition of a superset of 4 or more items. Consider a superset $\left\{p_1, p_2, p_5, p_7\right\}$. There are many candidate association rules, like $\left\{p_1, p_2, p_5\right\} \rightarrow \left\{p_7\right\}$, or $\left\{p_2, p_5, p_7\right\} \rightarrow \left\{p_1\right\}$. The LHS of each expression can be further decomposed, like $\left\{p_1, p_2, p_5\right\} \rightarrow \left\{p_7\right\}$ can be further decomposed to $\left\{p_1, p_5\right\} \rightarrow \left\{p_2, p_7\right\}$, $\left\{p_2, p_5\right\} \rightarrow \left\{p_1, p_7\right\}$, $\left\{p_1, p_5\right\} \rightarrow \left\{p_2, p_7\right\}$, $\left\{p_1\right\} \rightarrow \left\{p_2, p_5, p_7\right\}$, $\left\{p_5\right\} \rightarrow \left\{p_1, p_2, p_7\right\}$ or $\left\{p_1\right\} \rightarrow \left\{p_2, p_5, p_7\right\}$. Again, the number of rules grow quickly. $\diamond$

However, the Apriori principle states that if a rule does not meet the minimum confidence requirement, all its decompositions / subsets also does not meet the minimum confidence. So, in Example 3 if $\left\{p_1, p_2, p_5\right\} \rightarrow \left\{p_7\right\}$ does not meet the minimum confidence, all combinations with $p_7$ in the consequent will not meet the minimum confidence and we can omit those calculations.

In [3]:
# association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.50)
display(rules)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Eggs),(Kidney Beans),0.8,1.0,0.8,1.0,1.0,0.0,inf
1,(Kidney Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,0.0,1.0
2,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
3,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
4,(Milk),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
5,(Kidney Beans),(Milk),1.0,0.6,0.6,0.6,1.0,0.0,1.0
6,(Onion),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
7,(Kidney Beans),(Onion),1.0,0.6,0.6,0.6,1.0,0.0,1.0
8,(Yogurt),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
9,(Kidney Beans),(Yogurt),1.0,0.6,0.6,0.6,1.0,0.0,1.0
