# Association Rule Mining


## Support

The support of $X->Y$ equas the percentage of basket that contains both X and Y. We count how many baskets contains both items, and divide them by the total number of baskets.

$support(X -> Y) = P(X \cup Y)$

## Confidence

The confidence of X leading to Y is the probablity of Y given X. In other words, it is the percentage of baskets that contains both X and Y divided by the percentage of baskets that contains just X.

$confidence(X -> Y) = P(Y|X)$

## Association Rules

We read the following association rules as follow: 1% of all baskets have the combination of vanilla wafers, bananas and whipped cream. 40% of customers who purchased vanilla wafers also purchased bananas and whipped cream.

```
vanilla wafers -> bananas, whipped cream
[support=1%, confidence=40%]
```

The left hand side is the determining item, called the __antecedent__.

The right hand side is the resulting item, called the __consequent__.

In [17]:
from collections import Counter

In [16]:
data = [['vanilla wafers', 'bananas', 'dog food'],
        ['bananas', 'bread','yogurt'],
        ['bananas', 'apples', 'yogurt'],
        ['vanilla wafers', 'bananas', 'whipped cream'],
        ['bread', 'vanilla wafers', 'yogurt'],
        ['milk', 'bread', 'bananas'],
        ['vanilla wafers', 'apples', 'bananas'],
        ['yogurt', 'apples', 'vanilla wafers'],
        ['vanilla wafers', 'bananas', 'milk'],
        ['bananas', 'bread', 'peanut butter']]
data

[['vanilla wafers', 'bananas', 'dog food'],
 ['bananas', 'bread', 'yogurt'],
 ['bananas', 'apples', 'yogurt'],
 ['vanilla wafers', 'bananas', 'whipped cream'],
 ['bread', 'vanilla wafers', 'yogurt'],
 ['milk', 'bread', 'bananas'],
 ['vanilla wafers', 'apples', 'bananas'],
 ['yogurt', 'apples', 'vanilla wafers'],
 ['vanilla wafers', 'bananas', 'milk'],
 ['bananas', 'bread', 'peanut butter']]

In [35]:
# Find the individual support for each item in the basket.
flattened = sorted([item 
                    for items in data 
                    for item in items])
supports = Counter(flattened)
supports

Counter({'apples': 3,
         'bananas': 8,
         'bread': 4,
         'dog food': 1,
         'milk': 2,
         'peanut butter': 1,
         'vanilla wafers': 6,
         'whipped cream': 1,
         'yogurt': 4})

In [40]:
# How to calculate the support for {vanilla wafers, bananas}?
# support(vanilla wafers -> bananas) is the percentage of baskets that contains both vanilla wafers and bananas.
vanilla_wafers_and_bananas = 0
for basket in data:
    if 'vanilla wafers' in basket and 'bananas' in basket:
        vanilla_wafers_and_bananas += 1
support = vanilla_wafers_and_bananas / len(data)
support

0.4

In [46]:
# How to calculate the confidence(vanilla wafers -> bananas)?
# The percentage of baskets containing both vanilla wafers and bananas over the baskets containing vanilla wafers.
confidence = vanilla_wafers_and_bananas / supports['vanilla wafers']
f'vanilla wafers -> bananas [support={support*100:.0f}%, confidence={confidence * 100:.0f}%]'

'vanilla wafers -> bananas [support=40%, confidence=67%]'

In [47]:
# The opposite rule might not produce the same results.
# confidence(bananas/vanilla wafers).
confidence = vanilla_wafers_and_bananas / supports['bananas']
f'bananas -> vanilla wafers [support={support*100:.0f}%, confidence={confidence * 100:.0f}%]'

'bananas -> vanilla wafers [support=40%, confidence=50%]'

### Conclusion
The rule `vanilla wafers -> bananas` is higher than `bananas -> vanilla wafers` (same support, higher confidence).

## Added Value

We can calculate the added value like this to measure if pairing the items in the baskets is better than displaying the items individually. If the added value is positive, people are more likely to buy the paired items. If it is negative, the individual items fares better.
```
added value = confidence of rule - support of right side
```

In [59]:
# An array of item and the support value.
data = [['vanilla wafers', 0.8],
        ['bananas', 0.3],
        ['vanilla wafers, bananas', 0.3]]

# Observation: The support of individual items are higher than the support of paired items.

# What is the confidence(vanilla wafers -> bananas)?
# confidence(vanilla wafers -> bananas) = support(vanilla wafers U bananas) / support(vanilla wafers)
confidence = data[-1][-1] / data[0][1]
confidence

0.37499999999999994

In [60]:
support_bananas = data[0][-1]
added_value = confidence - support_bananas
added_value

-0.4250000000000001

In [63]:
# A different support value for vanilla wafers.
data = [['vanilla wafers', 0.3],
        ['bananas', 0.3],
        ['vanilla wafers, bananas', 0.3]]

confidence = data[-1][-1] / data[0][1]

# Our confidence is now 1.0, which means that people that buys vanilla wafers will also buy bananas! 
confidences

1.0

In [62]:
support_bananas = data[0][-1]
added_value = confidence - support_bananas
added_value

0.7