This example is created based on `mlxtend` library's documentation page ([link](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)). This notebook demonstrates some basic concepts of association analysis 

### Import Packages

In [None]:
import pandas as pd
from itertools import combinations
from mlxtend.preprocessing import TransactionEncoder

### Data

Let's create a toy data set for this exercise.

In [None]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Garlic', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

### Prepare (transform) data

Binary representation

In [None]:
te = TransactionEncoder()

te_ary = te.fit(dataset).transform(dataset)

te_ary

In [None]:
te.columns_

In [None]:
# let's create a dataframe from these results
df = pd.DataFrame(te_ary, columns=te.columns_)

df

### `Support`

Calculate `support` for each individual product.

In [None]:
df.sum()

In [None]:
item_supports = df.sum() / len(df)

item_supports

In [None]:
item_supports = item_supports.sort_values(ascending=False)

print (item_supports)

In [None]:
item_supports.plot.bar();

Extract all two-way combinations.

In [None]:
# let's take the first row as an example first
dataset[0]

In [None]:
# extract all two-way combinations
list(combinations(dataset[0], 2))

We can use this approach to calculate `support` for each pair of items _from the entire dataset_.

In [None]:
# save each two-way item pair into an array
item_pairs = []

for i, order in enumerate(dataset):
    pairs = combinations(set(order), 2)
    
    # for each product pair
    for item_pair in pairs:
        item_pairs.append(item_pair)
        
len(item_pairs)

In [None]:
item_pairs

Count how frequent each item-pair is.

In [None]:
from collections import Counter

Counter(tuple(sorted(elem)) for elem in item_pairs)

We must use `sorted()` option here because otherwise (Eggs, Kidney Beans) is considered a different set than (Kidney Beans, Eggs).

Let's save the results in a dictionary and sort it in descending order of frequency.

In [None]:
item_pair_ct = Counter(tuple(sorted(elem)) for elem in item_pairs)

item_pair_ct.items()

In [None]:
# sort the array ot tuples from high to low frequency
sorted(item_pair_ct.items(), key=lambda x: x[1], reverse=True)

In [None]:
# store the sorted results
item_pair_ct_sorted = sorted(item_pair_ct.items(), key=lambda x: x[1], reverse=True)

# let's calculate the percentages (i.e., support) from these counts
item_pair_pct_sorted = {}

for i, item_pair in enumerate(item_pair_ct_sorted):
    item_pair_pct_sorted[item_pair[0]] = item_pair[1] / len(dataset)
    
print (item_pair_pct_sorted)

### `Support` Filter

In [None]:
# for this exercise, we will use a support threshold of 0.6
min_support = 0.5

# extract all items that satisfy the support criterion
item_supports[item_supports >= min_support]

In [None]:
ax = item_supports.plot.bar()
ax.axhline(min_support, c='r');

In [None]:
# print all item-sets that satisfy the support criterion
for key, value in item_pair_pct_sorted.items():
    if value >= min_support:
        print (key, value)

______

Instead of doing all these calculations manually, we can use `mlxtend`.

In [None]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

frequent_itemsets

In [None]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)

If you are interested in rules according to a different metric of interest, you can simply adjust the metric and `min_threshold` arguments . E.g. if you are only interested in rules that have a `lift` of >= 1.2, you would do the following:

In [None]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1.2)

rules

Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

1. at least 2 antecedents
2. a confidence > 0.75
3. a lift score > 1.2

We could compute the antecedent length as follows:

In [None]:
rules['antecedent_len'] = rules['antecedents'].apply(lambda x: len(x))

rules

Then, we can use pandas' selection syntax as shown below:

In [None]:
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]