# Import data
This is the code from the assignment to load the data.

In [11]:
import pandas as pd

dataset = pd.read_csv("dataset.csv")
baskets = dataset.groupby("user_id").product_id.apply(set).tolist()
baskets[:5]

[{5614842, 5766379},
 {5861791, 5894239},
 {5830270, 5830275},
 {5635117, 5751383, 5809910},
 {5767496, 5767497, 5891498}]

# Association rule mining algorithm
> Implement an association rule mining algorithm, or use an existing online implementation.
Show that you understand the method by describing its function (without using code) in your
report. Make sure you are able to get the confidence and support of any found association
rules.

> Run the association rule mining algorithm on the given dataset. At this point, use only the
user id and product id columns. What are the top 10 association rules in terms of support
your method finds? Also include the confidence of these rules. What can you say about the
number of items in these rules?

## Method 1
I started with the apyori implementation, but this also allowed empty baskets, which causes errors in the inspect function. You can just skip to my own implementation, which you can find under the "Method 4" subtitle.
Using code from [this article](https://www.section.io/engineering-education/apriori-algorithm-in-python/), with the apriori algorithm from [apyori](https://pypi.org/project/apyori/).

In [64]:
from apyori import apriori

# support: measures the number of times a particular item or combination of items occur in a dataset
# confidence: measures how likely the customer is to consume item2 given they have consumed item1
# lift: a metric that determines the strength of association between the best rules, confidence/support
# TODO: define own min_upport, min_confidence and min_lift
rule = apriori(transactions=baskets, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2, max_length=2)

In [65]:
results = list(rule)

# putting output into a pandas dataframe
def inspect(output):
    for result in output:
        try:
            tuple(result[2][0][0])[0]
        except Exception as e:
            print(result)
            raise e
    lhs = [tuple(result[2][0][0])[0] for result in output]
    rhs = [tuple(result[2][0][1])[0] for result in output]
    support = [result[1] for result in output]
    confidence = [result[2][0][2] for result in output]
    lift = [result[2][0][3] for result in output]
    return list(zip(lhs, rhs, support, confidence, lift))

output_DataFrame = pd.DataFrame(inspect(results),
                                columns=['Left_Hand_Side', 'Right_Hand_Side', 'Support', 'Confidence', 'Lift'])

output_DataFrame.nlargest(n=10, columns='Support')

Unnamed: 0,Left_Hand_Side,Right_Hand_Side,Support,Confidence,Lift
0,5677043,5697463,0.004187,0.329341,57.687425
2,5809912,5809910,0.00373,0.583333,25.459302
3,5814516,5814517,0.00373,0.875,201.664474
1,5809911,5809910,0.003578,0.746032,32.560196


## Method 2
Using code from [this site](https://towardsdatascience.com/apriori-association-rule-mining-explanation-and-python-implementation-290b42afdfc6), which uses the [apriori_python library](https://pypi.org/project/apriori-python/). This method is also very slow, so trying eclat insted.

In [73]:
from apriori_python import apriori
freqItemSet, rules = apriori(baskets, minSup=0.01, minConf=0.001)
print(freqItemSet)
print(rules)

[[5614842, 5766379], [5894239, 5861791], [5830275, 5830270], [5635117, 5809910, 5751383], [5767496, 5767497, 5891498]]
{1: {frozenset({5809910}), frozenset({5649236}), frozenset({5677043}), frozenset({5790689})}}
[]


## Method 3
Let's try running eclat on it. To do this, I'll use code from [this site](https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17), which makes use of [pyECLAT](https://pypi.org/project/pyECLAT/). This method also gives an error.


In [74]:
data = pd.DataFrame(baskets)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,5614842,5766379.0,,,,,,,,
1,5894239,5861791.0,,,,,,,,
2,5830275,5830270.0,,,,,,,,
3,5635117,5809910.0,5751383.0,,,,,,,
4,5767496,5767497.0,5891498.0,,,,,,,


In [75]:
# we are looking for itemSETS
# we do not want to have any individual products returned
min_n_products = 2

# we want to set min support to 7
# but we have to express it as a percentage
min_support = 7/len(baskets)

# we have no limit on the size of association rules
# so we set it to the longest transaction
max_length = max([len(x) for x in baskets])

In [76]:
from pyECLAT import ECLAT

# create an instance of eclat
my_eclat = ECLAT(data=data, verbose=True)

# fit the algorithm
rule_indices, rule_supports = my_eclat.fit(min_support=min_support,
                                           min_combination=min_n_products,
                                           max_combination=max_length)

100%|██████████| 11687/11687 [00:50<00:00, 232.95it/s]
100%|██████████| 11687/11687 [00:07<00:00, 1618.54it/s]
100%|██████████| 11687/11687 [00:07<00:00, 1587.45it/s]


ValueError: Cannot index with multidimensional key

## Method 4
Enough libraries tried that didn't work, let's just implement eclat ourselves.
I based this implementation on the explanation from the lecture and [this explanation](https://www.geeksforgeeks.org/ml-eclat-algorithm/).


In [12]:
def make_tid_dict(baskets):
    tid_dict = dict()
    for index, item_set in enumerate(baskets):
        for item in item_set:
            frozen_set = frozenset({item})
            if frozen_set not in tid_dict:
                tid_dict[frozen_set] = set()
            tid_dict[frozen_set].add(index)
    return tid_dict

tid_dict = make_tid_dict(baskets)
list(tid_dict.items())[:5]

[(frozenset({5614842}),
  {0,
   152,
   1026,
   1470,
   1576,
   1691,
   1799,
   1814,
   1910,
   2027,
   2322,
   2694,
   3259,
   4929,
   5229,
   5693,
   6024,
   6594,
   6647,
   6702,
   6755,
   6980,
   7351,
   8335,
   9987,
   10223,
   13101}),
 (frozenset({5766379}), {0, 85, 4706, 4933, 6206, 6473, 8476, 9647, 12363}),
 (frozenset({5894239}), {1, 7278, 10877}),
 (frozenset({5861791}), {1, 3982, 7420}),
 (frozenset({5830275}), {2, 2588, 2855, 12148})]

In [13]:
from tqdm import tqdm
from itertools import permutations

min_support = 2

def combine_sets(tid_dict, min_support):
    new_tid_dict = dict()
    combs = list(permutations(tid_dict.items(), r=2))
    for dict_item1, dict_item2 in tqdm(combs, total=len(combs)):
        item_set1, tid_set1 = dict_item1
        item_set2, tid_set2 = dict_item2

        new_set_items = tid_set1 & tid_set2 # Note: This intersection operator is new from python 3.9
        if len(new_set_items) >= min_support:
            new_set = frozenset(item_set1 | item_set2) # Same for this union operator
            new_tid_dict[new_set] = new_set_items
    return new_tid_dict

def filter_min_support(tid_dict, min_support):
    items = dict()
    for item in tid_dict:
        if len(tid_dict[item]) >= min_support:
            items[item] = tid_dict[item]
    return items

items = filter_min_support(tid_dict, min_support)

def mine_freq_itemsets(items, min_support):
    new_results = items
    freq_itemsets = dict() # Not that frequent itemsets of length 1 will be skipped this way
    i = 1
    while new_results:
        i += 1
        print(f"Mining frequent itemsets of length {i}")
        new_results = combine_sets(new_results, min_support)
        freq_itemsets |= new_results
    return freq_itemsets

freq_itemsets = mine_freq_itemsets(items, min_support)
freq_itemsets

Mining frequent itemsets of length 2


100%|██████████| 29100630/29100630 [00:12<00:00, 2313691.33it/s]


Mining frequent itemsets of length 3


100%|██████████| 6592056/6592056 [00:02<00:00, 2750165.32it/s]


Mining frequent itemsets of length 4


100%|██████████| 159600/159600 [00:00<00:00, 2533334.79it/s]


Mining frequent itemsets of length 5


100%|██████████| 1332/1332 [00:00<00:00, 1351430.32it/s]


Mining frequent itemsets of length 6


100%|██████████| 6/6 [00:00<00:00, 5999.00it/s]


{frozenset({5614842, 5809910}): {1799, 6647},
 frozenset({4185, 5614842}): {152,
  1470,
  2322,
  4929,
  5229,
  5693,
  6024,
  7351,
  9987},
 frozenset({5614842, 5804261}): {6755, 6980},
 frozenset({3762, 5614842}): {1799, 1910, 2322, 6594, 6755, 7351},
 frozenset({3978, 5614842}): {1026, 6594, 13101},
 frozenset({5614842, 5764656}): {1910, 6755},
 frozenset({5766379, 5766390}): {4706, 6206, 8476},
 frozenset({5766377, 5766379}): {4706, 4933, 6206, 8476, 9647, 12363},
 frozenset({5751383, 5809910}): {3,
  7,
  969,
  7028,
  8649,
  8753,
  10659,
  10790,
  11375,
  13081,
  13124},
 frozenset({5809910, 5833334}): {7, 2755, 10576},
 frozenset({5692527, 5809910}): {4432, 6736},
 frozenset({5687151, 5809910}): {1901, 8438},
 frozenset({5763238, 5809910}): {9133, 10576},
 frozenset({5809910, 5849033}): {798, 969, 1554, 6720, 6970, 10659},
 frozenset({5809910, 5809912}): {43,
  285,
  673,
  722,
  798,
  956,
  1123,
  1348,
  1554,
  1756,
  1817,
  2054,
  3004,
  3895,
  4432,
  

In [14]:
# Cool, we have quite some frequent itemsets now, let's create some rules based on this
from more_itertools import set_partitions

min_confidence = 0.5

class Rule:

    def __init__(self, x: frozenset, y: frozenset, confidence: float, sup: float):
        self.x = x
        self.y = y
        self.sup = sup
        self.confidence  = confidence

    def __str__(self):
        return f"{self.x} -> {self.y} ({self.confidence}, {self.sup})"

def get_rules(freq_itemsets, min_confidence):
    rules = list()

    for freq_itemset, tids in freq_itemsets.items():
        n_freq_itemset_transactions = len(tids)

        # Check all possible bi partitions and see if the generated rule would match the min confidence
        for x, y in set_partitions(freq_itemset, 2):
            x = frozenset(x)

            if x not in freq_itemsets:
                # TODO: How come it's not possible that x is in it, but {x, y} is? This should be impossible
                continue # x not in freq_itemset, would mean x (and thus also {x, y}) don't meet the minsup

            n_x_occurances = len(freq_itemsets[x])
            confidence = n_freq_itemset_transactions / n_x_occurances

            if confidence < min_confidence:
                continue

            y = frozenset(y)
            rules.append(Rule(x=x, y=y, confidence=confidence, sup=n_freq_itemset_transactions))
    return rules

rules.sort(key=lambda x: (x.confidence, x.sup), reverse=True)
for rule in rules[:]:
    print(rule)

frozenset({5814515, 5814517}) -> frozenset({5814518}) (1.0, 5)
frozenset({5814515, 5814516}) -> frozenset({5814517}) (1.0, 4)
frozenset({5814515, 5814516}) -> frozenset({5814518}) (1.0, 4)
frozenset({5814515, 5814516}) -> frozenset({5814517, 5814518}) (1.0, 4)
frozenset({5814515, 5814516, 5814517}) -> frozenset({5814518}) (1.0, 4)
frozenset({5846096, 5814516}) -> frozenset({5814517}) (1.0, 4)
frozenset({5880203, 5880204}) -> frozenset({5880205}) (1.0, 4)
frozenset({5809912, 5849033}) -> frozenset({5809910}) (1.0, 3)
frozenset({5707826, 5793261}) -> frozenset({5692527}) (1.0, 3)
frozenset({5853035, 5853036}) -> frozenset({5853038}) (1.0, 3)
frozenset({5755601, 5814516}) -> frozenset({5814517}) (1.0, 3)
frozenset({5776130, 5814516}) -> frozenset({5814517}) (1.0, 3)
frozenset({5759492, 5767494}) -> frozenset({5766980}) (1.0, 3)
frozenset({5831969, 5803691}) -> frozenset({5803692}) (1.0, 3)
frozenset({5858914, 5848309}) -> frozenset({5893870}) (1.0, 3)
frozenset({5745713, 5745714}) -> froz

>If you were asked to give the 10 most interesting rules, which 10 would you give and why?

I currently sorted on confidence first, support second. If we take our confidence value high enough, e.g. 70%, it might be more useful to just sort on support.
# TODO: What makes a rule really interesting?

In [15]:
new_confidence = 0.7 # Note, new confidence must be higher than the previously defined confidence
new_rules = [rule for rule in rules if rule.confidence >= new_confidence]
new_rules.sort(key=lambda x: (x.confidence*x.sup, x.sup, x.confidence), reverse=True)
for rule in new_rules[:10]:
    print(rule)

frozenset({5814516, 5804820}) -> frozenset({5814517}) (0.8888888888888888, 8)
frozenset({5814515, 5814517}) -> frozenset({5814518}) (1.0, 5)
frozenset({5886282, 5892179, 5844300}) -> frozenset({5900651}) (0.75, 6)
frozenset({5814515, 5814516}) -> frozenset({5814517}) (1.0, 4)
frozenset({5814515, 5814516}) -> frozenset({5814518}) (1.0, 4)
frozenset({5814515, 5814516}) -> frozenset({5814517, 5814518}) (1.0, 4)
frozenset({5814515, 5814516, 5814517}) -> frozenset({5814518}) (1.0, 4)
frozenset({5846096, 5814516}) -> frozenset({5814517}) (1.0, 4)
frozenset({5880203, 5880204}) -> frozenset({5880205}) (1.0, 4)
frozenset({5880640, 5886282}) -> frozenset({5892179}) (0.7142857142857143, 5)


> A lot of information from the dataset was omitted in the current association rules, such as
the event types, which describe whether an item was viewed, purchased, added or removed
from the cart and the prices of items. Find a way to incorporate the additional information
provided into your association rules. Describe what you have added in your report.

We're only interested in the people actually did buy.


In [16]:
filtered_dataset = dataset[dataset["event_type"] == "purchase"]
filtered_baskets = filtered_dataset.groupby("user_id").product_id.apply(set).tolist()
filtered_baskets[:5]

[{5614842, 5766379},
 {5861791, 5894239},
 {5830270},
 {5751383},
 {5767496, 5767497, 5891498}]

In [18]:
def baskets_to_rules(baskets, min_support, min_confidence):
    tid_dict = make_tid_dict(baskets)
    items = filter_min_support(tid_dict, min_support)
    freq_itemsets = mine_freq_itemsets(items, min_support)
    rules = get_rules(freq_itemsets, min_confidence)
    return rules

rules = baskets_to_rules(filtered_baskets, 2, 0.7)
rules.sort(key=lambda x: (x.confidence, x.sup), reverse=True)

for rule in rules[:10]:
    print(rule)

Mining frequent itemsets of length 2


100%|██████████| 14292180/14292180 [00:05<00:00, 2570074.78it/s]


Mining frequent itemsets of length 3


100%|██████████| 825372/825372 [00:00<00:00, 2555095.88it/s]


Mining frequent itemsets of length 4


100%|██████████| 6006/6006 [00:00<00:00, 2018023.70it/s]


Mining frequent itemsets of length 5


100%|██████████| 12/12 [00:00<?, ?it/s]

frozenset({5755601, 5814516}) -> frozenset({5814517}) (1.0, 3)
frozenset({5759492, 5767494}) -> frozenset({5766980}) (1.0, 3)
frozenset({5848309, 5893870}) -> frozenset({5883311}) (1.0, 3)
frozenset({5880203, 5880204}) -> frozenset({5880205}) (1.0, 3)
frozenset({5809912, 5849033}) -> frozenset({5809910}) (1.0, 2)
frozenset({5763379, 5814516}) -> frozenset({5814517}) (1.0, 2)
frozenset({5677448, 5677025}) -> frozenset({5676783}) (1.0, 2)
frozenset({5759491, 5753484}) -> frozenset({5651975}) (1.0, 2)
frozenset({5810672, 5736504}) -> frozenset({5759491}) (1.0, 2)
frozenset({5677043, 5649236}) -> frozenset({5790563}) (1.0, 2)





> After adding additional information, which rules would you deem most interesting now, and
why?