# Import data
This is the (slightly modified) code from the assignment to load the data.

In [4]:
import pandas as pd

dataset = pd.read_csv("Assignment_1/dataset.csv")
baskets = dataset.groupby("user_id").product_id.apply(set).tolist()
baskets[:5]

[{5614842, 5766379},
 {5861791, 5894239},
 {5830270, 5830275},
 {5635117, 5751383, 5809910},
 {5767496, 5767497, 5891498}]

# Association rule mining algorithm
> Implement an association rule mining algorithm, or use an existing online implementation.
Show that you understand the method by describing its function (without using code) in your
report. Make sure you are able to get the confidence and support of any found association
rules.

> Run the association rule mining algorithm on the given dataset. At this point, use only the
user id and product id columns. What are the top 10 association rules in terms of support
your method finds? Also include the confidence of these rules. What can you say about the
number of items in these rules?

## Method 1
I started with the apyori implementation, but this also allowed empty baskets, which causes errors in the inspect function. You can just skip to my own implementation, which you can find under the "Method 4" subtitle.
Using code from [this article](https://www.section.io/engineering-education/apriori-algorithm-in-python/), with the apriori algorithm from [apyori](https://pypi.org/project/apyori/).

In [64]:
from apyori import apriori

# support: measures the number of times a particular item or combination of items occur in a dataset
# confidence: measures how likely the customer is to consume item2 given they have consumed item1
# lift: a metric that determines the strength of association between the best rules, confidence/support
# TODO: define own min_upport, min_confidence and min_lift
rule = apriori(transactions=baskets, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2, max_length=2)

In [65]:
results = list(rule)

# putting output into a pandas dataframe
def inspect(output):
    for result in output:
        try:
            tuple(result[2][0][0])[0]
        except Exception as e:
            print(result)
            raise e
    lhs = [tuple(result[2][0][0])[0] for result in output]
    rhs = [tuple(result[2][0][1])[0] for result in output]
    support = [result[1] for result in output]
    confidence = [result[2][0][2] for result in output]
    lift = [result[2][0][3] for result in output]
    return list(zip(lhs, rhs, support, confidence, lift))

output_DataFrame = pd.DataFrame(inspect(results),
                                columns=['Left_Hand_Side', 'Right_Hand_Side', 'Support', 'Confidence', 'Lift'])

output_DataFrame.nlargest(n=10, columns='Support')

Unnamed: 0,Left_Hand_Side,Right_Hand_Side,Support,Confidence,Lift
0,5677043,5697463,0.004187,0.329341,57.687425
2,5809912,5809910,0.00373,0.583333,25.459302
3,5814516,5814517,0.00373,0.875,201.664474
1,5809911,5809910,0.003578,0.746032,32.560196


## Method 2
Using code from [this site](https://towardsdatascience.com/apriori-association-rule-mining-explanation-and-python-implementation-290b42afdfc6), which uses the [apriori_python library](https://pypi.org/project/apriori-python/). This method is also very slow, so trying eclat insted.

In [73]:
from apriori_python import apriori
freqItemSet, rules = apriori(baskets, minSup=0.01, minConf=0.001)
print(freqItemSet)
print(rules)

[[5614842, 5766379], [5894239, 5861791], [5830275, 5830270], [5635117, 5809910, 5751383], [5767496, 5767497, 5891498]]
{1: {frozenset({5809910}), frozenset({5649236}), frozenset({5677043}), frozenset({5790689})}}
[]


## Method 3
Let's try running eclat on it. To do this, I'll use code from [this site](https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17), which makes use of [pyECLAT](https://pypi.org/project/pyECLAT/). This method also gives an error.


In [74]:
data = pd.DataFrame(baskets)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,5614842,5766379.0,,,,,,,,
1,5894239,5861791.0,,,,,,,,
2,5830275,5830270.0,,,,,,,,
3,5635117,5809910.0,5751383.0,,,,,,,
4,5767496,5767497.0,5891498.0,,,,,,,


In [75]:
# we are looking for itemSETS
# we do not want to have any individual products returned
min_n_products = 2

# we want to set min support to 7
# but we have to express it as a percentage
min_support = 7/len(baskets)

# we have no limit on the size of association rules
# so we set it to the longest transaction
max_length = max([len(x) for x in baskets])

In [76]:
from pyECLAT import ECLAT

# create an instance of eclat
my_eclat = ECLAT(data=data, verbose=True)

# fit the algorithm
rule_indices, rule_supports = my_eclat.fit(min_support=min_support,
                                           min_combination=min_n_products,
                                           max_combination=max_length)

100%|██████████| 11687/11687 [00:50<00:00, 232.95it/s]
100%|██████████| 11687/11687 [00:07<00:00, 1618.54it/s]
100%|██████████| 11687/11687 [00:07<00:00, 1587.45it/s]


ValueError: Cannot index with multidimensional key

## Method 4
Enough libraries tried that didn't work, let's just implement eclat ourselves.
TODO: Maybe this site can help: https://www.geeksforgeeks.org/ml-eclat-algorithm/


>If you were asked to give the 10 most interesting rules, which 10 would you give and why?

> A lot of information from the dataset was omitted in the current association rules, such as
the event types, which describe whether an item was viewed, purchased, added or removed
from the cart and the prices of items. Find a way to incorporate the additional information
provided into your association rules. Describe what you have added in your report.

> After adding additional information, which rules would you deem most interesting now, and
why?