We need to import two libraries, in case we don't have it perform the following snippet of code:

```
$ pip freeze | egrep "(mlxtend|pandas)"
mlxtend==0.17.0
pandas==0.25.1
```



In [None]:
#importing the necessary libraries

import mlxtend
import pandas as pd



# EXERCISES 

### Exercise 2.1 Association rules from frequent itemsets

1. First, you need to load the dataset into memory, using the csv module. Make sure you identify all valid rows. Also consider that rows having an InvoiceNo that starts with C should be discarded, as they indicate that the invoice is about a cancelled purchase.
2. Now that you have a dataset of items, you should aggregate it at an “invoice” level. For each invoice (identified by InvoiceNo) there can be multiple items (from multiple rows) in the dataset. For each invoice, you should build a list of all items belonging to it. For the example invoice presented in 1.2.1, you want to build the following list:
```
[ "GARDENERS KNEELING PAD KEEP CALM",
"HOT WATER BOTTLE KEEP CALM",
"DOORMAT KEEP CALM AND COME IN" ]
```
3. You should now have a list (one for each invoice) of lists (each list containing the items bought for that invoice). Now, we need to convert this into a matrix form. Of the many possible formats, we will use the one expected by the Mlxtend library, which is as follows. Given an ordered list of M possible items (in this case, all possible products that can be bought), and given N itemsets (in this case, invoices), we should build a matrix of N rows and M columns. The element at the ith row and jth column should be 1 if the i
th itemset (invoice) contains the jth item (product), 0 otherwise. For the following example:
```
a,b,c
b,c
a,c,d
a,b
```
The list of all possible items is ```[a, b, c, d]```. As such, the matrix that we will build is the following:
```
3
1 1 1 0
0 1 1 0
1 0 1 1
1 1 0 0
```
Once we have defined this matrix (as a list of lists), we can use Pandas to convert it to a DataFrame
(which is, essentially, a table) with the following code:
```
import pandas as pd
all_items = ['a', 'b', 'c', 'd'] # this is your list of items
pa_matrix = [
[1,1,1,0],
[0,1,1,0],
[1,0,1,1],
[1,1,0,0]
] # this is the matrix you built from the itemsets
df = pd.DataFrame(data=pa_matrix, columns=all_items)
```
4. With the df that you defined in the previous exercise, you can now use the fp_growth function. This
function, which is described in the detail in the official documentation. The first argument required is
the previously built DataFrame, df. The second is the minimum support (minsup), i.e. the minimum
fraction of the entire dataset in which the itemset should show up for it to be considered “frequent”.
Try using different values of ```minsup```, such as 0.5, 0.1, 0.05, 0.02, 0.01. How many results do you
obtain as minsup varies? You can check the number of frequent itemsets identified and print them
all with the following code snipped:
```
fi = fpgrowth(df, 0.05)
print(len(fi))
print(fi.to_string())
```
5. Consider the itemsets extracted for ```minsup = 0.02```. How many items are contained? Which ones
would you be considered the most useful?
6. Use the value returned by fpgrowth to extract the relevant association rules.
7. Extract the association rules from the frequent itemsets extracted with ```minsup = 0.01```. You can
find the documentation for ```association_rules()``` on the official documentation. You can use the
confidence as the metric to identify the rules, and a minimum threshold of 0.85 (feel free to vary
these values and observe how the results vary).
8. (*) Rerun the experiments from point 4 with ```apriori()``` (documentation on the official website).
Do the results match with the ones found by FP-Growth? Is Apriori faster or slower than FP-Growth?
You can measure how long a function call takes with the following code snippet:
```
import timeit
# number=1 means that it executes the function only once
timeit.timeit(lambda: apriori(df, 0.01), number=1)
4 
```

