# ID2222 Data Mining, Homework 2
# **Discovery of Frequent Itemsets and Association Rules**

Brando Chiminelli, Tommaso Praturlon

November 21th, 2022

## Goal:
The goal of this notebook is to find frequent itemsets to identify items that frequently occur together in sets of items (baskets) so frequent itemsets e.g., identify products bought together by sufficiently many customers.
The second step is to find the so called "Association rules”. Those are if-then rules in the form of implication X → Y (if X then Y), where X and Y are itemsets (Y can be just an item) such that X∩Y=∅. For example: If butter and bread are bought together, customers also buy milk: {butter, bread} → {milk}. The rule needs the support of many (several hundred) transactions to be statistically significant.

In our dataset we are dealing with sales transactions (baskets) of hashed items, so our associations rules will be for example {125, 987} → {122}. Moreover, we will display the relaative confidence for each association rule.

## How to run

In order to run this notebook you need to import the dataset at this address (https://canvas.kth.se/courses/36211/files/5772174/download?wrap=1) in a 'data' directory.
Then you can run all the notebook and read along the descriptions of the different parts of the implementation and check the comments present on the code.

## Import libraries and read the dataset
In the following we import the few libraries needed for the project and we read the dataset.

We decided to read 100 random baskets from the dataset in order to reduce weight on memory and by taking a random sampling we improve the rubustness of the code, because we test every time on a different dataset

In [158]:
import pandas as pd
import time
import matplotlib.pyplot as plt

PATH_TO_DATA = "../data/T10I4D100K.dat"
df_market = pd.read_csv(PATH_TO_DATA, header=None)
print("Data read successfully!")
# Delete duplicates from the dataset in the columns title and text

# Reduce dataset size for computation overload (temporary)
df_market = df_market.iloc[0:100]
print(df_market.head())
print("Number of baskets: ", len(df_market))

Data read successfully!
                                                   0
0  25 52 164 240 274 328 368 448 538 561 630 687 ...
1            39 120 124 205 401 581 704 814 825 834 
2                    35 249 674 712 733 759 854 950 
3            39 422 449 704 825 857 895 937 954 964 
4  15 229 262 283 294 352 381 708 738 766 853 883...
Number of baskets:  100


## Data cleaning

Before proceeding on, we need to clean our data, so to have our dataset as lists of integers which is easier to work with. The cleaning of the data is implemented in the following block.

In [159]:
# DATA CLEANING
# Make the dataframe a list of list integers
baskets_ls = []
# take all the baskets with their items, columns[0] is the only column we have in our dataframe
df_baskets = df_market[df_market.columns[0]]

for basket in df_baskets:
    basket = basket.split() # split the string of items
    basket_ls = [] # create the single basket as list
    for item in basket:
        item = int(item) # convert an item to int
        basket_ls.append(item) # add it to the basket
    baskets_ls.append(basket_ls) # add the basket to the list

print(baskets_ls)

[[25, 52, 164, 240, 274, 328, 368, 448, 538, 561, 630, 687, 730, 775, 825, 834], [39, 120, 124, 205, 401, 581, 704, 814, 825, 834], [35, 249, 674, 712, 733, 759, 854, 950], [39, 422, 449, 704, 825, 857, 895, 937, 954, 964], [15, 229, 262, 283, 294, 352, 381, 708, 738, 766, 853, 883, 966, 978], [26, 104, 143, 320, 569, 620, 798], [7, 185, 214, 350, 529, 658, 682, 782, 809, 849, 883, 947, 970, 979], [227, 390], [71, 192, 208, 272, 279, 280, 300, 333, 496, 529, 530, 597, 618, 674, 675, 720, 855, 914, 932], [183, 193, 217, 256, 276, 277, 374, 474, 483, 496, 512, 529, 626, 653, 706, 878, 939], [161, 175, 177, 424, 490, 571, 597, 623, 766, 795, 853, 910, 960], [125, 130, 327, 698, 699, 839], [392, 461, 569, 801, 862], [27, 78, 104, 177, 733, 775, 781, 845, 900, 921, 938], [101, 147, 229, 350, 411, 461, 572, 579, 657, 675, 778, 803, 842, 903], [71, 208, 217, 266, 279, 290, 458, 478, 523, 614, 766, 853, 888, 944, 969], [43, 70, 176, 204, 227, 334, 369, 480, 513, 703, 708, 835, 874, 895], [25, 

## A-Priori algorithm

To find frequent itemsets with support at least s in a dataset of sales transactions we are implementing the A-Priori algorithm.

First we need to define the support of an itemset as the the number of transactions containing the itemset. We say a set I of items is frequent only if its support is at least the value of a threshold s. The threshold s of the support should be set sufficiently high that not so many frequent itemsets are together. As a rule of thumb, s is 1% of the number of baskets.

## Finding frequent items

The first pass of the A-Priori algorithm is to determine which are the frequent items as singletons. Thus creating a list of these frequent items (in the code called "items") hopefully smaller than the one with all the items.

C_1 is the candidate set for single items and is a dictionary with the key equal to the item and the value is its support.

Since for a candidate to be a frequent itemset, all its subsets must be frequent, we added a loop that removes from the candidate set the items that are not frequent, so that this new candidate set will be used to find the frequent doubletons, tripletons, etc..

In order to get results in a short time we set the threshold at 3.

In [153]:
from itertools import combinations
import statistics

# items must have at least a frequence of support threshold 1% of total baskets
#S_THRESHOLD = 0.01*len(baskets_ls)
S_THRESHOLD = 3

# dictionary containing all frequencies for frequent items
C_1 = dict()
# take all the baskets with their items
# for every basket take the item and if it already exists
# in the dictionary count +1
for basket in baskets_ls:
    for item in basket:
        C_1[item] = C_1.get(item,0) + 1 # get gives the i value, if not found, gives 0
        
# find frequency statistics among items
min_freq = min(C_1.values())
max_freq = max(C_1.values())
median = statistics.median(C_1.values())
print("Minimum frequency: ", min_freq)
print("Maximum frequency: ", max_freq)
print("Median: ", median)

# delete non-frequent items
for item in list(C_1): # c1 is a list of dictionaries (e.g. 1:6, where 1 (key) is the value and 6 (value) the counter)
    if C_1[item]<S_THRESHOLD:
        del C_1[item]

items = list(C_1.keys()) # list of all different frequent items
support = [C_1] # list of dictionaries
#print("Support for C_1: \n", support)
#print("List of frequent items:\n", items)

Minimum frequency:  1
Maximum frequency:  10
Median:  2.0


## Find the support of all frequent itemsets in our dataset

The second step of the algorithm is to count all the pairs that consist of two frequent items. At the end of this step, we examine the structure of counts to determine which pairs are frequent. The same steps are applied to find larger sets of frequent items.
For the Monotonicity Rule, we know that if no frequent itemsets of a certain size are found, there cannot be a larger itemset of them, therefore we can break the iteration.

In order to get results in a short time we are only considering intemsets of dimension 1, 2, 3, 4.

In [154]:
# for every possible length of boundles, (a, b), (a, c, d), (e, f, g, w), ...
# ideally there is a number of Candidate Items Sets as big as the cardinality
# of all frequent singletons
MIN_SUPPORT = median

#for i in range(2,len(items)):
for i in range(2, 5):
    s = dict() # new support, now for doubletons, tripletons, etc. 
    # for every combinations of i items
    # count frequency of every combination among all baskets
    for combo in combinations(items,i):
        # iterate again in every basket of the original dataframe
        # must recreate the dataframe as set of int
        for basket in baskets_ls:
            # if the combination of i items is found in the basket, count+1
            if set(combo).issubset(basket):
                s[combo] = s.get(combo,0) + 1
        # once all baskets are checked
        # if there is a set for that combination and it is below threshold
        # delete it -> kkeeep  only actually frequuent items
        if s.get(combo) and s[combo]<MIN_SUPPORT:
            del s[combo]
    # if s is empty -> the dimension i of itemsets is not present in any basket
    if not s:
        break # exit the for cycle (monotonicity rule)
    support.append(s) # add the support of multiple-tons

# Print list of all dictionaries for each combination with their frequencies
print(support)

[{52: 4, 274: 9, 368: 6, 538: 6, 561: 4, 775: 6, 825: 6, 834: 3, 39: 5, 401: 4, 581: 4, 704: 6, 814: 3, 674: 4, 712: 3, 733: 3, 950: 3, 449: 3, 895: 5, 937: 5, 964: 4, 229: 3, 283: 3, 381: 5, 708: 3, 738: 3, 766: 5, 853: 6, 883: 5, 966: 6, 143: 3, 569: 6, 350: 3, 529: 10, 782: 3, 809: 4, 849: 3, 947: 5, 227: 4, 390: 5, 71: 6, 192: 3, 208: 4, 279: 5, 496: 3, 675: 5, 720: 4, 855: 5, 914: 3, 932: 3, 183: 7, 217: 4, 276: 4, 706: 3, 878: 3, 161: 4, 175: 3, 177: 6, 571: 6, 623: 4, 795: 7, 960: 3, 392: 4, 921: 5, 147: 5, 411: 4, 778: 3, 478: 3, 614: 4, 888: 6, 43: 3, 70: 4, 176: 3, 204: 4, 334: 4, 874: 6, 419: 5, 484: 3, 722: 7, 844: 3, 846: 3, 967: 3, 774: 3, 789: 5, 116: 3, 201: 3, 541: 5, 701: 4, 946: 3, 487: 3, 631: 3, 735: 3, 935: 4, 17: 4, 242: 3, 758: 3, 956: 3, 145: 3, 385: 3, 676: 3, 522: 3, 617: 3, 12: 3, 296: 5, 354: 9, 684: 3, 740: 3, 829: 6, 234: 4, 460: 6, 517: 3, 736: 3, 919: 5, 489: 6, 494: 4, 723: 3, 764: 3, 168: 3, 213: 3, 580: 4, 871: 3, 72: 5, 172: 3, 21: 4, 32: 4, 136: 4,

## Generating association rules 
In this final part we are generating association rules with confidence at least c = 50 from the itemsets found in the first step.

Confidence of rule X → Y is the fraction of transactions containing X⋃Y in all transactions that contain X, so we can say is the conditional probability of the itemset. A confidence set at 50 means that the probability of X⋃Y, given X, has to be at least 50%.

Moreover, since association rules are not symmetric, but we took combination that were only of the type (a,b) and not viceversa, now we need to test the rule X → Y and also Y → X for the same frequent combination.


In [155]:
MIN_CONFIDENCE = 50.0 #confidence is set to be at least 50% (confidence is the conditional probability of the itemset)

rules = dict()
for combo in support[-1]: #start from the last element of support, so the biggest cardinality of combo
    for item in combo:
        c = list(combo) 
        c.remove(item) #we need to remove one item from the combo to test our rule
        len_c = len(c)
        c = c[0] if len_c == 1 else tuple(c) #if we have to deal with tuple if our combo now has more than one element
        
        #now we compute the confidence that is:
        #the support of the union of the combo and the item divided by the support of the combo
        rule_1 = support[-1][combo]/support[0][item]*100 
        rule_2 = support[-1][combo]/support[len_c-1][c]*100 #we do the same with the opposite rule (rules are not symmetric)
        
        if rule_1>=MIN_CONFIDENCE: rules[f"{item}->{c}"] = rule_1
        if rule_2>=MIN_CONFIDENCE: rules[f"{c}->{item}"] = rule_2

print(rules)

{'(855, 639, 521)->274': 100.0, '(274, 639, 521)->855': 100.0, '639->(274, 855, 521)': 66.66666666666666, '(274, 855, 521)->639': 100.0, '521->(274, 855, 639)': 66.66666666666666, '(274, 855, 639)->521': 100.0, '(411, 764, 213)->274': 100.0, '411->(274, 764, 213)': 50.0, '(274, 764, 213)->411': 100.0, '764->(274, 411, 213)': 66.66666666666666, '(274, 411, 213)->764': 100.0, '213->(274, 411, 764)': 66.66666666666666, '(274, 411, 764)->213': 100.0, '(411, 764, 21)->274': 100.0, '411->(274, 764, 21)': 50.0, '(274, 764, 21)->411': 100.0, '764->(274, 411, 21)': 66.66666666666666, '(274, 411, 21)->764': 100.0, '21->(274, 411, 764)': 50.0, '(274, 411, 764)->21': 100.0, '(411, 764, 32)->274': 100.0, '411->(274, 764, 32)': 50.0, '(274, 764, 32)->411': 100.0, '764->(274, 411, 32)': 66.66666666666666, '(274, 411, 32)->764': 100.0, '32->(274, 411, 764)': 50.0, '(274, 411, 764)->32': 100.0, '(411, 764, 136)->274': 100.0, '411->(274, 764, 136)': 50.0, '(274, 764, 136)->411': 100.0, '764->(274, 411, 