# ID2222 Data Mining, Homework 2
# **Discovery of Frequent Itemsets and Association Rules**

Brando Chiminelli, Tommaso Praturlon

November 21th, 2022

The goal of this notebook is to ...

## Import libraries and read the dataset

In order to run this notebook you need to import the dataset at this address (https://canvas.kth.se/courses/36211/files/5772174/download?wrap=1) in a 'data' directory.

In [33]:
import pandas as pd
import numpy as np
import random
import time
import matplotlib.pyplot as plt

PATH_TO_DATA = "../data/T10I4D100K.dat"
df_market = pd.read_csv(PATH_TO_DATA, header=None)
print("Data read successfully!")
# Delete duplicates from the dataset in the columns title and text

df_market.head()
print("Number of baskets: ", len(df_market))

Data read successfully!
Number of baskets:  100000


## Finding frequent itemsets with support at least s

Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of rule X → Y is the number of transactions that contain X⋃Y. Confidence of rule X → Y is the fraction of transactions containing X⋃Y in all transactions that contain X.

TASK

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The A-Priori algorithm is based on the rule of monotonic increase of the monotonicity of support: if a set I of items is frequent, then so is every subset of I.

The threshold s of the support should be set sufficiently hihg that not so many frequent itemsets are together. As a rule of thumb, s is 1% of the number of baskets.

In [40]:
# DATA CLEANING
# Make the dataframe a list of integers
baskets_ls = []
# take all the baskets with their items
df_baskets = df_market[df_market.columns[0]]

for basket in df_baskets:
    basket = basket.split() # split the string of items
    basket_ls = [] # create the single basket as list
    for item in basket:
        item = int(item) # convert an item to int
        basket_ls.append(item) # add it to the basket
    baskets_ls.append(basket_ls) # add the basket to the list

print(baskets_ls)

[[25, 52, 164, 240, 274, 328, 368, 448, 538, 561, 630, 687, 730, 775, 825, 834], [39, 120, 124, 205, 401, 581, 704, 814, 825, 834], [35, 249, 674, 712, 733, 759, 854, 950], [39, 422, 449, 704, 825, 857, 895, 937, 954, 964], [15, 229, 262, 283, 294, 352, 381, 708, 738, 766, 853, 883, 966, 978], [26, 104, 143, 320, 569, 620, 798], [7, 185, 214, 350, 529, 658, 682, 782, 809, 849, 883, 947, 970, 979], [227, 390], [71, 192, 208, 272, 279, 280, 300, 333, 496, 529, 530, 597, 618, 674, 675, 720, 855, 914, 932], [183, 193, 217, 256, 276, 277, 374, 474, 483, 496, 512, 529, 626, 653, 706, 878, 939], [161, 175, 177, 424, 490, 571, 597, 623, 766, 795, 853, 910, 960], [125, 130, 327, 698, 699, 839], [392, 461, 569, 801, 862], [27, 78, 104, 177, 733, 775, 781, 845, 900, 921, 938], [101, 147, 229, 350, 411, 461, 572, 579, 657, 675, 778, 803, 842, 903], [71, 208, 217, 266, 279, 290, 458, 478, 523, 614, 766, 853, 888, 944, 969], [43, 70, 176, 204, 227, 334, 369, 480, 513, 703, 708, 835, 874, 895], [25, 

The first pass of the A-Priori algorithm is to determine which are the frequent items as singletons. Thus creating a _frequent-items table_ hopefully smaller than the one with all the items.
C_k is the Candidate set for k items.
First iteration: find frequent items

In [48]:
from itertools import combinations
import statistics

# items must have at least a frequence of support threshold 1% of total baskets
S_THRESHOLD = 0.01*len(baskets_ls)
# MIN_CONFIDENCE = 50.0

# dictionary containing all frequencies for frequent items
C_1 = dict()
# take all the baskets with their items
# for every basket take the item and if it already exists
# in the dictionary count +1
for basket in baskets_ls:
    for item in basket:
        C_1[item] = C_1.get(item,0) + 1 # get gives the i value, if not found, gives 0
        
# find frequency statistics among items 
min_freq = min(C_1.values())
max_freq = max(C_1.values())
median = statistics.median(C_1.values())
print("Minimum frequency: ", min_freq)
print("Maximum frequency: ", max_freq)
print("Median: ", median)

# delete non-frequent items
for item in list(C_1): # c1 is a list of dictionaries (1:6, where 1 is the value and 6 the counter)
    if C_1[item]<S_THRESHOLD:
        del C_1[item]

items = list(C_1.keys()) # list of all different frequent items
support = [C_1] # list of dictionaries
#print("Support for C_1: \n", support)
#print("List of frequent items:\n", items)

Minimum frequency:  1
Maximum frequency:  7828
Median:  816.5


The second step of the algorithm is to count all the pairs that consist of two frequent items. At the end of this step, we examine the structure of counts to determine which pairs are frequent. The same steps are applied to find larger sets of frequent items.
For the Monotonicity Rule, we know that if no frequent itemsets of a certain size are found, there cannot be a larger itemset of them, therefore we can break the iteration. 
Second iteration: to find frequent combinations of items among frequent items.

In [None]:
# for every possible length of boundles, (a, b), (a, c, d), (e, f, g, w), ...
# ideally there is a number of Candidate Items Sets as big as the cardinality
# of all frequent singletons
MIN_SUPPORT = median
# for i in range(2,len(items)):
for i in range(3, 4):
    s = dict() # new support, now for doubletons, tripletons, etc. 
    # for every combinations of i items
    # count frequency of every combination among all baskets
    for combo in combinations(items,i):
        # iterate again in every basket of the original dataframe
        # must recreate the dataframe as set of int
        for basket in baskets_ls:
            # if the combination of i items is found in the basket, count+1
            if set(combo).issubset(basket):
                s[combo] = s.get(combo,0) + 1
        # once all baskets are checked
        # if there is a set for that combination and it is below threshold
        # delete it -> kkeeep  only actually frequuent items
        if s.get(combo) and s[combo]<MIN_SUPPORT:
            del s[combo]
    # if s is empty -> that combination is not present in any basket
    if not s:
        break # exit the for cycle (monotonicity rule)
    support.append(s) # add the support of multiple-tons

# Print list of all dictionaries for each combination with their frequencies
print(support)

The other crucial part of the algorithm is to find significant rules that connect different items to eacch other.

In [None]:
rules = dict()
for combo in support[-1]:
    for item in combo:
        c = list(combo)
        c.remove(item)
        len_c = len(c)
        c = c[0] if len_c == 1 else tuple(c)
        rule_1 = support[-1][combo]/support[0][item]*100
        rule_2 = support[-1][combo]/support[len_c-1][c]*100
        if rule_1>=MIN_CONFIDENCE: rules[f"{item}->{c}"] = rule_1
        if rule_2>=MIN_CONFIDENCE: rules[f"{c}->{item}"] = rule_2

print(rules)