Authors: Fernando Díaz González and Giorgio Ruffa
{fdiaz, ruffa}@kth.se

ID 2222 Data Mining. Assignment 2

In [9]:
import itertools
import random
from collections import defaultdict

To compute the frequent itemsets we are going to use a support degree of 1% (aprox 1000 transactions). For rule discovery, we will use a confidence of 80%. $k$ is the number of items in a generic itemset, `k_max` is the maximum cardinality of the itemset that we are going to compute. 

In [2]:
support_threshold_percentage = 0.01
confidence_threshold = 0.6
k_max = 4  # triplets

## Frequent itemsets

For each $k$, we populate a default dictionary using the itemset as the key and a set of transactions ids in which the itemset appears as the value. The transaction id is the line number of the data file.

In the following cell we compute the dictionary for $k = 1$, that is, for the singletons.

In [3]:
data_file = "./data/T10I4D100K.dat"

k_list = [None] * k_max

tot_transactions = 0
with open(data_file, 'r') as data_file:
    singleton_to_transactions = defaultdict(set)
    for transaction_id, transaction in enumerate(data_file):
        for item in transaction.strip(" \n").split(" "):
            singleton_to_transactions[frozenset({item})].add(transaction_id)
        tot_transactions += 1
    k_list[0] = singleton_to_transactions


print("Total number of transactions: {}".format(tot_transactions))
print("Total number of distinct items: {}".format(len(k_list[0].keys())))
support_threshold = int(support_threshold_percentage * tot_transactions)
print("Support percentage thr {}, equivalent to at leat {} transactions".format(
    support_threshold_percentage, support_threshold))
print("The item {} appears in the following transactions (showing 20): {}...".format(
    "25", list(k_list[0][frozenset({"25"})])[0:20]))

Total number of transactions: 100000
Total number of distinct items: 870
Support percentage thr 0.01, equivalent to at leat 1000 transactions
The item 25 appears in the following transactions (showing 20): [0, 8204, 16397, 17, 57371, 32802, 73766, 81962, 98351, 98356, 73784, 49215, 81999, 32859, 73822, 57440, 73827, 32895, 90239, 16523]...


In this representation, the support is the length of the transaction set.

In the following cell we filter the singletons with a support less than the threshold.

In [4]:
def filter_and_remove(set_to_transactions, support_threshold):
    items_below_threshold = [
        item for item, transactions in set_to_transactions.items() if len(transactions) < support_threshold
    ]
    for item in items_below_threshold:
        del(set_to_transactions[item])
        
filter_and_remove(k_list[0], support_threshold)
print("Remaining singletons: {}".format(len(k_list[0].keys())))

Remaining singletons: 375


Now we iteratively generate and filter the $k$ itemsets for $k = 2,...,k\_max$. To get the $k$ itemsets we combine the filtered singletons with the $k - 1$ itemsets. Example, to get pairs, we combine singletons with singletons; to get triplets, we combine singletons with pairs.

In [5]:
for k in range(2, k_max + 1):
    print("** Computing itemsets of size {} **".format(k))
    singletons = k_list[0]
    k_minus_one_itemsets = k_list[k - 2]
    k_item_set_to_transactions = defaultdict(set)
    for keyA, keyB in itertools.product(singletons.keys(), k_minus_one_itemsets.keys()):
        k_item_set = frozenset(keyA.union(keyB))
        if len(k_item_set) != k:
            continue
        common_txs = singletons[keyA].intersection(k_minus_one_itemsets[keyB])
        k_item_set_to_transactions[k_item_set] = common_txs
    filter_and_remove(k_item_set_to_transactions, support_threshold)
    k_list[k - 1] = k_item_set_to_transactions

** Computing itemsets of size 2 **
** Computing itemsets of size 3 **
** Computing itemsets of size 4 **


We print the first 10 itemsets for each $k$ (not ordered by support).

In [6]:
for idx, k_itemsets in enumerate(k_list):
    k = idx + 1
    n_itemsets = len(k_itemsets)
    print("Number of {}-itemsets with support {} = {}".format(
        k, support_threshold, n_itemsets))
    for idx, (itemset, transactions) in enumerate(k_itemsets.items()):
        print("Items: {!s:<32} -> Support: {}".format(itemset, len(transactions)))
        if idx == 10:
              break

Number of 1-itemsets with support 1000 = 375
Items: frozenset({'25'})                -> Support: 1395
Items: frozenset({'52'})                -> Support: 1983
Items: frozenset({'240'})               -> Support: 1399
Items: frozenset({'274'})               -> Support: 2628
Items: frozenset({'368'})               -> Support: 7828
Items: frozenset({'448'})               -> Support: 1370
Items: frozenset({'538'})               -> Support: 3982
Items: frozenset({'561'})               -> Support: 2783
Items: frozenset({'630'})               -> Support: 1523
Items: frozenset({'687'})               -> Support: 1762
Items: frozenset({'775'})               -> Support: 3771
Number of 2-itemsets with support 1000 = 9
Items: frozenset({'682', '368'})        -> Support: 1193
Items: frozenset({'829', '368'})        -> Support: 1194
Items: frozenset({'825', '39'})         -> Support: 1187
Items: frozenset({'825', '704'})        -> Support: 1102
Items: frozenset({'704', '39'})         -> Support: 1107


The obtained result agrees with what reported by [ZHIGANG WANG et al](http://ijssst.info/Vol-17/No-32/paper44.pdf)

![frequent-items](frequent-items.png)

## Association rules

To find the antecedent and consequent of the rules, we iterate over all the itemsets with $k > 1$, generating all the possible combinations between its elements. For example, given a itemset $S$ like `{"a", "b", "c"}`, we generate the following rules:
* `{"a"} -> {"b", "c"}`
* `{"b"} -> {"a", "c"}`
* `{"c"} -> {"a", "b"}`
* `{"a"} -> {"b", "c"}`
* `{"b", "c"} -> {"a"}`
* `{"b", "a"} -> {"c"}`
* `{"a", "c"} -> {"b"}`

As you can see, the consequent is the result of the operation $S - antecedent$, where $S$ and $antecedent$ are both sets. 

In [7]:
def find_rules(k_list, confidence_threshold=0.5):
    rules = []
    # Iterate over all itemsets with k > 1 (pairs, triplets ...)
    for idx, k_itemsets in enumerate(k_list[1:]):
        k = idx + 1
        for itemset, transactions in k_itemsets.items():
            # This will be the support of the 'consequent' in the rule
            itemset_support = len(transactions)
            for i in range(len(itemset) - 1):
                # Generate antecedents of different sizes (1,...,k),
                # where k is the size of the itemset
                antecedents = list(itertools.combinations(itemset, i + 1))
                for antecedent in antecedents:
                    antecedent = frozenset(antecedent)
                    consequent = itemset - antecedent
                    k_antecedent = len(antecedent)
                    k_antecedent_itemsets = k_list[k_antecedent - 1]
                    # Find transaction list of antecedent, its length is the support
                    antecedent_support = len(k_antecedent_itemsets[antecedent])
                    confidence = itemset_support / antecedent_support
                    if confidence >= confidence_threshold:
                        rules.append((antecedent, consequent, confidence, itemset_support))
    return rules

In [8]:
SHOW = 15
rules = find_rules(k_list, confidence_threshold)
n_rules = len(rules)
print("Found {} rules, showing {}".format(n_rules, SHOW))
for i in range(min(SHOW, n_rules)):
    antecedent, consequent, confidence, _ = random.choice(rules)
    print("{} -> {} (confidence: {})".format(antecedent, consequent, confidence))

Found 5 rules, showing 15
frozenset({'704'}) -> frozenset({'825'}) (confidence: 0.6142697881828316)
frozenset({'704', '39'}) -> frozenset({'825'}) (confidence: 0.9349593495934959)
frozenset({'825', '39'}) -> frozenset({'704'}) (confidence: 0.8719460825610783)
frozenset({'704'}) -> frozenset({'825'}) (confidence: 0.6142697881828316)
frozenset({'825', '39'}) -> frozenset({'704'}) (confidence: 0.8719460825610783)
