# Discovery of Frequent Itemsets and Association Rules
This notebook aims to implement the Apriori algorithm to discover association rules between itemsets in a sales transaction database (i.e. a set of baskets). The task includes the following two sub-problems [[R. Agrawal and R. Srikant, VLDB '94](http://www.vldb.org/conf/1994/P487.PDF)]:

1. Finding frequent itemsets with support at least s. That means these item sets must appear together in a number of baskets greater than *s * total_baskets*.
2. Generating association rules with confidence at least c from the itemsets found in the first step.

## Implementation
This project has been developed using only NumPy and the native dictionary implementation for Python. We decided not to use PySpark because the transaction file fits in memory and, due to the lack of a distributed cluster, the speed will be equivalent. Also the implementation is extremely fast as we decided not only store singletons but also the list of the baskets the element appears in into a dictionary so the access time is O(1).

The execution time for the implemented Apriori algorithm, with the given database, was around 13.5 seconds. The implemented Association Rules algorithm took around 8,27 ms to complete. 

Some additional libraries to highlight were urllib.request, to download the transaction file from remote storage, pandas, to visualize the resulting table, and run experiments comparing it with the python dictionary, and finally functools and itertools for helper functions for combinations and reduction of lists.

## Authors
- Serghei Socolovschi [serghei@kth.se](mailto:serghei@kth.se)
- Angel Igareta [alih2@kth.se](mailto:alih2@kth.se)

## Dependencies

In [1]:
import urllib.request
import pandas as pd
import numpy as np
from itertools import permutations, combinations
from functools import reduce

## Methods

### Extract singletons
Stores the singletons (unique elements among all baskets) together with the list of the baskets the element appears in into a dictionary so the access time is O(1).

In [2]:
def extract_singletons(dataset):
  singletons = {}

  for basket_index in range(len(dataset)):
    item_set = dataset[basket_index]
    for item in item_set:
      key = str(item)
      if key not in singletons.keys():
        singletons[key] = np.array([basket_index])
      else:
        singletons[key] = np.append(singletons[key], basket_index)

  return singletons

### Generate combinations
Generates combinations of the current item set and the singletons. For each singleton and frequent itemset, it generates a new item set and in the end, checks that all of them are unique (without taking into account the order).


In [3]:
separator = "-" # When the new combined key is composed of multiple items, they are aggregated in a string and separated by a '-'

In [4]:
def generate_combinations(frequent_item_sets, size):
  combinations = set()

  for i in range(len(frequent_item_sets)):
    for j in range(i, len(frequent_item_sets)):
      potential_combination = sorted(set(frequent_item_sets[i].split(separator) + frequent_item_sets[j].split(separator)))
      if len(potential_combination) == size:
        combinations.add(separator.join(potential_combination))

  return combinations

In [5]:
generate_combinations(['2-3', '3-2', '2-4', '6-5'], 3)

{'2-3-4'}

In [6]:
generate_combinations(['2-3-4', '3-4-5', '5-7-8'], 4)

{'2-3-4-5'}

In [7]:
set(combinations([1, 2, 3], 2))

{(1, 2), (1, 3), (2, 3)}

### Search frequent item sets
Recursive approach for finding frequent itemsets. First, it calculates the combinations of size k (k starting from 2) and, for each combination, it calculates the similarity of the items. If this similarity (number of baskets where these items appear together) is over the baskets threshold, the item is stored in an array that will be returned.

In [8]:
def search_frequent_itemsets(previous_item_sets, singletons, selected_item_sets, baskets_threshold, k = 2):
  # New candidates of size k are computed by adding to the previous item sets the singletons.
  combinations_size_k = generate_combinations(list(previous_item_sets.keys()), k)

  # If no more combinations can be generated with the previous item sets, return selected.
  if len(combinations_size_k) == 0:
    return selected_item_sets
  # Otherwise, append item sets that are over the similarity threshold.
  else:
    current_item_sets = {}
    for combination in combinations_size_k:      
      combination_array = combination.split(separator)
      are_subsets_frequent = False
      if k == 2:
        # No need to check as previous items were already pruned
        are_subsets_frequent = True
      else: 
        # First, check all subset k - 1 of current combination is is in previous item sets
        combination_subsets = [separator.join(combination) for combination in combinations(combination_array, k - 1)]
        subset_intersection = np.intersect1d(combination_subsets, list(previous_item_sets.keys()))
        are_subsets_frequent = len(subset_intersection) == len(combination_subsets)

      # If all the subsets are frequent
      if are_subsets_frequent:
        # For each item, get baskets set and intersect all of them.
        tuple_total_baskets = [singletons[combination] for combination in combination_array]
        tuple_total_baskets_intersection = reduce(np.intersect1d, (tuple_total_baskets))

        # If the intersection is over the similarity_threshold, add it to the current item sets.
        if (len(tuple_total_baskets_intersection) > baskets_threshold):
          current_item_sets[combination] = tuple_total_baskets_intersection
    
    return search_frequent_itemsets(current_item_sets, singletons, dict(selected_item_sets, **current_item_sets), baskets_threshold, k + 1)

### Generate permutations for rules
For each possible subarray of item_set with size k, find all possible combinations and add k items at the left part and the other n - k at the right part.

In [9]:
def generate_rules(item_set):
  candidates = []
  for k in range(1, len(item_set)):
    for combination in list(combinations(item_set, k)):
      left_part = list(combination)
      right_part = [item for item in item_set if item not in left_part]
      candidates.append((left_part, right_part))
  
  return candidates

In [10]:
example_list = ['390', '722', '222', '444']

for i in range(0, len(example_list) - 1):
  example_sublist = example_list[:len(example_list) - i]
  print("\nRules generated for list: " + str(example_sublist))
  for candidate in generate_rules(example_sublist):
    print(str(candidate[0]) + " => " + str(candidate[1]))


Rules generated for list: ['390', '722', '222', '444']
['390'] => ['722', '222', '444']
['722'] => ['390', '222', '444']
['222'] => ['390', '722', '444']
['444'] => ['390', '722', '222']
['390', '722'] => ['222', '444']
['390', '222'] => ['722', '444']
['390', '444'] => ['722', '222']
['722', '222'] => ['390', '444']
['722', '444'] => ['390', '222']
['222', '444'] => ['390', '722']
['390', '722', '222'] => ['444']
['390', '722', '444'] => ['222']
['390', '222', '444'] => ['722']
['722', '222', '444'] => ['390']

Rules generated for list: ['390', '722', '222']
['390'] => ['722', '222']
['722'] => ['390', '222']
['222'] => ['390', '722']
['390', '722'] => ['222']
['390', '222'] => ['722']
['722', '222'] => ['390']

Rules generated for list: ['390', '722']
['390'] => ['722']
['722'] => ['390']


### Generate association rules
For each selected item set (frequent itemsets excluding singletons), generate each of the possible rules between the items within the set and return the ones in which confidence is higher than the threshold.
- Confidence: conf(I => j) = support(I ⋃ j) / support(I)
- Interest: Confidence - Pr[j]

In [11]:
def generate_association_rules(selected_item_sets, frequent_item_sets, total_baskets, confidence_threshold = 0.5):
  selected_rules = []
  # Iterates over the groups of frequent items and generating candidates for association rules from them.
  for item_set_raw in selected_item_sets:
    item_set = item_set_raw.split(separator) # From {'A-B-C'} -> ['A', 'B', 'C']
    candidate_rules = generate_rules(item_set)

    # Extracting lists of baskets from left part and right part of candidate association rule.
    for left_part, right_part in candidate_rules:
      left_part_baskets = frequent_item_sets[separator.join(left_part)]
      right_part_baskets = frequent_item_sets[separator.join(right_part)]

      # Calculate confidence and interest.
      confidence = len(np.intersect1d(left_part_baskets, right_part_baskets)) / len(left_part_baskets)
      interest = confidence - (len(right_part_baskets) / total_baskets)

      if confidence > confidence_threshold:
        rule = "{" + ", ".join(left_part) + "} => {" + ", ".join(right_part) + "}"
        selected_rules.append((rule, confidence, interest))
          
  return selected_rules

## Main method

### Configuration

In [12]:
similarity_threshold = 0.01
confidence_threshold = 0.5

### Dataset

In [13]:
dataset_url = "https://drive.google.com/uc?export=download&id=1xDIIHRMZcnOcUDvp0R48z-YEiDhrZ9FT" # Add here path to T10I4D100K.dat
dataset_raw = urllib.request.urlopen(dataset_url)

In [14]:
def transform_item_function(item_set_string):
  item_set_raw = item_set_string.strip().split() # Split items in item set
  item_set = [int(item_raw) for item_raw in item_set_raw] # Convert items to int
  return np.array(item_set)

dataset = [transform_item_function(item_set_string) for item_set_string in dataset_raw]

In [15]:
dataset[:2]

[array([ 25,  52, 164, 240, 274, 328, 368, 448, 538, 561, 630, 687, 730,
        775, 825, 834]),
 array([ 39, 120, 124, 205, 401, 581, 704, 814, 825, 834])]

### Frequent Itemsets

In [16]:
total_baskets = len(dataset)
baskets_threshold = similarity_threshold * total_baskets
print("Total baskets: " + str(total_baskets))
print("Baskets threshold: " + str(baskets_threshold))

Total baskets: 100000
Baskets threshold: 1000.0


In [17]:
# Extract singletons from dataset
singletons = extract_singletons(dataset)
print("Total items: " + str(len(singletons.keys())))

Total items: 870


In [18]:
# Filter singletons
new_singletons = {}

for (key, value) in singletons.items():
  if len(value) > baskets_threshold:
    new_singletons[key] = value

singletons = new_singletons
print("Items over similarity threshold: " + str(len(singletons.keys())))

Items over similarity threshold: 375


In [19]:
%%timeit
search_frequent_itemsets(singletons, singletons, [], baskets_threshold)

1 loop, best of 3: 13.3 s per loop


In [20]:
%%time
selected_item_sets = search_frequent_itemsets(singletons, singletons, [], baskets_threshold)
print(selected_item_sets.keys())

dict_keys(['227-390', '217-346', '39-825', '368-682', '390-722', '39-704', '704-825', '789-829', '368-829', '39-704-825'])
CPU times: user 13.5 s, sys: 4.96 ms, total: 13.5 s
Wall time: 13.5 s


### Association rules

In [21]:
frequent_item_sets = dict(singletons, **selected_item_sets)

In [22]:
%%timeit # 8.39 ms per loop
generate_association_rules(selected_item_sets, frequent_item_sets, total_baskets, confidence_threshold)

100 loops, best of 3: 8.27 ms per loop


In [23]:
%time
association_rules = generate_association_rules(selected_item_sets, frequent_item_sets, total_baskets, confidence_threshold)
df = pd.DataFrame(association_rules, columns =['Rule', 'Confidence', 'Interest'])
df.head(20)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs


Unnamed: 0,Rule,Confidence,Interest
0,{227} => {390},0.577008,0.550158
1,{704} => {39},0.617057,0.574477
2,{704} => {825},0.61427,0.58342
3,"{704} => {39, 825}",0.576923,0.565053
4,"{39, 704} => {825}",0.934959,0.904109
5,"{39, 825} => {704}",0.871946,0.854006
6,"{704, 825} => {39}",0.939201,0.896621


## Appendix

### Pandas version
Alternative implementation of the search for frequent itemsets where instead of a python dictionary, the library pandas is used. However, the running time of the experiment turned out to be 62 seconds (average from 3 loops), which is 4 times slower than the native implementation, so it was discarded.

In [24]:
def generate_combinations_pd(frequent_item_sets, singletons, size):
  combinations = set()

  for singleton in singletons:
    for item in frequent_item_sets:
      potential_combination = sorted(set([singleton] + item.split(separator)))
      if len(potential_combination) == size:
        combinations.add(separator.join(potential_combination))

  return combinations

In [25]:
generate_combinations_pd(['2-3', '3-2'], ['2', '6'], 3)

{'2-3-6'}

In [26]:
def extract_singletons_pd(dataset):
  singletons_dic = {}
  for basket_index in range(len(dataset)):
    item_set = dataset[basket_index]
    for item in item_set:
      key = str(item) # This key will be separated by - when representing more than 1 item
      if key not in singletons_dic.keys():
        singletons_dic[key] = (1, np.array([basket_index]))
      else:
        singletons_dic[key] = (singletons_dic[key][0] + 1,  np.append(singletons_dic[key][1], basket_index))

  singletons = pd.DataFrame(list(singletons_dic.items()), columns = ["key", "tuple"]).set_index('key')

  # Split tuple into count and baskets
  return pd.DataFrame(singletons['tuple'].tolist(), columns = ['count', 'baskets'], index=singletons.index)

In [27]:
def search_frequent_itemsets_pd(previous_item_sets, singletons, selected_item_sets, baskets_threshold, k = 2):
  combinations_size_k = generate_combinations_pd(previous_item_sets, singletons.index.values, k)

  # If no more generations can be generated with the previous item sets, return selected
  if len(combinations_size_k) == 0:
    return selected_item_sets
  # Otherwise, append item sets over the similarity threshold
  else:
    current_item_sets = []
    for combination in combinations_size_k:
      # For each item, get baskets set and intersect it
      tuple_total_baskets = singletons.loc[combination.split(separator), "baskets"].to_numpy()
      tuple_total_baskets_intersection = reduce(np.intersect1d, (tuple_total_baskets))

      # If the intersection is over the similarity_threshold add it to the pandas dataframe
      if (len(tuple_total_baskets_intersection) > baskets_threshold):
        current_item_sets.append(combination)
    
    return search_frequent_itemsets_pd(current_item_sets, singletons, selected_item_sets + current_item_sets, baskets_threshold, k + 1)

In [28]:
%%timeit # 1 min 2s per loop
singletons_pd = extract_singletons_pd(dataset)
singletons_pd = singletons_pd[singletons_pd['count'] > baskets_threshold]
selected_item_sets = search_frequent_itemsets_pd(singletons_pd.index.values, singletons_pd, [], baskets_threshold)
print(selected_item_sets)

['227-390', '217-346', '39-825', '368-682', '390-722', '39-704', '704-825', '789-829', '368-829', '39-704-825']
['227-390', '217-346', '39-825', '368-682', '390-722', '39-704', '704-825', '789-829', '368-829', '39-704-825']
['227-390', '217-346', '39-825', '368-682', '390-722', '39-704', '704-825', '789-829', '368-829', '39-704-825']
['227-390', '217-346', '39-825', '368-682', '390-722', '39-704', '704-825', '789-829', '368-829', '39-704-825']
1 loop, best of 3: 1min 1s per loop
