# Module 2 - Frequent pattern mining

## FPM Objectives:
1. Find combinations of attributes that are common to many objects
2. find significant associations between these combinations
3. Find frequent sequences

Definitions:
- $I$ is the set of all items $i_1, ..., i_n$, where $n$ is the number of items
- $S \in I$ is an itemset
- $D$ is a set of transactional data
- The support of an itemset in D is defined as: $sup_D(S) = \frac{\sum_{T \in D} \delta(S \subseteq T)}{|D|}$. It is the ratio between the number of transactions in which the given itemset is present and the total number of transactions. (during the slides, it is not taken as a ratio, but as just the numerator.)
- where $\delta$ is:
$$
\begin{equation}
\delta(x)=
    \begin{cases}
        1 & \text{if } x = true\\
        0 & \text{if } x = false
    \end{cases}
\end{equation}
$$
- The total number of itemsets is $2^{|I|} - 1$
- **Frequent Itemset Mining**: Given a set of items $I$, transactional data $D$ and a threshold value $\sigma$, FIM aims to find those itemsets called "frequent itemsets" which are generated from $I$, in which support in T is $\ge \sigma$ 

Naively computing these frequent itemsets is unfeasible, as the number of producable itemsets grows exponentally with the number of items in $I$. Thankfully, there are multiple ways to limit the search space for this purpose.

## 2 Theorems of Monotonicity

1. $sup_D(S) \ge \sigma \Rightarrow \text{all subsets of } S \text{ are frequent too}$
2. $sup_D(S) \lt \sigma \Rightarrow \text{all supersets of } S \text{ are non-frequent too}$

Now we'll move on to some algorithms for frequent itemset mining which utilize these theorems.

## Apriori

In [1]:
from typing import Set, Tuple
from itertools import product
from functools import reduce, partial
from operator import or_

# Sample transaction data with frozensets
t_data = [
    ('Andrew', frozenset(['Indian', 'Mediterranean'])),
    ('Bernhard', frozenset(['Indian', 'Oriental', 'Fast Food'])),
    ('Carolina', frozenset(['Indian', 'Mediterranean', 'Oriental'])),
    ('Dennis', frozenset(['Arabic', 'Mediterranean'])),
    ('Eve', frozenset(['Oriental'])),
    ('Fred', frozenset(['Indian', 'Mediterranean', 'Oriental'])),
    ('Gwyneth', frozenset(['Arabic', 'Mediterranean'])),
    ('Hayden', frozenset(['Indian', 'Oriental', 'Fast Food'])),
    ('Irene', frozenset(['Indian', 'Mediterranean', 'Oriental'])),
    ('James', frozenset(['Arabic', 'Mediterranean'])),
]

# Set of item types
all_items = {'Oriental', 'Indian', 'Mediterranean', 'Fast Food', 'Arabic'}

# Remove comment in end of line if you want the support normalized
def support(t_data: Set[Tuple[str, Set[str]]], item_set: Set[str]):
    return len(list(item_set for _,v in t_data if item_set <= v)) # / len(t_data)

def apriori(t_data: Set[Tuple[str, Set[str]]], items: Set[str], sigma: float):
    """Return frequent itemsets for a minimum support threshold sigma"""
    F = {}
    C = {}
    supD = partial(support, t_data)
    k = 1
    # Create initial entry for F with sets of single items
    F[k] = {frozenset([i]) for i in items if supD(frozenset([i])) >= sigma}
    while F[k]:
        C[k+1] = generate_candidates(F[k], k + 1)
        F[k+1] = {X for X in C[k+1] if supD(X) >= sigma}
        k += 1
    return reduce(or_, list(F.values()), set())

def generate_candidates(F: Set[Set[str]], k: int):
    """Create new itemsets which will be supersets of the elements
    in the last phase of k. Only union if result is of length k"""
    C = {X | Y for X,Y in product(F, F) if len(X | Y) == k}
    return C
    # Whatever the fuck this is
    # return {X for X in C if all(len(Y) == k - 1 for Y in X)}

apriori(t_data, all_items, 3) # You get same supported elements as the lattice in page 10 of the lectures

{frozenset({'Mediterranean'}),
 frozenset({'Arabic'}),
 frozenset({'Oriental'}),
 frozenset({'Indian'}),
 frozenset({'Indian', 'Oriental'}),
 frozenset({'Mediterranean', 'Oriental'}),
 frozenset({'Indian', 'Mediterranean'}),
 frozenset({'Indian', 'Mediterranean', 'Oriental'}),
 frozenset({'Arabic', 'Mediterranean'})}

## Eclat

Instead of storing items in the previous format, eclat stores them in a vertical one, so that checking the support for an itemset of , e.g. cardinality 2, once you have this data format simply becomes a task of finding the cardinality of the intersection between the 2 TID-sets of the items of the itemset.

| Item          | TID-set                              | Cardinality |
| ------------- | ------------------------------------ | ----------- |
| Arabic        | {Dennis, Gwyneth, James}              | 3           |
| Indian        | {Andrew, Bernhard, Carolina, Fred, Hayden, Irene} | 6           |
| Mediterranean | {Andrew, Carolina, Dennis, Fred, Gwyneth, Irene, James} | 7           |
| Oriental      | {Bernhard, Carolina, Eve, Fred, Hayden, Irene} | 6           |
| Fast Food     | {Bernhard, Hayden}                    | 2           |

So support for $\{I, M\}$ (im referring to items by their first letter) is $|\{Andrew, Bernhard, Carolina, Fred, Hayden, Irene\}\cap\{Andrew, Carolina, Dennis, Fred, Gwyneth, Irene, James\}|$, which is 4. It works similarly to Apriori, in the sense that it starts with cardinality k = 1, and then iteratively increases k and performs intersections on TID-sets in order to check for support, and if the support for a combination is good, it keeps it, if not, it drops it.