# Frequent Itemset Mining

In many domains -- from retail to web analytics, from bioinformatics to cybersecurity -- we are surrounded by discrete event data: logs of things that co-occur, purchases that happen together, or features that frequently appear in patterns. The **goal of frequent pattern mining** (or frequent data mining) **is to uncover regularities, co-occurrences, and associations hidden in these large transactional datasets**.


**Market Basket Analysis**
- Discover that customers who buy milk and bread often also buy butter.
- This insight drives store layout, cross-selling, and recommendation systems.

**Web & Clickstream Mining**
- Identify frequent sequences of pages or actions, e.g., home → search → checkout.
- Helps optimize navigation design or ad placement.

**Healthcare & Bioinformatics**
- Find frequent combinations of symptoms or genes associated with conditions.
- Enables diagnostic rule discovery and biomarker identification.

**Network Security**
- Detect frequent combinations of log events or packet signatures indicative of an attack pattern.

In [None]:
from collections import defaultdict, Counter
from itertools import combinations, chain
import math
import random
import time
import matplotlib.pyplot as plt

def powerset(iterable):
    '''Return all non-empty proper subsets of an iterable as frozensets.'''
    s = list(iterable)
    for r in range(1, len(s)):
        for comb in combinations(s, r):
            yield frozenset(comb)

def format_itemset(iset):
    return "{" + ", ".join(sorted(map(str, iset))) + "}"

def print_itemsets(freq_dict, num_transactions, max_items=20):
    '''Pretty-print up to `max_items` itemsets with support as fraction and count.'''
    items = list(freq_dict.items())
    items.sort(key=lambda kv: (-kv[1], sorted(list(kv[0]))))
    for i, (iset, cnt) in enumerate(items[:max_items], start=1):
        sup = cnt / max(1, num_transactions)
        print(f"{i:>2}. {format_itemset(iset):<40} support={sup:.3f} (count={cnt})")

def without(iterable, item):
    '''Return iterable minus a single item.'''
    return [x for x in iterable if x != item]

## Core Concepts

### Items & Transactions
- **Item**: an atomic symbol (e.g., `"milk"`, `"bread"`, `"diapers"`).
- **Itemset** ($\mathcal{I}$): a set of items, e.g., `{milk, bread}`.
- **Transaction** ($\mathcal{T}$): a set (or list) of items e.g., items purchased/observed together.
- **Transaction database** ($\mathcal{D}$): a list of transactions.

For example:

| **Basket ID** | **Items**                    |
|:--------------:|:-----------------------------|
| 1 | {milk, bread, salami} |
| 2 | {beer, diapers} |
| 3 | {beer, wurst} |
| 4 | {beer, baby food, diapers} |
| 5 | {diapers, coke, bread} |

### Support
- **Frequency:** $\sigma(\mathcal{I})=\{j \mid \mathcal{T}_j \supseteq \mathcal{I}\}$: basket ids ($j$) of transactions containing all items in $\mathcal{I}$.
- **Support:** $supp(X) = \frac{|\sigma(\mathcal{I})|}{|\mathcal{D}|}$.


### Association Rules
A rule has the form `X → Y` where `X` and `Y` are disjoint itemsets.

- **Confidence**: $conf(X \rightarrow Y) = supp(X \cup Y) / supp(X)$  
- **Lift**: $lift(X \rightarrow Y) = conf(X \rightarrow Y) / supp(Y)$  
- **Interestingness**: $int(X \rightarrow Y) = conf(X \rightarrow Y) - supp(Y)$ 

> **Anti-monotonicity (Apriori principle):** If an itemset `X` is frequent, then **all** subsets of `X` are also frequent. Conversely, if a subset is infrequent, any superset is infrequent.

In [None]:
def to_frozenset(txn):
    '''Normalize a transaction/list/tuple to a frozenset for hashing.'''
    return frozenset(txn)

def support_counts(transactions):
    '''Return a Counter mapping frozenset(itemset)->count for all singletons.'''
    c = Counter()
    for t in transactions:
        fs = to_frozenset(t)
        for item in fs:
            c[frozenset([item])] += 1
    return c

def count_support_of_itemsets(transactions, candidates):
    '''Count support for each candidate itemset over transactions.'''
    counts = Counter()
    for t in map(to_frozenset, transactions):
        for c in candidates:
            if c.issubset(t):
                counts[c] += 1
    return counts

def support_fraction(count, num_transactions):
    return count / max(1, num_transactions)

def rule_metrics(supp_xy, supp_x, supp_y, n):
    '''Compute standard rule metrics given supports as counts and DB size n.'''
    s_xy = 
    s_x  = 
    s_y  = 
    conf = 
    lift = 
    intr = 
    return {"support": s_xy, "confidence": conf, "lift": lift, "interestingness": intr}

### Sample Transaction Dataset

We'll use a tiny "market basket" sample to illustrate calculations.

In [None]:
# Toy Transactions
transactions = [
    {"milk","bread","eggs"},
    {"bread","butter"},
    {"milk","bread","butter","eggs"},
    {"beer","bread"},
    {"milk","diapers","beer","bread"},
    {"diapers","bread","butter"},
    {"milk","diapers","beer","cola"},
    {"bread","milk","diapers","beer"},
]
n_txn = len(transactions)
print(f"Loaded {n_txn} transactions.")
for i, t in enumerate(transactions):
    print(f"{i:>2}: {sorted(t)}")

In [None]:
# Compute Basic Supports and Example Rules
sing_counts = support_counts(transactions)
print("Singleton supports:")
print_itemsets(sing_counts, n_txn)

# Example: compute support and rule metrics for {milk} -> {bread}
X = frozenset(["milk"]); Y = frozenset(["bread"])
supp_x  = sing_counts[X]
supp_y  = sing_counts[Y]
supp_xy = count_support_of_itemsets(transactions, [X|Y])[X|Y]

metrics = rule_metrics(supp_xy, supp_x, supp_y, n_txn)
print(f"\nRule {format_itemset(X)} → {format_itemset(Y)} metrics:")
for k,v in metrics.items():
    if isinstance(v, float):
        print(f"  {k}: {v:.3f}")
    else:
        print(f"  {k}: {v}")

In [None]:
# Utilities: Deriving MFI, CI, CFI from all frequent itemsets
def maximal_frequent_itemsets(freq_counts):
    '''Return set of MFIs (as frozensets) given dict itemset->count for all frequent itemsets.'''
    fis = list(freq_counts.keys())
    fis_sorted = sorted(fis, key=lambda s: (-len(s), sorted(list(s))))
    maximal = set()
    kept = []
    for X in fis_sorted:
        if not any(X < Y for Y in kept):
            maximal.add(X)
            kept.append(X)
    return maximal

def closed_itemsets(all_counts):
    '''Return set of closed itemsets (not necessarily frequent) from dict itemset->count.'''
    items = list(all_counts.items())
    items.sort(key=lambda kv: (-len(kv[0]), sorted(list(kv[0]))))
    closed = set()
    for X, cnt in items:
        is_closed = True
        for Y, cntY in items:
            if X < Y and cntY == cnt:
                is_closed = False
                break
        if is_closed:
            closed.add(X)
    return closed

def closed_frequent_itemsets(freq_counts, all_counts=None):
    '''Return CFI = { X in F | there is no Y superset of X with same support }.
       If all_counts not given, we approximate using only freq_counts.
    '''
    base = all_counts if all_counts is not None else freq_counts
    cis = closed_itemsets(base)
    return {X for X in freq_counts.keys() if X in cis}

In [None]:
#@title MFIs, CIs, CFIs on the toy dataset
mfis = maximal_frequent_itemsets(freq_ap)
cfis = closed_frequent_itemsets(freq_ap)  # approximate via frequent sets only
cis  = closed_itemsets(freq_ap)           # closed (not necessarily all itemsets)

print("Maximal Frequent Itemsets (MFI):")
for i, s in enumerate(sorted(mfis, key=lambda s: (-len(s), sorted(list(s)))), 1):
    print(f"{i:>2}. {format_itemset(s)} (count={freq_ap[s]})")

print("\nClosed Frequent Itemsets (CFI):")
for i, s in enumerate(sorted(cfis, key=lambda s: (-len(s), sorted(list(s)))), 1):
    print(f"{i:>2}. {format_itemset(s)} (count={freq_ap[s]})")