# Frequent Itemset Mining

In many domains -- from retail to web analytics, from bioinformatics to cybersecurity -- we are surrounded by discrete event data: logs of things that co-occur, purchases that happen together, or features that frequently appear in patterns. The **goal of frequent pattern mining** (or frequent data mining) **is to uncover regularities, co-occurrences, and associations hidden in these large transactional datasets**.


**Market Basket Analysis**
- Discover that customers who buy milk and bread often also buy butter.
- This insight drives store layout, cross-selling, and recommendation systems.

**Web & Clickstream Mining**
- Identify frequent sequences of pages or actions, e.g., home → search → checkout.
- Helps optimize navigation design or ad placement.

**Healthcare & Bioinformatics**
- Find frequent combinations of symptoms or genes associated with conditions.
- Enables diagnostic rule discovery and biomarker identification.

**Network Security**
- Detect frequent combinations of log events or packet signatures indicative of an attack pattern.

In [1]:
from collections import defaultdict, Counter
from itertools import combinations, chain
import math
import random
import time
import matplotlib.pyplot as plt

def powerset(iterable):
    '''Return all non-empty proper subsets of an iterable as frozensets.'''
    s = list(iterable)
    for r in range(1, len(s)):
        for comb in combinations(s, r):
            yield frozenset(comb)

def format_itemset(iset):
    return "{" + ", ".join(sorted(map(str, iset))) + "}"

def print_itemsets(freq_dict, num_transactions, max_items=20):
    '''Pretty-print up to `max_items` itemsets with support as fraction and count.'''
    items = list(freq_dict.items())
    items.sort(key=lambda kv: (-kv[1], sorted(list(kv[0]))))
    for i, (iset, cnt) in enumerate(items[:max_items], start=1):
        sup = cnt / max(1, num_transactions)
        print(f"{i:>2}. {format_itemset(iset):<40} support={sup:.3f} (count={cnt})")

def without(iterable, item):
    '''Return iterable minus a single item.'''
    return [x for x in iterable if x != item]

## Core Concepts

### Items & Transactions
- **Item**: an atomic symbol (e.g., `"milk"`, `"bread"`, `"diapers"`).
- **Itemset** ($\mathcal{I}$): a set of items, e.g., `{milk, bread}`.
- **Transaction** ($\mathcal{T}$): a set (or list) of items e.g., items purchased/observed together.
- **Transaction database** ($\mathcal{D}$): a list of transactions.

For example:

| **Basket ID** | **Items**                    |
|:--------------:|:-----------------------------|
| 1 | {milk, bread, salami} |
| 2 | {beer, diapers} |
| 3 | {beer, wurst} |
| 4 | {beer, baby food, diapers} |
| 5 | {diapers, coke, bread} |

### Support
- **Frequency:** $\sigma(\mathcal{I})=\{j \mid \mathcal{T}_j \supseteq \mathcal{I}\}$: basket ids ($j$) of transactions containing all items in $\mathcal{I}$.
- **Support:** $supp(X) = \frac{|\sigma(\mathcal{I})|}{|\mathcal{D}|}$.
Support defines the frequency of an itemset.

### Association Rules
A rule has the form `X → Y` where `X` and `Y` are disjoint itemsets.

- **Confidence**: $conf(X \rightarrow Y) = supp(X \cup Y) / supp(X)$. How reliable the rule is: given the left-hand side, how often does the right-hand side also happen?
> If 100 customers buy bread, and 60 of those also buy butter, then the confidence of the rule
bread → butter is 0.6 (60%).
- **Lift**: $lift(X \rightarrow Y) = conf(X \rightarrow Y) / supp(Y)$. How much more often the items occur together than we’d expect by chance.
> If 20% of all customers buy butter, but 60% of bread buyers buy butter, then lift = 0.6 / 0.2 = 3.0. This means: bread buyers are 3× more likely to buy butter than a random shopper.
- **Interestingness**: $int(X \rightarrow Y) = conf(X \rightarrow Y) - supp(Y)$. How useful or surprising a rule is?
> A rule like milk → bread might have high support and confidence but low interestingness (too obvious). A rarer rule like wine → cheese might have lower support but higher interestingness if it reveals a meaningful shopping pattern.


**Anti-monotonicity (Apriori principle):** If an itemset `X` is frequent, then **all** subsets of `X` are also frequent. Conversely, if a subset is infrequent, any superset is infrequent.

In [45]:
def to_frozenset(txn):
    '''Normalize a transaction/list/tuple to a frozenset for hashing.'''
    return frozenset(txn)

def support_counts(transactions):
    '''Return a Counter mapping frozenset(itemset)->count for all singletons.'''
    c = Counter()
    for t in transactions:
        fs = to_frozenset(t)
        for item in fs:
            c[frozenset([item])] += 1
    return c

def count_support_of_itemsets(transactions, candidates):
    '''Count support for each candidate itemset over transactions.'''
    counts = Counter()
    for t in map(to_frozenset, transactions):
        for c in candidates:
            if c.issubset(t):
                counts[c] += 1
    return counts

def support_fraction(count, num_transactions):
    return count / max(1, num_transactions)

def rule_metrics(supp_xy, supp_x, supp_y, n):
    '''Compute standard rule metrics given supports as counts and DB size n.'''
    s_xy = 
    s_x  = 
    s_y  = 
    conf = 
    lift = 
    intr = 
    return {"support": s_xy, "confidence": conf, "lift": lift, "interestingness": intr}

### Sample Transaction Dataset

We'll use a tiny "market basket" sample to illustrate calculations.

In [36]:
# Toy Transactions
transactions = [
    {"milk","bread","eggs"},
    {"bread","butter"},
    {"milk","bread","butter","eggs"},
    {"beer","bread"},
    {"milk","diapers","beer","bread"},
    {"diapers","bread","butter"},
    {"milk","diapers","beer","cola"},
    {"bread","milk","diapers","beer"},
]
n_txn = len(transactions)
print(f"Loaded {n_txn} transactions.")
for i, t in enumerate(transactions):
    print(f"{i:>2}: {sorted(t)}")

Loaded 8 transactions.
 0: ['bread', 'eggs', 'milk']
 1: ['bread', 'butter']
 2: ['bread', 'butter', 'eggs', 'milk']
 3: ['beer', 'bread']
 4: ['beer', 'bread', 'diapers', 'milk']
 5: ['bread', 'butter', 'diapers']
 6: ['beer', 'cola', 'diapers', 'milk']
 7: ['beer', 'bread', 'diapers', 'milk']


In [46]:
# Compute Basic Supports and Example Rules
sing_counts = support_counts(transactions)
print("Singleton supports:")
print_itemsets(sing_counts, n_txn)

# Example: compute support and rule metrics for {milk} -> {bread}
X = frozenset(["milk"]); Y = frozenset(["bread"])
supp_x  = sing_counts[X]
supp_y  = sing_counts[Y]
supp_xy = count_support_of_itemsets(transactions, [X|Y])[X|Y]

metrics = rule_metrics(supp_xy, supp_x, supp_y, n_txn)
print(f"\nRule {format_itemset(X)} → {format_itemset(Y)} metrics:")
for k,v in metrics.items():
    if isinstance(v, float):
        print(f"  {k}: {v:.3f}")
    else:
        print(f"  {k}: {v}")

Singleton supports:
 1. {bread}                                  support=0.875 (count=7)
 2. {milk}                                   support=0.625 (count=5)
 3. {beer}                                   support=0.500 (count=4)
 4. {diapers}                                support=0.500 (count=4)
 5. {butter}                                 support=0.375 (count=3)
 6. {eggs}                                   support=0.250 (count=2)
 7. {cola}                                   support=0.125 (count=1)

Rule {milk} → {bread} metrics:
  support: 0.500
  confidence: 0.800
  lift: 0.914
  interestingness: -0.075


## Maximal, Closed, and Closed Frequent Itemsets

Let $\mathcal{F}$ be the set of all frequent itemsets for a given $minsup$ (minimum support threshold).

- **Maximal Frequent Itemset (MFI)**: a frequent itemset with **no frequent superset**.  
  Formally, $X$ in $F$ is maximal if there is no $Y$ in $F$ such that $X$ is a proper subset of $Y$.

- **Closed Itemset (CI)**: an itemset $X$ is *closed* if **no proper superset** of $X$ has the **same support count** as $X$.

- **Closed Frequent Itemset (CFI)**: an itemset that is both **frequent** and **closed**.

**Why they matter**
- MFIs compress the result set by keeping only the "largest" frequent patterns.
- CFIs preserve exact support of all frequent itemsets (lossless compression), enabling rule generation with fewer patterns.

## Apriori Algorithm

**Idea:** Use the anti-monotone property of support to prune the search space.
If an itemset of size $k-1$ is not frequent, then any $k$-itemset (itemset with $k$ element) containing it cannot be frequent.

### Pseudocode (high-level)
1. Count support of all singletons → $F_1 = \{ \text{frequent 1-itemsets} \}$.
2. For $k = 2, 3, \dots$
   1. **Join:** form candidate $k$-itemsets $C_k$ by joining pairs in $F_{k-1}$ that share $k-2$ items.  
   2. **Prune:** remove any $c \in C_k$ if it has a $(k-1)$-subset not in $F_{k-1}$.  
   3. Count supports of $C_k$ with one scan of DB → $F_k = \{ c \in C_k \mid supp(c) \geq minsup \}$.  
   4. Stop when $F_k$ is empty.
4. All frequent itemsets $F = \bigcup_{k=1} F_k$.

**Pros**: Simple, interpretable, good when minsup is not too low and dimensionality moderate.  
**Cons**: Many candidate generations + multiple scans; scales poorly for low minsup / high dimensionality.

## Park–Chen–Yu (PCY) Algorithm (Frequent Pairs)

**Goal:** Reduce memory for candidate **pairs** (`2`-itemsets) by hashing pairs during the first pass.

### Key idea (two-pass)
1. **Pass 1** (count singletons + hash pairs):
   - Count single items to find frequent 1-itemsets $L_1$.
   - For every transaction, hash each unordered pair $(i, j)$ into one of $B$ buckets and increment that bucket's count.
   - After pass 1, mark buckets with count $\geq minsup$ as **frequent** (bitmap).
2. **Pass 2** (count candidate pairs only):
   - Candidate pair $(i, j)$ is considered **only if**:  
     (a) $i$ and $j$ are in $L_1$, and  
     (b) $hash(i, j)$ hits a **frequent bucket**.
   - Count these candidates and keep those reaching $minsup$.

## FP–Growth: Frequent Pattern Growth

**Idea:** Avoid generating candidate sets explicitly. Build a compressed prefix-tree (**FP-tree**) of the database using **frequency-sorted** items, then mine frequent patterns via **conditional pattern bases** and **conditional FP-trees** recursively.

### Steps
1. **Single pass to count item supports**; discard infrequent items.
2. **Order items** in each transaction by descending global frequency (and break ties deterministically, e.g., lexicographically), then **insert** into an FP-tree (a compact prefix tree with counts).
3. **Mine recursively**:
   - For each item $i$ in the header table (from least frequent to most), build its **conditional pattern base** (multiset of prefix paths ending at $i$).
   - Build a **conditional FP-tree** from that base and mine recursively to get all frequent itemsets that include $i$.

**Pros**: Fewer passes over DB, avoids explosive candidate generation, works well for dense datasets.  
**Cons**: Tree may still blow up on datasets with low overlap.

# Key Takeaways

## When to Use What? (Apriori vs. PCY vs. FP–Growth)

- **Apriori**:
    - small/medium problems; uses many DB passes
    - and can explode with low support.
- **PCY**:
    - targeted optimization for **pair mining**
    - uses hashing to prune candidate pairs drastically.
- **FP–Growth**:
    - often fastest on dense datasets or low $minsup$
    - avoids candidate generation via compression & recursive mining.

**Heuristics**
- If you only need **frequent pairs** and memory is tight → **PCY**.
- If you need **all frequent itemsets** and dataset is **dense** → **FP–Growth**.
- If you want a simple, transparent baseline or **higher minsup** → **Apriori**.