# Overview

In [None]:
from typing import List, FrozenSet

In [None]:
Items = List[str]
Itemset = List[str]
Itemsets = List[Itemset]

```
LHS -> RHS
```

If `LHS` is bought in a transaction then `RHS` is also likely to be bought in the transaction

- `A -> B`: `A` implies `B`
- `LHS` and `RHS` are set of items, i.e. itemsets
- **Items are typically strings**

# Support and Confidence

In [None]:
Transaction = FrozenSet[str]
Transactions = List[Transaction]

- **Support for a set of items**: percentage of transactions containing all these items
- **Confidence for a rule**: percent of transactions containing LHS that also contain RHS

In [None]:
def get_support(itemset: Itemset, transactions: Transactions):
    count = 0
    itemset_s = frozenset(itemset)

    for transaction in transactions:
        if itemset_s.issubset(transaction):
            count += 1

    return count / float(len(transactions))


# Finding All Rules With Minimum Support and Minimum Confidence

## Observation 1

It is enough to 

- find all large (frequent) itemsets together with their support
- derive all association rules with sufficient confidence from each large itemsets

To decide whether to output `LHS -> RHS`, we need to 

1. Check that `LHS union (or) RHS` is large or frequent, i.e. `support(LHS union RHS) >= min_sup`
2. Check that `confidence(LHS -> RHS) >= min_confidence`


- number of scans is k + 1, where k is the number of items in the largest itemset


## Observation 2: A Priori Property

**A priori property**: every subset of a large (frequent) itemset is also large (frequent)

- A priori algorithm find large itemsets with a minimum support

```sql
insert into c_k+1

SELECT P.item1, P.item2, ... P.itemk, Q.itemk
FROM L_k P, L_k Q
WHERE P.item1 = Q.item1 
  AND P.item2 = Q.item2
  AND ...
  AND P.itemk-1 = Q.itemk-1
  AND P.itemk < Q.itemk
```

- P and Q agrees except the last item
- `P.itemk < Q.itemk` keeping elements ordered

## Algorithm

Technically `get_frequent_itemsets` **also returns an empty set**

In [None]:
def subsets_k_minus_1(itemset: Itemset) -> Itemsets:
    subsets = []

    for skip in range(len(itemset)):
        subset = itemset[0:skip] + itemset[skip + 1:]
        subsets.append(subset)

    return subsets


def apriori_gen(itemsets: List[List[str]]):
    """Given L_k-1, produce C_k

    Args:
        itemsets (Itemsets): L_k-1

    Returns:
        _type_: C_k
    """
    candidates = []  # type: Itemsets
    itemsets_s = frozenset(map(frozenset, itemsets))

    for p in itemsets:
        for q in itemsets:
            p_q = list(zip(p, q))
            p_q_until_k_min_2 = p_q[:len(p) - 1]
            p_q_until_k_min_2 = map(lambda x: x[0] == x[1], p_q_until_k_min_2)
            p_q_until_k_min_2 = all(p_q_until_k_min_2)

            p_q_k_min_1 = p_q[len(p) - 1:len(p)]
            p_q_k_min_1 = map(lambda x: x[0] < x[1], p_q_k_min_1)
            p_q_k_min_1 = all(p_q_k_min_1)

            if p_q_until_k_min_2 and p_q_k_min_1:
                candidate = p + q[len(p) - 1:len(p)]
                candidates.append(candidate)

    def is_subsets_in_l(candidate: Itemset) -> bool:
        for subset in subsets_k_minus_1(candidate):
            if frozenset(subset) not in itemsets_s:
                return False

        return True

    candidates = list(filter(is_subsets_in_l, candidates))

    return candidates


def get_frequent_itemsets(
        items: Items, 
        transactions: Transactions, 
        min_sup: float) -> Itemsets:
    """Given a dataframe, compute the most frequent itemsets

    Slides
    - Lecture 10

    Args:
        df (DataFrame): _description_
        min_sup (float): _description_

    Returns:
        DataFrame: a dataframe with the following columns: itemsets, support
    """
    assert 0 <= min_sup <= 1

    candidates = list(map(lambda i: [i], items))  # type: Itemsets

    def large_itemsets(itemsets: Itemsets) -> Itemsets:
        return list(filter(
            lambda candidate: get_support(candidate, transactions) > min_sup,
            itemsets))

    all_itemsets = large_itemsets(candidates)

    while True:
        candidates = apriori_gen(candidates)
        candidates = large_itemsets(candidates)

        if len(candidates) == 0:
            break

        all_itemsets = all_itemsets + candidates

    return all_itemsets


# Finding Associations With Minimum Confidence

The association of an itemset is just all the combination of elements of the itemset on LHS  and RHS

- Confidence `confidence(LHS -> RHS) = support(LHS union RHS) / support(LHS)`
- Support: `support(LHS union RHS)`

In [None]:
def get_associations(transactions, frequent_itemsets: Itemsets, min_conf: float) -> list:
    """Given frequent itemsets, compute the associations
    Args:
        df (DataFrame): original dataframe generated from input file
        frequent_itemsets (DataFrame): frequent itemsets generated from Apriori algorithm
        min_conf (float): minimum confidence threshold
    Returns:
        DataFrame: a dataframe with the following columns: left, right,
        confidence, support
    """
    assert 0 <= min_conf <= 1

    associations_list = []

    for itemset in frequent_itemsets:
        if len(itemset) < 2:
            continue

        for rhs in itemset:
            lhs_temp = set(itemset)
            lhs_temp.remove(rhs)
            lhs = list(lhs_temp)
            
            supp_lhs = get_support(lhs, transactions)
            support_union = get_support(itemset, transactions)

            conf = support_union / supp_lhs

            if conf >= min_conf:
                associations_list.append({
                    "left": lhs, 
                    "right": [rhs], 
                    "confidence": conf, 
                    "support": support_union})

    return associations_list

# Quantitative Association Rules

When items are numbers

- **Problem**: cannot use algorithm for string support rules; not enough support
- **Solution**: bucketizing; define an item as a range of values