# **Association rules**
Given a set of commercial transactions, find rules that will predict the occurence of an item based on the occurences of other items in the transaction (discovering co-occurencies).

Example: $\{ \text{Bread, Milk} \} \to \{ \text{Coke, Eggs} \}$

In this notation implication means co-occurencies, not causality.
It's different from the boolean implication, it can be true with some level of truth (*fuzzy logic*).

With $N$ item it's possible to derive $3^N$ association rules.

## Support and confidence
*Our example:*

![](https://i.ibb.co/rssRpwZ/toy.png)

### Glossary
**Itemset**: a collection of one or more items, i.e. $\{ \text{Milk, Bread Diaper} \}$;

**$k$-itemset**: an itemset with $k$ items;

**Support count ($\sigma$)**: frequency of occurence of an itemset, i.e. $\sigma (\{ \text{Milk, Bread Diaper} \}) = 2$;

**Support**: fraction that transaction that contain an itemset, i.e. $\sigma (\{ \text{Milk, Bread Diaper} \}) = \frac{2}{5}$;

**Frequent itemset**: an itemset whose support is greater than or equal to a $\text{minsup}$ threshold.

**Association rule**: an expression of the form $A \to C$, where $A$ and $C$ are itemsets (antecedent and consequent);


### Rule evaluation metrics
We need more than one since different transactions can have an equal metric.
* Support: fraction of the $N$ transactions that contain both $A$ and $C$. Can decrease but not increase. 
$$\text{sup} = \frac{\sigma(A+C)}{N}$$

* Confidence: measure how often all the items in $C$ appear in transactions that contain $A$
$$\text{conf} = \frac{\sigma(A+C)}{\sigma(A)}$$

Rules with low support can be generated by random associations. Rules with low confidence are not really reliable.

A rule with a relatively low support but high confidence can represent an uncommon but interesting phenomenon.

### Association rule mining
Given a set of $N$ transactions, the goal of association rule mining is to find all rules having:
* $\text{sup} \ge \text{minsup}$;
* $\text{conf} \ge \text{minconf}$.

1. **Brute-force** approach: list all possible association rules, compute support and confidence and prune the ones that fail the thresholding. Computationally prohibitive.

*Note*: since rules originating from the same itemset have identical support but can have different confidence it's possible to decouple the requirements.

2. **Two-step** approach:
> * Step 1 - **Frequent itemset generation**: generate all the itemsets whose support is greater than the threshold. Computationally expensive;
  * Step 2 - **Rule generation**: generate high confidence rules from each frequent dataset, where each rule is a binary partitioning of a frequent itemset.

## Frequent itemset generation
Given $D$ items there are $M=2^D$ possible candidates itemsets (two are empty and universe).

![](https://i.ibb.co/26CwZsS/photo-2021-01-06-09-52-31.jpg)

### Brute-force approach
1. Each itemset in the lattice is a candidate frequent itemset;
1. Count the support of each candidate by scanning the database;
1. Match each transaction against every candidate.

The complexity is $\mathcal{O}(NWM)$, being $W$ the average width of the transaction.

Given $D$ unique items, the total number of itemsets is $2^D$ and the total number of association rules is:
$$R = \sum_{k = 1}^{D - 1} \Biggr( {{D}\choose{k}} \times \sum_{j = 1}^{D - k} {{D - k}\choose{j}} \Biggl) = 3^D - 2^{D + 1} + 1$$

![](https://i.ibb.co/QcrVKDv/photo-2021-01-06-10-01-16.jpg)

### Pruning strategies
1. Reduce the number of $M$ candidates (apriori principle);
1. Reduce the number of $NM$ comparisons, using efficient data structures to store the candidates or transactions.


### Apriori principle
If an itemset is frequent, then all of its subsets must also be frequent. It holds due to the property of the support measure:
$$\forall X,Y \colon (X \subseteq Y) \to \text{sup}(X) \ge \text{sup}(Y)$$

The support of an itemset never exceeds the support of its subsets (anti-monotone property of support).

![](https://i.ibb.co/7vjJhkW/photo-2021-01-06-10-10-07.jpg)

#### Candidate generation
Let be:
* $C_k$ candidate itemsets of size $k$;
* $L_k$ frequent itemsets of size $k$ (survivors);
* $\text{subset}_k(c)$ set of subsets of $c$ with $k$ elements.

**Join step**:
1. Let $L_k$ be represented as a table with $k$ columns where each row is a frequent itemset;
1. Let the items in each row of $L_k$ be in lexicographic order;
1. $C_{k+1}$ is generated by a self join of $L_k$ (just add a piece at the end).

**Prune step**: each $(k+1)$-itemset which includes a $k$-itemset which is not in $L_k$ is deleted from $C_{k+1}$.

#### Frequent itemset generation
![](https://i.ibb.co/thHtgV6/Cattura.png)

#### Example
![](https://i.ibb.co/p2jzW5N/example.png)

#### Complexity
The computation is level-wise, the level is the cardinality of the itemsets under evaluation. 
The evaluation at level $k$ uses the apriori knowledge acquired from the previous levels, to reduce the search space.

* **Choice of the minimum support threshold**: lowering the threshold results in a greater number of frequent itemsets. This may reduce pruning and increase the maximum lenght of frequent itemsets. The number of complete reads of the dataset is given by the maximum lenght of frequent itemsets plus one.

* **Dimensionality of the dataset**: more space is needed to store support count of each item. If the number of frequent items also increases, both computation and I/O costs can increase.

* **Size of database**: since apriori algorithm makes multiple passes, run time of the algorithm may increase with number of transactions.

* **Average transaction width**: transaction width increases with denser dataset. This may increase the maximum lenght of frequent itemsets and traversals data structures.

### Compact representation of frequent itemsets
Even after filtering on support, the number of frequent itemsets can be very large.
It's can be useful to identify a small representative of frequent itemsets

* **Maximal frequent itemsets** - the smallest set of itemsets from which the frequent itemsets can be derived: their supersets are not frequent;

* **Closed itemsets** - minimal representation of itemsets without losing support information: some itemsets are redundant because they have identical suppport as their supersets.

#### Maximal frequent itemset
A **maximal frequent itemset** (MFI) doesn't have any frequent immediate supersets. They're neare the border dividing frequent by not frequent itemsets in the lattice.

![](https://i.ibb.co/TLwKR8r/egvqaee.png)

There exist efficient algorithms to explicitly find them without the enumeration of their subsets.

#### Closed itemsets
An itemset is closed if none of its immediate supersets has the same support.

#### Closed frequent itemsets
$X \to Y$ is redundant if there exists $X' \to Y'$ such thtat the support and the confidence are the same and $X \subseteq X'$ and $Y \subseteq Y'$ (equality can't be in both).

There exist efficient algorithms for efficient computation of closed frequent itemsets.

## [FP-growth](https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/)
The problem with apriori algorithm is that it needs to generate the candidate itemsets (can be very high) and needs multiple scans of the database to check the support of the candidates.

The **FP-growth** algorithm transforms the problem of *finding long frequent patterns* into *looking for shorter ones and the concatenating the suffix*.

Uses a compressed representation of the database using a so called FP-tree and once that the three is constructed it uses a **recursive divide-and-conquer** approach to mine frequent itemsets.

**[Steps](https://www.geeksforgeeks.org/ml-frequent-pattern-growth-algorithm/)**
1. Scan the database to find support of 1-itemsets;
1. Create the root for the FP-tree;
1. Scan the database and for each transaction:
> * Reorder the items for descending support;
  * Focus on the root node of the FP-tree;
  * For each item in the transaction: 
  >> * If it matches on of the descendants of the current node, move it and increment the count; 
    * Else create a new descendant node with count $1$.

### Conditional pattern base
A *subdatabase* which consists of the set of frequent items co-occuring with the suffix pattern.
Given a transaction database DB and a support threshold, the complete set of frequent item projections of transaction in the database can be derived from the respective FP-tree.

### Efficiency

![](https://i.ibb.co/7ns2XVH/eff1.png)

## Rule generation
The confidence of a rule can be computed from supports since it's just a ration: for confidence-based pruning it's sufficient to know the supports of frequent itemsets: $$\text{conf}(A \to C) = \frac{\text{sup}(A \to C)}{\text{sup}(A)}$$

Given a frequent itemset $L$ find all the non-empty subset $f \in L$ such that the confidence of rule $f \to (L - f)$ is not less than the minimum confidence setted by the experiment designer.
If $|L| = k$ then there are $2^k-2$ candidate rules.

*Note*: $L \to \emptyset$ and $\emptyset \to L$ can be ignored.

In general confidence is not anti-monotonic (like support), i.e. $\text{conf}(ABC \to D)$ can be larger or smaller than $\text{conf}(AB \to D)$.

Nevertheless, confidence of rules generated from the same itemset is anti-monotone with respect to the number of times on the RHS of the rule (it decreases when we move an item from LHS to RHS of the rule), i.e. $\text{conf}(ABC \to D) \ge \text{conf}(AB \to CD) \ge \text{conf}(A \to BCD)  $.

![](https://i.ibb.co/QKVBxsJ/pruning.png)

In apriori algorithm a candidate rule is generated by merging two rules that share the same prefix in the rule consequent, i.e. joining $CD \to AB$ and $BD \to AC$ would produce the candidate rule $D \to ABC$.

The rule $D \to ABC$ would be pruned if its subset $AD \to BC$ doesn't have high confidence.

Many real datasets have skewed support distribution:
* If the $\text{minsup}$ is set too high we could miss itemsets involving rare items (i.e. expensive products);
* If the $\text{minsup}$ is set too low the number of itemsets could be too large and computational expensive.

Using a single minimum support threshold may not be effective.


### Multiple minimum support
Let's consider $\text{MS}(i)$ as minimum support for item $i$.
Now support is no longer anti-monotonic (**bad**).

The algorithm it's the same of the apriori but with some changes.

1. Order the items according their minimum support in ascending order;

Let be:
* $L_1$: set of frequent items;
* $F_1$: set of items whose support is $\text{MS}(1)=\min_i \bigl(\text{MS}(i)\bigr)$;
* $C_2$: candidate itemsets of size $2$ that is generated from $F_1$ instead of $L_1$.

2. In traditional apriori a candidate $(k+1)$-itemset is generated by merging two frequent itemsets of size $k$ and the candidate it's pruned if it contains any infrequent subsets of size $k$.
In the modified apriori the candidate it's pruned only if subset contains the first item.

*Example*: the candidate is $\{ \text{Broccoli}, \text{Coke}, \text{Milk} \}$ (ordered according their minimum support in ascending order).
We found out that $\{ \text{Broccoli}, \text{Coke} \}$ and $\{ \text{Broccoli}, \text{Milk} \}$ are frequent, while $\{ \text{Coke}, \text{Milk} \}$ is infrequent. By the way, the candidate is not pruned since $ \{ \text{Coke}, \text{Milk} \}$ doesn't contain the first element $\text{Broccoli}$.


## Pattern evaluation
Association rule algorithms tend to produce too many rules: many of them are uninteresting or redundant.

**Redundancy**: $\{ A,B,C \} \to \{ D \}$ and $\{ A,B \} \to \{ D \}$ if they have same support and confidence.

Interestingness measures can be used to prune or rank the derived patterns.

Given a rule $A \to C$ the information needed to compute *rule interestingness* can be obtained from a contingency table. 
The elements of the contingency table are the basis for most of the interestigness measure.

![](https://i.ibb.co/XzBgmTx/photo-2021-01-06-15-57-56.jpg)

Confidence can be misleading. *Example*:
![](https://i.ibb.co/kym0X8M/photo-2021-01-06-16-00-27.jpg)

We have:
* $\text{conf}(\text{Tea} \to \text{Coffe}) = \frac{\text{sup}(\text{Tea, Coffee})}{\text{sup}(\text{Tea})} = \frac{15}{20} = 0.75$;
* $P(\text{Coffee}) = \frac{90}{100} = 0.9$;
* $P(\text{Coffee}|\overline{\text{Tea}}) = \frac{75}{80} = 0.9375$;

So, despite the high confidence of $\text{Tea} \to \text{Coffe}$ the absence of Tea increases the probability of Coffe.

### Statistical independece (no correlation)
* $P(S \land B)=P(S)*P(B) \to$ Statistical independece;
* $P(S \land B)>P(S)*P(B) \to$ Positively correlated;
* $P(S \land B)<P(S)*P(B) \to$ Negatively correlated.

### Statisitical-based measure
Are measures that take into account the deviation from statistical independence.

**Lift** $$\text{lift}(A \to C) = \frac{\text{conf}(A \to C)}{\text{sup}(C)}=\frac{P(A,C)}{P(A)*P(C)}$$
* Evaluates $1$ for independence;
* Insensitive to rule direction;
* Ratio of true cases with respect to independence.

**Leverage** $$\text{leve}(A \to C)=P(A,C) - P(A)*P(C) = \text{sup}(A \cup C) - \text{sup}(A)*\text{sup}(C)$$
* Evaluates $0$ for independence;
* Insensitive to rule direction;
* Number of additional cases with respect to independence.

**Conviction** or novelty $$\text{conv}(A \to C)=\frac{1-\text{sup}(C)}{1-\text{conf}(A \to C)} = \frac{P(A)(1-P(C))}{P(A)-P(A,C)}$$
* It's infinite if the rule is always true;
* Sensitive to rule direction;
* It's the ratio of the expected frequency that $A$ occurs without $C$ if $A$ and $C$ were independent divided by the observed frequency of incorrect predictions.

### Intuition about measures
* Higher support $\to$ Rule applies to more records
* Higher confidence $\to$ The chance that the rule is true for some record is higher;
* Higher lift $\to$ The chance that the rule is just a coincidence is lower;
* Higher conviction $\to$ The rule is violated less often that it would be if the antecedent and the consequent were independent.


### Properties of a good measure (Piatetsky-Shapiro)
1. $M(A,B)=0$ (or $1$) if $A$ and $B$ are statistically independent;
1. $M(A,B)$ increases monotonically with $P(A,B)$ with fixed $P(A)$ and $P(B)$;
1. $M(A,B)$ decreases monotonically with $P(A)$ (or $P(B)$) with fixed $P(A,B)$ and $P(B)$ (or $P(A)$).

## Multidimensional association rules
**Mono-dimensional** or intra-attribute: items $A$, $B$ and $C$ are together in a transaction.

**Multi-dimensional** or inter-attribute: attribute $A$ has value $a$, attribute $B$ has value $b$ and attribute $C$ has value $c$ in a
tuple.

Most software packages for association rules discovery do not deal with
quantitative attributes.

**Discretization**: possibly *equifrequency* or with mono–dimensional clustering, for optimal covering of the original value domains.

Association rules can involve items at different levels.


## Multilevel association rules
A real database can include tens of thousands of distinct items.
Frequently it's necessary to find a tradeoff between general and detailed reasoning (choose the right level of abstraction).

A common background knowledge is the organization of the items into a **hierarchy of concepts** since it can be easily coded in the transactions and it can help the choice of the right level of abstraction.

![](https://i.ibb.co/zZMkxKb/hierachy.png)

* From specialized to general the support of rules increases (in general) and new rules can become interesting;
* From general to specialized the support of rules decreases (in general) and can go under the threshold.

A level change can influence the confidence in any direction. If the specialized rules ha approximately the same confidence as the general one then it's redundant.

**Mining**: look for frequent itemsets at each level of abstraction, top down. Each level requires a new run of the rule discovery algorithm, decreasing the support threshold in lower levels.