---
title: Frequent-Pattern Analysis
math: 
    '\abs': '\left\lvert #1 \right\rvert' 
    '\norm': '\left\lvert #1 \right\rvert' 
    '\Set': '\left\{ #1 \right\}'
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\RM': '\M{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
---

In [None]:
import os
import logging
import numpy as np
import weka.core.jvm as jvm
from weka.associations import Associator
from weka.core.converters import Loader

jvm.start(logging_level=logging.ERROR)
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

## Association Rule Mining using Weka

We will conduct the market-basket analysis on the supermarket dataset in Weka.

### Transaction data

Each instance of the dataset is a transaction, i.e., a customer's purchase of items in a supermarket. The dataset can be represented as follows:

::::{prf:definition} 
:label: def:market-basket

For market-basket analysis, the dataset is

$$
\begin{align}
D &:= \Set{T_i}_{i=1}^{n} \quad \text{where}\\
T_i&\subseteq \mc{I},
\end{align}
$$

and $\mc{I}$ is the collection of all items. A transaction $T_i$ is simply a subset of items.

::::

Using the Explorer interface, load the `supermarket.arff` dataset in Weka.

![](images/supermarket_attribute.png)

Note that most attribute contains only one possible value, namely `t`. Click the button `Edit...` to open the data editor. Observe that most attributes have missing values:

![](images/supermarket_data.png)

In `supermarket.arff`:
- Each attribute specified by `@attribute` can be a product category, a department, or a product with one possible value `t`:
```
...
@attribute 'grocery misc' { t}
@attribute 'department11' { t}
@attribute 'baby needs' { t}
@attribute 'bread and cake' { t}
...
```
- The last attribute `'total'` has two possible values `{low, high}`: 
```
@attribute 'total' { low, high} % low < 100
```

To understand the dataset further:
1. Select the `Associate` tab. By default, `Apriori` is chosen as the `Associator`.
1. Open the `GenericObjectEditor` and check for a parameter called `treatZeroAsMissing`. Hover the mouse pointer over the parameter to see more details. 
1. Run the Apriori algorithm with different choices of the parameter `treatZeroAsMissing`. Observe the difference in the generated rules.

::::{exercise}
:label: ex:1
Explain what `t` and `?` means in the dataset when we set `treatZeroAsMissing` to `True` and `False`, respectively.

:::{hint}
:class: dropdown
See the [documentation](https://weka.sourceforge.io/doc.dev/weka/associations/Apriori.html) of the `Apriori` `Associator`.
:::
::::

YOUR ANSWER HERE

In [None]:
%%ai chatgpt -f text
What is the benefit of `treatZeroAsMissing` in Weka's Apriori Associator?

### Association rule

An association rule for market-basket analysis is defined as follows:

::::{prf:definition} Association rule
:label: def:AR

Given two itemsets (sets of items) $A$ and $B$, the association rule

$$
\begin{align}
A \implies B
\end{align}
$$ (association-rule)

means that a transaction contains all items in $B$ if it contains all items in $A$, i.e.,

$$
\begin{align}
\underbrace{A\subseteq T}_{\text{premise}} \implies \underbrace{B\subseteq T}_{\text{consequence}}
\end{align}
$$

for transaction $T\in D$.

::::

We will use [`python-weka-wrapper3`](https://github.com/fracpete/python-weka-wrapper3-examples/blob/master/src/wekaexamples/associations/apriori_output.py) for illustration. To load the dataset:

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
weka_data_path = (
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
)
dataset = loader.load_url(
    weka_data_path + "supermarket.arff"
)  # use load_file to load from file instead

To apply the apriori algorithm with the default settings:

```python
from weka.associations import Associator
```

In [None]:
apriori = Associator(classname="weka.associations.Apriori")
apriori.build_associations(dataset)
apriori

::::{exercise}
:label: ex:2
Explain what the first rule means according to the notation $A\implies B$.

:::{hint}
:class: dropdown
You may regard `biscuits=t` and `total=high` as items. In particular, since `total` has two possible values, it is associated with two items, the other being `total=low`. 
:::
::::

YOUR ANSWER HERE

To retrieve the rules as a list, and print the first rule:

In [None]:
rules = list(apriori.association_rules())
rules[0]

To obtain the set $A$ (in premise) and $B$ (in consequence):

In [None]:
rules[0].premise, rules[0].consequence

In [None]:
premise_support = rules[0].premise_support
total_support = rules[0].total_support

The apriori algorithm returns rules with large enough support:

::::{prf:definition} support
:label: def:support

The support of an association rule $A \implies B$ is the fraction of transactions containing $A$ and $B$, i.e.,

\begin{align}
\op{support}(A \implies B) &= \op{support}(A \cup B) :=
\frac{\op{count}(A \cup B)}{|D|}\quad \text{where}\\
\op{count(A \cup B)} &:= \abs{\Set{T\in D|T\supseteq A\cup B}}.
\end{align}

::::

For the first rule, the number 723 at the end of the rule corresponds to the total support count $\op{count}(A\cup B)$.

::::{exercise}
:label: ex:3
Assign to `support` the (fractional) support for the first rule (`rules[0]`). 

:::{hint}
:class: dropdown
In `python-weka-wrapper3`, you can use the properties `total_support` and `total_transactions` of `rules[0]`.
:::
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
support

In [None]:
# hidden tests

`<conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.35)` printed after the first rule indicates that 

- confidence is used for ranking the rules and 
- the rule has a confidence of 0.92.

By default, the rules are ranked by confidence, which is defined as follows:

::::{prf:definition} confidence
:label: def:confidence

Confidence of a rule is defined as

$$
\begin{align}
\op{confidence}(A\implies B) &:= \frac{\op{support(A \cup B)}}{\op{support(A)}},
\end{align}
$$ (confidence)

where the denominator $\op{support}(A)$ is the support of the premise. It gives, out of the transactions containing $A$, the fraction of transactions containing $B$.

::::

In `python-weka-wrapper3`, we can print different metrics as follows:

In [None]:
for n, v in zip(rules[0].metric_names, rules[0].metric_values):
    print(f"{n}: {v:.3g}")

::::{exercise}
:label: ex:4
Assign to `premise_support` the support count $\op{count}(A)$ of the premise for the first rule.
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
premise_support

In [None]:
# hidden tests

Lift is another rule quality measure defined as follows:

::::{prf:definition} lift
:label: def:lift

The lift of a rule is

$$
\begin{align}
\op{lift}(A\implies B) &:= \frac{\op{confidence}(A\implies B)}{\op{support(B)}} = \frac{\op{support(A \cup B)}}{\op{support(A)}\op{support(B)}}\\
&= \frac{\op{confidence}(A\implies B)}{\op{confidence}(\emptyset \implies B)}.
\end{align}
$$ (lift)

where the last equality is obtained by rewriting $\op{support}(B)$ in the denominator of the first equality as 

$$
\begin{align}
\op{confidence}(\emptyset \implies B) &= \frac{\op{support}(B)}{\op{support}(\emptyset)} = \op{support}(B).
\end{align}
$$

In other words, lift is the fractional increase in confidence by imposing the premise.

::::

::::{exercise}
:label: ex:5
In Weka, we can change the parameter `metricType` to rank the rule according to `Lift` instead of `Confidence`: 
- Rerun the algorithm with `metricType = Lift`.
- Assign to the variable `lift` the maximum lift achieved.

For `python-weka-wrapper3`, you can specify the option as follows:

```Python
apriori_lift = Associator(classname="weka.associations.Apriori", options=['-T', '1'])
...
```
where the value `1` corresponds to `Lift`.

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
lift

In [None]:
# hidden tests

::::{exercise}
:label: ex:6
Explain the relationship between the first and second rules above generated by ranking the rules by lift instead of confidence.

::::

YOUR ANSWER HERE

::::{exercise}
:label: ex:7
Explain why the maximum lift obtained by ranking the rules using `Lift` is smaller than 1.27, which is the lift obtained before by ranking rules using `Confidence`.

:::{hint}
:class: dropdown
From the [documentation](https://weka.sourceforge.io/doc.dev/weka/associations/Apriori.html), the apriori algorithm in Weka reduces the minimum support until it obtains a specified number (default: 10) of rules with specified minimum metric value for the metric type.
:::
::::

YOUR ANSWER HERE

In [None]:
%%ai chatgpt -f text
In association rule mining, what are the pros and cons of ranking the rules 
according to lift instead of confidence?