---
title: Rule-Based Classification
math: 
    '\abs': '\left\lvert #1 \right\rvert' 
    '\norm': '\left\lvert #1 \right\rvert' 
    '\Set': '\left\{ #1 \right\}'
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\RM': '\boldsymbol{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
    '\Gini': '\operatorname{Gini}'
    '\Info': '\operatorname{Info}'
    '\Gain': '\operatorname{Gain}'
    '\GainRatio': '\operatorname{GainRatio}'
---

#### Motivation

![](images/DT2RB.dio.svg)

- When is the decision equal to 1?
  1. If _____________________, then $\R{Y}=1$. 
  2. Else $\R{Y}=0$.

#### Knowledge representation

![](images/rule_set.dio.svg)

- Benefits representing knowledge by rules: (c.f. decision tree or NN)
    - M____________________________ 
    - I_____________________________
- How to generate rules?

#### Generate rules from a decision tree

![](images/DT2RB_trace.dio.svg)

- Each path from root to leaf corresponds to a rule:
  1. $\R{X}_1 = \underline{\phantom{x}} \Rightarrow \R{Y} = 0$
  2. $\R{X}_1 = \underline{\phantom{x}}, \R{X}_2 = \underline{\phantom{x}} \Rightarrow \R{Y} = 0$
  3. $\R{X}_1 = \underline{\phantom{x}}, \R{X}_2 = \underline{\phantom{x}} \Rightarrow \R{Y} = 1$
- Does the ordering of these rules matter? <u>Yes/No</u> because__________________________________________________________________

#### Sequential covering

- S________-and-c________ (c.f. divide-and-conquer)
  - Learn a good rule.
  - Remove covered instances and repeat 1 until all instances are covered.
- How to learn a good rule?

#### PART (partial tree) decision list

- PART (partial tree) decision list:
  1. Build a new decision tree (by C4.5) and extract the rule that maximizes coverage: fraction of instances satisfying the antecedent.
  2. Remove covered instances and repeat 1 until all instances are covered.

#### Example

![](images/PART.dio.svg)

1. Rule 1: ________________
   1. $\R{X}_1 = 0 \Rightarrow \R{Y} = 0 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
   2. $\R{X}_1 = 1, \R{X}_2 = 0 \Rightarrow \R{Y} = 0 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
   3. $\R{X}_1 = 1, \R{X}_2 = 1 \Rightarrow \R{Y} = 1 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$

2. Rule 2: ________________
   1. $\R{X}_2 = 0 \Rightarrow \R{Y} = 0 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
   2. $\R{X}_2 = 1 \Rightarrow \R{Y} = 1 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$

3. Default rule: $ \R{Y} = \underline{\phantom{xxx}} $

4. Issue: [Time complexity] _______________________________________

#### Generating rule directly

![](images/direct_rule.dio.svg)

1. Start with ZeroR, add conjuncts to improve confidence: fraction of correctly classified instances.
   - Rule 1: $\R{Y} = 0$
     - Confidence: $\underline{\phantom{xxx}} \%$
   - Rule 1 (refined): $\R{X}_1 = 0 \Rightarrow \R{Y} = 0$
     - Confidence: $\underline{\phantom{xxx}} \%$

2. Repeatedly add new rules to cover remaining tuples
   - Rule 2: $\R{Y} = 0$
     - Confidence: $\underline{\phantom{xxx}} \%$
   - Rule 2 (refined): $\R{X}_2 = 0 \Rightarrow \R{Y} = 0$
     - Confidence: $\underline{\phantom{xxx}} \%$
   - Default rule: $\R{Y} = \underline{\phantom{xxx}}$

- Decision list
   1. Rule 1: $\R{X}_1 = 0 \Rightarrow \R{Y} = 0$
   1. Rule 2: $\R{X}_2 = 0 \Rightarrow \R{Y} = 0$
   1. Default rule: $\R{Y} = 1$
- Is the list best possible? <u>Yes/No</u>
   1. Time to detect positive class: $\underline{\phantom{xxx}}$
   1. Length of the list: $\underline{\phantom{xxx}}$

#### Class-based ordering

![](images/class_based.dio.svg)

- Learn rules for positive class first:
  1. Rule 1:
     1. $\R{Y} = 1 \quad (\text{confidence:} \underline{\phantom{xxx}} \%)$
     1. $\R{X}_1 = \underline{\phantom{xxx}} \Rightarrow \R{Y} = 1 \quad (\text{confidence:} \underline{\phantom{xxx}} \%)$
     1. $\R{X}_1 = \underline{\phantom{xxx}}, \R{X}_2 = \underline{\phantom{xxx}} \Rightarrow \R{Y} = 1 \quad (\text{confidence:} \underline{\phantom{xxx}} \%)$
  2. Default rule: $\R{Y} = \underline{\phantom{xxx}}$
- Will the above guarantee a short decision list in general? <u>Yes/No</u> because $\underline{\phantom{xxx}}$

#### First Order Inductive Learner Gain

- Add conjunct that maximizes
  \begin{align}
  \op{FOIL\_Gain}
  &= p' \left( \log \frac{p'}{p' + n'} - \log \frac{p}{p + n} \right)
  \end{align}
  - Change in the number of positives: $p \rightarrow p'$
  - Change in the number of negatives: $n \rightarrow n'$

![](images/FOIL1.dio.svg)

- $\R{Y} = 1 \rightarrow \R{X}_1 = 0 \Rightarrow \R{Y} = 1$:

  $\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$ 

- $\R{Y} = 1 \rightarrow \R{X}_1 = 1 \Rightarrow \R{Y} = 1$: 

  $\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$ 

- <u>First/Second</u> is better.

![](images/FOIL2.dio.svg)

- $\R{X}_1 = 1 \Rightarrow \R{Y} = 1 \rightarrow \R{X}_1 = 1, \R{X}_2 = 0 \Rightarrow \R{Y} = 1$:

  $\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$ 

- $\R{X}_1 = 1 \Rightarrow \R{Y} = 1 \rightarrow \R{X}_1 = 1, \R{X}_2 = 1 \Rightarrow \R{Y} = 1$:

  $\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$ 

- <u>First/Second</u> is better.

\begin{align}
\op{FOIL\_Gain}
&= p' \left( \log \frac{p'}{p' + n'} - \log \frac{p}{p + n} \right)\\
&= \underbrace{(p' + n')}_{\text{(1)}} \underbrace{\frac{p'}{p' + n'}}_{\text{(2)}} \underbrace{\left( \log \frac{p'}{p' + n'} - \log \frac{p}{p + n} \right)}_{\text{(3)}}
\end{align}

- Heuristics:
  - (1) favors rules with large <u>coverage/confidence</u>.
  - (2)*(3) favors rules with large <u>coverage/confidence</u> given the same <u>coverage/confidence</u>.
  - (3) ensures $\op{FOIL\_Gain}$ is positive if <u>coverage/confidence</u> increases.

- [Challenge] Why not use information gain or gain ratio?

#### How to avoid overfitting?

- Repeated Incremental Pruning to Produce Error Reduction
- After each new rule, eliminate a conjunct (starting with the most recently added one) if it improves the following on a v_________ set:
  $$\op{FOIL\_Prune} = \frac{p - n}{p + n}$$
  or equivalently reduces
  $$\op{error} = \frac{n}{p + n}.$$

#### References

- 8.4 Rule-Based Classification
- (Optional) [Eibe Frank, Ian H. Witten. "Generating accurate rule sets without global optimization." Fifteenth International Conference on Machine Learning, 1998, p.144-151.](https://hdl.handle.net/10289/1047)
   - A partial tree is built with nodes (subsets of data) split (expanded) in the order of their entropy.
   - A node is considered for pruning by subtree replacement if all its children are leaf nodes.
- (Optional) [Cohen, William W. "Fast effective rule induction." Machine Learning Proceedings, 1995, p.115-123.](https://www.sciencedirect.com/science/article/abs/pii/B9781558603776500232?via%3Dihub) (See also [WEKA JRIP](https://weka.sourceforge.io/doc.dev/weka/classifiers/rules/JRip.html) or its [source code](https://git.cms.waikato.ac.nz/weka/weka/-/blob/stable-3-8/weka/src/main/java/weka/classifiers/rules/JRip.java).)
   - The algorithm stops adding rules to the rule-set if the description length of the new rule is 64 bits more than the minimum description length met.
   - After the algorithm stops adding rules, there is a rule optimization step that optimizes each rule one-by-one.