# Rule Search
A demonstration of the OPUS rule search library first using a simple 15 example toy dataset and then a larger demo using the MLB hospital dataset.

    Copyright (C) 2021 Geoffrey Guy Messier

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

In [1]:
%load_ext autoreload
%autoreload 1

In [2]:
import numpy as np
import pandas as pd
import datetime, copy, imp
import pickle
import time
import os
import re
from sklearn.model_selection import StratifiedKFold
from importlib import reload


from tqdm.auto import tqdm, trange
from tqdm.notebook import tqdm
tqdm.pandas()

import sys
sys.path.insert(0, '../util/')

import rules as rs

## Toy Data Set
- This data consists of two attributes (A0 and A1) and 15 examples that are either positive (x) or negative (o).  Example `8x` is a positive example that has an A0 value between 1 and 3 and an A1 value between 0 and 2.

```
A1
  |      |             0x|
  |      |          1o 2o|
  |      |          3o 4o| 5o 
2 -------------------------------
  |      |         6x 7x | 11o 12o
  |      |         8x 9x |  13x
  -      |           10o |  14o
  |      |               |
  |      |               |
  -------|-------|-------|------- A0
         1               3

```

- The following coverage table dictionary is created by the `gen_coverage_table()` routine and has the following fields:
    + `AttributeIndices`: The attribute index number for each feature.
    + `FeatureOperations`: The operations used to create each feature.
    + `ThresholdValues`: The threshold values used to create each feature.
    + `FeatureStrings`: Strings that describe each feature test (used by output and debugging routines).
    + `CoverageTable`: Dimensions: (#examples) x (#features).  A 1 indicates the feature is satisfied by the example data.  Check the above diagram to make sure you understand how these entries are created.
    + `Labels`: Indicate whether each example is positive (1) or negative (0).  Used to evaluate rule performance.
        

In [3]:
toyData = {
    'AttributeIndices': [ 0, 0, 0, 0, 1, 1 ],
    'FeatureOperations': [ '>=', '<', '>=', '<', '>=', '<' ],
    'ThresholdValues': [ 1, 1, 3, 3, 2, 2 ],
    'FeatureStrings': [ 'A0 >= 1', 'A0 < 1', 'A0 >= 3', 'A0 < 3', 'A1 >= 2', 'A1 < 2' ],
    'CoverageTable': [
        [ 1, 0, 0, 1, 1, 0 ], # 0x
        [ 1, 0, 0, 1, 1, 0 ], # 1o
        [ 1, 0, 0, 1, 1, 0 ], # 2o
        [ 1, 0, 0, 1, 1, 0 ], # 3o
        [ 1, 0, 0, 1, 1, 0 ], # 4o
        #
        [ 1, 0, 1, 0, 1, 0 ], # 5o
        #
        [ 1, 0, 0, 1, 0, 1 ], # 6x
        [ 1, 0, 0, 1, 0, 1 ], # 7x
        [ 1, 0, 0, 1, 0, 1 ], # 8x
        [ 1, 0, 0, 1, 0, 1 ], # 9x
        [ 1, 0, 0, 1, 0, 1 ], # 10o
        #
        [ 1, 0, 1, 0, 0, 1 ], # 11o
        [ 1, 0, 1, 0, 0, 1 ], # 12o
        [ 1, 0, 1, 0, 0, 1 ], # 13o
        [ 1, 0, 1, 0, 0, 1 ], # 14o        
    ],
    'Lables': [ 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0 ]
}

Data from the coverage table dictionary is copied into the `rs.Examples` class for use in the rule search routines.

In [4]:
help(rs.Examples)

Help on class Examples in module rules:

class Examples(builtins.object)
 |  Examples(data)
 |  
 |  Class for storing data examples used by the OPUS rule search class.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, data)
 |      Initialize member variables using coverage table dictionary.
 |      -- Parameters --
 |       data: Coverage table dictionary object generated by gen_coverage_table().
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



In [5]:
exsToy = rs.Examples(toyData)

## Evaluating Rule Quality
- The base class `RuleQuality` is used to consolidate common rule quality calculations and present general rule performance summary information.
- The derived classes implement specific quality metric calculations and are used by OPUS to optimize the rule search.

An example of one of the derived classes.

In [6]:
help(rs.RuleQualCoverageDiff)

Help on class RuleQualCoverageDiff in module rules:

class RuleQualCoverageDiff(RuleQuality)
 |  RuleQualCoverageDiff(ftrStr)
 |  
 |  Calculate true positive/false positive coverage difference metric for rule set.
 |  
 |  Method resolution order:
 |      RuleQualCoverageDiff
 |      RuleQuality
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, ftrStr)
 |      Init member variables.
 |      -- Parameters --
 |       ftrStr: Coverage table feature string (from Examples).
 |  
 |  val(self, rules, covTbl, labels, noFPos=False)
 |      Calculate quality metric.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from RuleQuality:
 |  
 |  confusion_matrix(self, rules, covTbl, labels)
 |      Returns a 2x2 numpy confusion matrix for a rule set.
 |      -- Parameters --
 |       rules: 2D ruleset array containing feature index numbers.
 |       covTbl: numpy coverage table matrix.
 |       labels: numpy label ve

## OPUS Rule Search
- An implementation of the OPUS rule search algorithm as described in Webb, "OPUS: An Efficient Admissable Algorithm for Unordered Search", Journal of Artificial Intelligence Research, 1995.

In [7]:
help(rs.OpusRuleSearch)

Help on class OpusRuleSearch in module rules:

class OpusRuleSearch(builtins.object)
 |  OpusRuleSearch(ruleQuality, maxRuleLen=None, debug=False)
 |  
 |  Implements the OPUS rule search algorithm.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, ruleQuality, maxRuleLen=None, debug=False)
 |      Initializes member variables.
 |      
 |      -- Parameters --
 |       ruleQuality: Reference to the RuleQuality derived class object that calculates
 |        the rule metric used by OPUS to find the best rules.
 |       mxRuleLen: Maximum rule length.  Default (None) corresponds to the maximum 
 |        possible.
 |       debug: Debug level. Values:  
 |          rules.OPUS_DEBUG_EXHAUSTIVE generates exhaustive debug information suitable 
 |             for debugging very small example sets.
 |          rules.OPUS_DEBUG_RULE_DEPTH generates debug information upon finishing each
 |             level of the search tree.  Suitable for estimating the run time and rule
 |             le

## Individual Rule Search
- Find the single best rule based on coverage difference.
- Look at the data diagram to confirm the best rule is found.
- Experiment with the OPUS_DEBUG_EXHAUSTIVE setting to learn how OPUS works.

In [8]:
qual = rs.RuleQualCoverageDiff(exsToy.FtrStrs)
rSrch = rs.OpusRuleSearch(ruleQuality=qual,debug=rs.OPUS_DEBUG_RULE_DEPTH)
rBest = rSrch.find_rule(exsToy)

print('\n-- Best Rule --')
qual.print_summary([rBest],exsToy.CovTbl,exsToy.Labels)

Finished Depth: 1
 MaxPotQuality: 5.50 (evaluated), 4.67 (todo)
 Best Rule: [3, 5]:(['A0 < 3' 'A1 < 2']), Quality: 3
Finished Depth: 2
 MaxPotQuality: 5.14 (evaluated), 4.00 (todo)
 Best Rule: [3, 5]:(['A0 < 3' 'A1 < 2']), Quality: 3
Finished Depth: 3
 MaxPotQuality: 5.00 (evaluated), nan (todo)
 Best Rule: [3, 5]:(['A0 < 3' 'A1 < 2']), Quality: 3

-- Best Rule --
Rule: ['A0 < 3' 'A1 < 2']

 Precision: 0.8000
 Recall: 0.6667
 Confusion:
  True Pos: 4/6
  False Neg: 2/6
  False Pos: 1/9
  True Neg: 8/9



## Rule Set Search

In [9]:
help(rs.rule_set_search)

Help on function rule_set_search in module rules:

rule_set_search(ruleQual, ruleSearch, exs, maxSetSize=1e+300, coveredMultWeight=0, debug=False)
    Uses a coverage approach to generate a rule set where OPUS is used to determine 
    the individual rules in the set.  Unless the maximum allowed set size is reached first,
    the routine terminates when all examples have been covered.
    
    -- Parameters --
     ruleQual: Reference to the RuleQuality derived class object used to calculate rule quality.
     ruleSearch: Reference to the object used to find the best individual rules.
     exs: Reference to an Examples class object containing the examples.
     maxSetSize: Maximum number of rules allowed in the set (defaults to infinite).
     coveredMultWeight: Weight used to multiply the label values of covered examples. 
       Defaults to 0 which completely removes an example the first time it's covered.
     debug: If true, produce debug output.



In [10]:
qual = rs.RuleQualPrecision(exsToy.FtrStrs)
rSrch = rs.OpusRuleSearch(ruleQuality=qual,debug=rs.OPUS_DEBUG_RULE_DEPTH)
ruleSet = rs.rule_set_search(qual,rSrch,exsToy,debug=True)
print(f'\nFinal Rule Set: {qual.ruleset_str(ruleSet)}')

Searching for individual rule...
Finished Depth: 1
 MaxPotQuality: 1.00 (evaluated), 1.00 (todo)
 Best Rule: [3, 5]:(['A0 < 3' 'A1 < 2']), Quality: 0.8
Finished Depth: 2
 MaxPotQuality: 1.00 (evaluated), 1.00 (todo)
 Best Rule: [3, 5]:(['A0 < 3' 'A1 < 2']), Quality: 0.8
Finished Depth: 3
 MaxPotQuality: 1.00 (evaluated), nan (todo)
 Best Rule: [3, 5]:(['A0 < 3' 'A1 < 2']), Quality: 0.8

-- New Rule --
Rule: ['A0 < 3' 'A1 < 2']

 Precision: 0.8000
 Recall: 0.6667
 Confusion:
  True Pos: 4/6
  False Neg: 2/6
  False Pos: 1/9
  True Neg: 8/9

Total Label Weight: 6 (before), 2 (after)

Searching for individual rule...
Finished Depth: 1
 MaxPotQuality: 1.00 (evaluated), 1.00 (todo)
 Best Rule: [2, 5]:(['A0 >= 3' 'A1 < 2']), Quality: 0.25
Finished Depth: 2
 MaxPotQuality: 1.00 (evaluated), 1.00 (todo)
 Best Rule: [2, 5]:(['A0 >= 3' 'A1 < 2']), Quality: 0.25
Finished Depth: 3
 MaxPotQuality: 1.00 (evaluated), nan (todo)
 Best Rule: [2, 5]:(['A0 >= 3' 'A1 < 2']), Quality: 0.25

-- New Rule --


## MLB Hospital Data Set
Perform a ruleset search for the more sophisticated MLB hospital data set.

In [11]:
covFileStr = '../data/MLB-CoverageTable.pkl'

with open(covFileStr,'rb') as pklFile:
    mlbData = pickle.load(pklFile)

exsMlb = rs.Examples(mlbData)

In [12]:
qual = rs.RuleQualFScore(exsMlb.FtrStrs,betaSq=0.1)
rSrch = rs.OpusRuleSearch(ruleQuality=qual, maxRuleLen=3, debug=rs.OPUS_DEBUG_RULE_DEPTH)
rBest = rSrch.find_rule(exsMlb)

print('\n-- Best Rule --')
qual.print_summary([rBest],exsMlb.CovTbl,exsMlb.Labels)

Finished Depth: 1
 MaxPotQuality: 0.92 (evaluated), 0.88 (todo)
 Best Rule: [112, 55]:(['A2 >= 56.5' 'A0 < 544']), Quality: 0.7437722419928826
Finished Depth: 2
 MaxPotQuality: 0.89 (evaluated), 0.87 (todo)
 Best Rule: [112, 55, 28]:(['A2 >= 56.5' 'A0 < 544' 'A0 >= 260']), Quality: 0.7712177121771219
Finished Depth: 3
 MaxPotQuality: 0.88 (evaluated), nan (todo)
 Best Rule: [112, 55, 28]:(['A2 >= 56.5' 'A0 < 544' 'A0 >= 260']), Quality: 0.7712177121771219

-- Best Rule --
Rule: ['A2 >= 56.5' 'A0 < 544' 'A0 >= 260']

 Precision: 0.9048
 Recall: 0.3115
 Confusion:
  True Pos: 19/61
  False Neg: 42/61
  False Pos: 2/854
  True Neg: 852/854



In [13]:
qual = rs.RuleQualFScore(exsMlb.FtrStrs,betaSq=0.1)
rSrch = rs.OpusRuleSearch(ruleQuality=qual, maxRuleLen=3, debug=rs.OPUS_DEBUG_RULE_DEPTH)
ruleSet = rs.rule_set_search(qual,rSrch,exsMlb,maxSetSize=2,debug=True)
print(f'\nFinal Rule Set: {qual.ruleset_str(ruleSet)}')

Searching for individual rule...
Finished Depth: 1
 MaxPotQuality: 0.92 (evaluated), 0.88 (todo)
 Best Rule: [112, 55]:(['A2 >= 56.5' 'A0 < 544']), Quality: 0.7437722419928826
Finished Depth: 2
 MaxPotQuality: 0.89 (evaluated), 0.87 (todo)
 Best Rule: [112, 55, 28]:(['A2 >= 56.5' 'A0 < 544' 'A0 >= 260']), Quality: 0.7712177121771219
Finished Depth: 3
 MaxPotQuality: 0.88 (evaluated), nan (todo)
 Best Rule: [112, 55, 28]:(['A2 >= 56.5' 'A0 < 544' 'A0 >= 260']), Quality: 0.7712177121771219

-- New Rule --
Rule: ['A2 >= 56.5' 'A0 < 544' 'A0 >= 260']

 Precision: 0.9048
 Recall: 0.3115
 Confusion:
  True Pos: 19/61
  False Neg: 42/61
  False Pos: 2/854
  True Neg: 852/854

Total Label Weight: 61.0 (before), 42.0 (after)

Searching for individual rule...
Finished Depth: 1
 MaxPotQuality: 0.91 (evaluated), 0.85 (todo)
 Best Rule: [104, 109]:(['A2 >= 42.5' 'A2 < 52.5']), Quality: 0.5755813953488372
Finished Depth: 2
 MaxPotQuality: 0.86 (evaluated), 0.82 (todo)
 Best Rule: [104, 109, 22]:(['A