# Ruleset Coverage Table Generation
- Terminology from "Foundations of Rule Learning" by Furnkranz et. al. (Springer).
- Uses the MLB hospital dataset to predict whether an individual experiences a vitals crash.  
- Data attributes are generated by summing the number of events that occur in a time window.  
- Attributes are used to create features that are summarized in a coverage table.

    Copyright (C) 2021 Geoffrey Guy Messier

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

In [1]:
%load_ext autoreload
%autoreload 1

In [2]:
import numpy as np
import pandas as pd
import datetime, copy, imp
import pickle
import time
import os
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold

from tqdm.auto import tqdm, trange
from tqdm.notebook import tqdm
tqdm.pandas()

import sys
sys.path.insert(0, '../util/')

from rules import gen_coverage_table

## Load Data

In [41]:
dataFileStr = '../data/MLBHospitalData.hd5'
dat = pd.read_hdf(dataFileStr)

In [42]:
labels = dat.groupby(level=0).progress_apply(lambda x: (x.Event == 'VitalsCrash').sum() > 0)

  0%|          | 0/915 [00:00<?, ?it/s]

In [43]:
def trunc_at_first_outcome(tbl):
    truncDate = tbl[tbl.Event == 'VitalsCrash'].Date.min()
    return tbl.loc[~(tbl.Date > truncDate)].reset_index()[[ 'Date', 'Event' ]]

In [44]:
events = dat.groupby(level=0).progress_apply(trunc_at_first_outcome)
events = pd.get_dummies(events,prefix='',prefix_sep='')

  0%|          | 0/915 [00:00<?, ?it/s]

## Window and Label Data
- Attributes are generated by summing the number of events that occur within an observation window.
- Labels are determined by whether a vitals crash occurs in a follow up window.

In [77]:
# Observation interval.
obsStart = pd.to_datetime('2002-01-01')
obsEnd = pd.to_datetime('2005-12-31')

# Follow up interval.
followStart = pd.to_datetime('2005-01-01')
followEnd = pd.to_datetime('2008-12-31')

In [78]:
# Count the data features for each individual.
def timeline_summary(tbl,startDate,endDate):
    
    cols = [ 'Stay', 'GoodTestResult', 'BadTestResult' ]
    
    return tbl.loc[ (tbl.Date >= startDate) & (tbl.Date <= endDate) ][cols].sum()
    

In [79]:
attr = events.groupby(level=0).progress_apply(timeline_summary,startDate=obsStart,endDate=obsEnd)

  0%|          | 0/915 [00:00<?, ?it/s]

In [80]:
def assign_labels(tbl,startDate,endDate):
    return tbl.loc[ (tbl.Date >= startDate) & (tbl.Date <= endDate) ].VitalsCrash.sum() > 0

In [81]:
labels = events.groupby(level=0).progress_apply(assign_labels,startDate=followStart,endDate=followEnd)

  0%|          | 0/915 [00:00<?, ?it/s]

In [82]:
print(f'Positive Cases: {labels.sum()}/{len(labels)} ({100*labels.sum()/len(labels):.2f}%)')

Positive Cases: 61/915 (6.67%)


## Coverage Table

In [37]:
help(gen_coverage_table)

Help on function gen_coverage_table in module rules:

gen_coverage_table(attr, lbl)
    Generates coverage table using the algorithm presented by Gamberger, et. al. in "Handling 
    unknown and imprecise attribute values in propositional rule learning: A feature-based approach".
    
    -- Parameters --
     attr: NxK attribute numpy array where N = number of examples and K = number of attributes.
     lbl: Nx1 numpy label vector where positive/negative examples are equal to 0/1.
     
    -- Returns --
     A 6-tuple consisting of the following:
      ftrStr: Array of human readable strings decribing feature tests (ie. "A0 <= 4.2").
      attrInds: Array of the attribute indices that correspond to each feature.
      vThrshs: Array of the threshold values used by each feature test.
      ops: Array of strings indicating the feature test operation (equal to '<', or '>=').
      covTbl: NxL numpy array where L is the number of features.
      labels: Nx1 numpy vector of label values.


In [84]:
attrNumpy = attr.to_numpy()
labelsNumpy = np.transpose( np.array([ labels.to_numpy()*1 ]) )

(ftrStr,attrInds,vs,ops,covTbl,labels) = gen_coverage_table(attrNumpy,labelsNumpy)

 Number of Examples: 915
  Attribute 0 Features: 62 (195 unique values)
  Attribute 1 Features: 8 (12 unique values)
  Attribute 2 Features: 55 (96 unique values)


In [86]:
covFileStr = '../data/MLB-CoverageTable.pkl'
pklFile = open(covFileStr,'wb',buffering=0)
pickle.dump({ 
    'FeatureStrings': ftrStr, 
    'AttributeIndices': attrInds, 
    'ThresholdValues': vs, 
    'FeatureOperations': ops, 
    'CoverageTable': covTbl, 
    'Lables': labels}, pklFile)
pklFile.close()