In this notebook, we create a simple checklist using the UCI Heart dataset that minimizes FPR subject to a constraint on the FNR.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import sys; sys.path.insert(0, '/home/alex/projects/predictive_checklists')
from IPChecklists.dataset import BinaryDataset

# if using CPLEX
from IPChecklists.model_cplex import ChecklistMIP
from IPChecklists.constraints_cplex import MaxNumFeatureConstraint, FNRConstraint, FPRConstraint

# if using Python-MIP
# from IPChecklists.model_pythonmip import ChecklistMIP
# from IPChecklists.constraints_pythonmip import MaxNumFeatureConstraint, FNRConstraint, FPRConstraint

Using CPLEX version 22.1.1.0


### 1. Load and process the dataset

In [15]:
#df = pd.read_csv('./data/heart.csv')
df = pd.read_csv('/home/alex/Downloads/processed.cleveland.data', names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'])
df['target'] = df['target'].apply(lambda x: 1 if x == 0 else 0)


# process feature columns
cont_cols = ['trestbps', 'chol', 'thalach', 'age', 'oldpeak']
for i in cont_cols:
    df[i] = df[i].astype(float)

cat_cols = ['cp', 'thal', 'ca', 'slope', 'restecg']
for i in cat_cols: # cast categorical columns as string for later type inference
    df[i] = df[i].astype(str)

df_train, df_test = train_test_split(df, test_size = 0.25, random_state = 42, stratify = df['target'])

In [16]:
df_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
258,70.0,1.0,2.0,156.0,245.0,0.0,2.0,143.0,0.0,0.0,1.0,0.0,3.0,1
73,65.0,1.0,4.0,110.0,248.0,0.0,2.0,158.0,0.0,0.6,1.0,2.0,6.0,0
57,41.0,1.0,4.0,110.0,172.0,0.0,2.0,158.0,0.0,0.0,1.0,0.0,7.0,0
83,68.0,1.0,3.0,180.0,274.0,1.0,2.0,150.0,1.0,1.6,2.0,0.0,7.0,0
33,59.0,1.0,4.0,135.0,234.0,0.0,0.0,161.0,0.0,0.5,2.0,0.0,7.0,1


### 2. Binarize the dataset

In [17]:
train_ds = BinaryDataset(df_train, 
                         target_name = 'target',  # column name of target variable
                         pos_label = 1, # what value of the target is a "positive" prediction
                         col_subset = cont_cols + cat_cols # use these columns for modelling
                      )

INFO:root:Removed 2 non-informative columns: {'oldpeak<~0.0', 'oldpeak>=0.0'}
INFO:root:Binary dataframe: 66 binary features and 227 samples


In [18]:
# binarized features
train_ds.binarized_df.columns

Index(['trestbps>=120.0', 'trestbps<~120.0', 'trestbps>=130.0',
       'trestbps<~130.0', 'trestbps>=140.0', 'trestbps<~140.0', 'chol>=211.0',
       'chol<~211.0', 'chol>=240.0', 'chol<~240.0', 'chol>=272.0',
       'chol<~272.0', 'thalach>=136.0', 'thalach<~136.0', 'thalach>=152.0',
       'thalach<~152.0', 'thalach>=165.5', 'thalach<~165.5', 'age>=47.0',
       'age<~47.0', 'age>=55.0', 'age<~55.0', 'age>=61.0', 'age<~61.0',
       'oldpeak>=0.8', 'oldpeak<~0.8', 'oldpeak>=1.8', 'oldpeak<~1.8',
       'cp==2.0', 'cp!=2.0', 'cp==4.0', 'cp!=4.0', 'cp==3.0', 'cp!=3.0',
       'cp==1.0', 'cp!=1.0', 'thal==3.0', 'thal!=3.0', 'thal==6.0',
       'thal!=6.0', 'thal==7.0', 'thal!=7.0', 'thal==?', 'thal!=?', 'ca==0.0',
       'ca!=0.0', 'ca==2.0', 'ca!=2.0', 'ca==1.0', 'ca!=1.0', 'ca==3.0',
       'ca!=3.0', 'ca==?', 'ca!=?', 'slope==1.0', 'slope!=1.0', 'slope==2.0',
       'slope!=2.0', 'slope==3.0', 'slope!=3.0', 'restecg==2.0',
       'restecg!=2.0', 'restecg==0.0', 'restecg!=0.0', 'reste

In [19]:
test_ds = train_ds.apply_transform(df_test) # binarize the test set using the same thresholds

### 3. Create a MIP

Here, we minimize the FPR subject to an FNR constraint. The FNR constraint is required, because the model could otherwise obtain 0% FPR by only making negative predictions.

Alternatively, we could have set cost_func = '01' (i.e. maximizing accuracy) and not have to use any performance constraints.

In [20]:
model = ChecklistMIP(train_ds, cost_func = 'FPR') 

INFO:root:Before compression: 227 rows
INFO:root:After compression: 224 rows


### 4. Build the MIP and add constraints

In [21]:
model.add_constraint(FNRConstraint(0.1)) # FNR <= 10%
model.build_problem(N_constraint = MaxNumFeatureConstraint('<=', 5)) # use at most 5 features

### 5. Solve the MIP

In [22]:
stats = model.solve(max_seconds=60, display_progress=False) # can solve for longer for better performance

Advanced basis not built.


Found solution with objective 1892.002211784141 and optimality gap 51.81%.


### 6. Create a "checklist" from the MIP

In [23]:
check = model.to_checklist()

In [24]:
check

oldpeak<~0.8
cp!=4.0
thal==3.0
ca==0.0
slope!=2.0

M = 3.0, N = 5.0

### 7. Examine various metrics

In [25]:
# training set performance. Note that FNR <= 10%
check.get_metrics(train_ds)

{'accuracy': 0.8722466960352423,
 'n_samples': 227,
 'TN': 86,
 'FN': 11,
 'TP': 112,
 'FP': 18,
 'error': 29,
 'TPR': 0.9105691056910569,
 'FNR': 0.08943089430894309,
 'FPR': 0.17307692307692307,
 'TNR': 0.8269230769230769,
 'precision': 0.8615384615384616,
 'pred_prevalence': 0.5726872246696035,
 'prevalence': 0.5418502202643172,
 'balanced_acc': 0.8687460913070668}

In [26]:
# test set performance
check.get_metrics(test_ds)

{'accuracy': 0.7894736842105263,
 'n_samples': 76,
 'TN': 26,
 'FN': 7,
 'TP': 34,
 'FP': 9,
 'error': 16,
 'TPR': 0.8292682926829268,
 'FNR': 0.17073170731707318,
 'FPR': 0.2571428571428571,
 'TNR': 0.7428571428571429,
 'precision': 0.7906976744186046,
 'pred_prevalence': 0.5657894736842105,
 'prevalence': 0.5394736842105263,
 'balanced_acc': 0.7860627177700348}