# genrules for finding high risk maternal comorbidities

By [Andrew Wheeler, PhD](mailto:apwheele@gmail.com)

This notebook illustrates the python code base I have developed, `genrules`, for the *NICHD Decoding Maternal Morbidity Data Challenge*. It uses a genetic algorithm to identify categories of comorbidities that show large increases in the relative risk given an input maternal morbidity being examined (e.g. hypertension, sepsis, post-partum depression, etc.), as well as different risk factors.

This notebook illustrates the code using real world examples, for more detailed documentation on the algorithm and potential arguments see the tech_docs folder.

For the project setup, this script assumes that you are running the python session from the root of the project, and that your python has several scientific libraries installed (most are the typical scientific stack, e.g. numpy, pandas, the `evol` library is the main unique one). See the `requirements.txt` file for instructions on building an environment to replicate these results. Finally, the nuMoM2b data for the challenge, `nuMoM2b_Dataset_NICHD Data Challenge.csv`, needs to be saved in the data folder.

## Upfront loading of libraries

I have intentionally made several functions to prepare data for modelling. This involves creating several variable sets for use, as well as prepping the outcome variable. I have intentionally made my functions as general as possible, but the *outcome* variable being examined needs to be an integer 0/1 variable with no missing. And the covariates assessing comorbidities can be encoded however you want (including missing as `np.NaN`), but it only makes sense to examine categorical data.

In [1]:
# I have my functions in the src folder
import pandas as pd
from src import genrules
from src import dataprep

# If this fails, either need to run from root, or add in something like
# import os
# os.chdir(r'C:\github\genrules') #replace with local directory on your machine
# before the above from src import lines

# Example 1 (infection)

This example shows the basic use of the library for the maternal morbidity of infection (variable `CBAB01`, recoded so Yes=1 and No=0, and other responses dropped), along with demographic variables of:

 - `AgeCat_V1`, with labels `{1: 13-17, 2: 18-34, 3: 35-39, 4: >=40}`
 - `CRace`, with labels `{1: Non-Hispanic White, 2: Non-Hispanic Black, 3: Hispanic, 4: Asian, 5: Other}`
 - `Education`, with labels `{1: Less HS, 2: HS or GED, 3: Some College, 4: Assoc Deg, 5: Comp College, 6: Degree Beyond Coll}`
 - `poverty`, with labels `{1: 200% of fed pov level, 2: 100-200% fed pov lev, 3: <100% fed pov lev}`
 - `Ins_Type` (combined yes from variables `Ins_Govt`, `Ins_Mil`, `Ins_Comm`, `Ins_Pers`, & `Ins_Othr`)
 - `BMI_Cat`, with labels `{1: < 18.5 (underweight), 2: [18.5-25) (normal weight),3: [25,30) (overweight), 4: [30,35) (obese), 5: >=35 (morbidly obese)}`
 
This script shows the preparing the data, along with the overall proportion of cases in the data that result in infection (while dropping those cases either missing or not recorded as 1/2).

In [2]:
infect = 'CBAB01' #infection

# Set of demographic variables
rhv = dataprep.demo

# Do some data analysis to prep
inf_dat = dataprep.prep_dat(infect,rhv)

inf_dat[infect].value_counts()

0    8042
1     512
Name: CBAB01, dtype: int64

So this shows that around 6% of the sample, `512/(512+8042)`, of the sample experiences an infection. We can use my `genrules` though to identify categories that show larger elevated risk ratios though.

In [3]:
# Set up object with all defaults
ge = genrules.genrules(data=inf_dat,y_var=infect,x_vars=rhv)

# Evolve the pop 5 generations, see what additional rules we discover
ge.evolve(rep=5)

# We can check out the top rules in the current leaderboard
tb = ge.leaderboard
tb[['relrisk','pval','tot_n','out_n','label']].head(20)

Creating initial pop, starting at 2021-10-11 16:09:53.317786
Total N of initial population 436 (finished @ 2021-10-11 16:09:53.352789)

Creating initial leaderboard @ 2021-10-11 16:09:53.353788
Initial candidates added to leaderboard 60

Generation 1 starting @ 2021-10-11 16:09:54.431786
Total new cases added to leaderboard 36

Generation 2 starting @ 2021-10-11 16:09:56.857778
Total new cases added to leaderboard 21

Generation 3 starting @ 2021-10-11 16:09:59.732785
Total new cases added to leaderboard 12

Generation 4 starting @ 2021-10-11 16:10:02.860759
Total new cases added to leaderboard 2

Generation 5 starting @ 2021-10-11 16:10:06.246758
Total new cases added to leaderboard 1


Unnamed: 0,relrisk,pval,tot_n,out_n,label
0,2.868188,0.00015,59,10,"{'CRace': '3', 'BMI_Cat': '3', 'poverty': '2'}"
1,2.838364,7e-06,90,15,"{'BMI_Cat': '4', 'Education': '2', 'ins_type':..."
2,2.838364,7e-06,90,15,"{'AgeCat_V1': '2', 'BMI_Cat': '3', 'poverty': ..."
3,2.768096,2.1e-05,86,14,"{'AgeCat_V1': '2', 'BMI_Cat': '4', 'Education'..."
4,2.715209,0.000601,56,9,"{'AgeCat_V1': '2', 'CRace': '4', 'Education': ..."
5,2.687387,2e-05,95,15,"{'BMI_Cat': '3', 'poverty': '2', 'ins_type': '..."
6,2.66726,0.000748,57,9,"{'AgeCat_V1': '2', 'CRace': '2', 'Education': ..."
7,2.646436,0.001481,51,8,"{'AgeCat_V1': '2', 'CRace': '2', 'BMI_Cat': '3..."
8,2.646436,0.001481,51,8,"{'CRace': '3', 'BMI_Cat': '3', 'Education': '1'}"
9,2.595238,0.001825,52,8,"{'CRace': '2', 'BMI_Cat': '3', 'Education': '2..."


We can see that the top rule found, `{'CRace': '2', 'BMI_Cat': '4', 'Education': '3'}`, has a relative risk of nearly 3 times over nulliparious mothers outside of this group. This is Black mothers who are obese, but have some college. This group has 9/51 with infections, so over 17% have infections in this group.

The algorithm intentionally penalizes rules with small number of observations (and high variance for the relative risk). And the baseline algorithm only includes rules with at least 50 observations in the group (although this is an option that can be changed, either higher or lower). 

One can peruse the list to identify other rules found, line 3 (row 2 with 0 based indexing in the table) for example is overweight Hispanic mothers who are at the poverty level have a relative risk of 2.9, `10/59 =  17%`. I have provided additional functions though to identify the most common set of attributes that are identified in the subsequent rule list using the `active_table()` genrules object function:

In [4]:
ge.active_table(type='att')

Unnamed: 0,Variable,Attribute,TotActive
0,ins_type,Ins_Govt,47
1,CRace,2,38
2,AgeCat_V1,2,34
3,CRace,3,29
4,Education,2,25
5,BMI_Cat,3,23
6,BMI_Cat,4,15
7,poverty,2,14
8,Education,4,11
9,BMI_Cat,5,10


This shows that among the (default) 100 rules identified, having government insurance is the most common characteristic associated with higher relative risks found. One can then do further exploratory data analysis of the rules that include government insurance types using typical pandas functions on the leaderboard dataframe (columns with `None` mean that the characteristic is not selected at all). 

So here those with overweight, with high school edu, and government insurance have the highest relative risk in this subset.

In [5]:
# One can then pull out Ins_Govt to check out additional co-morbidities
tb.loc[tb['ins_type'] == 'Ins_Govt',rhv + ['relrisk']]

Unnamed: 0,AgeCat_V1,CRace,BMI_Cat,Education,poverty,ins_type,relrisk
1,,,4.0,2.0,,Ins_Govt,2.838364
2,2.0,,3.0,,2.0,Ins_Govt,2.838364
3,2.0,,4.0,2.0,,Ins_Govt,2.768096
5,,,3.0,,2.0,Ins_Govt,2.687387
6,2.0,2.0,,4.0,,Ins_Govt,2.66726
7,2.0,2.0,3.0,2.0,,Ins_Govt,2.646436
9,,2.0,3.0,2.0,,Ins_Govt,2.595238
10,,2.0,,4.0,,Ins_Govt,2.576237
15,2.0,3.0,,2.0,,Ins_Govt,2.380658
18,,3.0,,2.0,,Ins_Govt,2.281305


# Example 2 (post-partum depression)

Given that one is likely not interested in rules that are too complicated, if one is starting with a smaller set of variables, the library allows for explicit examination of all potential groups (which is easily feasible in this size of data). Here is an example of examining all potential 3 groups in the same demographic data, but examining post-partum depression, `CMAE04a1c`.

In [6]:
dep = 'CMAE04a1c'

rhv = dataprep.demo

# Do some data analysis to prep
dep_dat = dataprep.prep_dat(dep,rhv)

# Check all triplets of characteristics
dep_ge = genrules.genrules(data=dep_dat,y_var=dep,x_vars=rhv,k=3)

# Do not evolve, just see up to triples
dep_ge.evolve(rep=0)

tb = dep_ge.leaderboard
tb[['relrisk','pval','tot_n','out_n','label']].head(10)

Creating initial pop, starting at 2021-10-11 16:10:09.991759
Total N of initial population 2633 (finished @ 2021-10-11 16:10:10.083759)

Creating initial leaderboard @ 2021-10-11 16:10:10.083759
Initial candidates added to leaderboard 147
Finished Initial leaderboard @ 2021-10-11 16:10:17.707786


Unnamed: 0,relrisk,pval,tot_n,out_n,label
0,5.155051,1.45345e-09,57,11,"{'CRace': '1', 'Education': '1', 'ins_type': '..."
1,4.511942,1.971812e-11,102,17,"{'CRace': '1', 'BMI_Cat': '2', 'Education': '1'}"
2,3.672808,1.72138e-05,65,9,"{'CRace': '1', 'BMI_Cat': '5', 'Education': '2'}"
3,3.458306,4.200219e-05,69,9,"{'CRace': '1', 'BMI_Cat': '4', 'Education': '4'}"
4,3.422066,0.0002815216,54,7,"{'Education': '6', 'poverty': '2'}"
5,3.422066,0.0002815216,54,7,"{'BMI_Cat': '4', 'Education': '4', 'poverty': ..."
6,3.408511,5.165685e-05,70,9,"{'CRace': '1', 'BMI_Cat': '5', 'poverty': '2'}"
7,3.299094,0.0004225815,56,7,"{'AgeCat_V1': '3', 'BMI_Cat': '5', 'ins_type':..."
8,3.156943,0.0003130916,67,8,"{'BMI_Cat': '5', 'poverty': '2', 'ins_type': '..."
9,3.159759,0.001465465,50,6,"{'BMI_Cat': '5', 'Education': '2', 'ins_type':..."


This shows that the top rules often are associated with white mothers. Note that this technique is explicitly exploratory data mining, and does not guarantee any particular *causal* association. It may be perhaps white mothers are more likely to be diagnosed with post-partum depression, and that drives the particular associations identified here. 

Or similarly for rule 5 (line 4), those with higher levels of education and poverty status have higher rates of post-partum depression. This may be a true effect, or may be due to those with high education are more likely familiar with depression and thus more likely to seek treatment.

But this library provides a convenient way to peruse large sets of data and identify potential co-morbidities for further study.

# Example 3 (hypertensive/clampsia)

The examples so far have focused on just one set of data to examine comorbidities (what I label as demographic, but includes overall weight and insurance status). One can swap out any variables you want though. Here I show an example for the outcome of hypertensive disorder, variable `CBAC01`, along with a set of drug variables taken during the 2 months before pregnancy (coded from variables `DrugCode` to `DrugCode_27` and `VXXC01g` to `VXXC01g_27`). 

These are all encoded as dummy variables, e.g. `Drug_102=1` means the mother took an NSAID sometime two months before the pregnancy, and this allows one to find any particular combination of drugs without worrying about the original ordering the Drugcode variables.

Here because these are more expansize variable sets (total of 87 drug codes), I illustrate starting the genrules algorithm with a smaller number of inputs, but allowing the evolution to run for more iterations (although here it would only be a few minutes longer to do as I did in the prior examples). Then I run the evolutionary algorithm a second time, setting mutations to drop characteristics. One can view the progress to see if the algorithm is still identifying rules, or if it is stuck in a particular local maximum. 

In [7]:
hyper = 'CBAC01'
# These are dummy variables representing 
# Drugs taken 2 months before pregancy
rhv = dataprep.drug_dummyvars
hyper_dat = dataprep.prep_dat(hyper,rhv)

# Only do single variables to start, so generations are faster
# Lessen penalty for extra variables and smaller samples
hy_ge = genrules.genrules(data=hyper_dat,y_var=hyper,x_vars=rhv,k=1,pen_var=0,min_samp=30)

# Evolve adding in attributes 6 rounds
hy_ge.evolve(rep=6)
# set_mute sets mutations to remove attributes, remove attributes for 6 rounds
hy_ge.evolve(rep=6,set_mute='remove')

tb = hy_ge.leaderboard
tb[['relrisk','pval','tot_n','out_n','label']].head(20)

Creating initial pop, starting at 2021-10-11 16:10:17.828783
Total N of initial population 172 (finished @ 2021-10-11 16:10:17.828783)

Creating initial leaderboard @ 2021-10-11 16:10:17.828783
Initial candidates added to leaderboard 42

Generation 1 starting @ 2021-10-11 16:10:18.105759
Total new cases added to leaderboard 29

Generation 2 starting @ 2021-10-11 16:10:18.767759
Total new cases added to leaderboard 28

Generation 3 starting @ 2021-10-11 16:10:19.618759
Total new cases added to leaderboard 22

Generation 4 starting @ 2021-10-11 16:10:20.687788
Total new cases added to leaderboard 17

Generation 5 starting @ 2021-10-11 16:10:22.005785
Total new cases added to leaderboard 13

Generation 6 starting @ 2021-10-11 16:10:23.555759
Total new cases added to leaderboard 10

Generation 7 starting @ 2021-10-11 16:10:25.407760
Total new cases added to leaderboard 13

Generation 8 starting @ 2021-10-11 16:10:27.549785
Total new cases added to leaderboard 13

Generation 9 starting @ 20

Unnamed: 0,relrisk,pval,tot_n,out_n,label
0,2.568663,2.331468e-15,55,31,"{'Drug_660': 0, 'Drug_183': 1, 'Drug_273': 0}"
1,2.533432,0.0,135,74,"{'Drug_341': 1, 'Drug_640': 0}"
2,2.533432,0.0,135,74,{'Drug_341': 1}
3,2.514843,8.21565e-15,58,32,"{'Drug_164': 0, 'Drug_520': 0, 'Drug_183': 1, ..."
4,2.471923,4.396483e-14,59,32,"{'Drug_164': 0, 'Drug_520': 0, 'Drug_183': 1, ..."
5,2.463804,4.130687e-09,35,19,"{'Drug_271': 0, 'Drug_101': 0, 'Drug_610': 0, ..."
6,2.393375,1.729838e-12,59,31,"{'Drug_194': 0, 'Drug_183': 1, 'Drug_103': 0, ..."
7,2.39235,0.0,88,46,"{'Drug_300': 0, 'Drug_192': 1, 'Drug_195': 0}"
8,2.390306,8.980594e-13,61,32,"{'Drug_183': 1, 'Drug_103': 0, 'Drug_349': 0}"
9,2.390306,8.980594e-13,61,32,"{'Drug_164': 0, 'Drug_183': 1, 'Drug_273': 0, ..."


When a drug variable equal's zero in the above rules, that means a mother *did not take* that particular drug. While that could potentially be informative (drugs in the `5??` range are vitamins), we are often likely more interested in only examining drug combinations that were actively taken.

Again the library results are flexible enough to allow us to further explore the results. Here I show the active table, but filter out only those drugs that are actively taken.

In [8]:
drug_act = hy_ge.active_table(type='att')
drug_act[drug_act['Attribute'] == 1]

Unnamed: 0,Variable,Attribute,TotActive
1,Drug_510,1,18
7,Drug_183,1,16
9,Drug_230,1,15
18,Drug_213,1,10
19,Drug_176,1,10
20,Drug_101,1,10
34,Drug_192,1,6
36,Drug_109,1,6
37,Drug_341,1,5
41,Drug_212,1,5


One particular drug that has a high risk by itself is `Drug_341`, which is insulin. Because taking insulin is associated with diabetes, we may therefore be interested in further exploring the relationship between diabetes and hypertensive disorder, and so one can further examine this association. Here I look enumerate all possible four categories between race, BMI, age, and diabetes ever diagnosed (`CMAE03`, where 1 = before pregnancy, 2 = during, and 3 = no).

In [9]:
rhv = ['CRace','BMI_Cat','AgeCat_V1','CMAE03']
hyper_dat2 = dataprep.prep_dat(hyper,rhv)

hy_ge2 = genrules.genrules(data=hyper_dat2,y_var=hyper,x_vars=rhv,k=4)

# Enumerates all possible 4 categories, no need to evolve
hy_ge2.evolve(rep=0)

tb = hy_ge2.leaderboard
tb[['relrisk','pval','tot_n','out_n','label']].head(20)

Creating initial pop, starting at 2021-10-11 16:10:43.688026
Total N of initial population 624 (finished @ 2021-10-11 16:10:43.714022)

Creating initial leaderboard @ 2021-10-11 16:10:43.714022
Initial candidates added to leaderboard 56
Finished Initial leaderboard @ 2021-10-11 16:10:45.570024


Unnamed: 0,relrisk,pval,tot_n,out_n,label
0,2.232373,3.64985e-09,55,27,"{'CRace': '1', 'AgeCat_V1': '2', 'CMAE03': '1'}"
1,2.205077,7.032165e-10,64,31,"{'CRace': '1', 'CMAE03': '1'}"
2,2.157233,2.220446e-16,132,62,{'CMAE03': '1'}
3,2.096073,2.392531e-13,118,54,"{'AgeCat_V1': '2', 'CMAE03': '1'}"
4,1.886679,9.657844e-06,65,27,"{'BMI_Cat': '3', 'AgeCat_V1': '2', 'CMAE03': '2'}"
5,1.752122,3.123432e-05,83,32,"{'BMI_Cat': '3', 'CMAE03': '2'}"
6,1.589956,0.0003834972,100,35,"{'CRace': '3', 'BMI_Cat': '5', 'AgeCat_V1': '2..."
7,1.590371,0.001509286,77,27,"{'BMI_Cat': '5', 'AgeCat_V1': '3'}"
8,1.569991,0.0003428514,110,38,"{'CRace': '3', 'BMI_Cat': '5', 'CMAE03': '3'}"
9,1.571831,0.002332883,75,26,"{'CRace': '3', 'AgeCat_V1': '3'}"


While it does appear that being diabetic *before* pregnancy increases the risk of hypertensive disorder, the magnitude of that assocation is a relative risk of slightly over 2 (from around 20% to 50%). It does not appear that any other demographic characteristics increase that risk a substantive amount over just being diabetic (as the single rule of being diabetic is 3rd in the rankings out of all possible combinations of these four variables).

While to leverage this library it still take human inputs as to what outcomes one wants to examine, as well as the variables to check for co-morbidities, I hope these functions provide much easier tools to explore the very high dimensional nuMoM2b data. It still will take informed researchers though to peruse these rules to and use personal knowledge of the data to understand what are likely spurious associations vs potential co-morbidities that should be further explored.