# Severity Score Calibration
## Cohort Extraction
### C.V. Cosgriff, MIT Critical Data

Here we will extract the necessary variables for feature engineering and cohort construction to build a number of simple mortality models. These models will be trained in the entire cohort and in risk-stratified subcohorts in an attempt to examine if model calibration can be improved by simply training on populations with improved case-severity homogeneity.

The features we will extract are as follows:

* Age, gender, weight (`patient` table)
* Source of admission, unit type (`patient` table)
* Laboratory data  (`labsfirstday` materialized view)
    * Blood gases: PaO2, pH, base excess, bicarbonate,
    * Hematology: hematocrit, hemoglobin, lymphocytes, neutrophils, platelets, white cell count
    * Electrolytes: calcium, chloride, ionized calcium, magnesium, phosphate, sodium
    * Biochemistry: albumin, amylase, bilirubin, blood urea nitrogen (BUN), B-natriuretic peptide, creatine phosphokinase (cpk), creatinine, lactate, lipase, troponin I/T, pH, bicarbonate, base excess, glucose
    * Coagulation: PT/INR, fibrinogen
* Vital signs (`vitalsfirstday` materialized view)
* Diagnoses: Modified Elixhauser (`diagnosis_groups.R`)
* Treatments (`treatmentfirstday` materializd view)
    * Antiarrhythmics
    * Antibiotics
    * Vasopressors
    * Sedatives
    * Diuretics
    * Blood products
* Ventilation status (`apachepredvar` table)
* APACHE IV Features (`apachepredvar` table)

The label for our classifier as well as their baseline APACHE IV predicted mortality are located in `apachepatientresult`.

## 0 - Environment

In [1]:
import pandas as pd
import numpy as np
import psycopg2
from sklearn.model_selection import train_test_split

# postgres envrionment setup; placeholds here, place your own info
dbname = 'eicu'
schema_name = 'eicu_crd'
query_schema = 'SET search_path TO ' + schema_name + ';'

# connect to the database
con = psycopg2.connect(dbname = dbname)

## 1 - Materialized Views

We will generate the requisite materialized views to aid the extraction of the cohort features. We start by introducing helper functions for interacting with the eICU database.

In [2]:
def execute_query_safely(sql, con):
    cur = con.cursor()
    try:
        cur.execute(sql)
    except:
        # if an exception, rollback, rethrow the exception - finally closes the connection
        cur.execute('rollback;')
        raise
    finally:
        cur.close()
    return

def generate_materialized_view(query_file, query_schema):
    with open(query_file) as fp:
        query = ''.join(fp.readlines())
    print('Generating materialized view using {} ...'.format(query_file), end = ' ')
    execute_query_safely(query_schema + query, con)
    print('done.')

Now we generate the views.

In [3]:
generate_materialized_view('./sql/vitalsfirstday.sql', query_schema)
generate_materialized_view('./sql/labsfirstday.sql', query_schema)
generate_materialized_view('./sql/treatmentfirstday.sql', query_schema)

Generating materialized view using ./sql/vitalsfirstday.sql ... done.
Generating materialized view using ./sql/labsfirstday.sql ... done.
Generating materialized view using ./sql/treatmentfirstday.sql ... done.


We will use an R script from a previous project to assign diagnosis groupings. These groupings were based off the ICD-9 codes from the Elixhauser system, but were modified and do not represent valid Elixhuaser scoring groupings.

In [4]:
!Rscript ./R/diagnosis_groups.R

Loading required package: methods
Loading required package: DBI

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘dbplyr’

The following objects are masked from ‘package:dplyr’:

    ident, sql



## 2 - Load Features

We begin by loading a base cohort. As our most import exclusion criteria is whether or not they had an APACHE IVa score (so we can fit models in the subpopulations), we can join the `patient` table on the `apachepredvar` and `apachepatientresult` table. There are 200,859 ICU stays, and so the number of rows returned will be the remainder aftr exclusion of patients for which APACHE data is not available.

In [5]:
base_cohort_query = query_schema + '''
WITH apacheIV AS
(
    SELECT patientunitstayid, apachescore
         , predictedhospitalmortality, actualhospitalmortality 
    FROM apachepatientresult
    WHERE apacheversion = 'IV'
    AND apachescore > 0
)

SELECT p.patientunitstayid, p.age, p.gender, p.ethnicity, p.admissionheight AS height
       , p.admissionweight AS weight , p.unittype AS unit_type, p.unitadmitsource
       , p.unitdischargeoffset AS unit_los, p.hospitaldischargeoffset AS hospital_los
       , a.day1meds AS gcs_meds, a.day1verbal AS gcs_verbal, a.day1motor AS gcs_motor
       , a.day1eyes AS gcs_eyes, a.admitDiagnosis AS admit_diagnosis
       , a.thrombolytics AS apache_thrombolytics, a.electivesurgery AS apache_elect_surg
       , activetx AS apache_active_tx, a.readmit AS apache_readmit, a.ima AS apache_ima
       , a.midur AS apache_midur, a.ventday1 AS apache_ventday1_worst
       , a.oOBVentDay1 AS apache_ventday1, a.oOBIntubDay1 AS apache_intubday1
       , a.day1fio2 AS apache_fio2, a.day1pao2 AS apache_pao2, (a.day1pao2 / a.day1fio2) AS apache_o2ratio
       , a.ejectfx AS apache_ejectfx, a.creatinine AS apache_creatinine
       , a.graftcount AS apache_graftcount, o.predictedhospitalmortality AS apache_prediction
       , o.apachescore AS apache_score, o.actualhospitalmortality AS hospital_expiration
FROM patient p
INNER JOIN apachepredvar a
ON p.patientunitstayid = a.patientunitstayid
INNER JOIN apacheIV o
ON p.patientunitstayid = o.patientunitstayid
ORDER BY patientunitstayid;
'''
base_cohort = pd.read_sql_query(base_cohort_query, con)
base_cohort.shape

(146689, 33)

Of the 200,859 patients in the database, 146,689 have APACHE IV variables recorded as well as a prediction carried out. We'll then load the rest of the feature set into a dataframe.

In [6]:
feature_set_query = query_schema + '''
SELECT v.patientunitstayid, v.HR_Mean, v.SBP_periodic_Mean, v.DBP_periodic_Mean
    , v.MAP_periodic_Mean, v.SBP_aperiodic_Mean, v.DBP_aperiodic_Mean
    , v.MAP_aperiodic_Mean, v.RR_Mean, v.SpO2_Mean, v.TempC_Mean
    , ANIONGAP_min, ANIONGAP_max, ALBUMIN_min, ALBUMIN_max 
    , AMYLASE_min, AMYLASE_max, BASEEXCESS_min, BASEEXCESS_max
    , BICARBONATE_min, BICARBONATE_max, BUN_min, BUN_max, BNP_min
    , BNP_max, CPK_min, CPK_max, BILIRUBIN_min, BILIRUBIN_max
    , CALCIUM_min, CALCIUM_max, IONCALCIUM_min, IONCALCIUM_max
    , CREATININE_min, CREATININE_max, CHLORIDE_min, CHLORIDE_max
    , GLUCOSE_min, GLUCOSE_max, HEMATOCRIT_min, HEMATOCRIT_max
    , FIBRINOGEN_min, FIBRINOGEN_max, LIPASE_min, LIPASE_max
    , HEMOGLOBIN_min, HEMOGLOBIN_max, LACTATE_min, LACTATE_max
    , LYMPHS_min, LYMPHS_max, MAGNESIUM_min, MAGNESIUM_max
    , PAO2_min, PAO2_max, PH_min, PH_max, PLATELET_min
    , PLATELET_max, PMN_min, PMN_max, PHOSPHATE_min, PHOSPHATE_max
    , POTASSIUM_min, POTASSIUM_max, PTT_min, PTT_max, INR_min
    , INR_max, PT_min, PT_max, SODIUM_min, SODIUM_max
    , TROPI_min, TROPI_max, TROPT_min, TROPT_max, WBC_min
    , WBC_max, t.abx, t.pressor, t.antiarr, t.sedative
    , t.diuretic, t.blood_product
FROM vitalsfirstday v
INNER JOIN labsfirstday l
ON v.patientunitstayid = l.patientunitstayid
INNER JOIN treatmentfirstday t
ON v.patientunitstayid = t.patientunitstayid;
'''

feature_set = pd.read_sql_query(feature_set_query, con)
feature_set.shape

(145578, 85)

We can then merge the two dataframes.

In [7]:
cohort = pd.merge(base_cohort, feature_set, on='patientunitstayid')
cohort.shape

(119884, 117)

Here we lose ~30,000 patients who did not have data in one of the feature tables. Finally, we load the diagnosis groupings CSV to assign this last set of features.

In [8]:
dx_groups = pd.read_csv('./dx-firstday_groupings.csv')
cohort = pd.merge(cohort, dx_groups, on='patientunitstayid')
cohort.shape

(119776, 147)

## 3 - Inclusion / Exclusion

By nature of our SQL query, we have already excluded patients not eligible/capable of producing a valid score, and patients who lack all vital/lab/treatment data. We can then apply the following and keep only patients who are:

1. Not readmissions
2. Not admitted from another ICU
3. Admitted to ICU for $\geq4$hours
4. Hospital admission $\lt365$days
5. Not burn patients
6. Age $\geq16$, age $\lt89$, and age not missing

__1 - Readmitted__

In [9]:
cohort = cohort.loc[cohort.apache_readmit == 0, :]
cohort.shape

(113179, 147)

__2 - Not Admitted from ICU__

In [10]:
cohort = cohort.loc[cohort.unitadmitsource != 'Other ICU', :]
cohort.shape

(112696, 147)

__3 - ICU LoS >4h__

In [11]:
cohort = cohort.loc[cohort.unit_los >= 240, :]
cohort.shape

(112696, 147)

__4 - Hospital LoS $\lt$365__

In [12]:
cohort = cohort.loc[cohort.hospital_los < 525600, :]
cohort.shape

(112684, 147)

__5 - Not Burn Patients__ 

In [13]:
cohort = cohort.loc[cohort.admit_diagnosis != 'BURN', :]
cohort.shape

(112673, 147)

__6 - Age $\geq$16, $\leq$89, and Not Missing__

In [14]:
cohort = cohort.loc[cohort.age != '> 89', :]
cohort = cohort.loc[cohort.age != '', :]
cohort.age = cohort.age.astype('int64')
cohort = cohort.loc[cohort.age >= 16, :]
cohort.shape

(108547, 147)

## 4 - Cleaning and Formatting

Before anything else, we can drop numerous variables from the pull as we won't need both min/max for various features. Instead we'll be going by the following principle: the most _abnormal_ laboratory value in the first 24 hours of ICU stay will be included. That is we will use:

* The minimum value for: bicarbonate, chloride, calcium, magnesium, base excess (including negative values), platelets, hemoglobin, phosphate, fibrinogen, pH and hematocrit
* The maximum value for: creatinine, BUN, bilirubin, PT/INR, lactate, troponin I/T, amylase, lipase, B-natriuretic peptide and creatinine phosphokinase
* For sodium, which aberrantly deviates bidirectionally, the most abnormal value was defined as the value with greatest deviation from the normal range boundaries.
    * This can be applied to glucose and potassium as well.
* For white blood cell and neutrophil counts, if any measurements were lower than the lower limit of normal, the minimum value was used; if the minimum was within normal range then the maximum was used as the most abnormal value.

In [15]:
# for the unidirectional abberations, just drop what isn't needed
lab_drop = ['bicarbonate_max', 'chloride_max', 'calcium_max', 'magnesium_max', 
            'baseexcess_max', 'platelet_max', 'hemoglobin_max', 'phosphate_max', 
            'fibrinogen_max', 'ph_max', 'hematocrit_max', 'creatinine_min',
            'bun_min', 'bilirubin_min', 'pt_min', 'inr_min', 'lactate_min', 
            'tropi_min', 'tropt_min', 'amylase_min', 'lipase_min', 'bnp_min',
            'cpk_min', 'albumin_max', 'ioncalcium_max', 'pao2_max', 'pt_min',
            'ptt_min', 'inr_min', 'aniongap_min']
cohort = cohort.drop(lab_drop, axis=1)

# sodium, deviates bidirectionally
sodium_check = abs(cohort.sodium_min - 135.) >= abs(cohort.sodium_max - 145.)
sodium = np.empty(len(cohort.index), dtype='float64')
sodium[sodium_check] = cohort.sodium_min[sodium_check]
sodium[~sodium_check] = cohort.sodium_max[~sodium_check]
cohort = cohort.assign(sodium=sodium)
cohort = cohort.drop(['sodium_min', 'sodium_max'], axis=1)

# potassium, deviates bidirectionally, same treatment then
potassium_check = abs(cohort.potassium_min - 3.5) >= abs(cohort.potassium_max - 5.0)
potassium = np.empty(len(cohort.index), dtype='float64')
potassium[potassium_check] = cohort.potassium_min[potassium_check]
potassium[~potassium_check] = cohort.potassium_max[~potassium_check]
cohort = cohort.assign(potassium=potassium)
cohort = cohort.drop(['potassium_min', 'potassium_max'], axis=1)

# similar treatment for glucose since hyperglycemia and hypoglycemia can both
# be important dependent on the clinical context.
glucose_check = abs(cohort.glucose_min - 70) >= abs(cohort.glucose_max - 130)
glucose = np.empty(len(cohort.index), dtype='float64')
glucose[glucose_check] = cohort.glucose_min[glucose_check]
glucose[~glucose_check] = cohort.glucose_max[~glucose_check]
cohort = cohort.assign(glucose=glucose)
cohort = cohort.drop(['glucose_min', 'glucose_max'], axis=1)

# wbc counts
wbc_check = cohort.wbc_min < 2
pmn_check = cohort.pmn_min < 45
lym_check = cohort.lymphs_min < 20

wbc = np.empty(len(cohort.index), dtype='float64')
wbc[wbc_check] = cohort.wbc_min[wbc_check]
wbc[~wbc_check] = cohort.wbc_max[~wbc_check]
cohort = cohort.assign(wbc=wbc)
cohort = cohort.drop(['wbc_min', 'wbc_max'], axis=1)

pmn = np.empty(len(cohort.index), dtype='float64')
pmn[pmn_check] = cohort.pmn_min[pmn_check]
pmn[~pmn_check] = cohort.pmn_max[~pmn_check]
cohort = cohort.assign(pmn=pmn)
cohort = cohort.drop(['pmn_min', 'pmn_max'], axis=1)

lym = np.empty(len(cohort.index), dtype='float64')
lym[lym_check] = cohort.lymphs_min[lym_check]
lym[~lym_check] = cohort.lymphs_max[~lym_check]
cohort = cohort.assign(lym=lym)
cohort = cohort.drop(['lymphs_min', 'lymphs_max'], axis=1)

We expect there to be substantial missing data in the EHR. We also note on inspection that some of the missing data in eICU is simply coded as a -1, and this is evident even in the APACHE predictions which are required for inclusion. As such, we'll swap -1 with `np.nan`, and then remove patients missing their APACHE prediction. From their we'll inspect the missing data quantities.

In [16]:
cohort = cohort.replace(-1, np.nan)
cohort.apache_prediction = cohort.apache_prediction.astype('float64')
cohort = cohort.loc[cohort.apache_prediction >= 0.,]
with pd.option_context('display.max_rows', None):
    display(cohort.isna().sum())

patientunitstayid             0
age                           0
gender                        0
ethnicity                     0
height                     1493
weight                     1280
unit_type                     0
unitadmitsource               0
unit_los                      0
hospital_los                  0
gcs_meds                      0
gcs_verbal                 1178
gcs_motor                  1178
gcs_eyes                   1178
admit_diagnosis               0
apache_thrombolytics          0
apache_elect_surg         86437
apache_active_tx              0
apache_readmit                0
apache_ima                    0
apache_midur                  0
apache_ventday1_worst         0
apache_ventday1               0
apache_intubday1              0
apache_fio2               82300
apache_pao2               82300
apache_o2ratio                0
apache_ejectfx           105424
apache_creatinine         20486
apache_graftcount             0
apache_prediction             0
apache_s

* There are 1178 patients missing GCS terms, but this actually agrees with the GCS meds indicator.
    * Thus this serves as an indicator of why those are missing
* For unclear reasons, the elective surgery indiactor is missing for many patients in the APACHE table. I had presumed this was just left blank and this `nan` = `0` but there are also 0s in the data. With the admit diagnosis and other data it is unlikely to be a very important variable and we'll drop it.
* The temperature variable come from the periodic sensor table, which can be a rather dirty source of data. Considering how much data is missing for this variable and the data source, we'll drop it.
* We have PaO2 data from both the APACHE table and from our direct pull of the labs; yet there is much more missing data in the APACHE variable version. We'll drop it and just use the direct pull. This is similarly true for creatinine. 
* We'll use aperiodic recordings of the BP since the periodic recording is missing in so many patients
* An index `Unnamed 0` was picked up from the diagnoses table, we'll drop it
* Many of the labs are missing for many patients; as was shown recently from Kohane's group the presence of a lab is often more informative than its value. That said, GBM as implemented in xgboost can actually learn from missingness and is often fairly robust to large amounts of missing data.
    * We could therefore just leave the missing data as is
    * We could perform mean imputation
    * We could impute _normal_ values
    * We could implement a more sophisticated imputation approach
* We don't want LoS variables in there since they let us _peak into the future_.
* We only needed unitadmitsource for the exclusion criteria and can drop it.
* We'll save off the label (`hospital_expiration`) and APACHE IV predicted probability later, but can drop the APACHE IV score now.

In [17]:
cohort = cohort.drop(['sbp_periodic_mean', 'dbp_periodic_mean', 'map_periodic_mean',
                      'Unnamed: 0', 'tempc_mean', 'apache_elect_surg', 'apache_creatinine', 
                      'apache_pao2', 'apache_fio2', 'apache_o2ratio', 'hospital_los', 'unit_los', 
                      'unitadmitsource', 'apache_score',], axis=1)

With these gone, we can decide on the remainder of missingness handling during modeling. We next turn to formatting the data so that it will be amenable to modeling. This entails converting categorical variables into indicators, and thus we must first convert the strings composing the categories into good variable names.

We'll start with admission diagnoses which include '-' characters.

In [18]:
cohort.admit_diagnosis = cohort.admit_diagnosis.str.replace('-', '_')
cohort.admit_diagnosis = cohort.admit_diagnosis.str.replace('/', '_')
adx_dummies = pd.get_dummies(cohort.admit_diagnosis, 'adx', drop_first=True)
cohort = pd.concat([cohort, adx_dummies], axis=1)
cohort = cohort.drop('admit_diagnosis', axis=1)

Next we turn to gender and ethnicity.

In [19]:
male_gender = (cohort.gender == 'Male').astype('int')
cohort = cohort.assign(male_gender=male_gender)
cohort = cohort.drop('gender', axis=1)

eth_map = {'Caucasian' : 'caucasian', 'Other/Unknown' : 'other', 
           'Native American' : 'native_american', 'African American' : 'african_american',
          'Asian' : 'asian', 'Hispanic' : 'hispanic', '' : 'other'}
cohort.ethnicity = cohort.ethnicity.map(eth_map)
eth_dummies = pd.get_dummies(cohort.ethnicity, 'eth', drop_first=True)
cohort = pd.concat([cohort, eth_dummies], axis=1)
cohort = cohort.drop('ethnicity', axis=1)

This leaves unit type as a categorical variable.

In [20]:
cohort.unit_type = cohort.unit_type.str.replace('-', '_')
cohort.unit_type = cohort.unit_type.str.replace(' ', '_')
unit_dummies = pd.get_dummies(cohort.unit_type, 'unit', drop_first=True)
cohort = pd.concat([cohort, unit_dummies], axis=1)
cohort = cohort.drop('unit_type', axis=1)

## 5 - Save Train/Test Split of Features and Label

We need to save the label and remove it from the features. We'll also need to save the APACHE prediction.

In [21]:
label = (cohort.hospital_expiration == 'EXPIRED').astype('int')
apache_pred = cohort.apache_prediction
cohort = cohort.drop(['hospital_expiration', 'apache_prediction'], axis=1)

And now we can form a train test split.

In [22]:
train_X, test_X, train_y, test_y, train_apache, test_apache = train_test_split(cohort, label, apache_pred, test_size=0.25, random_state=42)

With that, we can save the CSV files corresponding to data frames we generated.

In [23]:
# train files
train_X.to_csv('./data/train_X.csv', index=False)
train_y.to_csv('./data/train_y.csv', index=False, header=True)
train_apache.to_csv('./data/train_apache.csv', index=False, header=True)

# test files
test_X.to_csv('./data/test_X.csv', index=False)
test_y.to_csv('./data/test_y.csv', index=False, header=True)
test_apache.to_csv('./data/test_apache.csv', index=False, header=True)

With this portion complete, we can move onto the construction of our mortality models.