# Cohort creation

We define the cohorts for mortality prediction and length of stay prediction, based on the first 24 hour of stay of the patients.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

establish connection to DB and define helper function for running queries

In [2]:
import pandas as pd
from proto.etl.config import SSHInfoEicu, DBInfoEicu
from proto.etl.utils import connect_to_db_via_ssh, run_eicu_query, get_column_completeness, load_schema_for_modelling

conn = connect_to_db_via_ssh(SSHInfoEicu, DBInfoEicu)
cursor = conn.cursor()
query_schema = 'set search_path to eicu_crd;'

### Select the patients for mortality prediction

1. keep patients that are from the top 5 hospitals
2. keep visits that were at least 26 hours long
4. save unit stay ids as a new view

In [3]:
query = """
select  patientunitstayid, hospitalid, hosp_mort, region from icustay_detail
where hospitalid in (
    select distinct hospitalid  from patient_top5hospitals
    )
    and
    icu_los_hours >= 26
"""
df_patients = run_eicu_query(query, conn)

- We got 16.5k admissions in the largest 5 hospitals that fulfill this criteria. 
- The class inbalance seems to be roughly 1:10 in each.

In [4]:
df_patients.shape

(16567, 4)

In [5]:
df_patients.groupby(['hospitalid', 'hosp_mort'])['patientunitstayid'].count()

hospitalid  hosp_mort
73          0.0          4404
            1.0           373
167         0.0          2103
            1.0           243
176         0.0          1637
            1.0           122
264         0.0          3587
            1.0           358
420         0.0          3161
            1.0           469
Name: patientunitstayid, dtype: int64

The hospitals themselves are from different parts of the US

In [47]:
df_patients.groupby('hospitalid').first()['region']

hospitalid
73       Midwest
167         West
176         None
264      Midwest
420    Northeast
Name: region, dtype: object

Save these patients as a new view by executing `patient_top5hospitals_mort_dataset.sql`. 

## Feature creation

We want to be able to predict after 24 hour stay and at least 4 hours look forward if someone dies in a hospital.

1. get rid off everything that happened after the 24 hour for the cohort.
2. Create one liner features, i.e. which can be represented as a single row per patient.
    - gender,
    - age,
    - ethnicity,
    - height,
    - weight on admission
    - hospital region - embedded - 2
    - hospital unittype - embed - 2
    - apache_groups - embedded - 3
    - apacheapsvars - [imputed](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)
    - apachepredvars - imputed (same as above)
    - labsfirstday - imputed
    - diagnoses - embedded and averaged
3. Create time bucketed features. Here we deal with missing values by carrying forward the previous measurement, or in case of the first measurement, carrying that all the way back to the admission time.
    - pivoted_lab - bucketed to 24 hours
    - pivoted_bg - bucketed to 24 hours
    - pivoted_med - bucketed to 24 hours
    - pivotted_o2 - bucketed to 24 hours
    - pivotted_score - bucketed to 24 hours
    - pivotted_uo - bucketed to 24 hours
    - pivotted_vital - bucketed to 24 hours
    - pivotted_vital_other - bucketed to 24 hours

In [37]:
def get_table_completeness(table):
    """
    Quick helper that return column completeness for a table. 
    Note, it join on patients we're interested in for mortality
    pred and restricts for the first 24 hours.
    """
    
    query = """
    select * from %s l 
    join patient_top5hospitals_mort_dataset p 
    on l.patientunitstayid=p.patientunitstayid 
    where chartoffset < 1440
    """ % table
    df = run_eicu_query(query, conn)
    return (1-df.isnull().sum(axis=0)/df.shape[0])*100

In [34]:
get_table_completeness('pivoted_o2')

patientunitstayid    100.000000
chartoffset          100.000000
entryoffset          100.000000
o2_flow               41.555433
o2_device             70.856865
etco2                  1.263903
patientunitstayid    100.000000
hosp_mort             99.665268
dtype: float64

### Create one row per patient features

Start with the basic ones

In [38]:
import numpy as np
from sklearn.impute import SimpleImputer

In [16]:
query = """
select
       i.patientunitstayid,
       gender,
       age,
       admissionheight,
       admissionweight,
       ethnicity,
       region,
       unittype,
       apachedxgroup
from (select patientunitstayid from patient_top5hospitals_mort_dataset) p
join icustay_detail i
on p.patientunitstayid=i.patientunitstayid
join apache_groups ag
on p.patientunitstayid=ag.patientunitstayid
"""
df_basic = run_eicu_query(query, conn)

In [24]:
# add vars to numerical and non-numerical lists
num_vars = [
    'gender',
    'age',
    'admissionheight',
    'admissionweight',
]
cat_vars = [
    'ethnicity',
    'region',
    'unittype',
    'apachedxgroup'
]

In [49]:
# check data completeness
(1-df_basic.isnull().sum(axis=0)/df_basic.shape[0])*100

patientunitstayid    100.000000
gender                99.975856
age                  100.000000
admissionheight       99.257560
admissionweight       92.358303
ethnicity            100.000000
region                89.261785
unittype             100.000000
apachedxgroup        100.000000
dtype: float64

In [47]:
imp

SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='most_frequent', verbose=0)

In [54]:
# make age numerical by replacing > 89 with 90
df_basic.age[df_basic.age == '> 89'] = 90
df_basic.age = df_basic.age.astype(np.int)

# Impute categorical missing data in the simples way
df_basic_cat = df_basic.loc[:, cat_vars]
imp = SimpleImputer(strategy="most_frequent")
df_basic.loc[:, cat_vars] = imp.fit_transform(df_basic_cat.astype('category'))

In [55]:
(1-df_basic.isnull().sum(axis=0)/df_basic.shape[0])*100

patientunitstayid    100.000000
gender                99.975856
age                  100.000000
admissionheight       99.257560
admissionweight       92.358303
ethnicity            100.000000
region               100.000000
unittype             100.000000
apachedxgroup        100.000000
dtype: float64

Now add the more numerous apache and lab vars

In [7]:
query = """
select *
from (select patientunitstayid from patient_top5hospitals_mort_dataset) p
left join apacheapsvar a1
on p.patientunitstayid=a1.patientunitstayid
left join apachepredvar a2
on p.patientunitstayid=a2.patientunitstayid
left join labsfirstday l
on p.patientunitstayid=l.patientunitstayid
"""
df_apache_labs = run_eicu_query(query, conn)

In [None]:
df_apache_labs

In [20]:
(1-df_apache_labs.isnull().sum(axis=0)/df_apache_labs.shape[0])*100

patientunitstayid    100.000000
apacheapsvarid        92.328122
patientunitstayid     92.328122
intubated             92.328122
vent                  92.328122
dialysis              92.328122
eyes                  92.328122
motor                 92.328122
verbal                92.328122
meds                  92.328122
urine                 92.328122
wbc                   92.328122
temperature           92.328122
respiratoryrate       92.328122
sodium                92.328122
heartrate             92.328122
meanbp                92.328122
ph                    92.328122
hematocrit            92.328122
creatinine            92.328122
albumin               92.328122
pao2                  92.328122
pco2                  92.328122
bun                   92.328122
glucose               92.328122
                        ...    
chloride_max          96.191224
glucose_min           96.245548
glucose_max           96.245548
hematocrit_min        97.706284
hematocrit_max        97.706284
hemoglob