# Cohort creation

We define the cohorts for mortality prediction and length of stay prediction, based on the first 24 hour of stay of the patients.

In [4]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


establish connection to DB and define helper function for running queries

In [5]:
import pandas as pd
from proto.etl.config import SSHInfoEicu, DBInfoEicu
from proto.etl.utils import connect_to_db_via_ssh, run_eicu_query, get_column_completeness, load_schema_for_modelling

conn = connect_to_db_via_ssh(SSHInfoEicu, DBInfoEicu)
cursor = conn.cursor()
query_schema = 'set search_path to eicu_crd;'

### Select the patients for mortality prediction

1. keep patients that are from the top 5 hospitals
2. keep visits that were at least 26 hours long
4. save unit stay ids as a new view

In [3]:
query = """
select  patientunitstayid, hospitalid, hosp_mort, region from icustay_detail
where hospitalid in (
    select distinct hospitalid  from patient_top5hospitals
    )
    and
    icu_los_hours >= 26
"""
df_patients = run_eicu_query(query, conn)

- We got 16.5k admissions in the largest 5 hospitals that fulfill this criteria. 
- The class inbalance seems to be roughly 1:10 in each.

In [4]:
df_patients.shape

(16567, 4)

In [5]:
df_patients.groupby(['hospitalid', 'hosp_mort'])['patientunitstayid'].count()

hospitalid  hosp_mort
73          0.0          4404
            1.0           373
167         0.0          2103
            1.0           243
176         0.0          1637
            1.0           122
264         0.0          3587
            1.0           358
420         0.0          3161
            1.0           469
Name: patientunitstayid, dtype: int64

The hospitals themselves are from different parts of the US

In [47]:
df_patients.groupby('hospitalid').first()['region']

hospitalid
73       Midwest
167         West
176         None
264      Midwest
420    Northeast
Name: region, dtype: object

Save these patients as a new view by executing `patient_top5hospitals_mort_dataset.sql`. 

## Feature creation

We want to be able to predict after 24 hour stay and at least 4 hours look forward if someone dies in a hospital.

1. get rid off everything that happened after the 24 hour for the cohort.
2. Create one liner features, i.e. which can be represented as a single row per patient.
    - gender,
    - age,
    - ethnicity,
    - height,
    - weight on admission
    - hospital region - embedded - 2
    - hospital unittype - embed - 2
    - apache_groups - embedded - 3
    - apacheapsvars - [imputed](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)
    - apachepredvars - imputed (same as above)
    - labsfirstday - imputed
    - diagnoses - embedded and averaged
3. Create time bucketed features. Here we deal with missing values by carrying forward the previous measurement, or in case of the first measurement, carrying that all the way back to the admission time.
    - pivoted_lab - bucketed to 24 hours
    - pivoted_bg - bucketed to 24 hours
    - pivoted_med - bucketed to 24 hours
    - pivotted_o2 - bucketed to 24 hours
    - pivotted_score - bucketed to 24 hours
    - pivotted_uo - bucketed to 24 hours
    - pivotted_vital - bucketed to 24 hours
    - pivotted_vital_other - bucketed to 24 hours

In [3]:
def get_table_completeness(table):
    """
    Quick helper that return column completeness for a table. 
    Note, it join on patients we're interested in for mortality
    pred and restricts for the first 24 hours.
    """
    
    query = """
    select * from %s l 
    join patient_top5hospitals_mort_dataset p 
    on l.patientunitstayid=p.patientunitstayid 
    where chartoffset < 1440
    """ % table
    df = run_eicu_query(query, conn)
    return (1-df.isnull().sum(axis=0)/df.shape[0])*100

In [34]:
get_table_completeness('pivoted_o2')

patientunitstayid    100.000000
chartoffset          100.000000
entryoffset          100.000000
o2_flow               41.555433
o2_device             70.856865
etco2                  1.263903
patientunitstayid    100.000000
hosp_mort             99.665268
dtype: float64

### Create one row per patient features

Start with the basic stay level variables

In [61]:
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [62]:
query = """
select
       i.patientunitstayid,
       admissionheight,
       admissionweight,
       ethnicity,
       region as hospital_region,
       unittype,
       apachedxgroup
from (select patientunitstayid from patient_top5hospitals_mort_dataset) p
join icustay_detail i
on p.patientunitstayid=i.patientunitstayid
join apache_groups ag
on p.patientunitstayid=ag.patientunitstayid
"""
df_basic = run_eicu_query(query, conn)

In [63]:
# add vars to numerical and non-numerical lists
num_vars = [
    'admissionheight',
    'admissionweight',
]
cat_vars = [
    'ethnicity',
    'hospital_region',
    'unittype',
    'apachedxgroup'
]

Now add the more numerous apache and lab vars

In [64]:
query = """
select *
from (select patientunitstayid from patient_top5hospitals_mort_dataset) p
left join apacheapsvar a1
on p.patientunitstayid=a1.patientunitstayid
left join apachepredvar a2
on p.patientunitstayid=a2.patientunitstayid
left join labsfirstday l
on p.patientunitstayid=l.patientunitstayid
"""
df_apache_labs = run_eicu_query(query, conn)
# get rid off duplicated unitstayid cols
df_apache_labs = df_apache_labs.loc[:,~df_apache_labs.columns.duplicated()]

In [65]:
# add numeric cols to the num vars
float_cols = df_apache_labs.dtypes == 'float64'
non_id_cols = ~df_apache_labs.columns.str.endswith('id')
num_vars += list(df_apache_labs.columns[(float_cols & non_id_cols).ravel()].values)

Merge apache_lab and basic tables

In [66]:
df_basic_apache_labs = df_basic.set_index('patientunitstayid').join(df_apache_labs.set_index('patientunitstayid'))



Impute missing values both for categorical vars (simple mode), and numerical vars (multivariate imputation)

In [75]:
df = df_basic_apache_labs

# make age numerical by replacing > 89 with 90
df.age[df.age.isnull()] = 90
df.age = df.age.astype(np.int)

# Impute categorical missing data in the simples way
imp = SimpleImputer(strategy="most_frequent")
df[cat_vars] = imp.fit_transform(df[cat_vars].astype('category'))

In [76]:
# Impute missing values for numerical variables too using iterative imputer
imp = IterativeImputer()
df[num_vars] = imp.fit_transform(df[num_vars].values)

In [84]:
df[cat_vars + num_vars].to_csv('one_row_per_patient_vars.csv')