# Create csv tables from the database

In this notebook, we extract the data by taking the tables in sql converting to csvs for subsequent analysis. Here we will merge the information that we have on current ICUs. We will also extract the data for various sensitivity analyses.

It is important to note that in our study, this part of the process was not transparent to those developing the models, acting as a "blind experiment". This is why later in this notebook we will map the ids found in MIMIC to another set of ids to make reconstruction harder.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2

In [2]:
pd.set_option('display.max_columns', None)

### Set up data access

In [3]:
sqluser = 'mimicuser'
dbname = 'mimic'

In [4]:
schema_name = 'mimiciii'
query_schema = 'SET search_path to public,' + schema_name + ';'

In [5]:
import getpass
password = getpass.getpass() 

 ·····


In [6]:
# Connect to local postgres version of mimic
con = psycopg2.connect(dbname=dbname, user=sqluser, host="/tmp/", password=password)

Set up some useful functions to aid working with sql and  to summarise our exclusions later

In [7]:
def load_from_sql(script, con, schema_name):
        # Read the SQL query from the provided sql script
        f = open(script, 'r')
        sql_query = f.read()
        f.close()
        df = execute_query(sql_query, con, schema_name)
        return df

def execute_query(query, con, schema_name):
    cur = con.cursor()
    cur.execute('SET search_path to ' + schema_name)
    return pd.read_sql_query(query,con)

In [8]:
def sepsis_summary(df_sepsis):

    times = ['t_suspicion', 't_sofa', 't_sepsis_min', 't_sepsis_max']
    
    print(f'There are {len(df_sepsis.icustay_id.unique())} patients after LOS exclusions')
    
    for time in times:
        print(f'---------------\n Looking at {time}')
        
        len_remove = len(df_sepsis[df_sepsis[time] < df_sepsis.intime+pd.Timedelta(hours=4)]\
            .icustay_id.unique())

        len_keep = len(df_sepsis[df_sepsis[time].between(df_sepsis.intime+pd.Timedelta(hours=4),\
                                                df_sepsis.outtime, inclusive=True)].icustay_id.unique())

        print(f'We will be keeping {len_keep} patients and removing {len_remove}\n')
 

### Create a current ICU column

We start by extracting vital sign information of the training data and merge with the current ICU the patients are in. To do so, we need to make sure that each patient has a uniquely defined position at any one time. So let us check the information from our current_icu view (see the `current_icu.sql` file for how this view is defined)

In [9]:
Q = 'select icustay_id, charttime, curr_careunit from current_icu c order by icustay_id, charttime'
df_care = execute_query(Q, con, schema_name)

Remove those entries that are not linked to an icustay_id

In [10]:
df_care = df_care[~df_care.icustay_id.isnull()]

The following table shows that we have duplicated rows

In [11]:
df_care[df_care.duplicated(keep= False)].head()

Unnamed: 0,icustay_id,charttime,curr_careunit
93010,200817.0,2116-07-15 13:11:00,CSRU
93011,200817.0,2116-07-15 13:11:00,CSRU
93012,200817.0,2116-07-15 13:15:00,CSRU
93013,200817.0,2116-07-15 13:15:00,CSRU
93014,200817.0,2116-07-15 13:30:00,CSRU


We can filter these out

In [12]:
df_care = df_care.drop_duplicates()

In [13]:
df_care = df_care.sort_values(by=['icustay_id','charttime','curr_careunit'], ignore_index=True)

However there are still repeated data for some patient at a particular charttime

In [14]:
df_repeats = df_care[df_care[['icustay_id', 'charttime']].duplicated(keep= False)]

df_repeats = df_repeats.sort_values(by=['icustay_id','charttime','curr_careunit'])

In [15]:
len(df_repeats)

636

This is actually caused by conflicting information for 8 patients in the original transfer data. We shall look at each of these patients closely to determine which location they are more likely to be.

In [16]:
df_repeats.icustay_id.unique()

array([200182., 208656., 208996., 241787., 245640., 275357., 278029.,
       281227.])

We can look at the original information from the transfer table. Let's look at 275357 as an example

In [17]:
df_repeats[df_repeats.icustay_id == 275357].head()

Unnamed: 0,icustay_id,charttime,curr_careunit
8069086,275357.0,2184-03-20 01:00:00,NICU
8069087,275357.0,2184-03-20 01:00:00,NWARD
8069088,275357.0,2184-03-20 13:10:00,NICU
8069089,275357.0,2184-03-20 13:10:00,NWARD
8069090,275357.0,2184-03-21 06:00:00,NICU


In [18]:
Q = """select * from transfers where icustay_id = 275357"""
test = execute_query(Q, con, schema_name)

In [19]:
test

Unnamed: 0,row_id,subject_id,hadm_id,icustay_id,dbsource,eventtype,prev_careunit,curr_careunit,prev_wardid,curr_wardid,intime,outtime,los
0,43664,8017,131067,275357,carevue,admit,,NICU,,26,2184-03-18 01:20:00,2184-03-18 12:31:00,11.18
1,43665,8017,131067,275357,carevue,transfer,NICU,NICU,26.0,26,2184-03-18 12:31:00,2184-03-22 13:50:00,97.32


By looking at tables like above for all patients, we can make a judgement as to which duplicate to keep. After looking at the raw data, we will get rid of the CCU repeats for 281227, CSRU repeats for 200182, Get rid of NWARD for 275357 and the None repeats for the others. To do this, we compile a list of all the indices that we want to remove by keeping either first or last as appropriate

In [20]:
index1 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'last'))\
        &(df_care.icustay_id==281227)
index2 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'last'))\
        &(df_care.icustay_id==200182)
index3 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'first'))\
        &(df_care.icustay_id==208656)
index4 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'first'))\
        &(df_care.icustay_id==245640)
index5 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'first'))\
        &(df_care.icustay_id==278029)
index6 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'first'))\
        &(df_care.icustay_id==208996)
index7 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'first'))\
        &(df_care.icustay_id==241787)
index8 = (df_care[['icustay_id', 'charttime']].duplicated(keep= 'first'))\
        &(df_care.icustay_id==275357)

In [21]:
drop1 = np.where(index1)[0]
drop2 = np.where(index2)[0]
drop3 = np.where(index3)[0]
drop4 = np.where(index4)[0]
drop5 = np.where(index5)[0]
drop6 = np.where(index6)[0]
drop7 = np.where(index7)[0]
drop8 = np.where(index8)[0]

In [22]:
all_dropped_index = np.concatenate((drop1, drop2, drop3, drop4, drop5, drop6, drop7, drop8))

In [23]:
len(all_dropped_index)

318

Let us do a sanity check first by checking that this works for the "drop1" indices

In [24]:
drop1

array([8710601, 8710603, 8710605, 8710607, 8710609, 8710611, 8710613,
       8710615, 8710617, 8710619, 8710621, 8710623, 8710625, 8710627,
       8710629, 8710631, 8710633, 8710635, 8710637, 8710639, 8710641,
       8710643, 8710645, 8710647, 8710649, 8710651, 8710653, 8710655,
       8710657, 8710659, 8710661, 8710663, 8710665, 8710667, 8710669,
       8710671, 8710673, 8710675, 8710677, 8710679, 8710681, 8710683,
       8710685, 8710687, 8710689, 8710691, 8710693, 8710695, 8710697,
       8710699, 8710701, 8710703, 8710705, 8710707, 8710709, 8710711])

We can see that originally we have duplicate entries of CSRU and CCU.

In [25]:
df_care.iloc[8710600:8710610]

Unnamed: 0,icustay_id,charttime,curr_careunit
8710600,281227.0,2134-04-20 09:00:00,SICU
8710601,281227.0,2134-04-20 10:00:00,CCU
8710602,281227.0,2134-04-20 10:00:00,CSRU
8710603,281227.0,2134-04-20 11:00:00,CCU
8710604,281227.0,2134-04-20 11:00:00,CSRU
8710605,281227.0,2134-04-20 11:29:00,CCU
8710606,281227.0,2134-04-20 11:29:00,CSRU
8710607,281227.0,2134-04-20 11:40:00,CCU
8710608,281227.0,2134-04-20 11:40:00,CSRU
8710609,281227.0,2134-04-20 12:00:00,CCU


But if we dropped the index, then we remove all CCU entries as desired

In [26]:
df_care.drop(df_care.index[all_dropped_index]).iloc[8710336:8710345]

Unnamed: 0,icustay_id,charttime,curr_careunit
8710598,281227.0,2134-04-20 07:00:00,SICU
8710599,281227.0,2134-04-20 08:00:00,SICU
8710600,281227.0,2134-04-20 09:00:00,SICU
8710602,281227.0,2134-04-20 10:00:00,CSRU
8710604,281227.0,2134-04-20 11:00:00,CSRU
8710606,281227.0,2134-04-20 11:29:00,CSRU
8710608,281227.0,2134-04-20 11:40:00,CSRU
8710610,281227.0,2134-04-20 12:00:00,CSRU
8710612,281227.0,2134-04-20 13:00:00,CSRU


In [27]:
df_care = df_care.drop(df_care.index[all_dropped_index])

Now we verify that there are no duplicates after dropping these indices.

In [28]:
df_care[df_care[['icustay_id', 'charttime']].duplicated()]

Unnamed: 0,icustay_id,charttime,curr_careunit


### Remapping ids

For our experiments, we have used an additional 1 to 1 map for the subject_ids, hadm_ids, and icustay_ids found in MIMIC to a random key (the mapping is found in the three files that we load below). These keys were made by a random permutation of 1 to len(ids) for each of subject_id, hadm_id and icustay_id. This part of the pipeline is to obscure the original information from the members of the team that will be running the models so that it is harder to construct which patients will be in the test set, ensuring fairness in our set-up.

We load the mapping here.

In [29]:
df_subject_ids = pd.read_csv('subject_key_combined.csv')
df_hadm_ids = pd.read_csv('hadm_key_combined.csv')
df_icustay_ids = pd.read_csv('icustay_key_combined.csv')

### Sepsis - Blood cultures only

We extract the data of the patients in the training set for the sepsis time that is computed with a SOFA window of 48 hours before and 24 hours after time of suspected infection.

Note that we decided later to merge the training and validation set as we switched to cross-validation, and therefore we extract all data pertaining to train and val here.

In [30]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_blood s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

There appears to be some duplicated entries that we will remove

In [31]:
df_sepsis[df_sepsis.duplicated()].head()

Unnamed: 0,subject_id,hadm_id,icustay_id,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max
1896,109,176760,256504,EMERGENCY,2142-01-21 11:23:00,2142-01-23 16:40:00,0,NaT,2142-01-21 11:24:37,2142-01-22 17:20:39,0.026944,24.4578,F,black,Medicaid,HYPERTENSIVE EMERGENCY,MICU,MICU,metavision,2.0,21,1.0,1,2142-01-22 05:44:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,115.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,36.0,5.5,,1.5,5.1,134.0,41.8,,,4.7,24.0,102.0,8.5,115.0,20.0,6.5,5.6,136.0,0.764583,0.763461,0.0,1.0,0.0,1.0,0.0,4.0,6.0,NaT,NaT,NaT,NaT
2108,109,124398,249219,EMERGENCY,2142-03-24 16:34:00,2142-03-31 15:15:00,0,NaT,2142-03-24 16:35:16,2142-03-25 05:37:49,0.021111,24.6275,F,black,Medicaid,HYPERTENSION,MICU,MICU,metavision,7.0,25,1.0,1,2142-03-25 03:27:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,74.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.453472,0.452593,0.0,1.0,0.0,0.0,0.0,4.0,5.0,NaT,NaT,NaT,NaT
2449,109,131376,299251,EMERGENCY,2142-07-07 10:01:00,2142-07-08 18:05:00,0,NaT,2142-07-07 10:02:10,2142-07-08 18:05:24,0.019444,24.915,F,black,Medicaid,HYPERTENSION,MICU,MICU,metavision,1.0,31,1.0,1,2142-07-08 06:14:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,75.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.842361,0.841551,0.0,2.0,0.0,1.0,1.0,4.0,8.0,NaT,NaT,NaT,NaT
13388,1339,109444,214830,EMERGENCY,2138-09-22 08:36:00,2138-09-30 15:25:00,0,NaT,2138-09-22 08:37:25,2138-09-28 17:17:53,0.023611,67.5196,F,white,Medicare,PNEUMONIA,MICU,MICU,metavision,8.0,5,6.0,1,2138-09-22 09:05:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,70.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019155,0.0,0.0,0.0,1.0,0.0,1.0,2.0,2138-09-22 09:05:00,2138-09-22 09:00:00,2138-09-22 09:00:00,2138-09-22 09:05:00
13421,1339,109444,214830,EMERGENCY,2138-09-22 08:36:00,2138-09-30 15:25:00,0,NaT,2138-09-22 08:37:25,2138-09-28 17:17:53,0.023611,67.5196,F,white,Medicare,PNEUMONIA,MICU,MICU,metavision,8.0,5,6.0,1,2138-09-23 00:45:00,,,,,,,,,,,,,,,,,,,,,,,,1,,,,,,168.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.672917,0.671933,0.0,0.0,0.0,1.0,0.0,1.0,2.0,2138-09-22 09:05:00,2138-09-22 09:00:00,2138-09-22 09:00:00,2138-09-22 09:05:00


In [32]:
df_sepsis = df_sepsis.drop_duplicates()

However, it appears that there are still duplicate times for a few patients.

In [33]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated(keep=False)])

22

We check for possible explanations by looking at the results returned by the following queries

In [34]:
Q = """select * from extracted_chart where icustay_id = 287417
order by charttime
"""
test = execute_query(Q, con, schema_name)

In [35]:
Q = """select * from chartevents where icustay_id = 287417
order by charttime
"""
test = execute_query(Q, con, schema_name)

In [36]:
test.hadm_id.unique()

array([193734, 157999])

In [37]:
Q = """select * from extracted_chart where hadm_id = 157999
order by charttime
"""
test = execute_query(Q, con, schema_name)

In [38]:
Q = """select * from extracted_data where hadm_id = 157999
order by charttime
"""
test = execute_query(Q, con, schema_name)

After some inspection, it appears that the above problems seem to be caused by having spurious hadm_id matched with icustay_id for blood glucose readings, perhaps caused some admin error? However for the purposes of our research these would not have much impact given that we are taking hourly summary of the data. If our data format changes, then this issue will have to be addressed.

In [39]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we remap the ids to our random mapping as explained above.

In [40]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [41]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random":\
                          "hadm_id", "icustay_id_random":"icustay_id"}, inplace=True)
df_newid2.head()

Unnamed: 0,subject_id,hadm_id,icustay_id,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,25382,19144,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,25382,19144,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,25382,19144,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,25382,19144,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,25382,19144,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


Next we reorder the columns

In [42]:
df_newid2.columns

Index(['subject_id', 'hadm_id', 'icustay_id', 'admission_type', 'admittime',
       'dischtime', 'hospital_expire_flag', 'deathtime', 'intime', 'outtime',
       ...
       'sofa_liver', 'sofa_circu', 'sofa_cns', 'sofa_renal', 'sofa_total',
       't_suspicion', 't_sofa', 't_sepsis_min', 't_sepsis_max',
       'curr_careunit'],
      dtype='object', length=116)

In [43]:
static_info = ['subject_id', 'hadm_id', 'icustay_id', 'admission_type', 'admittime',
               'dischtime', 'hospital_expire_flag', 'deathtime', 'intime', 'outtime', 'admit_diff', 
               'age', 'gender', 'ethnicity', 'insurance', 'initial_diagnosis', 'first_careunit',
               'last_careunit', 'dbsource', 'los_hospital', 'hospstay_seq', 'los_icu',
               'icustay_seq', 't_suspicion', 't_sofa', 't_sepsis_min', 't_sepsis_max']

In [44]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

We see that almost half of the patient are classed as will be having sepsis at some point

In [45]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

6329

In [46]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients whose stay was less than 4 hours and greater than 20 days

In [47]:
df_sepsis = df_sepsis[(df_sepsis['outtime'] <= df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime'] >= df_sepsis['intime']+pd.Timedelta(hours=4))]

Next we give a summary of what would happen in the next stage of the analysis, when we make further exclusions based on whether the patient's time of sepsis is within 4 hours of ICU admission.

In [48]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 2072 patients and removing 4024

---------------
 Looking at t_sofa
We will be keeping 1409 patients and removing 4685

---------------
 Looking at t_sepsis_min
We will be keeping 850 patients and removing 5246

---------------
 Looking at t_sepsis_max
We will be keeping 2631 patients and removing 3463



In [49]:
df_sepsis.to_csv('../data/raw/metavision_sepsis_blood_only_data.csv',\
                 index=False)

##  Other SOFA windows

We process the data for the other SOFA windows (24,12), (12, 6), (6,3) in the same way

#### SOFA window: (24,12)

In [50]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_blood_sensitivity_2412 s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

In [51]:
df_sepsis = df_sepsis.drop_duplicates()

Again, we have a few cases of duplicated times, this will be resolved later in our pipeline when we bin the data on an hourly basis. 

In [52]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated()])

11

In [53]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we again remap the ids

In [54]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [55]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random": "hadm_id",\
                          "icustay_id_random":"icustay_id"}, inplace=True)

We move the static columns to the start of the dataframe

In [56]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

We see that fewer patients have been classed as having sepsis compared with previously

In [57]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

5961

In [58]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients whose stay was less than 4 hours and greater than 20 days, then give a summary

In [59]:
df_sepsis = df_sepsis[(df_sepsis['outtime']<=df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime']>=df_sepsis['intime']+pd.Timedelta(hours=4))]

In [60]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 1937 patients and removing 3800

---------------
 Looking at t_sofa
We will be keeping 1278 patients and removing 4459

---------------
 Looking at t_sepsis_min
We will be keeping 896 patients and removing 4841

---------------
 Looking at t_sepsis_max
We will be keeping 2319 patients and removing 3418



In [61]:
df_sepsis.to_csv('../data/raw/metavision_sepsis_blood_only_data_sensitivity_24_12.csv'\
                 , index=False)

#### SOFA window: (12,6)

In [62]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_blood_sensitivity_126 s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

In [63]:
df_sepsis = df_sepsis.drop_duplicates()

Note the same issue with a few duplicate chartime points remains

In [64]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated()])

11

In [65]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we again remap the ids

In [66]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [67]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random": "hadm_id",\
                          "icustay_id_random":"icustay_id"}, inplace=True)

We move the static columns to the start of the dataframe

In [68]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

Again we see that the number of people with sepsis is decreasing with a decreased window size

In [69]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

5277

In [70]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients based on length of stay and print summary for future exclusions

In [71]:
df_sepsis = df_sepsis[(df_sepsis['outtime']<=df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime']>=df_sepsis['intime']+pd.Timedelta(hours=4))]

In [72]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 1705 patients and removing 3355

---------------
 Looking at t_sofa
We will be keeping 1065 patients and removing 3995

---------------
 Looking at t_sepsis_min
We will be keeping 918 patients and removing 4142

---------------
 Looking at t_sepsis_max
We will be keeping 1852 patients and removing 3208



In [73]:
df_sepsis.to_csv('../data/raw/metavision_sepsis_blood_only_data_sensitivity_12_6.csv',\
                 index=False)

#### SOFA window: (6,3)

In [74]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_blood_sensitivity_63 s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

In [75]:
df_sepsis = df_sepsis.drop_duplicates()

In [76]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated()])

11

In [77]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we again remap the ids

In [78]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [79]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random": "hadm_id",\
                          "icustay_id_random":"icustay_id"}, inplace=True)

We move the static columns to the start of the dataframe

In [80]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

We note that by reducing the SOFA window from (48,24) to (6,3), there is a large reduction of patients identified with sepsis, from around 6000 to around 4000.

In [81]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

4128

In [82]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients based on length of stay and print summary for future exclusions

In [83]:
df_sepsis = df_sepsis[(df_sepsis['outtime'] <= df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime'] >= df_sepsis['intime']+pd.Timedelta(hours=4))]

In [84]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 1202 patients and removing 2731

---------------
 Looking at t_sofa
We will be keeping 887 patients and removing 3046

---------------
 Looking at t_sepsis_min
We will be keeping 830 patients and removing 3103

---------------
 Looking at t_sepsis_max
We will be keeping 1259 patients and removing 2674



In [85]:
df_sepsis.to_csv('../data/raw/metavision_sepsis_blood_only_data_sensitivity_6_3.csv',\
                 index=False)

## Other sensitivity analysis

We also check sensitivity on blood cultures (whether to use all cultures or not), SOFA score (whether to use absolute value or change in score) and CNS component (whether it adds unnecessary noise). For the other sensitivity analysis, we will use a SOFA window of (24,12) for all of them.

### Sepsis - All cultures 

Now we extract the sepsis onset time that we get by using all cultures taken rather just than those relating to blood cultures. We expect that this may increase the sensitivity at a much decreased specificity.

In [86]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_sensitivity_2412 s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

In [87]:
df_sepsis = df_sepsis.drop_duplicates()

In [88]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated()])

11

In [89]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we again remap the ids

In [90]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [91]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random": "hadm_id",\
                          "icustay_id_random":"icustay_id"}, inplace=True)

We move the static columns to the start of the dataframe

In [92]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

We see that with the inclusion of all cultures, a greater number of people have been flagged as having sepsis (compared with the 5961 of the blood cultures)

In [93]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

6182

In [94]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients based on length of stay and print summary for future exclusions

In [95]:
df_sepsis = df_sepsis[(df_sepsis['outtime']<=df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime']>=df_sepsis['intime']+pd.Timedelta(hours=4))]

In [96]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 964 patients and removing 4989

---------------
 Looking at t_sofa
We will be keeping 1238 patients and removing 4715

---------------
 Looking at t_sepsis_min
We will be keeping 602 patients and removing 5351

---------------
 Looking at t_sepsis_max
We will be keeping 1600 patients and removing 4353



In [97]:
df_sepsis.to_csv('../data/raw/all_cultures_sensitivity_24_12.csv',\
                 index=False)

### Sepsis - Absolute value exceeds 2 instead of change of 2

Since the original Sepsis-3 paper defined t_sofa as the time that a patient had a SOFA score of 2 or above unless they had underlying health issues, we will look at how the performance varies if we defined our sepsis time based on this absolute value of SOFA instead of a change in SOFA score (the change in SOFA score that we used above appears to be more popular within the early sepsis detection literature)

In [98]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_blood_abs_2412 s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

In [99]:
df_sepsis = df_sepsis.drop_duplicates()

In [100]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated()])

11

In [101]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we again remap the ids

In [102]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [103]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random": "hadm_id",\
                          "icustay_id_random":"icustay_id"}, inplace=True)

We move the static columns to the start of the dataframe

In [104]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

We see that there are a greater number of people marked as with sepsis when we only look at an absolute value of SOFA rather than a SOFA increase (compared with the 5961 for a SOFA change)

In [105]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

6241

In [106]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients based on length of stay and print summary for future exclusions

In [107]:
df_sepsis = df_sepsis[(df_sepsis['outtime']<=df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime']>=df_sepsis['intime']+pd.Timedelta(hours=4))]

In [108]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 2179 patients and removing 3825

---------------
 Looking at t_sofa
We will be keeping 1411 patients and removing 4593

---------------
 Looking at t_sepsis_min
We will be keeping 1038 patients and removing 4966

---------------
 Looking at t_sepsis_max
We will be keeping 2552 patients and removing 3452



In [109]:
df_sepsis.to_csv('../data/raw/absolute_values_sensitivity_24_12.csv',\
                 index=False)

### Sepsis - No CNS component used

In this final sensitivity analysis, we looked at whether the CNS component of the SOFA score play a big impact. The reason that we are particularly interested in this aspect is because there is discrepency between caregivers when assessing patients who are intubated. Sometimes these patients are assigned high scores for GCS, but other times they may not. As a result, this may create a lot of noise in the data.

In [110]:
Q = """
with ids as (select * from train_ids union select * from val_ids)
select e.*, s.t_suspicion, s.t_sofa, s.t_sepsis_min, s.t_sepsis_max
from extracted_data_nogcs e 
inner join ids on e.subject_id = ids.subject_id
  inner join sepsis_cohort_time_blood_nogcs_2412 s 
  on e.icustay_id = s.icustay_id order by subject_id, charttime"""
df_sepsis = execute_query(Q, con, schema_name)

In [111]:
df_sepsis = df_sepsis.drop_duplicates()

In [112]:
len(df_sepsis[df_sepsis[['icustay_id', 'charttime']].duplicated()])

11

In [113]:
df_sepsis_merge = df_sepsis.merge(df_care, on=['icustay_id', 'charttime'], how='inner')

Next we again remap the ids

In [114]:
print('Merging new icustay_ids')

df_newid2 = pd.merge(df_icustay_ids, df_sepsis_merge, on='icustay_id')

print('Merging new hadm_ids')

df_newid2 = pd.merge(df_hadm_ids, df_newid2, on='hadm_id')

print('Merging new subject_ids')

df_newid2 = pd.merge(df_subject_ids, df_newid2, on='subject_id')

df_newid2.head()

Merging new icustay_ids
Merging new hadm_ids
Merging new subject_ids


Unnamed: 0,subject_id,subject_id_random,hadm_id,hadm_id_random,icustay_id,icustay_id_random,admission_type,admittime,dischtime,hospital_expire_flag,deathtime,intime,outtime,admit_diff,age,gender,ethnicity,insurance,initial_diagnosis,first_careunit,last_careunit,dbsource,los_hospital,hospstay_seq,los_icu,icustay_seq,charttime,heart_rate,heart_rhythm,nbp_sys,nbp_dias,nbp_mean,abp_sys,abp_dias,abp_mean,temp_site,temp_celcius,verbal_response,o2sat,resp_rate,resp_rate_spont,resp_rate_set,resp_rate_total,minute_vol,mean_airway_pressure,peak_insp_rate,plateau_pressure,o2flow_chart,o2_device,vent_mode,on_vent,tidal_vol_set,tidal_vol_obs,tidal_vol_spon,peep_set,fio2_chart,glucose_chart,specimen,baseexcess,bicarbonate_bg,totalco2,carboxyhemoglobin,chloride_bg,calcium_bg,glucose_bg,hematocrit_bg,hemoglobin_bg,intubated,lactate,methemoglobin,o2flow_lab,fio2,so2,pco2,peep,ph,po2,potassium_bg,requiredo2,sodium_bg,temperature_bg,tidalvolume,ventilator,alkalinephos,ast,bilirubin_direct,bilirubin_total,bun,creatinine,fibrinogen,magnesium,phosphate,platelets,ptt,tropinin_t,tropinin_i,wbc,bicarbonate,chloride,calcium,glucose,hematocrit,hemoglobin,potassium,sodium,rolling_los_hospital,rolling_los_icu,sofa_resp,sofa_coag,sofa_liver,sofa_circu,sofa_cns,sofa_renal,sofa_total,t_suspicion,t_sofa,t_sepsis_min,t_sepsis_max,curr_careunit
0,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 15:45:00,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.0,36.0,,0.4,21.0,7.0,,,,238.0,28.5,,0.13,6.1,22.0,93.0,,158.0,43.0,14.3,3.8,131.0,-0.198611,-0.199722,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,
1,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:00:00,,SR (Sinus Rhythm),,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.020139,0.019028,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
2,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:54:00,,,176.0,77.0,102.0,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.057639,0.056528,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
3,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 21:56:00,72.0,,,,,,,,,,,96.0,14.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.059028,0.057917,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU
4,28566,25382,190098,19144,204609,16732,EMERGENCY,2110-08-06 20:31:00,2110-08-08 16:25:00,0,NaT,2110-08-06 20:32:36,2110-08-07 18:32:14,0.026667,50.8759,M,black,Medicare,CONGESTIVE HEART FAILURE,MICU,MICU,metavision,2.0,2,1.0,1,2110-08-06 22:00:00,70.0,SR (Sinus Rhythm),169.0,79.0,102.0,,,,,36.388889,,98.0,12.0,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.061806,0.060694,0.0,0.0,0.0,0.0,0.0,4.0,4.0,NaT,NaT,NaT,NaT,MICU


In [115]:
df_newid2.drop(['subject_id', 'hadm_id', 'icustay_id'], axis = 1, inplace=True)

df_newid2.rename(columns={"subject_id_random": "subject_id", "hadm_id_random": "hadm_id",\
                          "icustay_id_random":"icustay_id"}, inplace=True)

We move the static columns to the start of the dataframe

In [116]:
new_columns = static_info + (df_newid2.columns.drop(static_info).tolist())

df_newid2 = df_newid2[new_columns]

df_sepsis = df_newid2.sort_values(['subject_id', 'charttime'])

For this sensitivity analysis, we see that fewer people have been flagged as having sepsis (compared with the 5961 of the blood cultures). This is not unexpected, as the fluctuations in the CNS score can indeed cause a SOFA score change of 2 or more.

In [117]:
len(df_sepsis[~df_sepsis.t_sepsis_min.isnull()].icustay_id.unique())

5463

In [118]:
len(df_sepsis.icustay_id.unique())

13971

Exclude patients based on length of stay and print summary for future exclusions

In [119]:
df_sepsis = df_sepsis[(df_sepsis['outtime']<=df_sepsis['intime']+pd.Timedelta(days=20)) & \
             (df_sepsis['outtime']>=df_sepsis['intime']+pd.Timedelta(hours=4))]

In [120]:
sepsis_summary(df_sepsis)

There are 13704 patients after LOS exclusions
---------------
 Looking at t_suspicion
We will be keeping 1746 patients and removing 3498

---------------
 Looking at t_sofa
We will be keeping 1159 patients and removing 4085

---------------
 Looking at t_sepsis_min
We will be keeping 816 patients and removing 4428

---------------
 Looking at t_sepsis_max
We will be keeping 2089 patients and removing 3155



In [121]:
df_sepsis.to_csv('../data/raw/no_gcs_sensitivity_24_12.csv',\
                 index=False)

In [122]:
con.close()