### Cohort selection process for MIMIC data.

* We combine MICU (medical) and SICU (surgical) as an approximation to GICU in Bristol.
* We use Metavision only because of relatively well structured data (compared with Carevue).

The conditions for cohort inclusion are ICUSTAYs that:

* Are the first (or only) ICU stay of a hopistal admission.
* Have outcome information (survival/mortality or readmission to ICU).
* Have 'good' data availability (we test for this later, not in this notebook).

A good outcome is defined as:

* Patient survives hospital stay
* Patient is not re-admitted to ICU diuring the same hospital admission

Therefore, a bad outcome is (at least) one of:

* Death during hospital admission (on ICU or ward).
* Readmission to ICU during hospital stay (even if patient survives).

In [1]:
import graphlab
import graphlab.aggregate as agg
import numpy as np
import pickle

In [2]:
%load_ext sql
%sql mysql://root:mysql2016@localhost/MIMIC?unix_socket=/run/mysqld/mysqld.sock
%sql USE MIMIC

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


This non-commercial license of GraphLab Create for academic use is assigned to cm1788@bristol.ac.uk and will expire on October 04, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1540910865.log


0 rows affected.


  cursor.execute('SELECT @@tx_isolation')


[]

The following query extracts the required information about ICUSTAY, HADM (hospital admission), and outcomes:

In [3]:
data=%sql SELECT I.SUBJECT_ID, I.HADM_ID, I.ICUSTAY_ID, A.HOSPITAL_EXPIRE_FLAG, I.INTIME, I.OUTTIME, A.DEATHTIME \
FROM ICUSTAYS I \
INNER JOIN PATIENTS P \
ON I.SUBJECT_ID=P.SUBJECT_ID \
INNER JOIN ADMISSIONS A \
ON I.HADM_ID=A.HADM_ID \
WHERE (FIRST_CAREUNIT='MICU' or FIRST_CAREUNIT='SICU') AND DBSOURCE='METAVISION'

14595 rows affected.


We store the query output in a dataframe (and save to avoid re-running the query):

In [4]:
print len(data) 
gdata = graphlab.SFrame(data.DataFrame())
gdata.save('mimic_mortality')

14595


Counting unqiue individuals, stays, and hospital admissions:

In [5]:
print len(gdata['SUBJECT_ID'].unique())
print len(gdata['HADM_ID'].unique())
print len(gdata['ICUSTAY_ID'].unique())

11007
13748
14595


#### For each HADM (hospital admission) we want to know how many ICU stays occur:

For each HADM we will include the first ICUSTAY in our cohort.

(Presence of a second ICUSTAY in the same HADM constitues a negatie outcome.)

In [6]:
count = dict.fromkeys(gdata['HADM_ID'].unique(),0)

intimes = dict.fromkeys(gdata['HADM_ID'].unique(),0)  ## to check we get earliest stay
first_stays = dict.fromkeys(gdata['HADM_ID'].unique(),0)

for row in gdata:
    count[row['HADM_ID']] += 1
    
    if intimes[row['HADM_ID']]==0:
        intimes[row['HADM_ID']] = row['INTIME']
        first_stays[row['HADM_ID']] = row['ICUSTAY_ID']
        
    elif row['INTIME'] <= intimes[row['HADM_ID']]:
        intimes[row['HADM_ID']] = row['INTIME']
        first_stays[row['HADM_ID']] = row['ICUSTAY_ID']
        

Here we produce a list of all hospital admissions (HADM_ID) during which the patient dies:

In [7]:
test = gdata.groupby(key_columns='HADM_ID', operations={'mortality':agg.MAX('HOSPITAL_EXPIRE_FLAG')})
print sum(test['mortality']==1)
print len(test)

mort = list(test[test['mortality']==1]['HADM_ID'])

1681
13748


Here we produce a list of all ICUSTAYs during which the patient dies on ICU:

(This is determined from A.DEATHTIME < I.OUTTIME)

In [8]:
test = gdata[gdata['DEATHTIME']==None ]
print sum(test['HOSPITAL_EXPIRE_FLAG']==1)  
## All those who die in hospital have a DEATHTIME RECORDED

0


In [9]:
_sub = gdata[gdata['DEATHTIME']!=None]

print "There are %d recorded death times" %len(_sub)
ic_deaths = _sub.apply(lambda row: row['ICUSTAY_ID'] if row['DEATHTIME'] <= row['OUTTIME'] else -1)
ic_deaths = [i for i in ic_deaths if i!=-1]
print "Of these, %d occur on ICU." %len(ic_deaths)

There are 1922 recorded death times
Of these, 1279 occur on ICU.


#### We save these cohort characteristics for use in extracting data from MIMIC (see 'data_pull_mimic.ipynb'):

In [10]:
f = open('first_stays.pkl', 'wb')
pickle.dump(first_stays, f)
f.close()

f = open('moratlities.pkl', 'wb')
pickle.dump(mort, f)
f.close()

f = open('stay_counts.pkl', 'wb')
pickle.dump(count, f)
f.close()

f = open('icu_deaths.pkl', 'wb')
pickle.dump(ic_deaths, f)
f.close()

---------------------------------------------------------------------------

---------------------------------------------------------------------------
#### Other scripting (sanity checks and looking at some occurance rates):

In [11]:
more_than_one_stay = sum(np.asarray(count.values())>1)
more_than_two_stay = sum(np.asarray(count.values())>2)

In [12]:
print more_than_one_stay
print more_than_two_stay

742
80


In [13]:
## Readmission rate:
more_than_one_stay/float(len(gdata['HADM_ID'].unique()))

0.0539714867617108

In [14]:
## Overall mortality
print len(mort)/float(len(gdata['HADM_ID'].unique()))

0.122272330521


In [15]:
## Readmission mortality:
death=0
for mor in mort:
    if count[mor]>1:
        death+=1
        
readmit_mort = death/float(sum(np.asarray(count.values())>1))
print readmit_mort

0.260107816712


In [16]:
## Single stay mortality:
death=0
for mor in mort:
    if count[mor]==1:
        death+=1
        
single_mort = death/float(sum(np.asarray(count.values())==1))
print single_mort

0.11440873443
