### Cohort selection process for MIMIC data.

* We combine MICU (medical) and SICU (surgical) as an approximation to GICU in Bristol.
* We use Metavision only because of relatively well structured data (compared with Carevue).

The conditions for cohort inclusion are ICUSTAYs that:

* Are the first (or only) ICU stay of a hopistal admission.
* Have outcome information (survival/mortality or readmission to ICU).
* Have 'good' data availability (we test for this later, not in this notebook).

A positive outcome is defined as:

* Patient survives hospital stay
* Patient is not re-admitted to ICU diuring the same hospital admission

Therefore, a negative outcome is (at least) one of:

* Death during hospital admission (on ICU or ward).
* Readmission to ICU during hospital stay (even if patient survives).

In [1]:
import graphlab
import graphlab.aggregate as agg
import numpy as np
import pickle

In [2]:
%load_ext sql
%sql mysql://root:mysql2016@localhost/MIMIC?unix_socket=/run/mysqld/mysqld.sock
%sql USE MIMIC

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


This non-commercial license of GraphLab Create for academic use is assigned to cm1788@bristol.ac.uk and will expire on October 04, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1553156928.log


0 rows affected.


  cursor.execute('SELECT @@tx_isolation')


[]

The following query extracts the required information about ICUSTAY, HADM (hospital admission), and outcomes:

In [3]:
data=%sql SELECT I.SUBJECT_ID, I.HADM_ID, I.ICUSTAY_ID, A.HOSPITAL_EXPIRE_FLAG, I.INTIME, I.OUTTIME, A.DEATHTIME \
FROM ICUSTAYS I \
INNER JOIN PATIENTS P \
ON I.SUBJECT_ID=P.SUBJECT_ID \
INNER JOIN ADMISSIONS A \
ON I.HADM_ID=A.HADM_ID \
WHERE (FIRST_CAREUNIT='MICU' or FIRST_CAREUNIT='SICU') AND DBSOURCE='METAVISION'

14595 rows affected.


We store the query output in a dataframe (and save to disk avoid re-running the query later):

In [4]:
print len(data) 
gdata = graphlab.SFrame(data.DataFrame())
gdata.save('mimic_mortality')

14595


Counting unqiue individuals, stays, and hospital admissions:

In [5]:
print "There are: "
print str(len(gdata['SUBJECT_ID'].unique())) + " patients,"
print str(len(gdata['HADM_ID'].unique())) + " hospitals admissions, and"
print str(len(gdata['ICUSTAY_ID'].unique())) + " intensive care stays."

There are: 
11007 patients,
13748 hospitals admissions, and
14595 intensive care stays.


#### For each HADM (hospital admission) we want to know how many ICU stays occur:

For each HADM we will include the first ICUSTAY in our cohort.

(Presence of a second ICUSTAY in the same HADM constitues a negative outcome.)

In [6]:
gdata_grouped = gdata.groupby(key_columns='HADM_ID',
              operations={
                  'SUBJECT_ID' : agg.SELECT_ONE('SUBJECT_ID'),
                  'first intime': agg.MIN('INTIME'),
                  'first stay'  : agg.ARGMIN('INTIME', 'ICUSTAY_ID'),
                  'count' : agg.COUNT('ICUSTAY_ID'),
                  'mortality':agg.MAX('HOSPITAL_EXPIRE_FLAG')})

Convert aggregated data to dictionaries for portable pickle save:

In [7]:
ic_count = dict()
first_stays = dict()
intimes = dict()

for row in gdata_grouped:
    
    ic_count[row['HADM_ID']] = row['count']
    intimes[row['HADM_ID']] = row['first intime']
    first_stays[row['HADM_ID']] = row['first stay']

Here we produce a list of all hospital admissions (HADM_ID) during which the patient dies:

In [8]:
print len(gdata_grouped)
print sum(gdata_grouped['mortality']==1)
mortality_list = list(gdata_grouped[gdata_grouped['mortality']==1]['HADM_ID'])

13748
1681


Here we produce a list of all ICUSTAYs during which the patient dies on ICU:

(This is determined from A.DEATHTIME < I.OUTTIME)

In [9]:
print "%d mortalities have no DEATHTIME recorded." %sum(gdata[gdata['DEATHTIME']==None ]['HOSPITAL_EXPIRE_FLAG']==1)  
print " "

_mortalities = gdata[gdata['DEATHTIME']!=None]
ic_deaths = _mortalities.apply(lambda row: row['ICUSTAY_ID'] if row['DEATHTIME'] <= row['OUTTIME'] else -1)
ic_deaths = [i for i in ic_deaths if i!=-1]

print "There are %d recorded death times." %len(_mortalities)
print "Of these, %d occur on ICU." %len(ic_deaths)

0 mortalities have no DEATHTIME recorded.
 
There are 1922 recorded death times.
Of these, 1279 occur on ICU.


#### We save these cohort characteristics for use in extracting data from MIMIC (see 'data_pull_mimic.ipynb'):

In [10]:
## These are dictionaries:
f = open('first_stays.pkl', 'wb')
pickle.dump(first_stays, f)
f.close()

f = open('stay_counts.pkl', 'wb')
pickle.dump(ic_count, f)
f.close()

## These are lists:
f = open('moratlities.pkl', 'wb')
pickle.dump(mortality_list, f)
f.close()

f = open('icu_deaths.pkl', 'wb')
pickle.dump(ic_deaths, f)
f.close()

---------------------------------------------------------------------------

---------------------------------------------------------------------------
#### Other scripting (sanity checks and looking at some occurance rates):

In [13]:
more_than_one_stay = sum(np.asarray(ic_count.values())>1)
more_than_two_stay = sum(np.asarray(ic_count.values())>2)

In [14]:
print more_than_one_stay
print more_than_two_stay

742
80


In [15]:
## Readmission rate:
more_than_one_stay/float(len(gdata['HADM_ID'].unique()))

0.0539714867617108

In [16]:
## Overall mortality
print len(mortality_list)/float(len(gdata['HADM_ID'].unique()))

0.122272330521


In [19]:
## Readmission mortality:
death=0
for mortality in mortality_list:
    if ic_count[mortality]>1:
        death+=1
        
readmit_mort = death/float(sum(np.asarray(ic_count.values())>1))
print readmit_mort

0.260107816712


In [20]:
## Single stay mortality:
death=0
for mortality in mortality_list:
    if ic_count[mortality]==1:
        death+=1
        
single_mort = death/float(sum(np.asarray(ic_count.values())==1))
print single_mort

0.11440873443
