### Cohort selection process for MIMIC data.

* We combine MICU (medical) and SICU (surgical) as an approximation to GICU in Bristol.
* We use Metavision only because of relatively well structured data (compared with Carevue).

The conditions for cohort inclusion are ICUSTAYs that:

* Are the first (or only) ICU stay of a hopistal admission.
* Have outcome information (survival/mortality or readmission to ICU).
* Have 'good' data availability (we test for this later, not in this notebook).

A positive outcome is defined as:

* Patient survives hospital stay
* Patient is not re-admitted to ICU diuring the same hospital admission

Therefore, a negative outcome is (at least) one of:

* Death during hospital admission (on ICU or ward).
* Readmission to ICU during hospital stay (even if patient survives).

In [60]:

import numpy as np
import pickle

In [2]:
from google.cloud import bigquery
client = bigquery.Client()

In [None]:
%load_ext sql
%sql mysql://root:mysql2016@localhost/MIMIC?unix_socket=/run/mysqld/mysqld.sock
%sql USE MIMIC

The following query extracts the required information about ICUSTAY, HADM (hospital admission), and outcomes:

In [3]:
data=%sql SELECT I.SUBJECT_ID, I.HADM_ID, I.ICUSTAY_ID, A.HOSPITAL_EXPIRE_FLAG, I.INTIME, I.OUTTIME, A.DEATHTIME \
FROM ICUSTAYS I \
INNER JOIN PATIENTS P \
ON I.SUBJECT_ID=P.SUBJECT_ID \
INNER JOIN ADMISSIONS A \
ON I.HADM_ID=A.HADM_ID \
WHERE (FIRST_CAREUNIT='MICU' or FIRST_CAREUNIT='SICU') AND DBSOURCE='METAVISION'

14595 rows affected.


In [13]:
query = """SELECT I.SUBJECT_ID, I.HADM_ID, I.ICUSTAY_ID, A.HOSPITAL_EXPIRE_FLAG, I.INTIME, I.OUTTIME, A.DEATHTIME
FROM physionet-data.mimiciii_clinical.icustays I 
INNER JOIN physionet-data.mimiciii_clinical.patients P 
ON I.SUBJECT_ID=P.SUBJECT_ID 
INNER JOIN physionet-data.mimiciii_clinical.admissions A 
ON I.HADM_ID=A.HADM_ID 
WHERE (FIRST_CAREUNIT='MICU' or FIRST_CAREUNIT='SICU') AND DBSOURCE='metavision'"""
query_job = client.query(query)
gdata = query_job.to_dataframe()

We store the query output in a dataframe (and save to disk avoid re-running the query later):

In [4]:
print len(data) 
gdata = graphlab.SFrame(data.DataFrame())
gdata.save('mimic_mortality')

14595


In [31]:
gdata

Unnamed: 0,SUBJECT_ID,HADM_ID,ICUSTAY_ID,HOSPITAL_EXPIRE_FLAG,INTIME,OUTTIME,DEATHTIME
0,154,102354,201272,0,2127-12-23 22:02:59,2127-12-26 00:06:44,NaT
1,406,100765,231758,0,2126-03-11 23:07:15,2126-03-18 22:17:14,NaT
2,5199,115157,296867,0,2204-04-08 08:16:19,2204-04-12 10:51:07,NaT
3,5841,126020,243663,0,2131-06-19 13:43:26,2131-06-20 22:25:08,NaT
4,7533,166898,208809,0,2114-03-29 09:36:53,2114-04-08 18:39:51,NaT
5,11043,198451,287685,0,2160-04-22 20:13:18,2160-04-26 17:13:34,NaT
6,10820,125172,223012,0,2171-10-25 09:28:51,2171-11-02 13:38:29,NaT
7,15025,116670,250703,0,2207-06-16 22:12:35,2207-06-18 13:24:30,NaT
8,16975,100446,283082,0,2122-03-28 17:52:51,2122-03-30 15:58:53,NaT
9,18254,125836,202339,0,2196-12-17 08:14:30,2196-12-18 12:29:48,NaT


Counting unqiue individuals, stays, and hospital admissions:

In [14]:
print( "There are: ")
print( str(len(gdata['SUBJECT_ID'].unique())) + " patients,")
print( str(len(gdata['HADM_ID'].unique())) + " hospitals admissions, and")
print( str(len(gdata['ICUSTAY_ID'].unique())) + " intensive care stays.")

There are: 
11007 patients,
13748 hospitals admissions, and
14595 intensive care stays.


#### For each HADM (hospital admission) we want to know how many ICU stays occur:

For each HADM we will include the first ICUSTAY in our cohort.

(Presence of a second ICUSTAY in the same HADM constitues a negative outcome.)

In [49]:
import pandas as pd
def agg_func(data, hadm_id):
    series = {}
    series['HADM_ID'] = hadm_id
    series['SUBJECT_ID'] = data['SUBJECT_ID'].min()
    series['count'] = data['ICUSTAY_ID'].count()
    series['first stay'] = data.loc[data['INTIME'].idxmin()]['ICUSTAY_ID']
    series['first intime'] = data['INTIME'].min()
    series['mortality'] = data['HOSPITAL_EXPIRE_FLAG'].max()
    return pd.Series(series)

gdata_grouped = gdata.groupby('HADM_ID')
agg_data = []
for hadm_id, group in gdata_grouped:
    agg_data.append(agg_func(group, hadm_id))

gdata_grouped = pd.DataFrame(agg_data)
gdata_grouped

Unnamed: 0,HADM_ID,SUBJECT_ID,count,first stay,first intime,mortality
0,100001,58526,1,275225,2117-09-11 11:47:35,0
1,100003,54610,1,209281,2150-04-17 15:35:42,0
2,100010,55853,1,271147,2109-12-10 21:58:01,0
3,100016,68591,1,217590,2188-05-24 13:07:20,0
4,100018,58128,1,269533,2176-08-29 16:56:37,0
5,100028,53456,1,297261,2142-12-23 18:07:12,0
6,100035,48539,1,245719,2115-02-22 06:52:06,0
7,100037,58947,2,270105,2183-03-23 18:22:04,0
8,100072,50379,1,294067,2145-11-11 22:10:58,0
9,100075,77988,1,278942,2186-02-01 17:02:13,0


Convert aggregated data to dictionaries for portable pickle save:

In [50]:
ic_count = dict()
first_stays = dict()
intimes = dict()

for idx, row in gdata_grouped.iterrows():
    
    ic_count[row['HADM_ID']] = row['count']
    intimes[row['HADM_ID']] = row['first intime']
    first_stays[row['HADM_ID']] = row['first stay']

Here we produce a list of all hospital admissions (HADM_ID) during which the patient dies:

In [51]:
print( len(gdata_grouped))
print( sum(gdata_grouped['mortality']==1))
mortality_list = list(gdata_grouped[gdata_grouped['mortality']==1]['HADM_ID'])

13748
1681


Here we produce a list of all ICUSTAYs during which the patient dies on ICU:

(This is determined from A.DEATHTIME < I.OUTTIME)

In [57]:
print( "%d mortalities have no DEATHTIME recorded." %sum(gdata[gdata['DEATHTIME']==None ]['HOSPITAL_EXPIRE_FLAG']==1))  
print( " ")

_mortalities = gdata[~gdata['DEATHTIME'].isnull()]
ic_deaths = _mortalities.apply(lambda row: row['ICUSTAY_ID'] if row['DEATHTIME'] <= row['OUTTIME'] else -1, axis=1)
ic_deaths = [i for i in ic_deaths if i!=-1]

print( "There are %d recorded death times." %len(_mortalities))
print( "Of these, %d occur on ICU." %len(ic_deaths))

0 mortalities have no DEATHTIME recorded.
 
There are 1922 recorded death times.
Of these, 1279 occur on ICU.


pandas._libs.tslibs.timestamps.Timestamp

#### We save these cohort characteristics for use in extracting data from MIMIC (see 'data_pull_mimic.ipynb'):

In [61]:
## These are dictionaries:
f = open('first_stays.pkl', 'wb')
pickle.dump(first_stays, f)
f.close()

f = open('stay_counts.pkl', 'wb')
pickle.dump(ic_count, f)
f.close()

## These are lists:
f = open('moratlities.pkl', 'wb')
pickle.dump(mortality_list, f)
f.close()

f = open('icu_deaths.pkl', 'wb')
pickle.dump(ic_deaths, f)
f.close()

---------------------------------------------------------------------------

---------------------------------------------------------------------------
#### Other scripting (sanity checks and looking at some occurance rates):

In [13]:
more_than_one_stay = sum(np.asarray(ic_count.values())>1)
more_than_two_stay = sum(np.asarray(ic_count.values())>2)

In [14]:
print more_than_one_stay
print more_than_two_stay

742
80


In [15]:
## Readmission rate:
more_than_one_stay/float(len(gdata['HADM_ID'].unique()))

0.0539714867617108

In [16]:
## Overall mortality
print len(mortality_list)/float(len(gdata['HADM_ID'].unique()))

0.122272330521


In [19]:
## Readmission mortality:
death=0
for mortality in mortality_list:
    if ic_count[mortality]>1:
        death+=1
        
readmit_mort = death/float(sum(np.asarray(ic_count.values())>1))
print readmit_mort

0.260107816712


In [20]:
## Single stay mortality:
death=0
for mortality in mortality_list:
    if ic_count[mortality]==1:
        death+=1
        
single_mort = death/float(sum(np.asarray(ic_count.values())==1))
print single_mort

0.11440873443
