# Data Extraction from MIMIC.

### Contents:

* (A) Locating the required data
* (B) Extracting the data
* (C) Post-processing of data (involves combining with cohort information from 'cohort_selection_mimic.ipynb')

We use two mysql libraries:

* 'sql' is simple for scripting, with line/cell magic. 
* 'MySQLdb' is required for more complex operations (e.g. parameter binding with python lists).


In [1]:
import graphlab
import numpy as np
import pickle
import datetime
from collections import OrderedDict
import matplotlib.pyplot as plt
import matplotlib.dates as dates
%matplotlib inline

This non-commercial license of GraphLab Create for academic use is assigned to cm1788@bristol.ac.uk and will expire on October 04, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1540983808.log


In [2]:
%load_ext sql
%sql mysql://root:mysql2016@localhost/MIMIC?unix_socket=/run/mysqld/mysqld.sock
%sql USE MIMIC

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


0 rows affected.


  cursor.execute('SELECT @@tx_isolation')


[]

### (A) Locating the required data.

A number of different variables are required to assess the nurse led discharge (NLD) criteria. Here we locate the ITEMIDs corresponding to these variables, which will be used to extract patient data from chartevents.

For simplicity we focus on (dbsource=)'Metavision' only since data storage within 'Carevue' is less coherent. Because the study compares with a general intensive care unit (GICU) from the UK, we take only patients with (first_careunit=)'MICU' or 'SICU' since this patient subset approximately corresponds to the GICU population.

The required variables are:

#### To locate these variables we proceed as follows: 

We define these search terms:

And search for ITEMIDS with LABELS that match the search terms using the following query:

In [3]:
%sql SELECT * FROM D_ITEMS WHERE (LABEL LIKE '%fraction%' OR LABEL LIKE 'fio2') AND DBSOURCE='metavision'

3 rows affected.


ROW_ID,ITEMID,LABEL,ABBREVIATION,DBSOURCE,LINKSTO,CATEGORY,UNITNAME,PARAM_TYPE,CONCEPTID,MAPPING_DEEP
12804,223835,Inspired O2 Fraction,FiO2,metavision,chartevents,Respiratory,,Numeric,,26.0
14083,225628,CK-MB fraction (%),CK-MB fraction (%),metavision,chartevents,Labs,,Numeric,,
14685,227008,Ejection Fraction,Ejection Fraction,metavision,chartevents,Scores - APACHE IV (2),%,Numeric,,


For each variable we visually inspect the search ouptut and select the ITEMS that appear relevant. 

*For example, for FiO2 we select 226754, 227009, 227010,223835 as candidates.*

We then measure the approximate frequnecy of the candidate variables on a random subset of the data (i.e. how many ICUSTAYS have at least one recording of each ITEMID)...

In [4]:
sub_limit = 100  ## subset size: number of ICUSTAYS randomly selected  

In [5]:
%sql SELECT D.ITEMID, D.LABEL, COUNT(DISTINCT(II.ICUSTAY_ID)) AS STAY_COUNT, (COUNT(DISTINCT(II.ICUSTAY_ID))/(:sub_limit)) AS FREQ \
FROM CHARTEVENTS C \
INNER JOIN ( \
    SELECT * FROM ( \
        SELECT * \
        FROM ICUSTAYS I \
        WHERE I.DBSOURCE='metavision' AND (I.FIRST_CAREUNIT='MICU' or I.FIRST_CAREUNIT='SICU') \
        ORDER BY RAND() \
        LIMIT :sub_limit) AS II_sub \
    ORDER BY II_sub.SUBJECT_ID, II_sub.HADM_ID, II_sub.ICUSTAY_ID \
            ) AS II \
ON (C.SUBJECT_ID=II.SUBJECT_ID \
    AND C.HADM_ID=II.HADM_ID \
    AND C.ICUSTAY_ID=II.ICUSTAY_ID) \
INNER JOIN D_ITEMS D \
ON C.ITEMID=D.ITEMID \
WHERE D.ITEMID IN (226754, 227009, 227010, 223835) \
GROUP BY D.ITEMID

1 rows affected.


ITEMID,LABEL,STAY_COUNT,FREQ
223835,Inspired O2 Fraction,41,0.41


..and manually inspect the first few instances of each ITEM to confirm it contains the required data (correct units and order of magnitude):

In [6]:
%sql SELECT * FROM CHARTEVENTS WHERE ITEMID=223835 LIMIT 5 

5 rows affected.


ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,VALUEUOM,WARNING,ERROR,RESULTSTATUS,STOPPED
335,34,144319,290505,223835,2191-02-23 07:31:00,2191-02-23 07:35:00,16924,60,60.0,,0,0,,
430,34,144319,290505,223835,2191-02-23 11:00:00,2191-02-23 11:04:00,14913,60,60.0,,0,0,,
706,36,165660,241249,223835,2134-05-12 07:09:00,2134-05-12 07:09:00,17525,100,100.0,,0,0,,
789,36,165660,241249,223835,2134-05-12 12:00:00,2134-05-12 13:56:00,17525,100,100.0,,0,0,,
877,36,165660,241249,223835,2134-05-12 16:13:00,2134-05-12 16:27:00,18428,100,100.0,,0,0,,


*For example ITEMID 223835 (Inspired O2 Fraction) contains numeric fractional data, and is measured with a reasonably high frequency (~0.4 to 0.6). We connclude that this is a relevant ITEM.*	

#### By the above procedure we identify the following ITEMIDS corresponding to the required variables:

* Some ITEMIDS may not be in (frequent) use in the database
* Urine output may be located in either CHARTEVENTS or OUTPUTEVENTS (we check both)

In [7]:
%sql SELECT * FROM D_ITEMS WHERE ITEMID=220224 LIMIT 5 

1 rows affected.


ROW_ID,ITEMID,LABEL,ABBREVIATION,DBSOURCE,LINKSTO,CATEGORY,UNITNAME,PARAM_TYPE,CONCEPTID,MAPPING_DEEP
12740,220224,Arterial O2 pressure,PO2 (Arterial),metavision,chartevents,Labs,mmHg,Numeric,,8


**As a single list this is** (220739, 223900, 223901, 226755, 226756, 226757, 226758, 227011, 227012, 227013, 227014,228112, 220615, 226752, 227005, 223791, 227881, 227519, 227059, 220640, 227464, 227442, 226772, 226535, 220645, 226534, 226776, 224826, 226759, 227443,220227, 220277, 226860,226861,226862,226863,226865,228232, 225624, 227000, 227001, 223838, 224832, 224391, 227810,223837, 224829, 226754, 227009, 227010,223835, 220210, 224688, 224689, 224690, 226770,227039,227516,220224,220235,226062,226063,227036, 223761, 223762, 224027, 220045, 220228, 220339, 224699, 224700, 220050, 220059, 220179, 224167, 225309, 227243, 226850, 226852, 228151) ** for CHARTEVENTS.**

** And ** (226566, 226627, 226631) **for OUTPUTEVENTS (urine output only).**

-------------------------------------
## (B) Extracting the data

Having located the relevant ITEMIDs we now:

* Calculate their usage frequency from a larger random sample of ICUSTAYS
* Define a cuttoff frequency
* Extract and save all measurements of the remaining variables

In [8]:
sub_limit=1000  ## size of random sample

As before we calculate how many ICUSTAYS have at least one measurement of each ITEM, and convert this to an occurence frequnecy (FREQ). This time we calculate for all ITEMIDs simultaneoulsy, and sort the results by ascending occurence frequency. For CHARTEVENTS:

In [9]:
%sql SELECT D.ITEMID, D.LABEL, COUNT(DISTINCT(II.ICUSTAY_ID)) AS STAY_COUNT, (COUNT(DISTINCT(II.ICUSTAY_ID))/(:sub_limit)) AS FREQ \
FROM CHARTEVENTS C \
INNER JOIN ( \
    SELECT * FROM ( \
        SELECT * \
        FROM ICUSTAYS I \
        WHERE I.DBSOURCE='metavision' AND (I.FIRST_CAREUNIT='MICU' or I.FIRST_CAREUNIT='SICU') \
        ORDER BY RAND() \
        LIMIT :sub_limit) AS II_sub \
    ORDER BY II_sub.SUBJECT_ID, II_sub.HADM_ID, II_sub.ICUSTAY_ID \
            ) AS II \
ON (C.SUBJECT_ID=II.SUBJECT_ID \
    AND C.HADM_ID=II.HADM_ID \
    AND C.ICUSTAY_ID=II.ICUSTAY_ID) \
INNER JOIN D_ITEMS D \
ON C.ITEMID=D.ITEMID \
WHERE D.ITEMID IN (220739, 223900, 223901, 226755, 226756, 226757, 226758, 227011, 227012, 227013, 227014,228112, 220615, 226752, 227005, 223791, 227881, 227519, 227059, 220640, 227464, 227442, 226772, 226535, 220645, 226534, 226776, 224826, 226759, 227443,220227, 220277, 226860,226861,226862,226863,226865,228232, 225624, 227000, 227001, 223838, 224832, 224391, 227810,223837, 224829, 226754, 227009, 227010,223835, 220210, 224688, 224689, 224690, 226770,227039,227516,220224,220235,226062,226063,227036, 223761, 223762, 224027, 220045, 220228, 220339, 224699, 224700, 220050, 220059, 220179, 224167, 225309, 227243, 226850, 226852, 228151) \
GROUP BY D.ITEMID \
ORDER BY FREQ

59 rows affected.


ITEMID,LABEL,STAY_COUNT,FREQ
226752,CreatinineApacheIIValue,1,0.001
226754,FiO2ApacheIIValue,1,0.001
226755,GcsApacheIIScore,1,0.001
226756,GCSEyeApacheIIValue,1,0.001
226757,GCSMotorApacheIIValue,1,0.001
226758,GCSVerbalApacheIIValue,1,0.001
226759,HCO3ApacheIIValue,1,0.001
226772,PotassiumApacheIIValue,1,0.001
226776,SodiumApacheIIValue,1,0.001
227000,BUN_ApacheIV,1,0.001


..and for OUTPUTEVENTS:

In [10]:
%sql SELECT D.ITEMID, D.LABEL, COUNT(DISTINCT(II.ICUSTAY_ID)) AS STAY_COUNT, (COUNT(DISTINCT(II.ICUSTAY_ID))/(:sub_limit)) AS FREQ \
FROM OUTPUTEVENTS C \
INNER JOIN ( \
    SELECT * FROM ( \
        SELECT * \
        FROM ICUSTAYS I \
        WHERE I.DBSOURCE='metavision' AND (I.FIRST_CAREUNIT='MICU' or I.FIRST_CAREUNIT='SICU') \
        ORDER BY RAND() \
        LIMIT :sub_limit) AS II_sub \
    ORDER BY II_sub.SUBJECT_ID, II_sub.HADM_ID, II_sub.ICUSTAY_ID \
            ) AS II \
ON (C.SUBJECT_ID=II.SUBJECT_ID \
    AND C.HADM_ID=II.HADM_ID \
    AND C.ICUSTAY_ID=II.ICUSTAY_ID) \
INNER JOIN D_ITEMS D \
ON C.ITEMID=D.ITEMID \
WHERE D.ITEMID IN (226566, 226627, 226631) \
GROUP BY D.ITEMID \
ORDER BY FREQ

2 rows affected.


ITEMID,LABEL,STAY_COUNT,FREQ
226631,PACU Urine,11,0.011
226627,OR Urine,127,0.127


**Note:** we find that the 'Urine Output' ITEMs that we identified are not in frequent use. The difficulty of identifying urine in MIMIC has been previoulsy noted by others. We accept that we canot use 'Urine Output' and do not attempt further investigation.  

We now store the CHARTEVENTS query results into a dataframe, and remove those that are only recorded infrequently. This is done to simplify the subsequent analysis (and reduce the number of checks on data integrity), but we aknowledge that it will result in a small amount of missing data. We define a minimum usage frequency of 0.01 (1%).

In [11]:
freqs=%sql SELECT D.ITEMID, D.LABEL, COUNT(DISTINCT(II.ICUSTAY_ID)) AS STAY_COUNT, (COUNT(DISTINCT(II.ICUSTAY_ID))/(:sub_limit)) AS FREQ \
FROM CHARTEVENTS C \
INNER JOIN ( \
    SELECT * FROM ( \
        SELECT * \
        FROM ICUSTAYS I \
        WHERE I.DBSOURCE='metavision' AND (I.FIRST_CAREUNIT='MICU' or I.FIRST_CAREUNIT='SICU') \
        ORDER BY RAND() \
        LIMIT :sub_limit) AS II_sub \
    ORDER BY II_sub.SUBJECT_ID, II_sub.HADM_ID, II_sub.ICUSTAY_ID \
            ) AS II \
ON (C.SUBJECT_ID=II.SUBJECT_ID \
    AND C.HADM_ID=II.HADM_ID \
    AND C.ICUSTAY_ID=II.ICUSTAY_ID) \
INNER JOIN D_ITEMS D \
ON C.ITEMID=D.ITEMID \
WHERE D.ITEMID IN (220739, 223900, 223901, 226755, 226756, 226757, 226758, 227011, 227012, 227013, 227014,228112, 220615, 226752, 227005, 223791, 227881, 227519, 227059, 220640, 227464, 227442, 226772, 226535, 220645, 226534, 226776, 224826, 226759, 227443,220227, 220277, 226860,226861,226862,226863,226865,228232, 225624, 227000, 227001, 223838, 224832, 224391, 227810,223837, 224829, 226754, 227009, 227010,223835, 220210, 224688, 224689, 224690, 226770,227039,227516,220224,220235,226062,226063,227036, 223761, 223762, 224027, 220045, 220228, 220339, 224699, 224700, 220050, 220059, 220179, 224167, 225309, 227243, 226850, 226852, 228151) \
GROUP BY D.ITEMID \
ORDER BY FREQ

freqs=graphlab.SFrame(freqs.DataFrame())
freqs.print_rows(num_rows=4)
ITEMIDS = freqs[freqs['FREQ']>=0.01]['ITEMID']  ## cutoff = 0.01

ITEMIDS_TOREMOVE = freqs[freqs['FREQ']<0.01]['ITEMID']  ## remove these from variable mapping

53 rows affected.
+--------+-------------------------+------------+-------+
| ITEMID |          LABEL          | STAY_COUNT |  FREQ |
+--------+-------------------------+------------+-------+
| 226752 | CreatinineApacheIIValue |     1      | 0.001 |
| 226754 |    FiO2ApacheIIValue    |     1      | 0.001 |
| 226755 |     GcsApacheIIScore    |     1      | 0.001 |
| 226756 |   GCSEyeApacheIIValue   |     1      | 0.001 |
+--------+-------------------------+------------+-------+
[53 rows x 4 columns]



In [12]:
variable_mapping = dict()

variable_mapping['fio2'] = [226754, 227009, 227010,223835]
variable_mapping['resp'] = [220210, 224688, 224689, 224690]
variable_mapping['po2'] = [226770,227039,227516,220224]  ## 227516 is venous.
variable_mapping['pco2'] = [220235,226062,226063,227036]  ## 226062,226063 are venous.

variable_mapping['temp'] = [223761, 223762, 224027] 
variable_mapping['hr'] = [220045]
variable_mapping['bp'] = [220050, 220059, 220179, 224167, 225309, 227243, 226850, 226852, 228151]
variable_mapping['k'] = [220640, 227464, 227442, 226772, 226535]
variable_mapping['na'] = [220645, 226534, 226776]
variable_mapping['hco3'] = [224826, 226759, 227443]
variable_mapping['spo2'] = [220227, 220277, 226860,226861,226862,226863,226865,228232]
variable_mapping['bun'] = [225624, 227000, 227001]
variable_mapping['airway'] = [223838, 224832, 224391, 227810,223837, 224829]
variable_mapping['gcs'] = [220739, 223900, 223901, 226755, 226756, 226757, 226758, 227011, 227012, 227013, 227014,228112]
variable_mapping['creatinine'] = [220615, 226752, 227005]
variable_mapping['pain'] = [223791, 227881]
variable_mapping['urine'] = [227519, 227059]
variable_mapping['haemoglobin'] = [220228]
variable_mapping['peep'] = [220339, 224699, 224700]
                        
for var in variable_mapping:
        variable_mapping[var] = [itd for itd in variable_mapping[var] if itd not in ITEMIDS_TOREMOVE]
        
print variable_mapping

{'urine': [227519, 227059], 'pain': [223791], 'temp': [223761, 223762, 224027], 'hr': [220045], 'fio2': [227009, 227010, 223835], 'resp': [220210, 224688, 224689, 224690], 'airway': [223838, 224832, 224391, 227810, 223837, 224829], 'po2': [226770, 227039, 227516, 220224], 'hco3': [224826, 227443], 'peep': [220339, 224699, 224700], 'gcs': [220739, 223900, 223901, 227011, 227012, 227013, 227014, 228112], 'pco2': [220235, 226062, 226063, 227036], 'na': [220645, 226534], 'bun': [225624, 227000, 227001], 'bp': [220050, 220059, 220179, 225309, 226850, 226852, 228151], 'creatinine': [220615, 227005], 'k': [220640, 227464, 227442, 226535], 'spo2': [220227, 220277, 226860, 226861, 226862, 226863, 226865, 228232], 'haemoglobin': [220228]}


#### Finally we extract the data from CHARTEVENTS for all the selected ITEMIDS:

In [166]:
import MySQLdb
import MySQLdb.cursors
from contextlib import closing

db='MIMIC'
user='root'
password='mysql2016'

conn = MySQLdb.connect(host="localhost",
                     user=user, 
                     passwd=password, 
                     db=db,
                     unix_socket="/run/mysqld/mysqld.sock",
                     cursorclass = MySQLdb.cursors.SSCursor)  ## ensures correct behaviour for 'fetchone'

In [167]:
## test the connection (should only print one row if 'fetchone' is working correctly):
with closing(conn.cursor()) as cur:
    cur.execute('SELECT * FROM D_ITEMS WHERE LABEL LIKE "%urine%" AND DBSOURCE="metavision"')
    row = cur.fetchone()
    print row


(12710, 220799, 'ZSpecific Gravity (urine)', 'ZSpecific gravity (urine)', 'metavision', 'chartevents', 'Labs', 'None', 'Numeric', None, None)


In [69]:
#list_of_ids = [205254]
#list_of_ids.extend(ITEMIDS)
#format_strings = '%d,' + ','.join(['%s'] * len(list_of_ids))
list_of_ids = ITEMIDS
format_strings = ','.join(['%s'] * len(list_of_ids))

In [16]:
pull_query="""SELECT C.SUBJECT_ID, C.HADM_ID, C.ICUSTAY_ID, C.ITEMID, C.CHARTTIME, C.VALUE, C.VALUENUM, C.VALUEUOM, II.INTIME, II.OUTTIME, II.LOS, D.LABEL, D.UNITNAME 
FROM CHARTEVENTS C 
INNER JOIN ( 
    SELECT * 
    FROM ICUSTAYS I 
    WHERE I.DBSOURCE='metavision' AND (I.FIRST_CAREUNIT='MICU' or I.FIRST_CAREUNIT='SICU') 
    LIMIT 100000) AS II 
ON (C.SUBJECT_ID=II.SUBJECT_ID 
    AND C.HADM_ID=II.HADM_ID 
    AND C.ICUSTAY_ID=II.ICUSTAY_ID) 
INNER JOIN D_ITEMS D 
ON C.ITEMID=D.ITEMID 
WHERE D.ITEMID in (%s)"""

## Column headers for the dataframe in which we will store the query results:
cols = ['C.SUBJECT_ID', 'C.HADM_ID', 'C.ICUSTAY_ID', 'C.ITEMID', 'C.CHARTTIME', 'C.VALUE', 'C.VALUENUM', 'C.VALUEUOM', 'II.INTIME', 'II.OUTTIME', 'II.LOS', 'D.LABEL', 'D.UNITNAME']

We now extract the data. This is a big query. We use 'fetchone' one to grab the results row by row, and store them in a dictionary (keys=column names, values=list of column values).

We use graphlab because is has out-of-memory storage for SFrames (no concern over size of result set).
We save the SFrame intermittently and clear the results dictionary.

In [25]:
if False:
    data = None
    table = None
    subjectids = []
    max_len = 10000  ## buffer dictionary size (to reduce memory usage)

    with closing(conn.cursor()) as cur:
        cur.execute(pull_query %format_strings,tuple(list_of_ids))

        row = cur.fetchone()
        while row is not None:

            if data is None:
                ## results buffer is clear, begin new one
                row = [[element] for element in row]
                data = OrderedDict(zip(cols,row))
            else:
                ## append to existing results buffer
                new_row = dict(zip(cols,row))
                for key in data.keys():
                    data[key].append(new_row[key])

                if len(data['C.SUBJECT_ID']) >= max_len:
                    ## buffer is full.. 
                    if table is None:
                        table = graphlab.SFrame(data) ## creates SFrame from dictionary of columns
                        table.save('mimic_all_data')            
                    else:
                        table = table.append(graphlab.SFrame(data))
                        table.save('mimic_all_data')            

                        print "Saving SFrame..."
                        print "Number of icustays = %d" %(len(table['C.SUBJECT_ID'].unique()))

                    data = None ## clear buffer

            row = cur.fetchone() 

Saving SFrame...
Number of icustays = 114
Saving SFrame...
Number of icustays = 164
Saving SFrame...
Number of icustays = 224
Saving SFrame...
Number of icustays = 300
Saving SFrame...
Number of icustays = 320
Saving SFrame...
Number of icustays = 340
Saving SFrame...
Number of icustays = 406
Saving SFrame...
Number of icustays = 470
Saving SFrame...
Number of icustays = 545
Saving SFrame...
Number of icustays = 604
Saving SFrame...
Number of icustays = 654
Saving SFrame...
Number of icustays = 719
Saving SFrame...
Number of icustays = 775
Saving SFrame...
Number of icustays = 845
Saving SFrame...
Number of icustays = 909
Saving SFrame...
Number of icustays = 973
Saving SFrame...
Number of icustays = 1053
Saving SFrame...
Number of icustays = 1112
Saving SFrame...
Number of icustays = 1195
Saving SFrame...
Number of icustays = 1248
Saving SFrame...
Number of icustays = 1326
Saving SFrame...
Number of icustays = 1395
Saving SFrame...
Number of icustays = 1499
Saving SFrame...
Number of 

Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icustays = 10942
Saving SFrame...
Number of icust

Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Saving SFrame...
Number of icustays = 10943
Savin

Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icustays = 10962
Saving SFrame...
Number of icust

OperationalError: (2013, 'Lost connection to MySQL server during query')

### Running full extraction in one go kept crashing (MySQL time out). So now we run in stages and then recombine all data at the end:

In [165]:
partial_pull_query0 ="""SELECT C.SUBJECT_ID, C.HADM_ID, C.ICUSTAY_ID, C.ITEMID, C.CHARTTIME, C.VALUE, C.VALUENUM, C.VALUEUOM, II.INTIME, II.OUTTIME, II.LOS, D.LABEL, D.UNITNAME 
FROM CHARTEVENTS C 
INNER JOIN ( 
    SELECT * 
    FROM ICUSTAYS I 
    WHERE I.DBSOURCE='metavision' AND (I.FIRST_CAREUNIT='MICU' or I.FIRST_CAREUNIT='SICU') AND I.ICUSTAY_ID>=%d AND I.ICUSTAY_ID<=%d AND I.ICUSTAY_ID NOT IN (249026,254172)"""

partial_pull_query1="""
    LIMIT 100000) AS II 
ON (C.SUBJECT_ID=II.SUBJECT_ID 
    AND C.HADM_ID=II.HADM_ID 
    AND C.ICUSTAY_ID=II.ICUSTAY_ID) 
INNER JOIN D_ITEMS D 
ON C.ITEMID=D.ITEMID 
WHERE D.ITEMID in (%s)"""

## Column headers for the dataframe in which we will store the query results:
cols = ['C.SUBJECT_ID', 'C.HADM_ID', 'C.ICUSTAY_ID', 'C.ITEMID', 'C.CHARTTIME', 'C.VALUE', 'C.VALUENUM', 'C.VALUEUOM', 'II.INTIME', 'II.OUTTIME', 'II.LOS', 'D.LABEL', 'D.UNITNAME']

In [99]:
min_id = 200000
max_id = 300000
boundary_ids = []
start = min_id
end = min_id + 1000
while end<=max_id:
    boundary_ids.append((start,end))
    start+=1000
    end+=1000

In [162]:
#len(boundary_ids)
boundary_ids[49]

(249000, 250000)

#### Piecewise code below crashing on 49th and 54th boundary. Let split this further:

In [140]:
min_id = 249000
max_id = 249100
_sub_boundary_ids = []
start = min_id
end = min_id + 10
while end<=max_id:
    _sub_boundary_ids.append((start,end))
    start+=100
    end+=100

### Problematic icustay ids!
#### Not clear why. Too muc data in single stay?
#### If not many then shouldn't affect results.

In [None]:
problem_ids = [249026, 254172]

In [168]:
for boundary in boundary_ids[54:]:
#for boundary in boundary_ids[49:]:
#for boundary in [(249027,249030)]:
#for boundary in _sub_boundary_ids:
    
    data = None
    table = None
    
    print "Beginning: " + str(boundary)
    
    with closing(conn.cursor()) as cur:
        cur.execute(partial_pull_query0 %(boundary[0],boundary[1]) + partial_pull_query1 %format_strings,tuple(list_of_ids))

        print "   intitial query complete"
        row = cur.fetchone()
        while row is not None:
            
            if data is None:
                ## results buffer is clear, begin new one
                row = [[element] for element in row]
                data = OrderedDict(zip(cols,row))
            else:
                ## append to existing results buffer
                new_row = dict(zip(cols,row))
                for key in data.keys():
                    data[key].append(new_row[key])

            STAY = data['C.ICUSTAY_ID']
            row = cur.fetchone() 
    print "   extracted all rows. Saving SFrame.."
    table = graphlab.SFrame(data) ## creates SFrame from dictionary of columns
    table.save('_temp_mimic_all_data_' + str(boundary))            
    print "Completed: " + str(boundary)

Beginning: (254000, 255000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (254000, 255000)
Beginning: (255000, 256000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (255000, 256000)
Beginning: (256000, 257000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (256000, 257000)
Beginning: (257000, 258000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (257000, 258000)
Beginning: (258000, 259000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (258000, 259000)
Beginning: (259000, 260000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (259000, 260000)
Beginning: (260000, 261000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (260000, 261000)
Beginning: (261000, 262000)
   intitial query complete
   extracted all rows. Saving SFrame..
Completed: (261000, 262000)
Beginning: (262000, 2630

#### We now combine these into a single data frame. Once this is processed and cleaned we will delete all temporary and intermediate data frames from the hard drive..

In [174]:
table = graphlab.SFrame('_temp_mimic_all_data_' + str(boundary_ids[0]))

for boundary in boundary_ids[1:]:
    print boundary
    new_table = graphlab.SFrame('_temp_mimic_all_data_' + str(boundary))
    table = table.append(new_table)
    
table.save('mimic_all_data')

(201000, 202000)
(202000, 203000)
(203000, 204000)
(204000, 205000)
(205000, 206000)
(206000, 207000)
(207000, 208000)
(208000, 209000)
(209000, 210000)
(210000, 211000)
(211000, 212000)
(212000, 213000)
(213000, 214000)
(214000, 215000)
(215000, 216000)
(216000, 217000)
(217000, 218000)
(218000, 219000)
(219000, 220000)
(220000, 221000)
(221000, 222000)
(222000, 223000)
(223000, 224000)
(224000, 225000)
(225000, 226000)
(226000, 227000)
(227000, 228000)
(228000, 229000)
(229000, 230000)
(230000, 231000)
(231000, 232000)
(232000, 233000)
(233000, 234000)
(234000, 235000)
(235000, 236000)
(236000, 237000)
(237000, 238000)
(238000, 239000)
(239000, 240000)
(240000, 241000)
(241000, 242000)
(242000, 243000)
(243000, 244000)
(244000, 245000)
(245000, 246000)
(246000, 247000)
(247000, 248000)
(248000, 249000)
(249000, 250000)
(250000, 251000)
(251000, 252000)
(252000, 253000)
(253000, 254000)
(254000, 255000)
(255000, 256000)
(256000, 257000)
(257000, 258000)
(258000, 259000)
(259000, 26000

In [175]:
table = graphlab.SFrame('mimic_all_data/')
old_table = graphlab.SFrame('mimic_all_data_CLEANED_RFD/')

IN_IDS = table['C.ICUSTAY_ID'].unique()
O_IN_IDS = old_table['C.ICUSTAY_ID'].unique()

In [176]:
remaining_ids = [iid for iid in O_IN_IDS if iid not in IN_IDS]

In [177]:
print len(IN_IDS)
print len(O_IN_IDS)

14430
11288


### Appears to have worked, but cannot confirm until see how many rows in the cleaned dataset.

----------------------------------------------------------------------------------------------
## (C) Post-processing of data.

Having extracted the main data we perform some post-processing to facilitate anlysis. 

In [178]:
all_data = graphlab.SFrame('mimic_all_data') ## reload
print all_data.column_names()

number_of_stays = len(all_data['C.ICUSTAY_ID'].unique())
print len(all_data)
print number_of_stays

['C.CHARTTIME', 'C.HADM_ID', 'C.ICUSTAY_ID', 'C.ITEMID', 'C.SUBJECT_ID', 'C.VALUE', 'C.VALUENUM', 'C.VALUEUOM', 'D.LABEL', 'D.UNITNAME', 'II.INTIME', 'II.LOS', 'II.OUTTIME']
10841059
14430


*There are more than 10 million rows in the data (on disk this is just over 1.5gb saved). And there are 14426 unique ICUSTAYS, not all of these will be in the cohort).*

We add the following columns to the data:

* ['final_4hr'] : 1 if measurement is from final 4 hours of patients stay
* ['final_24hr'] : 1 if measurement is from final 24 hours of patients stay
* ['hrs_bd'] : float giving number of hours before discharge that msrmnt was taken (can filter on this column later)



In [179]:
all_data['hrs_bd'] = (all_data['II.OUTTIME'] - all_data['C.CHARTTIME'])/float(60**2)
all_data.save('mimic_all_data')  

In [180]:
HR = 4  ## number of hours before end of ICUSTAY 
all_data['final_%dhr' %HR] = all_data.apply(lambda row: 1 if (row['II.OUTTIME'] - row['C.CHARTTIME']).total_seconds()/(60.**2) <= HR else 0)
all_data.save('mimic_all_data')            

In [181]:
HR = 24  ## number of hours before end of ICUSTAY 
all_data['final_%dhr' %HR] = all_data.apply(lambda row: 1 if (row['II.OUTTIME'] - row['C.CHARTTIME']).total_seconds()/(60.**2) <= HR else 0)
all_data.save('mimic_all_data')            

#### Merging data extraction with cohort..

We now add columns for cohort identification, and filtering based on outcome. Main cohort consists of the first ICUSTAY of each hospital admission (should be >13000 stays).

From this cohort we are interested in ICUSTAYS that have good (1) and bad (0) outcome.
The are two types of bad outcome: (A) in-hospital death, (B) readmission to ICU (during same hospital admission).

The above abstraction is not fool-proof. For example:
* patients may die outside of hospital for related reasons
* readmission may be linked but occur on different stay
* in-hopistal death may be unrelated to ICU discharge
* etc

We add the following columns to the data frame: **

* ['cohort']: binary flag. 1 indicates that ICUSTAY is part of cohort (i.e. first or only stay of a hospital admission).
* ['outcome']: binary flag. 1 indicates good outcome (survival with no readmission). 0 indicates bad outcome (death or readmission).
* ['in_h_death']: binary flag. 1 for death. 0 for survival.
* ['in_icu_death']: binary flag. 1 for death. 0 for survival.
* ['readmit']: integer. Number of readmissions during same hospital admission.

*Note: there are fewer HADMs than in cohort_selection_mimic.ipynb, since not all patients have chartevents data.*

In [182]:
f = open('first_stays.pkl', 'rb')
first_stays = pickle.load(f)
f.close()

f = open('moratlities.pkl', 'rb')
mortalities = pickle.load(f)
f.close()

f = open('stay_counts.pkl', 'rb')
stays_counts = pickle.load(f)
f.close()

f = open('icu_deaths.pkl', 'rb')
ic_deaths = pickle.load(f)
f.close()

In [183]:
print "Adding cohort column..."
all_data['cohort'] = all_data.apply(lambda row: 1 if first_stays[row['C.HADM_ID']]==row['C.ICUSTAY_ID'] else 0)

print "Adding in hospital column..."
all_data['in_h_death'] = all_data['C.HADM_ID'].apply(lambda hadmid: 1 if hadmid in mortalities else 0 )

print "Adding in icu death column..."
all_data['in_icu_death'] = all_data['C.ICUSTAY_ID'].apply(lambda icustay: 1 if icustay in ic_deaths else 0 )

print "Adding readmission column..."
all_data['readmit'] = all_data['C.HADM_ID'].apply(lambda hadmid: stays_counts[hadmid] - 1)

print "Adding outcome column..."
all_data['outcome'] = all_data['C.HADM_ID'].apply(lambda hadmid: 1 if (hadmid not in mortalities and stays_counts[hadmid]==1) else 0)

print 'finish adding new columns, saving data frame..'
all_data.save('mimic_all_data')

Adding cohort column...
Adding in hospital column...
Adding in icu death column...
Adding readmission column...
Adding outcome column...
finish adding new columns, saving data frame..


In [184]:
len(all_data['C.ICUSTAY_ID'].unique())

14430