# Extracting MIMIC Data

## Environment Setup

In [1]:
import time
import pymysql
import getpass
import pickle as pkl
import pandas as pd
from _collections import OrderedDict

## Query List

#### Construct a list of queries that we will want to run.

In [2]:
queryList = OrderedDict([('PNA-POS', './queries/PNA-POS-Mimic2.sql'), ('PNA-NEG','./queries/PNA-NEG-Mimic2.sql')])

In [6]:
queryList = OrderedDict([('PNA-POS', './queries/PNA-POS-Mimic2-local.sql'), ('PNA-NEG','./queries/PNA-NEG-Mimic2-local.sql')])

## MIMIC Database Connection

#### Make a connection to the MIMIC database and get a cursor for record processing.

In [3]:
conn = pymysql.connect(host="mysql", 
                       port = 3306, user="jovyan", 
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"), db='mimic2')
cur = conn.cursor()

Enter MySQL passwd for jovyan········


In [7]:
conn = pymysql.connect(host="localhost", 
                       port = 3306, user="jferraro", 
                       passwd=getpass.getpass("Enter MySQL passwd "), db='mimic')
cur = conn.cursor()

Enter MySQL passwd ········


## Retrieve our Data

#### For each query we will retrieve the data and build an ordered dictionary containing our data. 

In [4]:
queries = []

for key in queryList:
    count = 1
    file = open(queryList[key], 'r')
    query = file.read()
    print("execute query: " + key)
    %time cur.execute(query)
    queries.append(query)

corpus = pd.concat([pd.read_sql_query(q, conn) for q in queries])
print(corpus.head())

conn.close()

execute query: PNA-POS
CPU times: user 53.8 ms, sys: 2.53 ms, total: 56.3 ms
Wall time: 1.27 s
execute query: PNA-NEG
CPU times: user 382 ms, sys: 199 ms, total: 581 ms
Wall time: 3.08 s
                                                text
0  \n\n\n     DATE: [**2996-12-2**] 10:25 AM\n   ...
1  \n\n\n     DATE: [**2850-2-14**] 10:22 PM\n   ...
2  \n\n\n     DATE: [**2631-10-3**] 9:52 AM\n    ...
3  \n\n\n     DATE: [**2584-11-21**] 11:17 AM\n  ...
4  \n\n\n     DATE: [**2584-11-21**] 11:17 AM\n  ...


## Safestore the Data

#### We will serialize the ordered dictionary structure out to disk. This way we will not have to rebuild our data every time we want to tray a new model. It is a convenience thing....

In [5]:
file = open('pna-corpus.pkl','wb')
pkl.dump(corpus, file)
file.close()
print("Done!")

Done!


In [6]:
corpus["label]

Unnamed: 0,text
0,\n\n\n DATE: [**2996-12-2**] 10:25 AM\n ...
1,\n\n\n DATE: [**2850-2-14**] 10:22 PM\n ...
2,\n\n\n DATE: [**2631-10-3**] 9:52 AM\n ...
3,\n\n\n DATE: [**2584-11-21**] 11:17 AM\n ...
4,\n\n\n DATE: [**2584-11-21**] 11:17 AM\n ...
5,\n\n\n DATE: [**2584-11-21**] 11:17 AM\n ...
6,\n\n\n DATE: [**2996-12-3**] 4:26 PM\n ...
7,\n\n\n DATE: [**2996-12-3**] 4:26 PM\n ...
8,\n\n\n DATE: [**3327-5-12**] 2:57 AM\n ...
9,\n\n\n DATE: [**3327-5-12**] 2:57 AM\n ...
