# Extracting MIMIC Data

## Environment Setup

In [1]:
import time
import pymysql
import getpass
import pickle as pkl
import pandas as pd
from _collections import OrderedDict

## Query List

#### Construct a list of queries that we will want to run.

In [10]:
queryList = OrderedDict([('PNA', './queries/PNA-Mimic.sql'), ('CHF','./queries/CHF-Mimic.sql'), ('COPD','./queries/CHF-Mimic.sql')])

In [2]:
queryList = OrderedDict([('PNA', './queries/PNA-local.sql'), ('CHF','./queries/CHF-local.sql'), ('COPD','./queries/COPD-local.sql')])

## MIMIC Database Connection

#### Make a connection to the MIMIC database and get a cursor for record processing.

In [None]:
conn = pymysql.connect(host="mysql", 
                       port = 3306, user="jovyan", 
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"), db='mimic2')
cur = conn.cursor()

In [3]:
conn = pymysql.connect(host="localhost", 
                       port = 3306, user="jferraro", 
                       passwd=getpass.getpass("Enter MySQL passwd "), db='mimic')
cur = conn.cursor()

Enter MySQL passwd ········


## Retrieve our Data

#### For each query we will retrieve the data and build a ordered dictionary containing our data. Were use an ordered dictionary because we want to keep the Pneumonia, Congestive Heart Failure, and COPD cases together.

In [4]:
queries = []

for key in queryList:
    count = 1
    file = open(queryList[key], 'r')
    query = file.read()
    print("execute query: " + key)
    %time cur.execute(query)
    queries.append(query)

corpus = pd.concat([pd.read_sql_query(q, conn) for q in queries])
print(corpus.head())

conn.close()

execute query: PNA
Wall time: 11.9 s
execute query: CHF
Wall time: 4.27 s
execute query: COPD
Wall time: 2.36 s
                                                text label
0  [**2996-12-2**] 10:25 AM\n CT CHEST W/O CONTRA...   PNA
1  [**3201-9-21**] 4:50 PM\n CT CHEST W/CONTRAST ...   PNA
2  [**3299-6-23**] 5:06 PM\n CT CHEST W/CONTRAST ...   PNA
3  [**3186-6-14**] 2:54 PM\n CT CHEST W/CONTRAST ...   PNA
4  [**2500-1-17**] 9:41 PM\n CT CHEST W/O CONTRAS...   PNA


## Safestore the Data

#### We will serialize the ordered dictionary structure out to disk. This way we will not have to rebuild this data structure when we want to use our data. It is a convenience thing....

In [5]:
file = open('differential-corpus.pkl','wb')
pkl.dump(corpus, file)
file.close()
print("Done!")

Done!
