## Use psycopg2 engine to extract charts events for patient data

This script connects to the patient database 'extumate' and extracts chart events for the labelled patients identified by the field, hadm_id, in the table 'sample_vents'.

The script utilizes the pandas chunksize argument in order to avoid memory issues.

Finally, the data is pickled so it can be stored for future processing.

In [1]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import psutil
import os

#### Set user-defined variables

In [23]:
export_name = "pulseox.pkl"
pickle_folder = "../data/pickled/"
export_path = pickle_folder+export_name
export_path

'../data/pickled/pulseox.pkl'

#### Write sql query

In [None]:
sql_query = """
SELECT 
  chartevents.* 
FROM
  chartevents
  INNER JOIN sample_vents ON chartevents.hadm_id = sample_vents.hadm_id
WHERE
  (
    chartevents.itemid = 220277
  );
"""

#### print virtual memory available

In [2]:
svmem = psutil.virtual_memory()
print (svmem.available) #in bytes 

3979010048


#### print size of database we're pulling from

In [3]:
os.path.getsize('../data/raw/chartevents.csv') 

29184776616

#### figure out chunk size for pandas dataframe reading

In [6]:
df_sample = pd.read_csv('../data/raw/chartevents.csv', nrows=10)
df_sample_size = df_sample.memory_usage(index=True).sum()
my_chunk = (2000000000 / df_sample_size)/10
my_chunk = int(my_chunk//1) # we get the integer part
print (my_chunk)

215517


In [7]:
# Define a database name (we're using a dataset on births, so we'll call it birth_db)

# Set your postgres username/password, and connection specifics

username = 'postgres'

password = 'password'    # change this

host     = 'localhost'

port     = '5432'            # default port that postgres listens on

db_name  = 'extumate'

#db_name  = 'birth_db'

In [8]:
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine( 'postgresql://{}:{}@{}:{}/{}'.format(username, password, host, port, db_name) )
print(engine.url)

postgresql://postgres:password@localhost:5432/extumate


#### Check engine is working by checking for 'sample_vents' table

In [9]:
engine.has_table('sample_vents')

True

#### Connect using psycopg2 connection and query the database. 

Joining 'chartevents' with the 'sample_vents' table on the field hadm_id (so only pulling from patients who were ventilated), before selecting the type of event using the chartevents.itemid speeds up extraction of this data.

In [10]:
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database = db_name, user = username, host=host,password=password)

df_result = pd.read_sql_query(sql_query,con,chunksize=my_chunk)
df_result

<generator object SQLiteDatabase._query_iterator at 0x7fd7d58b6510>

In [11]:
concat_df = pd.concat(
    [chunk
    for chunk in df_result])

In [15]:
concat_df

Unnamed: 0,subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
0,10004235,24181354,30276431,2196-02-24 18:00:00,2196-02-24 18:14:00,220277,94,94,%,0
1,10004235,24181354,30276431,2196-02-24 19:00:00,2196-02-24 19:43:00,220277,99,99,%,0
2,10004235,24181354,30276431,2196-02-24 20:00:00,2196-02-24 19:57:00,220277,97,97,%,0
3,10004235,24181354,30276431,2196-02-24 21:00:00,2196-02-24 21:04:00,220277,98,98,%,0
4,10004235,24181354,30276431,2196-02-24 22:00:00,2196-02-24 22:29:00,220277,100,100,%,0
...,...,...,...,...,...,...,...,...,...,...
204175,19999068,21606769,31096823,2161-08-28 08:20:00,2161-08-28 08:20:00,220277,100,100,%,0
204176,19999068,21606769,31096823,2161-08-28 09:00:00,2161-08-28 09:02:00,220277,99,99,%,0
204177,19999068,21606769,31096823,2161-08-28 10:00:00,2161-08-28 10:19:00,220277,100,100,%,0
204178,19999068,21606769,31096823,2161-08-28 11:00:00,2161-08-28 12:36:00,220277,100,100,%,0


In [13]:
##concat_df.to_sql('pulseox', engine, if_exists='replace',chunksize=my_chunk) ### very, very slow!

#### Pickle dataframe for future processing

In [22]:
concat_df.to_pickle(export_path)