## Raw Datasets From Excel to Parquet

The idea of this notebook is to create parquet files of each dataset to improve I/O and storage speed of each dataset.

#### Importing Key Libraries

In [1]:
import numpy as np

import pandas as pd
from pandas import Series, DataFrame

import pyarrow as pa
import pyarrow.parquet as pq

#### Reading Raw Data
We are using version 4 of the raw dataset provided by the Fairfax Fire and Rescue Department on February 25, 2021.

In [2]:
df_pat = pd.read_excel(io = '../data/01_raw/20210225-ems-raw-v04.xlsx', sheet_name='Patients')
df_pro = pd.read_excel(io = '../data/01_raw/20210225-ems-raw-v04.xlsx', sheet_name='Procedures')
df_med = pd.read_excel(io = '../data/01_raw/20210225-ems-raw-v04.xlsx', sheet_name='Medications')

#### Adjusting Value Type for Each Series in the Data Frame.
In this section we are considering that:
* all ID attributes will be considered as strings
* all Date attributes will be set to datetime64
* Any other type of attributes will be considered as categorical

Setting values as categorical as much as possible and where appropriate will help improving performance during analytics.

In [3]:
df_pat = df_pat.astype(dtype = {'Shift': 'category',
                                'FireStation': 'category',
                                'Battalion': 'category',
                                'PatientOutcome': 'category',
                                'PatientGender': 'category',
                                'FRDPersonnelGender': 'category',
                                'UnitId': 'category',              
                                'CrewMemberRoles': 'category',     
                                'PatientId': 'string',
                                'FRDPersonnelID': 'string',
                                'DispatchTime': 'datetime64',
                                'FRDPersonnelStartDate': 'datetime64'})

df_pro = df_pro.astype(dtype = {'Procedure_Performed_Code': 'category',
                                'Procedure_Performed_Description': 'category',
                                'PatientId': 'string',
                                'FRDPersonnelID': 'string',
                                'Dim_Procedure_PK': 'string',
                                'Procedure_Performed_Date_Time': 'datetime64'})

df_med = df_med.astype(dtype = {'Medication_Given_RXCUI_Code': 'category',
                                'Medication_Given_Description': 'category',
                                'PatientId': 'string',
                                'FRDPersonnelID': 'string',
                                'Dim_Medication_PK': 'string',
                                'Medication_Administered_Date_Time': 'datetime64'})

#### Writing Parquet Files into Local Directory

In [4]:
# Step 1: Convert DataFrame into PyArrow Table
pat_tab_w = pa.Table.from_pandas(df_pat)
pro_tab_w = pa.Table.from_pandas(df_pro)
med_tab_w = pa.Table.from_pandas(df_med)

# Step 2: Write the Table into parquet format and save it in the raw data folder.
pq.write_table(pat_tab_w, '../data/01_raw/20210225-ems-v04-patients-raw.parquet')
pq.write_table(pro_tab_w, '../data/01_raw/20210225-ems-v04-procedures-raw.parquet')
pq.write_table(med_tab_w, '../data/01_raw/20210225-ems-v04-medications-raw.parquet')

#### Creating New DataFrames from Parquet
The following script can be repeated in any of your notebooks to load the DataFrames from parquet instead of the excel file. If you are like me that creates a new notebook for each concern, you will enjoy of faster loading time than using excel files.

For illustration purposes, I have included the libraries you need to import for this script to work as intended on any notebook.

In [5]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

pat_df_r = pq.read_table('../data/01_raw/20210225-ems-v04-patients-raw.parquet')\
             .to_pandas(categories=['FireStation','Battalion'])
pro_df_r = pq.read_table('../data/01_raw/20210225-ems-v04-procedures-raw.parquet')\
             .to_pandas(categories=['Procedure_Performed_Code'])
med_df_r = pq.read_table('../data/01_raw/20210225-ems-v04-medications-raw.parquet')\
             .to_pandas(categories=['Medication_Given_RXCUI_Code'])

#### Excel to Parquet DataFrames Validation
As a validation mechanisim, I'm going to run a test for each dataset to validate that both data frames present the same shape. This should give you the certainty that no records were lost during after converting the files to parquet.

Have in mind that we also adjusted the data types to ensure a better performance by Pandas and consistency across notebooks.

In [6]:
assert df_pat.shape == pat_df_r.shape, 'FAIL: DataFrame shapes are not the same for Patients dataset.'
print ('PASS: DataFrame integrity maintained for the Patients dataset.')

PASS: DataFrame integrity maintained for the Patients dataset.


In [7]:
assert df_pro.shape == pro_df_r.shape, 'FAIL: DataFrame shapes are not the same for Procedures dataset.'
print ('PASS: DataFrame integrity maintained for the Procedures dataset.')

PASS: DataFrame integrity maintained for the Procedures dataset.


In [8]:
assert df_med.shape == med_df_r.shape, 'FAIL: DataFrame shapes are not the same for Medications dataset.'
print ('PASS: DataFrame integrity maintained for the Medications dataset.')

PASS: DataFrame integrity maintained for the Medications dataset.


#### Uploading New Dataset Objects to S3
Since this is an infrequent activity I have imported the local library in this section as well. However, you will not receive the same results because the objects already exist in S3 or you don't have access to the S3 bucket by the time you may try to run this portion of the script.

In [9]:
from src.d01_data.ingest import ProjectIngest

ProjectIngest('01_raw','20210225-ems-v04-patients-raw.parquet').remote_upload()
ProjectIngest('01_raw','20210225-ems-v04-procedures-raw.parquet').remote_upload()
ProjectIngest('01_raw','20210225-ems-v04-medications-raw.parquet').remote_upload()

The 20210225-ems-v04-patients-raw.parquet file already exist in the S3 bucket under the key 01_raw/20210225-ems-v04-patients-raw.parquet
The 20210225-ems-v04-procedures-raw.parquet file already exist in the S3 bucket under the key 01_raw/20210225-ems-v04-procedures-raw.parquet
The 20210225-ems-v04-medications-raw.parquet file already exist in the S3 bucket under the key 01_raw/20210225-ems-v04-medications-raw.parquet
