# Merge and Prepare Intermediate Datasets -- Patients

This notebook is the data engineering for creating the intermediate dataset for the "Patients" dataset. The script reads in the raw data provided by our project partner, drops records where FRDPersonnelStartDate is NULL, removes remaining duplicates, add tenure by month attribute, converts categorical data into numeric, and adds new numeric columns utilizing the the "One Hot Encoding" method, then outputs dataframe ro a CSV file. I will document and comment the notebook later. Enjoy!

In [1]:
# Funtion to identify and print easy to understand variable types 
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count == total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))

In [2]:
import numpy as np
import pandas as pd

# Setup HTML display
from IPython.core.display import display, HTML
# Notebook cell width adjustment
display(HTML('<style>.container { width:80% !important; }</style>'))

In [3]:
dfPatients = pd.read_excel(r'./data/20210225-ems-raw-v04.xlsx', 
                        sheet_name='Patients', 
                        na_values=['NA'])
dfPatients.shape

(543774, 12)

In [4]:
print_categories(dfPatients)

PatientId :  Numerical
FRDPersonnelID :  Categorical
Shift :  Categorical
UnitId :  Categorical
FireStation :  Numerical
Battalion :  Numerical
PatientOutcome :  Categorical
PatientGender :  Categorical
CrewMemberRoles :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date


In [5]:
dfPatients = dfPatients.drop(dfPatients[(dfPatients.FRDPersonnelStartDate.isnull())].index)
dfPatients.shape

(543300, 12)

In [6]:
# Drop duplicate records
dfPatients_dedup = pd.DataFrame.drop_duplicates(dfPatients)
dfPatients_dedup.shape
# diff 543774 - 543760 = 14

(543298, 12)

In [7]:
# Add TenureMonths, which is the count of months working at the time of the dispatch call
dfPatients_dedup['TenureMonths'] = ((dfPatients_dedup.loc[:, 'DispatchTime'].dt.date - \
                                     dfPatients_dedup.loc[:, 'FRDPersonnelStartDate'].dt.date) / \
                                    np.timedelta64(1, 'M')).astype(int)
dfPatients_dedup.shape
# error is just a warning, need to figure out even though using loc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


(543298, 13)

In [8]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,CrewMemberRoles,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,TenureMonths
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:29:02,Male,1997-08-18,244
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,"Driver-Transport,Other Patient Caregiver-At Scene",2018-01-01 00:29:02,Male,2007-02-19,130
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Intern/Student,Other Patient Caregiver-At Scen...",2018-01-01 00:29:02,Male,2008-08-18,112
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:40:54,Male,2011-10-24,74
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Driver-Response,Driver-Transport",2018-01-01 00:40:54,Male,2012-02-27,70


In [9]:
dfPatients_dedup['Shift'].value_counts()

C - Shift    183309
B - Shift    181976
A - Shift    178013
Name: Shift, dtype: int64

In [10]:
# First just going to to add a factorized column starting with 1 as opposed to the normal python 0
dfPatients_dedup['ShiftCode'] = pd.factorize(dfPatients_dedup['Shift'])[0] + 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [11]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,CrewMemberRoles,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,TenureMonths,ShiftCode
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:29:02,Male,1997-08-18,244,1
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,"Driver-Transport,Other Patient Caregiver-At Scene",2018-01-01 00:29:02,Male,2007-02-19,130,1
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Intern/Student,Other Patient Caregiver-At Scen...",2018-01-01 00:29:02,Male,2008-08-18,112,1
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:40:54,Male,2011-10-24,74,1
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Driver-Response,Driver-Transport",2018-01-01 00:40:54,Male,2012-02-27,70,1


In [12]:
dfPatients_dedup['ShiftCode'].value_counts()

3    183309
2    181976
1    178013
Name: ShiftCode, dtype: int64

In [13]:
# For fun and experimentation going to use the "One Hot Encoding" method using get_dummies to create a row of 1,0 for each Shift A, B, and C
dingdongs = pd.get_dummies(dfPatients_dedup['Shift'])

In [14]:
dingdongs.head()

Unnamed: 0,A - Shift,B - Shift,C - Shift
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [15]:
dingdongs.value_counts()

A - Shift  B - Shift  C - Shift
0          0          1            183309
           1          0            181976
1          0          0            178013
dtype: int64

In [16]:
dingdongs.rename(columns={'A - Shift': 'Shift_A', 'B - Shift': 'Shift_B', 'C - Shift': 'Shift_C'}, inplace=True)

In [17]:
dingdongs.value_counts()

Shift_A  Shift_B  Shift_C
0        0        1          183309
         1        0          181976
1        0        0          178013
dtype: int64

In [18]:
dfPatients_dedup = pd.concat([dfPatients_dedup, dingdongs], axis=1)

In [19]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,CrewMemberRoles,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,TenureMonths,ShiftCode,Shift_A,Shift_B,Shift_C
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:29:02,Male,1997-08-18,244,1,1,0,0
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,"Driver-Transport,Other Patient Caregiver-At Scene",2018-01-01 00:29:02,Male,2007-02-19,130,1,1,0,0
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Intern/Student,Other Patient Caregiver-At Scen...",2018-01-01 00:29:02,Male,2008-08-18,112,1,1,0,0
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:40:54,Male,2011-10-24,74,1,1,0,0
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Driver-Response,Driver-Transport",2018-01-01 00:40:54,Male,2012-02-27,70,1,1,0,0


In [20]:
dfPatients_dedup['UnitId'].value_counts()

M422     18653
M430     18102
M405     15942
M404     15897
M425     15383
         ...  
BC405        2
BC402        2
LA436        2
LA437        2
K441         1
Name: UnitId, Length: 167, dtype: int64

In [21]:
dfPatients_dedup['UnitIdCode'] = pd.factorize(dfPatients_dedup['UnitId'])[0] + 1

In [22]:
dfPatients_dedup['UnitIdCode'].value_counts()

26     18653
22     18102
24     15942
1      15897
37     15383
       ...  
148        2
145        2
162        2
143        2
146        1
Name: UnitIdCode, Length: 167, dtype: int64

In [23]:
dfPatients_dedup['PatientOutcome'].value_counts()

Treated & Transported                        356209
Patient Refusal  (AMA)                        46608
No Treatment/Transport Required               44876
Canceled (Prior to Arrival)                   38680
EMS Assist (Other Agency)                     13697
Canceled (On Scene, No Patient Contact)       12195
No Patient Found                              10500
Standby (No Services Performed)                8660
Patient Dead at Scene (No EMS CPR)             4860
Standby (Operational Support Provided)         3238
Treated, Transferred Care                      2471
Patient Dead at Scene (EMS CPR Attempted)      1302
Name: PatientOutcome, dtype: int64

In [24]:
dfPatients_dedup['PatientOutcomeCode'] = pd.factorize(dfPatients_dedup['PatientOutcome'])[0] + 1

In [25]:
dfPatients_dedup['PatientOutcomeCode'].value_counts()

4     356209
3      46608
2      44876
7      38680
11     13697
5      12195
6      10500
1       8660
9       4860
8       3238
10      2471
12      1302
0          2
Name: PatientOutcomeCode, dtype: int64

In [26]:
dfPatients_dedup['PatientGender'].value_counts()

Female                           210378
Male                             201452
Unknown (Unable to Determine)       178
Name: PatientGender, dtype: int64

In [27]:
dfPatients_dedup['FRDPersonnelGender'].value_counts()

Male      455727
Female     87571
Name: FRDPersonnelGender, dtype: int64

In [28]:
dfPatients_dedup['ProviderGenderCode'] = pd.factorize(dfPatients_dedup['FRDPersonnelGender'])[0] + 1

In [29]:
dfPatients_dedup['ProviderGenderCode'].value_counts()

1    455727
2     87571
Name: ProviderGenderCode, dtype: int64

In [30]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,CrewMemberRoles,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,TenureMonths,ShiftCode,Shift_A,Shift_B,Shift_C,UnitIdCode,PatientOutcomeCode,ProviderGenderCode
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:29:02,Male,1997-08-18,244,1,1,0,0,1,1,1
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,"Driver-Transport,Other Patient Caregiver-At Scene",2018-01-01 00:29:02,Male,2007-02-19,130,1,1,0,0,1,1,1
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Intern/Student,Other Patient Caregiver-At Scen...",2018-01-01 00:29:02,Male,2008-08-18,112,1,1,0,0,1,1,1
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:40:54,Male,2011-10-24,74,1,1,0,0,2,2,1
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Driver-Response,Driver-Transport",2018-01-01 00:40:54,Male,2012-02-27,70,1,1,0,0,2,2,1


In [31]:
print_categories(dfPatients_dedup)

PatientId :  Numerical
FRDPersonnelID :  Categorical
Shift :  Categorical
UnitId :  Categorical
FireStation :  Numerical
Battalion :  Numerical
PatientOutcome :  Categorical
PatientGender :  Categorical
CrewMemberRoles :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date
TenureMonths :  Numerical
ShiftCode :  Numerical
Shift_A :  Numerical
Shift_B :  Numerical
Shift_C :  Numerical
UnitIdCode :  Numerical
PatientOutcomeCode :  Numerical
ProviderGenderCode :  Numerical


In [32]:
dfPatients_dedup.to_csv(r'./data/dfPatients_dedup.csv', index = True)