# Merge and Prepare Intermediate Datasets -- Patients

This notebook is the data engineering for the creation of the intermediate dataset for the "Patients" dataset. The script: 
* reads in the raw data provided by our project partner, 
* drops records where `FRDPersonnelStartDate` is NULL, 
* removes remaining duplicates, 
* add tenure by month attribute, 
* converts categorical data into numeric, 
* adds new numeric columns utilizing the the "One Hot Encoding" method, then 
* outputs dataframe to a CSV file. 

_Author: Mark Lee_<br>
_Edited: Michael McRae_


In [1]:
# Funtion to identify and print easy to understand variable types 
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count == total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))

In [2]:
# Import libraires
import numpy as np
import pandas as pd

# Setup HTML display
from IPython.core.display import display, HTML
# Notebook cell width adjustment
display(HTML('<style>.container { width:80% !important; }</style>'))

### Load data and calculate read-in time

In [3]:
# McRae addition to capture time it takes to read the file.  
# Also edited location of datafile to be consistent with GIT file structure (data subdirectories)
import os, time
print(os.getcwd())
readStart=time.time()

# Edit the file name to make sure you are pointing to the right file.
dfPatients = pd.read_excel(r'../data/01_raw/20210225-ems-raw-v04.xlsx', 
                        sheet_name='Patients', 
                        na_values=['NA'])

# Stop the clock and calculate read time
readStop=time.time()
readTime=readStop-readStart
readMin=np.floor(readTime/60)
readSec=np.floor(readTime-(readMin*60))
print("The file was read in {0:.0f} minutes and {1:.0f} seconds.".format(readMin,readSec))

dfPatients.shape

The file was read in 1 minutes and 59 seconds.


(543774, 12)

### Display data types

In [4]:
print_categories(dfPatients)

PatientId :  Numerical
FRDPersonnelID :  Categorical
Shift :  Categorical
UnitId :  Categorical
FireStation :  Numerical
Battalion :  Numerical
PatientOutcome :  Categorical
PatientGender :  Categorical
CrewMemberRoles :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date


### Remove rows where Provider Start Date is NULL. This also removes rows where Provider Gender is NULL.

In [5]:
dfPatients = dfPatients.drop(dfPatients[(dfPatients.FRDPersonnelStartDate.isnull())].index)
dfPatients.shape

(543300, 12)

### Remove duplicate rows

In [6]:
dfPatients_dedup = pd.DataFrame.drop_duplicates(dfPatients)
dfPatients_dedup.shape
# diff 543774 - 543760 = 14

(543298, 12)

## Remove Duplicates and Merge CrewMemberRoles
### This next step is here to solve the issue with extraneous records based on CrewMemberRoles.  

In [7]:
#Create a separate dataframe with only key, PatientID, FRDPersonnelID, and CrewMemberRoles for simplicity
df_Patients_dedup_sub=dfPatients_dedup[["PatientId","FRDPersonnelID","CrewMemberRoles"]]

#Create a composite key column to identify duplicate records
df_Patients_dedup_sub.loc[:,"key"]=df_Patients_dedup_sub.PatientId.astype(str)+'-'+df_Patients_dedup_sub.FRDPersonnelID

df_Patients_dedup_sub.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


Unnamed: 0,PatientId,FRDPersonnelID,CrewMemberRoles,key
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,"Primary Patient Caregiver-At Scene,Primary Pat...",479838-6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,"Driver-Transport,Other Patient Caregiver-At Scene",479838-F39A55F0-C619-E511-80CA-001DD8B71D38
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,"Intern/Student,Other Patient Caregiver-At Scen...",479838-D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,"Primary Patient Caregiver-At Scene,Primary Pat...",479839-71D4C99E-9E01-E211-B5F5-78E7D18CFD3C
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,"Driver-Response,Driver-Transport",479839-F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C


In [8]:
df_Patients_dedup_sub.shape

(543298, 4)

### This next section groups by the key and creates duplicate columns to capture the values in the duplicate rows, so these values can be concatenated.

In [9]:
s = df_Patients_dedup_sub.groupby('key').cumcount()
m = s.astype(bool) * ('('+s.astype(str)+')')
df_out = df_Patients_dedup_sub.set_index(['key', m]).unstack().sort_index(level=1, axis=1).reset_index()
df_out.columns = df_out.columns.map(lambda x: f'{x[0]}{x[1]}')

# df_out.head()

df_out.shape

(543258, 10)

### Notice that the row count has decreased.  That is because the duplicate lines have been removed with the values stored in the extra columns.

In [10]:
df_out.head()

Unnamed: 0,key,CrewMemberRoles,FRDPersonnelID,PatientId,CrewMemberRoles(1),FRDPersonnelID(1),PatientId(1),CrewMemberRoles(2),FRDPersonnelID(2),PatientId(2)
0,1000002-91F621B7-1D51-E311-B3F8-AE437D699481,"Primary Patient Caregiver-At Scene,Primary Pat...",91F621B7-1D51-E311-B3F8-AE437D699481,1000002.0,,,,,,
1,1000002-ADFF314F-8E6F-4B99-9C81-DBBA88336FFE,"Driver-Response,Driver-Transport,Other Patient...",ADFF314F-8E6F-4B99-9C81-DBBA88336FFE,1000002.0,,,,,,
2,1000009-1FB4635C-0444-4221-A903-FE298AF4A87A,"Driver-Response,Driver-Transport",1FB4635C-0444-4221-A903-FE298AF4A87A,1000009.0,,,,,,
3,1000009-DFD8C99E-9E01-E211-B5F5-78E7D18CFD3C,"Primary Patient Caregiver-At Scene,Primary Pat...",DFD8C99E-9E01-E211-B5F5-78E7D18CFD3C,1000009.0,,,,,,
4,1000024-0FC6D8A5-B755-49AF-B4FE-073AA2CEC1E8,"Primary Patient Caregiver-At Scene,Primary Pat...",0FC6D8A5-B755-49AF-B4FE-073AA2CEC1E8,1000024.0,,,,,,


In [11]:
#Remove unnecessary columns generated in last step
df_out2=df_out.drop(columns=['FRDPersonnelID(1)','PatientId(1)', 'FRDPersonnelID(2)', 'PatientId(2)'])

#Remove records that don't have a second Role (not duplicate)
df_out1=df_out2.dropna(how='all',subset=['CrewMemberRoles(1)','CrewMemberRoles(2)'])

#create a dataset with only the rows that don't have duplicates for later merging
df_nodup=pd.merge(df_out2,df_out1,how='outer',indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])


In [12]:
df_out1.head()

Unnamed: 0,key,CrewMemberRoles,FRDPersonnelID,PatientId,CrewMemberRoles(1),CrewMemberRoles(2)
69,1000136-9D35C1F0-3E4C-4E24-87C2-44B32F463C83,"Other Patient Caregiver-At Scene,Other Patient...",9D35C1F0-3E4C-4E24-87C2-44B32F463C83,1000136.0,"Driver-Response,Driver-Transport,Other Patient...",
797,1001604-8FD2C99E-9E01-E211-B5F5-78E7D18CFD3C,Driver-Response,8FD2C99E-9E01-E211-B5F5-78E7D18CFD3C,1001604.0,Primary Patient Caregiver-At Scene,
84347,1158779-8AD7C99E-9E01-E211-B5F5-78E7D18CFD3C,Driver-Response,8AD7C99E-9E01-E211-B5F5-78E7D18CFD3C,1158779.0,Primary Patient Caregiver-At Scene,
104033,1195070-86D6C99E-9E01-E211-B5F5-78E7D18CFD3C,Driver-Response,86D6C99E-9E01-E211-B5F5-78E7D18CFD3C,1195070.0,Primary Patient Caregiver-At Scene,
105268,1197358-86D6C99E-9E01-E211-B5F5-78E7D18CFD3C,"Other Patient Caregiver-At Scene,Primary Patie...",86D6C99E-9E01-E211-B5F5-78E7D18CFD3C,1197358.0,Driver-Response,


In [13]:
df_nodup.shape

(543219, 6)

In [14]:
df_out1.shape

(39, 6)

In [15]:
#move the CrewMemberRoles over to the new column
df_nodup.loc[:,'newCrewMemberRoles']=df_nodup['CrewMemberRoles']

#concatenate the duplicate CrewMemberRoles and place in new column
df_out1.loc[:,'newCrewMemberRoles']=df_out1['CrewMemberRoles']+', '+ df_out1['CrewMemberRoles(1)']

#create a new dataframe with only the triplicates in it
df_trips=df_out1.dropna(how='all',subset=['CrewMemberRoles(2)'])

#create a new dataframe with only the duplicates (not the triplicates)
df_dups=pd.merge(df_out1,df_trips,how='outer',indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])

#contatenate the triplicate CrewMemberRoles with data from prior steps into the new column
df_trips.loc[:,'newCrewMemberRoles']=df_trips['newCrewMemberRoles']+', '+df_trips['CrewMemberRoles(2)']

#Merge all three datasets (singles, duplicates, triplicates)
df_Roles_deduped=pd.concat([df_nodup,df_dups,df_trips])

#Drop extra columns that are no longer needed
df_Roles_deduped=df_Roles_deduped.drop(columns=['CrewMemberRoles','CrewMemberRoles(1)','CrewMemberRoles(2)'])

df_Roles_deduped.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


(543258, 4)

In [16]:
df_Roles_deduped.head()

Unnamed: 0,key,FRDPersonnelID,PatientId,newCrewMemberRoles
0,1000002-91F621B7-1D51-E311-B3F8-AE437D699481,91F621B7-1D51-E311-B3F8-AE437D699481,1000002.0,"Primary Patient Caregiver-At Scene,Primary Pat..."
1,1000002-ADFF314F-8E6F-4B99-9C81-DBBA88336FFE,ADFF314F-8E6F-4B99-9C81-DBBA88336FFE,1000002.0,"Driver-Response,Driver-Transport,Other Patient..."
2,1000009-1FB4635C-0444-4221-A903-FE298AF4A87A,1FB4635C-0444-4221-A903-FE298AF4A87A,1000009.0,"Driver-Response,Driver-Transport"
3,1000009-DFD8C99E-9E01-E211-B5F5-78E7D18CFD3C,DFD8C99E-9E01-E211-B5F5-78E7D18CFD3C,1000009.0,"Primary Patient Caregiver-At Scene,Primary Pat..."
4,1000024-0FC6D8A5-B755-49AF-B4FE-073AA2CEC1E8,0FC6D8A5-B755-49AF-B4FE-073AA2CEC1E8,1000024.0,"Primary Patient Caregiver-At Scene,Primary Pat..."


In [17]:
#merge back into patients
dfPatients_dedup2=pd.merge(dfPatients_dedup,df_Roles_deduped,how='left',on=['FRDPersonnelID','PatientId']) 
           
dfPatients_dedup2.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,CrewMemberRoles,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,key,newCrewMemberRoles
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:29:02,Male,1997-08-18,479838-6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,"Primary Patient Caregiver-At Scene,Primary Pat..."
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,"Driver-Transport,Other Patient Caregiver-At Scene",2018-01-01 00:29:02,Male,2007-02-19,479838-F39A55F0-C619-E511-80CA-001DD8B71D38,"Driver-Transport,Other Patient Caregiver-At Scene"
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,"Intern/Student,Other Patient Caregiver-At Scen...",2018-01-01 00:29:02,Male,2008-08-18,479838-D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,"Intern/Student,Other Patient Caregiver-At Scen..."
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Primary Patient Caregiver-At Scene,Primary Pat...",2018-01-01 00:40:54,Male,2011-10-24,479839-71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,"Primary Patient Caregiver-At Scene,Primary Pat..."
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,"Driver-Response,Driver-Transport",2018-01-01 00:40:54,Male,2012-02-27,479839-F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,"Driver-Response,Driver-Transport"


In [18]:
dfPatients_dedup2.shape

(543298, 14)

In [19]:
#replace CrewMemberRoles column with updated column
dfPatients_dedup2=dfPatients_dedup2.drop(columns=["key","CrewMemberRoles"])
dfPatients_dedup2=dfPatients_dedup2.rename(columns={"newCrewMemberRoles":"CrewMemberRoles"})
dfPatients_dedup2.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,CrewMemberRoles
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,1997-08-18,"Primary Patient Caregiver-At Scene,Primary Pat..."
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2007-02-19,"Driver-Transport,Other Patient Caregiver-At Scene"
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2008-08-18,"Intern/Student,Other Patient Caregiver-At Scen..."
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2011-10-24,"Primary Patient Caregiver-At Scene,Primary Pat..."
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2012-02-27,"Driver-Response,Driver-Transport"


In [20]:
dfPatients_dedup2.shape

(543298, 12)

In [21]:
# Drop duplicate records
dfPatients_dedup2 = pd.DataFrame.drop_duplicates(dfPatients_dedup2)
dfPatients_dedup=dfPatients_dedup2
dfPatients_dedup.shape

(543258, 12)

### Add TenureMonths, which is the count of months working at the time of the dispatch call

In [22]:
dfPatients_dedup.loc[:,'TenureMonths'] = ((dfPatients_dedup['DispatchTime'].dt.date - \
                                     dfPatients_dedup['FRDPersonnelStartDate'].dt.date) / \
                                    np.timedelta64(1, 'M')).astype(int)
dfPatients_dedup.shape
# error is just a warning, need to figure out even though using loc

(543258, 13)

In [23]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,CrewMemberRoles,TenureMonths
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,1997-08-18,"Primary Patient Caregiver-At Scene,Primary Pat...",244
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2007-02-19,"Driver-Transport,Other Patient Caregiver-At Scene",130
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2008-08-18,"Intern/Student,Other Patient Caregiver-At Scen...",112
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2011-10-24,"Primary Patient Caregiver-At Scene,Primary Pat...",74
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2012-02-27,"Driver-Response,Driver-Transport",70


In [24]:
dfPatients_dedup['Shift'].value_counts()

C - Shift    183295
B - Shift    181956
A - Shift    178007
Name: Shift, dtype: int64

### Add a factorized column starting with 1 as opposed to the normal python 0

In [25]:
dfPatients_dedup.loc[:,'ShiftCode'] = pd.factorize(dfPatients_dedup['Shift'])[0] + 1

In [26]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,CrewMemberRoles,TenureMonths,ShiftCode
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,1997-08-18,"Primary Patient Caregiver-At Scene,Primary Pat...",244,1
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2007-02-19,"Driver-Transport,Other Patient Caregiver-At Scene",130,1
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2008-08-18,"Intern/Student,Other Patient Caregiver-At Scen...",112,1
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2011-10-24,"Primary Patient Caregiver-At Scene,Primary Pat...",74,1
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2012-02-27,"Driver-Response,Driver-Transport",70,1


In [27]:
dfPatients_dedup['ShiftCode'].value_counts()

3    183295
2    181956
1    178007
Name: ShiftCode, dtype: int64

### Use the "One Hot Encoding" method using get_dummies to create a row of 1,0 for each Shift A, B, and C

In [28]:
dingdongs = pd.get_dummies(dfPatients_dedup['Shift'])

In [29]:
dingdongs.head()

Unnamed: 0,A - Shift,B - Shift,C - Shift
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [30]:
dingdongs.value_counts()

A - Shift  B - Shift  C - Shift
0          0          1            183295
           1          0            181956
1          0          0            178007
dtype: int64

In [31]:
# Rename columns
dingdongs.rename(columns={'A - Shift': 'Shift_A', 'B - Shift': 'Shift_B', 'C - Shift': 'Shift_C'}, inplace=True)

In [32]:
dingdongs.value_counts()

Shift_A  Shift_B  Shift_C
0        0        1          183295
         1        0          181956
1        0        0          178007
dtype: int64

### Concatenate the Duplicated Patients and One Hot Encoding Shift dataframes together

In [33]:
dfPatients_dedup = pd.concat([dfPatients_dedup, dingdongs], axis=1)

In [34]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,CrewMemberRoles,TenureMonths,ShiftCode,Shift_A,Shift_B,Shift_C
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,1997-08-18,"Primary Patient Caregiver-At Scene,Primary Pat...",244,1,1,0,0
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2007-02-19,"Driver-Transport,Other Patient Caregiver-At Scene",130,1,1,0,0
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2008-08-18,"Intern/Student,Other Patient Caregiver-At Scen...",112,1,1,0,0
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2011-10-24,"Primary Patient Caregiver-At Scene,Primary Pat...",74,1,1,0,0
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2012-02-27,"Driver-Response,Driver-Transport",70,1,1,0,0


In [35]:
dfPatients_dedup['UnitId'].value_counts()

M422     18653
M430     18102
M405     15942
M404     15896
M425     15382
         ...  
BC405        2
LA436        2
LA437        2
K441         1
BC402        1
Name: UnitId, Length: 167, dtype: int64

### Add a new factorized column called UnitIdCode made from UnitId as to not lose the original column

In [36]:
dfPatients_dedup['UnitIdCode'] = pd.factorize(dfPatients_dedup['UnitId'])[0] + 1

In [37]:
dfPatients_dedup['UnitIdCode'].value_counts()

26     18653
22     18102
24     15942
1      15896
37     15382
       ...  
148        2
145        2
143        2
146        1
162        1
Name: UnitIdCode, Length: 167, dtype: int64

In [38]:
# List out values and counts for Patient Outcome Codes
dfPatients_dedup['PatientOutcome'].value_counts()

Treated & Transported                        356196
Patient Refusal  (AMA)                        46598
No Treatment/Transport Required               44876
Canceled (Prior to Arrival)                   38674
EMS Assist (Other Agency)                     13693
Canceled (On Scene, No Patient Contact)       12194
No Patient Found                              10499
Standby (No Services Performed)                8659
Patient Dead at Scene (No EMS CPR)             4857
Standby (Operational Support Provided)         3238
Treated, Transferred Care                      2470
Patient Dead at Scene (EMS CPR Attempted)      1302
Name: PatientOutcome, dtype: int64

### Add a new factorized column called PatientOutcomeCode made from PatientOutcome as to not lose the original column data

In [39]:
dfPatients_dedup['PatientOutcomeCode'] = pd.factorize(dfPatients_dedup['PatientOutcome'])[0] + 1

In [40]:
dfPatients_dedup['PatientOutcomeCode'].value_counts()

4     356196
3      46598
2      44876
7      38674
11     13693
5      12194
6      10499
1       8659
9       4857
8       3238
10      2470
12      1302
0          2
Name: PatientOutcomeCode, dtype: int64

In [41]:
dfPatients_dedup['PatientGender'].value_counts()

Female                           210367
Male                             201437
Unknown (Unable to Determine)       177
Name: PatientGender, dtype: int64

In [42]:
## McRae added to include PatientGender encoding
dfPatients_dedup['PatientGenderCode'] = pd.factorize(dfPatients_dedup['PatientGender'])[0] + 1
dfPatients_dedup['PatientGenderCode'].value_counts()

2    210367
1    201437
0    131277
3       177
Name: PatientGenderCode, dtype: int64

In [43]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,CrewMemberRoles,TenureMonths,ShiftCode,Shift_A,Shift_B,Shift_C,UnitIdCode,PatientOutcomeCode,PatientGenderCode
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,1997-08-18,"Primary Patient Caregiver-At Scene,Primary Pat...",244,1,1,0,0,1,1,0
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2007-02-19,"Driver-Transport,Other Patient Caregiver-At Scene",130,1,1,0,0,1,1,0
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2008-08-18,"Intern/Student,Other Patient Caregiver-At Scen...",112,1,1,0,0,1,1,0
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2011-10-24,"Primary Patient Caregiver-At Scene,Primary Pat...",74,1,1,0,0,2,2,0
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2012-02-27,"Driver-Response,Driver-Transport",70,1,1,0,0,2,2,0


In [44]:
dfPatients_dedup['FRDPersonnelGender'].value_counts()

Male      455693
Female     87565
Name: FRDPersonnelGender, dtype: int64

### Add a new factorized column called ProviderGenderCode made from ProviderGender as to not lose the original column

In [45]:
dfPatients_dedup['ProviderGenderCode'] = pd.factorize(dfPatients_dedup['FRDPersonnelGender'])[0] + 1

In [46]:
dfPatients_dedup['ProviderGenderCode'].value_counts()

1    455693
2     87565
Name: ProviderGenderCode, dtype: int64

In [47]:
dfPatients_dedup.head()

Unnamed: 0,PatientId,FRDPersonnelID,Shift,UnitId,FireStation,Battalion,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,...,CrewMemberRoles,TenureMonths,ShiftCode,Shift_A,Shift_B,Shift_C,UnitIdCode,PatientOutcomeCode,PatientGenderCode,ProviderGenderCode
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,...,"Primary Patient Caregiver-At Scene,Primary Pat...",244,1,1,0,0,1,1,0,1
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,...,"Driver-Transport,Other Patient Caregiver-At Scene",130,1,1,0,0,1,1,0,1
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M404,4,401,Standby (No Services Performed),,2018-01-01 00:29:02,Male,...,"Intern/Student,Other Patient Caregiver-At Scen...",112,1,1,0,0,1,1,0,1
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,...,"Primary Patient Caregiver-At Scene,Primary Pat...",74,1,1,0,0,2,2,0,1
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,A - Shift,M401,1,402,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,...,"Driver-Response,Driver-Transport",70,1,1,0,0,2,2,0,1


### Display data types

In [48]:
print_categories(dfPatients_dedup)
dfPatients_dedup.shape

PatientId :  Numerical
FRDPersonnelID :  Categorical
Shift :  Categorical
UnitId :  Categorical
FireStation :  Numerical
Battalion :  Numerical
PatientOutcome :  Categorical
PatientGender :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date
CrewMemberRoles :  Categorical
TenureMonths :  Numerical
ShiftCode :  Numerical
Shift_A :  Numerical
Shift_B :  Numerical
Shift_C :  Numerical
UnitIdCode :  Numerical
PatientOutcomeCode :  Numerical
PatientGenderCode :  Numerical
ProviderGenderCode :  Numerical


(543258, 21)

### Output the intemediate data Patients dataframe to CSV into the Intermediate data directory

In [49]:
# Make sure to edit the filename to be sure you are writing to the right place
dfPatients_dedup.to_csv(r'../data/02_intermediate/dfPatients_dedup.csv', index = True)