# Merge and Prepare Intermediate Datasets -- Procedures and Patients

The script for Procedures: 
- reads in the raw data provided by our project partner,
- drops duplicate records for a particular FRDPersonnelID

The script for Patients: 
- drops unneeded columns
- removes duplicate rows,
- Drops NULL rows for Provider Gender and Start Date

Then the script performs a merge of the Procedures and Patients datasets with:
- an inner join on PatientId and FRDPersonnelID
- a left outer join on Procedures back to Patients to build Orphan rows dataframe
- build lookup reference dataframe from Patients of all Providers
- inner join lookup dataframe with Orphan rows dataframe
- build lookup reference dataframe from Patients of all Patients
- merge lookup dataframe with Orphan rows dataframe on PatientId only
- append Orphan rows dataframe back to Procedure+Patients dataframe
- add tenure by month attribute,
- converts categorical data into numeric,
- adds new numeric columns utilizing the the "One Hot Encoding" method, then
- outputs dataframe to a CSV file.

_Author: Mark Lee_


In [1]:
# Function to identify and print easy to understand variable types 
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count == total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import os, time

# Setup HTML display
from IPython.core.display import display, HTML
# Notebook cell width adjustment
display(HTML('<style>.container { width:80% !important; }</style>'))

## Read in Procedures dataset

In [3]:
# Read in Procedures dataset and capture how long it takes
print(os.getcwd())
readStart=time.time()

dfProcedures = pd.read_excel(r'../data/01_raw/20210225-ems-raw-v04.xlsx', 
                        sheet_name='Procedures',
                        na_values=['NA'])

# Stop the clock and calculate read time
readStop=time.time()
readTime=readStop-readStart
readMin=np.floor(readTime/60)
readSec=np.floor(readTime-(readMin*60))
print("The file was read in {0:.0f} minutes and {1:.0f} seconds.".format(readMin,readSec))

dfProcedures.shape

The file was read in 1 minutes and 36 seconds.


(170808, 6)

## Read in Patients dataset

In [4]:
# Read in Patients dataset and capture how long it takes
print(os.getcwd())
readStart=time.time()

dfPatients = pd.read_excel(r'../data/01_raw/20210225-ems-raw-v04.xlsx', 
                        sheet_name='Patients', 
                        na_values=['NA'])

# Stop the clock and calculate read time
readStop=time.time()
readTime=readStop-readStart
readMin=np.floor(readTime/60)
readSec=np.floor(readTime-(readMin*60))
print("The file was read in {0:.0f} minutes and {1:.0f} seconds.".format(readMin,readSec))

dfPatients.shape

The file was read in 1 minutes and 48 seconds.


(543774, 12)

# Working with Procedures dataframe

### Capture Procedures record count in variable to track how many rows ultimately get removed

In [5]:
varProceduresCount = len(dfProcedures.index)
print('Procedures record count: ', varProceduresCount)

Procedures record count:  170808


### Getting a count of NULL values, but not dropping them yet

In [6]:
(" \nCount total NaN at each column in a Procedures dataFrame : \n\n", dfProcedures.isnull().sum())

(' \nCount total NaN at each column in a Procedures dataFrame : \n\n',
 Dim_Procedure_PK                    0
 PatientId                           0
 Procedure_Performed_Code            0
 Procedure_Performed_Description     0
 FRDPersonnelID                      0
 Procedure_Performed_Date_Time      31
 dtype: int64)

### Display data types

In [7]:
print_categories(dfProcedures)

Dim_Procedure_PK :  Numerical
PatientId :  Numerical
Procedure_Performed_Code :  Numerical
Procedure_Performed_Description :  Categorical
FRDPersonnelID :  Categorical
Procedure_Performed_Date_Time :  Date


### Drop Procedure_Performed_Date_Time column which is not needed for analytics, Dim_Procedure_PK acts as a record-level unique identifier

In [8]:
dfProcedures = dfProcedures.drop(['Procedure_Performed_Date_Time'],axis=1)
print("\nAfter modifying first column:\n", dfProcedures.columns)


After modifying first column:
 Index(['Dim_Procedure_PK', 'PatientId', 'Procedure_Performed_Code',
       'Procedure_Performed_Description', 'FRDPersonnelID'],
      dtype='object')


In [9]:
dfProcedures.head()

Unnamed: 0,Dim_Procedure_PK,PatientId,Procedure_Performed_Code,Procedure_Performed_Description,FRDPersonnelID
0,180498,479853,392230005,IV Start - Extremity Vein (arm or leg),58D5C99E-9E01-E211-B5F5-78E7D18CFD3C
1,180499,479853,268400002,CV - ECG - 12 Lead Obtained,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C
2,180500,479853,268400002,CV - ECG - 12 Lead Obtained,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C
3,180501,479853,268400002,CV - ECG - 12 Lead Obtained,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C
4,180508,479862,268400002,CV - ECG - 12 Lead Obtained,1D18E8FC-EE92-E211-A596-78E7D18C3D20


### No need to drop PatientGender NULL rows for intermediate dataset, may drop in analytics design step

In [10]:
#dfProcedures = dfProcedures.drop(dfProcedures[(dfProcedures.PatientGender.isnull())].index)
#dfProcedures.shape

### Drop duplicate Procedures rows, these are related to where FRDPersonnelID = 9E5D40DC-5DC6-E411-80C7-001DD8B71D38

In [11]:
dfProcedures = pd.DataFrame.drop_duplicates(dfProcedures)
dfProcedures.shape

(170161, 5)

### Capture revised Procedures record count in variable

In [12]:
varProceduresCount2 = len(dfProcedures.index)
print('Procedures revised record count: ', varProceduresCount2)

Procedures revised record count:  170161


### Display duplicate records dropped count

In [13]:
varProceduresDiff = varProceduresCount - varProceduresCount2
print('Duplicate records count: ', varProceduresDiff)

Duplicate records count:  647


# Working with Patients dataframe

### Capture Patients record count in variable to track how many rows ultimately get removed

In [14]:
varPatientsCount = len(dfPatients.index)
print('Patients record count: ', varPatientsCount)

Patients record count:  543774


In [15]:
print(" \nCount total NaN at each column in a Patients dataFrame : \n\n", dfPatients.isnull().sum())

 
Count total NaN at each column in a Patients dataFrame : 

 PatientId                     0
FRDPersonnelID                0
Shift                         0
UnitId                        0
FireStation                   0
Battalion                     0
PatientOutcome                2
PatientGender            131378
CrewMemberRoles               0
DispatchTime                  0
FRDPersonnelGender          474
FRDPersonnelStartDate       474
dtype: int64


Work on Patients dataset first

In [16]:
print_categories(dfPatients)

PatientId :  Numerical
FRDPersonnelID :  Categorical
Shift :  Categorical
UnitId :  Categorical
FireStation :  Numerical
Battalion :  Numerical
PatientOutcome :  Categorical
PatientGender :  Categorical
CrewMemberRoles :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date


## Keeping necessary columns for Focus Questions
```
PatientId
FRDPersonnelID
PatientOutcome
PatientGender
DispatchTime
FRDPersonnelGender
FRDPersonnelStartDate
```

### Create new dataframe with only columns needed for Focus Questions

In [17]:
dfPatientsSub = dfPatients.iloc[:, [0,1,6,7,9,10,11]]
dfPatientsSub.shape

(543774, 7)

In [18]:
dfPatientsSub.head()

Unnamed: 0,PatientId,FRDPersonnelID,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate
0,479838,6ED7C99E-9E01-E211-B5F5-78E7D18CFD3C,Standby (No Services Performed),,2018-01-01 00:29:02,Male,1997-08-18
1,479838,F39A55F0-C619-E511-80CA-001DD8B71D38,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2007-02-19
2,479838,D6D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Standby (No Services Performed),,2018-01-01 00:29:02,Male,2008-08-18
3,479839,71D4C99E-9E01-E211-B5F5-78E7D18CFD3C,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2011-10-24
4,479839,F9D4C99E-9E01-E211-B5F5-78E7D18CFD3C,No Treatment/Transport Required,,2018-01-01 00:40:54,Male,2012-02-27


### Remove duplicate records 

In [19]:
dfPatientsSub = pd.DataFrame.drop_duplicates(dfPatientsSub)
dfPatientsSub.shape

(543718, 7)

### Capture revised Patients record count in variable

In [20]:
varPatientsSubCount = len(dfPatientsSub.index)
print('Patients record count: ', varPatientsSubCount)

Patients record count:  543718


### Display duplicate records dropped count

In [21]:
varPatientsSubDiff = varPatientsCount - varPatientsSubCount
print('Duplicate records count: ', varPatientsSubDiff)

Duplicate records count:  56


### Just getting a count of NULL values, but not dropping them yet

In [22]:
(" \nCount total NaN at each column in a Procedures Sub dataFrame : \n\n", dfPatientsSub.isnull().sum())

(' \nCount total NaN at each column in a Procedures Sub dataFrame : \n\n',
 PatientId                     0
 FRDPersonnelID                0
 PatientOutcome                2
 PatientGender            131365
 DispatchTime                  0
 FRDPersonnelGender          460
 FRDPersonnelStartDate       460
 dtype: int64)

### Drop NULL PatientGender records

In [23]:
# Don't run yet

#dfPatientsSub = dfPatientsSub.drop(dfPatientsSub[(dfPatientsSub.PatientGender.isnull())].index)
#dfPatientsSub.shape

### Drop NULL rows for Provider Gender and Start Date (FRDPersonnelGender and FRDPersonnelStartDate are both 460)

In [24]:
dfPatientsSub = dfPatientsSub.drop(dfPatientsSub[(dfPatientsSub.FRDPersonnelGender.isnull())].index)
dfPatientsSub.shape

(543258, 7)

### Capture revised Patients record count in variable

In [25]:
varPatientsSubCount2 = len(dfPatientsSub.index)
print('Patients record count: ', varPatientsSubCount2)

Patients record count:  543258


### Display duplicate records dropped count

In [26]:
varPatientsSubDiff2 = varPatientsSubCount - varPatientsSubCount2
print('Duplicate records count: ', varPatientsSubDiff2)

Duplicate records count:  460


# Merge Procedures and Patients datasets

Start with an inner join Between the Procedures and Patients dataframes on `PatientId` and `FRDPersonnelID`

In [27]:
dfProcPat = dfPatientsSub.merge(dfProcedures, 
                        on=('PatientId','FRDPersonnelID'))
dfProcPat.shape

(147281, 10)

In [28]:
dfProcPat.head()

Unnamed: 0,PatientId,FRDPersonnelID,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,Dim_Procedure_PK,Procedure_Performed_Code,Procedure_Performed_Description
0,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180498,392230005,IV Start - Extremity Vein (arm or leg)
1,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180499,268400002,CV - ECG - 12 Lead Obtained
2,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180500,268400002,CV - ECG - 12 Lead Obtained
3,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180501,268400002,CV - ECG - 12 Lead Obtained
4,479862,32D8C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Female,2018-01-01 00:44:31,Male,2006-12-11,180510,392230005,IV Start - Extremity Vein (arm or leg)


### Display count of NULLs by column

In [29]:
(" \nCount total NaN at each column in a Procedures+Patients dataFrame : \n\n", dfProcPat.isnull().sum())

(' \nCount total NaN at each column in a Procedures+Patients dataFrame : \n\n',
 PatientId                            0
 FRDPersonnelID                       0
 PatientOutcome                       0
 PatientGender                      104
 DispatchTime                         0
 FRDPersonnelGender                   0
 FRDPersonnelStartDate                0
 Dim_Procedure_PK                     0
 Procedure_Performed_Code             0
 Procedure_Performed_Description      0
 dtype: int64)

## Perform a left outer join on Procedures back to Procedures+Patients where NULL

### Add Composite Key using PatientId and FRDPersonnelID to both Procedures and Procedures+Patients datasets for left outer joins

In [30]:
dfProcedures['PK'] = dfProcedures['PatientId'].map(str) + '_' + dfProcedures['FRDPersonnelID']
dfProcPat['_PK'] = dfProcPat['PatientId'].map(str) + '_' + dfProcPat['FRDPersonnelID']

### Check columns

In [31]:
dfProcPat.columns

Index(['PatientId', 'FRDPersonnelID', 'PatientOutcome', 'PatientGender',
       'DispatchTime', 'FRDPersonnelGender', 'FRDPersonnelStartDate',
       'Dim_Procedure_PK', 'Procedure_Performed_Code',
       'Procedure_Performed_Description', '_PK'],
      dtype='object')

### Display count of NULLs by column

In [32]:
print(" \nCount total NaN at each column in a Procedures+Patients dataFrame : \n\n", dfProcPat.isnull().sum())

 
Count total NaN at each column in a Procedures+Patients dataFrame : 

 PatientId                            0
FRDPersonnelID                       0
PatientOutcome                       0
PatientGender                      104
DispatchTime                         0
FRDPersonnelGender                   0
FRDPersonnelStartDate                0
Dim_Procedure_PK                     0
Procedure_Performed_Code             0
Procedure_Performed_Description      0
_PK                                  0
dtype: int64


### Perform left outer join on Composite Keys (PK and _PK) and create a new dataframe to append to exising dfProcPat

In [33]:
procPKi = dfProcedures.set_index('PK')
procPatPKi = dfProcPat.set_index('_PK')

dfProcLeftJoin = dfProcedures.merge(procPatPKi, 
                             how = 'left', 
                             left_on = ('PK'),
                             right_on = ('_PK'), 
                             indicator = 'i').query('i == "left_only"')

In [34]:
dfProcLeftJoin.shape

(22880, 17)

In [35]:
dfProcLeftJoin.head()

Unnamed: 0,Dim_Procedure_PK_x,PatientId_x,Procedure_Performed_Code_x,Procedure_Performed_Description_x,FRDPersonnelID_x,PK,PatientId_y,FRDPersonnelID_y,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,Dim_Procedure_PK_y,Procedure_Performed_Code_y,Procedure_Performed_Description_y,i
27,180516,479885,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,479885_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,NaT,,NaT,,,,left_only
30,180519,479920,392230005,IV Start - Extremity Vein (arm or leg),C387923A-F613-E411-A585-F305C4522FCB,479920_C387923A-F613-E411-A585-F305C4522FCB,,,,,NaT,,NaT,,,,left_only
31,180532,480011,392230005,IV Start - Extremity Vein (arm or leg),9BD2C99E-9E01-E211-B5F5-78E7D18CFD3C,480011_9BD2C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,NaT,,NaT,,,,left_only
35,180549,480042,392230005,IV Start - Extremity Vein (arm or leg),F1D2C99E-9E01-E211-B5F5-78E7D18CFD3C,480042_F1D2C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,NaT,,NaT,,,,left_only
36,180557,480049,392230005,IV Start - Extremity Vein (arm or leg),72D2C99E-9E01-E211-B5F5-78E7D18CFD3C,480049_72D2C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,NaT,,NaT,,,,left_only


### Display count of NULLs by column

In [36]:
print(" \nCount total NaN at each column in a Procedures+Patients dataFrame : \n\n", dfProcLeftJoin.isnull().sum())

 
Count total NaN at each column in a Procedures+Patients dataFrame : 

 Dim_Procedure_PK_x                       0
PatientId_x                              0
Procedure_Performed_Code_x               0
Procedure_Performed_Description_x        0
FRDPersonnelID_x                         0
PK                                       0
PatientId_y                          22880
FRDPersonnelID_y                     22880
PatientOutcome                       22880
PatientGender                        22880
DispatchTime                         22880
FRDPersonnelGender                   22880
FRDPersonnelStartDate                22880
Dim_Procedure_PK_y                   22880
Procedure_Performed_Code_y           22880
Procedure_Performed_Description_y    22880
i                                        0
dtype: int64


### Rename columns with _x

In [37]:
dfProcLeftJoin.rename(columns = {'Dim_Procedure_PK_x':'Dim_Procedure_PK',
                                     'PatientId_x':'PatientId',
                                     'Procedure_Performed_Code_x':'Procedure_Performed_Code',
                                     'Procedure_Performed_Description_x':'Procedure_Performed_Description',
                                     'FRDPersonnelID_x':'FRDPersonnelID'}, inplace = True) 
print("\nAfter modifying first column:\n", dfProcLeftJoin.columns)


After modifying first column:
 Index(['Dim_Procedure_PK', 'PatientId', 'Procedure_Performed_Code',
       'Procedure_Performed_Description', 'FRDPersonnelID', 'PK',
       'PatientId_y', 'FRDPersonnelID_y', 'PatientOutcome', 'PatientGender',
       'DispatchTime', 'FRDPersonnelGender', 'FRDPersonnelStartDate',
       'Dim_Procedure_PK_y', 'Procedure_Performed_Code_y',
       'Procedure_Performed_Description_y', 'i'],
      dtype='object')


### Drop all NaN, _y columns, don't need them anymore

In [38]:
dfProcLeftJoin = dfProcLeftJoin.drop(['PatientId_y', 'FRDPersonnelID_y', 'PatientOutcome', 'PatientGender',
                                      'DispatchTime', 'FRDPersonnelGender', 'FRDPersonnelStartDate',
                                      'Dim_Procedure_PK_y', 'Procedure_Performed_Code_y', 
                                      'Procedure_Performed_Description_y', 'i'],axis=1)
print("\nAfter modifying first column:\n", dfProcLeftJoin.columns)


After modifying first column:
 Index(['Dim_Procedure_PK', 'PatientId', 'Procedure_Performed_Code',
       'Procedure_Performed_Description', 'FRDPersonnelID', 'PK'],
      dtype='object')


In [39]:
dfProcLeftJoin.shape

(22880, 6)

### Display count of NULLs by column

In [40]:
print(" \nCount total NaN at each column in a Procedures+Patients dataFrame : \n\n", dfProcLeftJoin.isnull().sum())

 
Count total NaN at each column in a Procedures+Patients dataFrame : 

 Dim_Procedure_PK                   0
PatientId                          0
Procedure_Performed_Code           0
Procedure_Performed_Description    0
FRDPersonnelID                     0
PK                                 0
dtype: int64


## Build lookup reference dataframe from Patients dataframe containing a distinct list of 
```
FRDPersonnelID
FRDPersonnelGender
FRDPersonnelStartDate
```
## Then join lookup dataframe luProvider with dfProcLeftJoin on PatientId only

In [41]:
# Create Lookup table for Providers
luProvider = dfPatients.groupby(['FRDPersonnelID','FRDPersonnelGender', 'FRDPersonnelStartDate']).size().reset_index(name='count')
luProvider.shape

(1830, 4)

In [42]:
luProvider.head(5)

Unnamed: 0,FRDPersonnelID,FRDPersonnelGender,FRDPersonnelStartDate,count
0,00D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2008-04-28,1200
1,00D6C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2007-09-17,388
2,00D7C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,1993-09-20,21
3,00D8C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,1991-06-03,5
4,00D9C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2003-04-19,249


In [43]:
# Drop count column, not needed
luProvider = luProvider.drop(['count'],axis=1)

# Join lookup dataframe luProvider with dfProcLeftJoin on PatientId only

In [44]:
dfProcPatOrphans = dfProcLeftJoin.merge(luProvider, 
                        on=('FRDPersonnelID'))
dfProcPatOrphans.shape

(22869, 8)

In [45]:
dfProcPatOrphans.head()

Unnamed: 0,Dim_Procedure_PK,PatientId,Procedure_Performed_Code,Procedure_Performed_Description,FRDPersonnelID,PK,FRDPersonnelGender,FRDPersonnelStartDate
0,180516,479885,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,479885_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25
1,185426,486284,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,486284_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25
2,186800,488083,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,488083_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25
3,209301,514333,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,514333_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25
4,214483,522281,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,522281_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25


### The count of dfProcPatOrphans is 11 records difference, so quick analysis to see why dfProcLeftJoin = 22880, dfProcPatOrphans = 22869, diff 11 rows

In [46]:
dfProcPatOrphansAnalysis = dfProcPatOrphans

In [47]:
dfProcPatOrphansAnalysis.rename(columns = {'PK':'_PK'}, inplace = True)

In [48]:
procPKi = dfProcLeftJoin.set_index('PK')
procPatPKi = dfProcPatOrphansAnalysis.set_index('_PK')

dfProcLeftJoinAnalysis = dfProcLeftJoin.merge(procPatPKi, 
                             how = 'left', 
                             left_on = ('PK'),
                             right_on = ('_PK'), 
                             indicator = 'i').query('i == "left_only"')

In [49]:
dfProcLeftJoinAnalysis.shape

(11, 14)

In [50]:
dfProcLeftJoinAnalysis

Unnamed: 0,Dim_Procedure_PK_x,PatientId_x,Procedure_Performed_Code_x,Procedure_Performed_Description_x,FRDPersonnelID_x,PK,Dim_Procedure_PK_y,PatientId_y,Procedure_Performed_Code_y,Procedure_Performed_Description_y,FRDPersonnelID_y,FRDPersonnelGender,FRDPersonnelStartDate,i
146,184565,485335,392230005,IV Start - Extremity Vein (arm or leg),4CD8C99E-9E01-E211-B5F5-78E7D18CFD3C,485335_4CD8C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
414,192135,494708,392230005,IV Start - Extremity Vein (arm or leg),AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,494708_AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
1509,223503,531908,392230005,IV Start - Extremity Vein (arm or leg),6AD7C99E-9E01-E211-B5F5-78E7D18CFD3C,531908_6AD7C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
1657,228522,538259,392230005,IV Start - Extremity Vein (arm or leg),AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,538259_AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
4512,956627,621689,392230005,IV Start - Extremity Vein (arm or leg),AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,621689_AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
8329,1068456,731691,392230005,IV Start - Extremity Vein (arm or leg),0707D573-B26A-4E75-99AC-5C135B8F3271,731691_0707D573-B26A-4E75-99AC-5C135B8F3271,,,,,,,NaT,left_only
8937,1087195,751013,392230005,IV Start - Extremity Vein (arm or leg),AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,751013_AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
13213,1156914,892544,268400002,CV - ECG - 12 Lead Obtained,11D6C99E-9E01-E211-B5F5-78E7D18CFD3C,892544_11D6C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
18642,1188677,1075825,392230005,IV Start - Extremity Vein (arm or leg),43D2C99E-9E01-E211-B5F5-78E7D18CFD3C,1075825_43D2C99E-9E01-E211-B5F5-78E7D18CFD3C,,,,,,,NaT,left_only
24339,1221439,1261788,268400002,CV - ECG - 12 Lead Obtained,1AC85E51-5ED2-468F-B317-41181CF6720C,1261788_1AC85E51-5ED2-468F-B317-41181CF6720C,,,,,,,NaT,left_only


### These 8 Provider IDs are not found at all in the Patients dataframe (or the Provider lookup dataframe)

In [51]:
dfProcLeftJoinAnalysis.groupby(['FRDPersonnelID_x']).size().reset_index(name='count')

Unnamed: 0,FRDPersonnelID_x,count
0,0707D573-B26A-4E75-99AC-5C135B8F3271,1
1,11D6C99E-9E01-E211-B5F5-78E7D18CFD3C,1
2,1AC85E51-5ED2-468F-B317-41181CF6720C,1
3,43D2C99E-9E01-E211-B5F5-78E7D18CFD3C,1
4,4CD8C99E-9E01-E211-B5F5-78E7D18CFD3C,1
5,6AD7C99E-9E01-E211-B5F5-78E7D18CFD3C,1
6,8F31A24D-3506-4978-9197-4B16E5AA82A5,1
7,AAD8C99E-9E01-E211-B5F5-78E7D18CFD3C,4


# Join dfProcPatOrphans dataframe back with dfPatients to bring back in
```
PatientOutcome
PatientGender
```

In [52]:
# A quick reminder of dfPatients schema
dfPatients.columns

Index(['PatientId', 'FRDPersonnelID', 'Shift', 'UnitId', 'FireStation',
       'Battalion', 'PatientOutcome', 'PatientGender', 'CrewMemberRoles',
       'DispatchTime', 'FRDPersonnelGender', 'FRDPersonnelStartDate'],
      dtype='object')

In [53]:
# A quick reminder of dfProcPatOrphans schema
dfProcPatOrphans.columns

Index(['Dim_Procedure_PK', 'PatientId', 'Procedure_Performed_Code',
       'Procedure_Performed_Description', 'FRDPersonnelID', '_PK',
       'FRDPersonnelGender', 'FRDPersonnelStartDate'],
      dtype='object')

## Build lookup reference dataframe from Patients dataframe containing a distinct list of 
```
PatientId
PatientOutcome
PatientGender
DispatchTime
```
## Then join/merge lookup dataframe luPatients with dfProcPatOrphans on PatientId only

In [54]:
luPatients = dfPatients.groupby(['PatientId', 'PatientOutcome', 'PatientGender','DispatchTime']).size().reset_index(name='count')
luPatients.shape

(189513, 5)

In [55]:
luPatients.dtypes

PatientId                  int64
PatientOutcome            object
PatientGender             object
DispatchTime      datetime64[ns]
count                      int64
dtype: object

In [56]:
luPatients.head()

Unnamed: 0,PatientId,PatientOutcome,PatientGender,DispatchTime,count
0,479851,Patient Refusal (AMA),Male,2018-01-01 00:20:49,2
1,479853,Treated & Transported,Male,2018-01-01 00:04:06,2
2,479862,Treated & Transported,Female,2018-01-01 00:44:31,4
3,479876,Treated & Transported,Male,2018-01-01 00:42:53,3
4,479879,Treated & Transported,Female,2018-01-01 01:17:54,2


In [57]:
# Drop count column, not needed
luPatients = luPatients.drop(['count'],axis=1)

### Merge lookup dataframe luPatients with dfProcPatOrphans on PatientId only

In [58]:
dfProcPatSub = dfProcPatOrphans.merge(luPatients,on=('PatientId'))
dfProcPatSub.shape

(22791, 11)

In [59]:
dfProcPatSub.head()

Unnamed: 0,Dim_Procedure_PK,PatientId,Procedure_Performed_Code,Procedure_Performed_Description,FRDPersonnelID,_PK,FRDPersonnelGender,FRDPersonnelStartDate,PatientOutcome,PatientGender,DispatchTime
0,180516,479885,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,479885_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25,Treated & Transported,Male,2018-01-01 01:08:08
1,185426,486284,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,486284_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25,Treated & Transported,Female,2018-01-07 14:42:01
2,186800,488083,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,488083_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25,Treated & Transported,Female,2018-01-09 06:51:21
3,209301,514333,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,514333_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25,Treated & Transported,Female,2018-02-07 19:32:10
4,214483,522281,392230005,IV Start - Extremity Vein (arm or leg),17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,522281_17D3C99E-9E01-E211-B5F5-78E7D18CFD3C,Male,2005-07-25,Treated & Transported,Female,2018-02-16 15:45:03


In [60]:
dfProcPatSub.columns

Index(['Dim_Procedure_PK', 'PatientId', 'Procedure_Performed_Code',
       'Procedure_Performed_Description', 'FRDPersonnelID', '_PK',
       'FRDPersonnelGender', 'FRDPersonnelStartDate', 'PatientOutcome',
       'PatientGender', 'DispatchTime'],
      dtype='object')

In [61]:
(" \nCount total NaN at each column in a Procedures dataFrame : \n\n", dfProcPatSub.isnull().sum())

(' \nCount total NaN at each column in a Procedures dataFrame : \n\n',
 Dim_Procedure_PK                   0
 PatientId                          0
 Procedure_Performed_Code           0
 Procedure_Performed_Description    0
 FRDPersonnelID                     0
 _PK                                0
 FRDPersonnelGender                 0
 FRDPersonnelStartDate              0
 PatientOutcome                     0
 PatientGender                      0
 DispatchTime                       0
 dtype: int64)

In [62]:
(" \nCount total NaN at each column in a Procedures dataFrame : \n\n", dfProcPat.isnull().sum())

(' \nCount total NaN at each column in a Procedures dataFrame : \n\n',
 PatientId                            0
 FRDPersonnelID                       0
 PatientOutcome                       0
 PatientGender                      104
 DispatchTime                         0
 FRDPersonnelGender                   0
 FRDPersonnelStartDate                0
 Dim_Procedure_PK                     0
 Procedure_Performed_Code             0
 Procedure_Performed_Description      0
 _PK                                  0
 dtype: int64)

# Append dfProcPatSub back to dfProcPat and do not reindex

In [63]:
dfProcPatAppend = dfProcPat.append(dfProcPatSub, ignore_index=True)
dfProcPatAppend.shape

(170072, 11)

In [64]:
(" \nCount total NaN at each column in a Procedures dataFrame : \n\n", dfProcPatAppend.isnull().sum())

(' \nCount total NaN at each column in a Procedures dataFrame : \n\n',
 PatientId                            0
 FRDPersonnelID                       0
 PatientOutcome                       0
 PatientGender                      104
 DispatchTime                         0
 FRDPersonnelGender                   0
 FRDPersonnelStartDate                0
 Dim_Procedure_PK                     0
 Procedure_Performed_Code             0
 Procedure_Performed_Description      0
 _PK                                  0
 dtype: int64)

In [65]:
print_categories(dfProcPatAppend)

PatientId :  Numerical
FRDPersonnelID :  Categorical
PatientOutcome :  Categorical
PatientGender :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date
Dim_Procedure_PK :  Numerical
Procedure_Performed_Code :  Numerical
Procedure_Performed_Description :  Categorical
_PK :  Categorical


In [66]:
dfProcPatAppend['PatientOutcome'].value_counts()

Treated & Transported                        161083
Patient Refusal  (AMA)                         6672
Patient Dead at Scene (EMS CPR Attempted)      1591
Patient Dead at Scene (No EMS CPR)              371
Treated, Transferred Care                       243
EMS Assist (Other Agency)                       103
No Treatment/Transport Required                   7
Standby (Operational Support Provided)            1
Canceled (On Scene, No Patient Contact)           1
Name: PatientOutcome, dtype: int64

In [67]:
dfProcPatAppend['PatientGender'].value_counts()

Male                             89956
Female                           79942
Unknown (Unable to Determine)       70
Name: PatientGender, dtype: int64

In [68]:
dfProcPatAppend['FRDPersonnelGender'].value_counts()

Male      140225
Female     29847
Name: FRDPersonnelGender, dtype: int64

In [69]:
dfProcPatAppend['Procedure_Performed_Description'].value_counts()

CV - ECG - 12 Lead Obtained                   66977
IV Start - Extremity Vein (arm or leg)        59840
Electrocardiographic monitoring                4756
MS - Cervical Spinal Restriction of Motion     4699
Assess - Assessment of Patient                 4192
                                              ...  
Laryngoscopy - Indirect                           4
Cricothyrotomy - Needle                           3
Enviro - Hypothermia Induction Therapy            3
Intubation - Existing Tracheostomy Stoma          3
Infectious - Decontamination of Patient           3
Name: Procedure_Performed_Description, Length: 68, dtype: int64

# Add TenureMonths, which is the count of months working at the time of the dispatch call

In [70]:
dfProcPatAppend['TenureMonths'] = ((dfProcPatAppend.loc[:, 'DispatchTime'].dt.date - \
                                     dfProcPatAppend.loc[:, 'FRDPersonnelStartDate'].dt.date) / \
                                    np.timedelta64(1, 'M')).astype(int)
dfProcPatAppend.shape
# error is just a warning, need to figure out even though using loc

(170072, 12)

In [71]:
dfProcPatAppend['TenureMonths'].value_counts()

 51     1790
 56     1725
 49     1713
 48     1696
 50     1664
        ... 
 420       1
-27        1
 453       1
 373       1
 417       1
Name: TenureMonths, Length: 437, dtype: int64

### Add a factorized column starting with 1 as opposed to the normal python 0

In [72]:
dfProcPatAppend['PatientOutcomeCode'] = pd.factorize(dfProcPatAppend['PatientOutcome'])[0] + 1

In [73]:
dfProcPatAppend['PatientOutcomeCode'].value_counts()

1    161083
2      6672
6      1591
5       371
3       243
4       103
8         7
9         1
7         1
Name: PatientOutcomeCode, dtype: int64

In [74]:
dfProcPatAppend['PatientGenderCode'] = pd.factorize(dfProcPatAppend['PatientGender'])[0] + 1

In [75]:
dfProcPatAppend['PatientGenderCode'].value_counts()

1    89956
2    79942
0      104
3       70
Name: PatientGenderCode, dtype: int64

### Use the "One Hot Encoding" method using get_dummies to create a row of 1,0 for each PatientGender

In [76]:
dumDumPatient = pd.get_dummies(dfProcPatAppend['PatientGender'], prefix='PatientGender',)

In [77]:
dumDumPatient.head()

Unnamed: 0,PatientGender_Female,PatientGender_Male,PatientGender_Unknown (Unable to Determine)
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,1,0,0


In [78]:
dumDumPatient.value_counts()

PatientGender_Female  PatientGender_Male  PatientGender_Unknown (Unable to Determine)
0                     1                   0                                              89956
1                     0                   0                                              79942
0                     0                   0                                                104
                                          1                                                 70
dtype: int64

In [79]:
dfProcPatAppend = pd.concat([dfProcPatAppend, dumDumPatient], axis=1)

In [80]:
dfProcPatAppend.head()

Unnamed: 0,PatientId,FRDPersonnelID,PatientOutcome,PatientGender,DispatchTime,FRDPersonnelGender,FRDPersonnelStartDate,Dim_Procedure_PK,Procedure_Performed_Code,Procedure_Performed_Description,_PK,TenureMonths,PatientOutcomeCode,PatientGenderCode,PatientGender_Female,PatientGender_Male,PatientGender_Unknown (Unable to Determine)
0,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180498,392230005,IV Start - Extremity Vein (arm or leg),479853_58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,161,1,1,0,1,0
1,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180499,268400002,CV - ECG - 12 Lead Obtained,479853_58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,161,1,1,0,1,0
2,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180500,268400002,CV - ECG - 12 Lead Obtained,479853_58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,161,1,1,0,1,0
3,479853,58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Male,2018-01-01 00:04:06,Male,2004-07-12,180501,268400002,CV - ECG - 12 Lead Obtained,479853_58D5C99E-9E01-E211-B5F5-78E7D18CFD3C,161,1,1,0,1,0
4,479862,32D8C99E-9E01-E211-B5F5-78E7D18CFD3C,Treated & Transported,Female,2018-01-01 00:44:31,Male,2006-12-11,180510,392230005,IV Start - Extremity Vein (arm or leg),479862_32D8C99E-9E01-E211-B5F5-78E7D18CFD3C,132,1,2,1,0,0


### Use the "One Hot Encoding" method using get_dummies to create a row of 1,0 for each Provider Gender

In [81]:
dfProcPatAppend['ProviderGenderCode'] = pd.factorize(dfProcPatAppend['FRDPersonnelGender'])[0] + 1

In [82]:
dfProcPatAppend['ProviderGenderCode'].value_counts()

1    140225
2     29847
Name: ProviderGenderCode, dtype: int64

In [83]:
print_categories(dfProcPatAppend)

PatientId :  Numerical
FRDPersonnelID :  Categorical
PatientOutcome :  Categorical
PatientGender :  Categorical
DispatchTime :  Date
FRDPersonnelGender :  Categorical
FRDPersonnelStartDate :  Date
Dim_Procedure_PK :  Numerical
Procedure_Performed_Code :  Numerical
Procedure_Performed_Description :  Categorical
_PK :  Categorical
TenureMonths :  Numerical
PatientOutcomeCode :  Numerical
PatientGenderCode :  Numerical
PatientGender_Female :  Numerical
PatientGender_Male :  Numerical
PatientGender_Unknown (Unable to Determine) :  Numerical
ProviderGenderCode :  Numerical


### Output the intemediate data Procedures+Patients dataframe to CSV into the Intermediate data directory

In [85]:
dfProcPatAppend.to_csv(r'../data/02_intermediate/ProceduresPatients-Intermediate.csv', index = False)