# TQIP Data Preprocessing

This notebook illustrates how we  preprocessed the data for the paper: "Assesing the Utility of Deep Neural Networks in Dynamic Risk Prediction After Trauma"

The TQIP database can be requested through the American College of Surgeons website at
https://facs.org

## Notebook set-up

Import pre-installed packages: 

In [None]:
import sys
import fastai
import pandas as pd
from pathlib import Path
from fastai.tabular.all import *
from platform import python_version 

Package versions: 

In [None]:
print("Python version: " + python_version())
print("Pandas version: " + pd.__version__)
print("Pytorch version: " + torch.__version__)
print("Fastai version: " + fastai.__version__)

Python version: 3.8.3
Pandas version: 1.1.4
Pytorch version: 1.6.0
Fastai version: 2.0.11


Set seed for reproducable results:

In [None]:
seed = 42

Disable warnings for chained assignments:

In [None]:
pd.options.mode.chained_assignment=None

Create a path to the TQIP data folder:

In [None]:
data = Path('E:\Data\TQIP')

## Input variables

Load the 2017 TQIP database into a dataframe:

In [None]:
TQIP = pd.read_csv(data/'2017/PUF_TRAUMA.csv', low_memory=False)

### Ordinal columns

Preprocess columns that contain strings with a natural order:

In [None]:
ordinal_colums = ['EMSGCSEYE', 'EMSGCSVERBAL', 'GCSEYE', 'GCSVERBAL', 'GCSMOTOR', 'VERIFICATIONLEVEL', 
                  'PEDIATRICVERIFICATIONLEVEL', 'STATEDESIGNATION', 'STATEPEDIATRICDESIGNATION', 'BEDSIZE']

Function that changes colums to categories:

In [None]:
def ordinal(column, sizes):
    TQIP[column] = TQIP[column].astype('category')
    return TQIP[column].cat.set_categories(sizes, ordered=True, inplace=True)

In [None]:
ordinal('EMSGCSEYE', ('Opens eyes spontaneously','Opens eyes in response to verbal stimulation',
                      'Opens eyes in response to painful stimulation', 'No eye movement when assessed'))

In [None]:
ordinal('EMSGCSVERBAL', ('Smiles, oriented to sounds, follows objects, interacts (P) | Oriented (A)',
                      'Cries but is consolable, inappropriate interactions (P) | Confused (A)',
                      'Inconsistently consolable, moaning (P) | Inappropriate words (A)',
                      'Inconsolable, agitated (P) | Incomprehensible sounds (A)', 
                      'No vocal response (P) | No verbal response (A)'))

In [None]:
ordinal('EMSGCSMOTOR', ('Appropriate response to stimulation (P) | Obeys commands (A)', 'Localizing pain', 
                        'Withdrawal from pain', 'Flexion to pain', 'Extension to pain', 'No motor response'))

In [None]:
ordinal('GCSEYE', ('Opens eyes spontaneously','Opens eyes in response to verbal stimulation',
                   'Opens eyes in response to painful stimulation', 'No eye movement when assessed'))

In [None]:
ordinal('GCSVERBAL', ('Smiles, oriented to sounds, follows objects, interacts (P) | Oriented (A)',
                      'Cries but is consolable, inappropriate interactions (P) | Confused (A)',
                      'Inconsistently consolable, moaning (P) | Inappropriate words (A)',
                      'Inconsolable, agitated (P) | Incomprehensible sounds (A)', 
                      'No vocal response (P) | No verbal response (A)'))

In [None]:
ordinal('GCSMOTOR', ('Appropriate response to stimulation (P) | Obeys commands (A)', 'Localizing pain', 'Withdrawal from pain',
                      'Flexion to pain', 'Extension to pain', 'No motor response'))

In [None]:
ordinal('VERIFICATIONLEVEL', ('I - Level I Trauma Center', 'II - Level II Trauma Center', 'III - Level III Trauma Center'))

In [None]:
ordinal('PEDIATRICVERIFICATIONLEVEL', ('I - Level I Pediatric Trauma Center', 'II - Level II Pediatric Trauma Center'))

In [None]:
ordinal('STATEDESIGNATION', ('I', 'II', 'III', 'IV', 'Other', 'Not applicable'))

In [None]:
ordinal('STATEPEDIATRICDESIGNATION', ('I', 'II', 'III', 'IV', 'Other', 'Not Applicable'))

In [None]:
ordinal('BEDSIZE', ('<= 200', '201-400','401-600', '> 600'))

Check that all columns have been coded correctly: 

In [None]:
for i in ordinal_colums: 
    globals()[i + '_str_vc'] = TQIP[i].value_counts(dropna=False).to_list()
    globals()[i + '_cat_vc'] = TQIP[i].cat.codes.value_counts(dropna=False).tolist()
    print(i, 'string value counts == category value counts:', globals()[i + '_str_vc'] == globals()[i + '_cat_vc'])

EMSGCSEYE string value counts == category value counts: True
EMSGCSVERBAL string value counts == category value counts: True
GCSEYE string value counts == category value counts: True
GCSVERBAL string value counts == category value counts: True
GCSMOTOR string value counts == category value counts: True
VERIFICATIONLEVEL string value counts == category value counts: True
PEDIATRICVERIFICATIONLEVEL string value counts == category value counts: True
STATEDESIGNATION string value counts == category value counts: True
STATEPEDIATRICDESIGNATION string value counts == category value counts: True
BEDSIZE string value counts == category value counts: True


### Filter input variables

Recode input variables that were **not** collected within the first day of hospitalization:

In [None]:
TQIP.loc[TQIP.VTEProphylaxisDays != 1, "VTEPROPHYLAXISTYPE"] = 'None'

In [None]:
TQIP.loc[TQIP.HMRRHGCTRLSURGDays != 1, "HMRRHGCTRLSURGTYPE"] = 'None'

In [None]:
TQIP.loc[TQIP.EDDAYS != 1, "EDDISCHARGEDISPOSITION"] = 'None'

Save the dataframe in feather format: 

In [None]:
TQIP.to_feather(data/'feather/2017_TQIP')

### The severity of each injured body region

In this section, we will get the severity of each injured body region from the first- and the last numbers of the AIS code.

Import AIS diagnoses:

In [None]:
ais_df = pd.read_csv(data/'2017/PUF_AISDIAGNOSIS.csv', low_memory=False)

Create a list with AIS codes from 1-9 (body_reg_num) and a list with the corresponding body regions (body_reg):

In [None]:
body_reg_num =  [str(x) for x in range(1,10)]

In [None]:
body_reg = ['HEAD','FACE','NECK','THORAX','ABDOMEN','SPINE','UPPER_EXT','LOWER_EXT','UNSPEC']

Change AIS codes to strings:

In [None]:
ais_df['AISSeverity'] = ais_df['AISSeverity'].astype('Int32').astype(str)

In [None]:
ais_df['AISPREDOT'] = ais_df['AISPREDOT'].astype('Int32').astype(str)

Create a column for each injured body region, and fill in the severity of that that injury:

In [None]:
for x, y in zip(body_reg_num, body_reg):    
    ais_df.loc[ais_df.loc[(ais_df['AISPREDOT'].str.startswith(x))].index,y] = ais_df.loc[ais_df.loc[
        (ais_df['AISPREDOT'].str.startswith(x))].index,'AISSeverity']

Drop redundant columns:

In [None]:
ais_df=ais_df.drop(['AISPREDOT', 'AISPREDOT_BIU', 'AISSeverity', 'AISSeverity_BIU', 'AISVersion'], axis=1)

In the resulting dataframe, each row represents one injury, and each patient may have multiple injuries

In [None]:
ais_df.shape

(3434829, 10)

Convert the 'long' dataframe to multiple 'wide' dataframes - 
one for each body region:

In [None]:
for i in body_reg:
    globals()[i + '_Inj'] = pd.crosstab(ais_df['inc_key'], ais_df[i]).add_prefix(i + '_')

We now have one dataframe for each body region. The columns represent the severity of the injury, and the values represent the number of injuries:

In [None]:
data_frames = HEAD_Inj, FACE_Inj, NECK_Inj, THORAX_Inj, ABDOMEN_Inj, SPINE_Inj, UPPER_EXT_Inj, LOWER_EXT_Inj, UNSPEC_Inj

Merge the 'wide' dataframes for each body region:

In [None]:
ais_df = reduce(lambda  left,right: pd.merge(left,right,on=['inc_key'], how='outer'), data_frames)

Fill missing values with 0

In [None]:
ais_df = ais_df.fillna(0)

Reset the index and save the preprocessed injury characteristics in feather format

In [None]:
ais_df.reset_index().to_feather(data/'feather/2017_ais')

### Merge injury characteristics with the other input variables

Load dataframes:

In [None]:
TQIP = pd.read_feather(data/'feather/2017_TQIP')

In [None]:
ais_df = pd.read_feather(data/'feather/2017_ais')

Merge dataframes on the patient identifier column (inc_key):

In [None]:
TQIP = TQIP.merge(ais_df, on = 'inc_key', how = 'left')

Save the merged dataframe:

In [None]:
TQIP.to_feather(data/'feather/2017_TQIP')

## Output variables

Load the 2017 TQIP dataframe:

In [None]:
TQIP = pd.read_feather(data/'feather/2017_TQIP') 

Delete cases were length of stay was not recorded:

In [None]:
l1 = len(TQIP)
l1

997970

In [None]:
TQIP = TQIP[~TQIP.LOSDays.isna()]

Deleted cases: 

In [None]:
l1 - len(TQIP)

18962

Statistics for length of stay:

In [None]:
TQIP['LOSDays'].quantile(q=(0.25, 0.5, 0.75))

0.25    2.0
0.50    3.0
0.75    6.0
Name: LOSDays, dtype: float64

### Early trauma deaths

Create a new column for early mortality:

In [None]:
TQIP['DECEASED_early'] = 0

Identify all patients who were coded as "Deceased/expired" on **ED discharge**, with a length of stay (ED or total) of 1 day or less:

In [None]:
TQIP.loc[(TQIP.EDDISCHARGEDISPOSITION == "Deceased/expired") & ((TQIP.EDDAYS <= 1.0)|(TQIP.LOSDays <= 1.0)), "DECEASED_early"] = 1

Identify all patients who were coded as "Deceased/expired" on **hospital discharge**, with a total length of stay of 1 day or less:

In [None]:
TQIP.loc[(TQIP.HOSPDISCHARGEDISPOSITION == "Deceased/Expired") & (TQIP.LOSDays <= 1.0), "DECEASED_early"] = 1

### Late trauma deaths

Create a new column for late mortality:

In [None]:
TQIP['DECEASED_late'] = 0

Identify all patients who were coded as "Deceased/expired" on ED discharge, with a length of stay (ED or total) of more than 1:

In [None]:
TQIP.loc[(TQIP.EDDISCHARGEDISPOSITION == "Deceased/expired") & ((TQIP.EDDAYS > 1.0)&(TQIP.LOSDays > 1.0)), "DECEASED_late"] = 1

Identify all patients who were coded as "Deceased/expired" on hospital discharge, with a total length of stay of more than 1 day:

In [None]:
TQIP.loc[(TQIP.HOSPDISCHARGEDISPOSITION == "Deceased/Expired") & (TQIP.LOSDays > 1.0), "DECEASED_late"] = 1

### Post-trauma complications 

Create a list of post-trauma complications (PTCs):

In [None]:
PTC_list = ['HC_CLABSI', 'HC_DEEPSSI', 'HC_DVTHROMBOSIS', 'HC_CARDARREST', 'HC_CAUTI', 'HC_EMBOLISM', 
           'HC_EXTREMITYCS', 'HC_INTUBATION', 'HC_KIDNEY', 'HC_MI', 'HC_ORGANSPACESSI', 
           'HC_RESPIRATORY', 'HC_SEPSIS', 'HC_STROKECVA', 'HC_SUPERFICIALINCISIONSSI', 'HC_PRESSUREULCER', 'HC_VAPNEUMONIA']

Convert yes/no answers to 1/0 values:

In [None]:
for i in PTC_list:
    TQIP[str(i)] = TQIP[str(i)].map({'No':0, 'Yes':1})

Reset the index and save the dataframe in feather format:

In [None]:
TQIP.reset_index(drop = True).to_feather(data/'feather/2017_TQIP')

## Load and preprocess data

Load the 2017 TQIP dataframe:

In [None]:
TQIP = pd.read_feather(data/'feather/2017_TQIP') 

Select columns used for this project:

In [None]:
col_names_short = ['SEX', 'ASIAN', 'PACIFICISLANDER', 'RACEOTHER', 'AMERICANINDIAN', 'BLACK', 'WHITE', 'ETHNICITY', 
                   'TRANSPORTMODE', 'PREHOSPITALCARDIACARREST', 'TCCGCSLE13', 'TCCSBPLT30', 'TCC10RR29', 'TCCPEN', 'TCCCHEST', 
                   'TCCLONGBONE', 'TCCCRUSHED', 'TCCAMPUTATION', 'TCCPELVIC', 'TCCSKULLFRACTURE', 'TCCPARALYSIS', 
                   'TEACHINGSTATUS', 'BEDSIZE', 'HOSPITALTYPE', 'VERIFICATIONLEVEL', 'PEDIATRICVERIFICATIONLEVEL', 
                   'STATEDESIGNATION', 'STATEPEDIATRICDESIGNATION', 'EMSGCSEYE', 'EMSGCSVERBAL', 'EMSGCSMOTOR', 
                   'PRIMARYECODEICD10', 'AGEYEARS', 'WEIGHT', 'HEIGHT', 'EMSSBP', 'EMSPULSERATE', 'EMSRESPIRATORYRATE', 
                   'EMSPULSEOXIMETRY', 'EMSTOTALGCS', 'GCSEYE', 'GCSVERBAL', 'GCSMOTOR', 'GCSQ_SEDATEDPARALYZED', 
                   'GCSQ_EYEOBSTRUCTION', 'GCSQ_INTUBATED', 'GCSQ_VALID', 'CC_ADHD', 'CC_ALCOHOLISM', 'CC_ANGINAPECTORIS', 
                   'CC_ANTICOAGULANT', 'CC_BLEEDING', 'CC_CHEMO', 'CC_CIRRHOSIS', 'CC_COPD', 'CC_CVA', 'CC_DEMENTIA', 
                   'CC_DIABETES', 'CC_DISCANCER', 'CC_FUNCTIONAL', 'CC_CHF', 'CC_HYPERTENSION', 'CC_MI', 'CC_PAD', 'CC_OTHER', 
                   'CC_MENTALPERSONALITY', 'CC_RENAL', 'CC_SMOKING', 'CC_STEROID', 'CC_SUBSTANCEABUSE', 'HEAD_1', 'HEAD_2', 
                   'HEAD_3', 'HEAD_4', 'HEAD_5', 'HEAD_6', 'HEAD_9', 'FACE_1', 'FACE_2', 'FACE_3', 'FACE_4', 'FACE_5', 'FACE_9',
                   'NECK_1', 'NECK_2', 'NECK_3', 'NECK_4', 'NECK_5', 'NECK_6', 'NECK_9', 'THORAX_1', 'THORAX_2', 'THORAX_3', 
                   'THORAX_4', 'THORAX_5', 'THORAX_6', 'THORAX_9', 'ABDOMEN_1', 'ABDOMEN_2', 'ABDOMEN_3', 'ABDOMEN_4', 
                   'ABDOMEN_5', 'ABDOMEN_6', 'ABDOMEN_9', 'SPINE_1', 'SPINE_2', 'SPINE_3', 'SPINE_4', 'SPINE_5', 'SPINE_6', 
                   'SPINE_9', 'UPPER_EXT_1', 'UPPER_EXT_2', 'UPPER_EXT_3', 'UPPER_EXT_4', 'UPPER_EXT_5', 'UPPER_EXT_6', 
                   'UPPER_EXT_9', 'LOWER_EXT_1', 'LOWER_EXT_2', 'LOWER_EXT_3', 'LOWER_EXT_4', 'LOWER_EXT_5', 'LOWER_EXT_6', 
                   'LOWER_EXT_9', 'UNSPEC_1', 'UNSPEC_2', 'UNSPEC_3', 'UNSPEC_4', 'UNSPEC_5', 'UNSPEC_6', 'UNSPEC_9', 
                   'RESPIRATORYASSISTANCE', 'SUPPLEMENTALOXYGEN', 'EMSRESPONSEMINS', 'EMSSCENEMINS', 'EMSMINS', 'SBP', 
                   'PULSERATE', 'TEMPERATURE', 'RESPIRATORYRATE', 'PULSEOXIMETRY', 'ISS_05', 'VTEPROPHYLAXISTYPE', 
                   'HMRRHGCTRLSURGTYPE', 'EDDISCHARGEDISPOSITION', 'HC_CLABSI', 'HC_CAUTI', 'HC_SUPERFICIALINCISIONSSI', 
                   'HC_DEEPSSI', 'HC_ORGANSPACESSI', 'HC_VAPNEUMONIA', 'HC_SEPSIS', 'HC_DVTHROMBOSIS',  'HC_EMBOLISM', 
                   'HC_EXTREMITYCS', 'HC_PRESSUREULCER', 'HC_KIDNEY', 'HC_MI', 'HC_CARDARREST', 'HC_STROKECVA', 'HC_INTUBATION',
                   'HC_RESPIRATORY', 'DECEASED_early', 'DECEASED_late'] 

Rename columns:

In [None]:
col_names_long = ['Gender', 'Race Category: Asian', 'Race Category: Pacific Islander', 'Race Category: Other', 
                  'Race Category: American Indian', 'Race Category: Black', 'Race Category: White', 'Ethnicity', 
                  'Transport Mode', 'Pre-Hospital Cardiac Arrest', 'TCC: Glasgow Coma Scale of 13 or less', 
                  'TCC: Systolic Blood Pressure under 90', 'TCC: Respiratory rate less than 10 or more than 29', 
                  'TCC: Penetrating Injuries', 'TCC: Chest wall instability or deformity', 
                  'TCC: Two or more proximal long-bone fractures', 'TCC: Crushed, degloved, mangled, or pulseless extremity', 
                  'TCC: Amputation proximal to wrist or ankle', 'TCC: Pelvic fracture', 'TCC: Open or depressed skull fracture',
                  'TCC: Paralysis',  'Hospital Teaching Status', 'Bed Size', 'Hospital Type', 'ACS Verification Level',
                  'Pediatric Verification Level', 'State Designation', 'State Pediatric Designation', 'EMS GCS - Eye', 
                  'EMS GCS - Verbal', 'EMS GCS - Motor', 'Primary External Cause', 'Age (years)', 'Weight', 'Height', 
                  'Initial EMS Systolic Blood Pressure', 'Initial EMS Pulse Rate', 'Initial EMS Respiratory Rate', 
                  'Initial EMS Oxygen Saturation', 'Initial EMS Total GCS', 'GCS - Eye', 'GCS - Verbal', 'GCS -Motor', 
                  'GCS Assessment Qualifiers: Patient Chemically Sedated or Paralyzed', 
                  'GCS Assessment Qualifiers: Obstruction to the Patients Eye', 'GCS Assessment Qualifiers: Patient Intubated',
                  'GCS Assessment Qualifiers: Valid GCS', 
                  'Comorbid Condition: Attention Deficit Disorder/Attention Deficit Hyperactivity Disorder (ADD/ADHD)',
                  'Comorbid Condition: Alcohol Use Disorder', 'Comorbid Condition: Angina Pectoris', 
                  'Comorbid Condition: Anticoagulant Therapy', 'Comorbid Condition: Bleeding Disorder', 
                  'Comorbid Condition: Currently Receiving Chemotherapy for Cancer', 
                  'Comorbid Condition: Cirrhosis', 'Comorbid Condition: Chronic Obstructive Pulmonary Disease (COPD)', 
                  'Comorbid Condition: Cerebrovascular Accident (CVA)', 'Comorbid Condition: Dementia', 
                  'Comorbid Condition: Diabetes Mellitus', 'Comorbid Condition: Disseminated Cancer', 
                  'Comorbid Condition: Functionally Dependent Health Status', 'Comorbid Condition: Congestive Heart Failure', 
                  'Comorbid Condition: Hypertension', 'Comorbid Condition: Myocardial Infarction (MI)', 
                  'Comorbid Condition: Peripheral Arterial Disease (PAD)', 'Comorbid Condition: Other', 
                  'Comorbid Condition: Mental/Personality Disorder', 'Comorbid Condition: Chronic Renal Failure', 
                  'Comorbid Condition: Current Smoker', 'Comorbid Condition: Steroid Use', 
                  'Comorbid Condition: Substance Abuse Disorder', 'Minor Head Injury', 'Moderate Head Injury', 
                  'Serious Head Injury', 'Severe Head Injury', 'Critical Head Injury', 'Maximum Head Injury', 
                  'Head Injury (NFS)', 'Minor Face Injury', 'Moderate Face Injury', 'Serious Face Injury', 'Severe Face Injury',
                  'Critical Face Injury', 'Face Injury (NFS)', 'Minor Neck Injury', 'Moderate Neck Injury', 
                  'Serious Neck Injury', 'Severe Neck Injury', 'Critical Neck Injury', 'Maximum Neck Injury', 
                  'Neck Injury (NFS)' , 'Minor Thoracic Injury', 'Moderate Thoracic Injury', 'Serious Thoracic Injury', 
                  'Severe Thoracic Injury', 'Critical Thoracic Injury', 'Maximum Thoracic Injury', 'Thoracic Injury (NFS)', 
                  'Minor Abdominal Injury', 'Moderate Abdominal Injury', 'Serious Abdominal Injury', 'Severe Abdominal Injury', 
                  'Critical Abdominal Injury', 'Maximum Abdominal Injury', 'Abdominal Injury (NFS)', 'Minor Spine Injury', 
                  'Moderate Spine Injury', 'Serious Spine Injury', 'Severe Spine Injury', 'Critical Spine Injury', 
                  'Maximum Spine Injury', 'Spine Injury (NFS)', 'Minor Upper Extremity Injury', 
                  'Moderate Upper Extremity Injury', 'Serious Upper Extremity Injury', 'Severe Upper Extremity Injury', 
                  'Critical Upper Extremity Injury', 'Maximum Upper Extremity Injury', 'Upper Extremity Injury (NFS)', 
                  'Minor Lower Extremity Injury', 'Moderate Lower Extremity Injury', 'Serious Lower Extremity Injury', 
                  'Severe Lower Extremity Injury', 'Critical Lower Extremity Injury', 'Maximum Lower Extremity Injury', 
                  'Lower Extremity Injury (NFS)', 'Minor Unspecified Injury', 'Moderate Unspecified Injury', 
                  'Serious Unspecified Injury', 'Severe Unspecified Injury', 'Critical Unspecified Injury', 
                  'Maximum Unspecified Injury', 'Unspecified Injury (NFS)', 'Respiratory Assistance', 'Supplemental Oxygen', 
                  'Time to EMS Response (mins)', 'Time EMS spent at scene (mins)', 
                  'Time from dispatch to ED/hospital arrival (mins)', 'Initial ED Systolic Blood Pressure', 
                  'Initial ED Pulse Rate', 'Initial ED Temperature', 'Initial ED Respiratory Rate', 
                  'Initial ED Oxygen Saturation', 'AIS derived ISS', 'VTE Prophylaxis Type', 
                  'Type of Surgery for Hemorrhage Control', 'ED Discharge Disposition', 
                  'Central Line-Associated Bloodstream Infection', 'Catheter-Associated Urinary Tract Infection', 
                  'Superficial Incisional Surgical Site Infection', 'Deep Surgical Site Infection', 
                  'Organspace Surgical Site Infection', 'Ventilator-Associated Pneumonia', 'Severe Sepsis', 
                  'Deep Vein Thrombosis', 'Pulmonary Embolism', 'Extremity Compartment Syndrome', 'Pressure Ulcer', 
                  'Acute Kidney Injury', 'Myocardial Infarction', 'Cardiac Arrest', 'Stroke', 'Unplanned Intubation',
                  'Acute Respiratory Distress Syndrome', 'Early Mortality', 'Late Mortality']

In [None]:
TQIP.rename(columns=dict(zip(col_names_short, col_names_long)), inplace=True)

In [None]:
TQIP.to_feather(data/'feather/2017_TQIP')

### Test data

In [None]:
TQIP = pd.read_feather(data/'feather/2017_TQIP') 

Create a test dataset by identifying all patients that were treated at medium-sized (bed size = 201-400) university hospitals.

In [None]:
TQIP_test = TQIP[(TQIP['Bed Size'] == '201-400') & (TQIP['Hospital Teaching Status'] == 'university')]

Reset index:

In [None]:
TQIP_test.reset_index(inplace=True, drop=True)

Save test dataframe:

In [None]:
TQIP_test.to_feather(data/'feather/2017_TQIP_test')

### Train and validation data

Create a dataframe for training and validatoin: 

In [None]:
TQIP_train_val = TQIP[~((TQIP['Bed Size'] == '201-400') & (TQIP['Hospital Teaching Status'] == 'university'))]

Reset index

In [None]:
TQIP_train_val.reset_index(inplace=True, drop=True)

Split the train_val data randomly into a training dataset (80%) and a validation dataset (20%):

In [None]:
splits = RandomSplitter(seed=seed)(range_of(TQIP_train_val))

Define preprocessingm steps:
* *FillMissing = Replace missing values with the median of the group while simultaneously creating a new binary column indicating whether a variable was missing or not*<br>
* *Categorify = Turn categorical variables into categories*<br
* *Normalize = Normalize continous data by subtraction of the mean and division by the standard deviation*

In [None]:
procs = [FillMissing, Categorify, Normalize]

Save the training dataset

In [None]:
TQIP_train = TQIP_train_val.iloc[splits[0], :]

Reset index:

In [None]:
TQIP_train.reset_index(inplace=True, drop=True)

Save test dataframe:

In [None]:
TQIP_train.to_feather(data/'feather/2017_TQIP_train')

Save the validation dataset

In [None]:
TQIP_valid= TQIP_train_val.iloc[splits[1], :]

Reset index:

In [None]:
TQIP_valid.reset_index(inplace=True, drop=True)

Save test dataframe:

In [None]:
TQIP_valid.to_feather(data/'feather/2017_TQIP_valid')

### Preprocess data for the Pre-Hospital Model

Define categorical input variables for the Pre-Hospital Model:

In [None]:
cat_names_prehosp = col_names_long[0:32]

Define continous input variables for the Pre-Hospital Model:

In [None]:
cont_names_prehosp = col_names_long[32:40]

Define output variables for the Pre-Hospital Model: 

In [None]:
y_names_prehosp = col_names_long[146:166]

Use TabularPandas from Fastai to write a tabular processor for the train_val dataframe:

In [None]:
to_prehosp = TabularPandas(TQIP_train_val, procs=procs, cat_names = cat_names_prehosp, cont_names = cont_names_prehosp, 
                           y_names=y_names_prehosp, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_prehosp), 
                           inplace=True, splits=splits)

Save the processed dataframes:

In [None]:
pickle.dump(to_prehosp, open('to_prehosp.pkl', 'wb'))

Use TabularPandas from Fastai to write a tabular processor for the test dataframe:

In [None]:
to_test_prehosp = TabularPandas(TQIP_test, procs=procs, cat_names = cat_names_prehosp, cont_names = cont_names_prehosp, 
                                y_names=y_names_prehosp, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_prehosp),
                                inplace=True)

Save the processed test dataframe:

In [None]:
pickle.dump(to_test_prehosp, open('to_prehosp_test.pkl', 'wb'))

### Preprocess data for the ED Model

Define categorical input features for the ED model:

In [None]:
cat_names_ed = col_names_long[0:32] + col_names_long[40:134]

Define continous input features for the ED model:

In [None]:
cont_names_ed = col_names_long[32:40] + col_names_long[134:143]

Define output variables for the ED Model: 

In [None]:
y_names_ed = col_names_long[146:165]

Use TabularPandas from Fastai to write a tabular processor for the train_val dataframe:

In [None]:
to_ed = TabularPandas(TQIP_train_val, procs=procs, cat_names = cat_names_ed, cont_names = cont_names_ed, 
                   y_names=y_names_ed, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_ed), splits=splits)

Save the processed dataframes:

In [None]:
pickle.dump(to_ed, open('to_ED.pkl', 'wb'))

Use TabularPandas from Fastai to write a tabular processor for the test dataframe:

In [None]:
to_test_ed = TabularPandas(TQIP_test, procs=procs, cat_names = cat_names_ed, cont_names = cont_names_ed, 
                        y_names=y_names_ed, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_ed))

Save the processed test dataframe:

In [None]:
pickle.dump(to_test_ed, open('to_ED_test.pkl', 'wb'))

#### Preprocess data without detailed injury characteristics

Define categorical input features:

In [None]:
cat_names_ed = col_names_long[0:32] + col_names_long[40:70] + col_names_long[132:134]

Define additional continous input features for the ED model:

In [None]:
cont_names_ed = col_names_long[32:40] + col_names_long[134:143]

Define output variables for the ED Model: 

In [None]:
y_names_ed = col_names_long[146:165]

Use TabularPandas from Fastai to write a tabular processor for the train_val dataframe:

In [None]:
to_ed = TabularPandas(TQIP_train_val, procs=procs, cat_names = cat_names_ed, cont_names = cont_names_ed, 
                   y_names=y_names_ed, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_ed), splits=splits)

Save the processed dataframes:

In [None]:
pickle.dump(to_ed, open('to_ED_noinj.pkl', 'wb'))

### Preprocess data for the In-Hospital Model

Define additional categorical input features for the In-Hospital model:

In [None]:
cat_names_inhosp = col_names_long[0:32] + col_names_long[40:134] + col_names_long[143:146] + col_names_long[163:164]

The In-Hospital Model has the same continous input variables as the ED Model:

In [None]:
cont_names_inhosp = col_names_long[32:40] + col_names_long[134:143]

Define output variables for the In-Hospital Model: 

In [None]:
y_names_inhosp = col_names_long[146:163] + col_names_long[164:]

Use TabularPandas from Fastai to write a tabular processor for the train_val dataframe:

In [None]:
to_inhosp = TabularPandas(TQIP_train_val, procs=procs, cat_names = cat_names_inhosp, cont_names = cont_names_inhosp, 
                   y_names=y_names_inhosp, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_inhosp), splits=splits)

Save the processinhosp dataframes:

In [None]:
pickle.dump(to_inhosp, open('to_inhosp.pkl', 'wb'))

Use TabularPandas from Fastai to write a tabular processor for the test dataframe:

In [None]:
to_test_inhosp = TabularPandas(TQIP_test, procs=procs, cat_names = cat_names_inhosp, cont_names = cont_names_inhosp, 
                        y_names=y_names_inhosp, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_inhosp))

Save the processinhosp test dataframe:

In [None]:
pickle.dump(to_test_inhosp, open('to_inhosp_test.pkl', 'wb'))

#### Preprocess data without early mortality and ED discharge disposition

Define categorical input features:

In [None]:
cat_names_inhosp = col_names_long[0:32] + col_names_long[40:134] + col_names_long[143:145]

Define additional continous input features for the ED model:

In [None]:
cont_names_inhosp = col_names_long[32:40] + col_names_long[134:143]

Define output variables for the ED Model: 

In [None]:
y_names_inhosp = col_names_long[146:163] + col_names_long[164:]

Use TabularPandas from Fastai to write a tabular processor for the train_val dataframe:

In [None]:
to_inhosp = TabularPandas(TQIP_train_val, procs=procs, cat_names = cat_names_inhosp, cont_names = cont_names_inhosp, 
                   y_names=y_names_inhosp, y_block=MultiCategoryBlock(encoded=True, vocab=y_names_inhosp), splits=splits)

Save the processed dataframes:

In [None]:
pickle.dump(to_ed, open('to_inhosp_noearlymort.pkl', 'wb'))

## Statistics across the entire dataset

Import the preprocessed in- and output data:

In [None]:
TQIP = pd.read_feather(data/'feather/2017_TQIP') 

### Gender and race statistics:

In [None]:
TQIP['Gender'].value_counts(dropna = False), TQIP['Gender'].value_counts(normalize=True, dropna=False)

(Male      587840
 Female    391028
 NaN          140
 Name: Gender, dtype: int64,
 Male      0.600445
 Female    0.399412
 NaN       0.000143
 Name: Gender, dtype: float64)

Race statistics:

In [None]:
race_cats = ['Race Category: Asian', 'Race Category: Pacific Islander', 'Race Category: Other', 
             'Race Category: American Indian', 'Race Category: Black', 'Race Category: White']

In [None]:
for i in race_cats: 
    counts = len(TQIP.loc[TQIP[i]=='Yes'])
    perc = counts/len(TQIP)
    print(i, '- Count =', counts, '- Percentage =', "{:.1%}".format(perc))

Race Category: Asian - Count = 18379 - Percentage = 1.9%
Race Category: Pacific Islander - Count = 2580 - Percentage = 0.3%
Race Category: Other - Count = 78819 - Percentage = 8.1%
Race Category: American Indian - Count = 8837 - Percentage = 0.9%
Race Category: Black - Count = 139747 - Percentage = 14.3%
Race Category: White - Count = 714967 - Percentage = 73.0%


The TQIP data from 2017 is predominately white (71.5%) and male (60.0%). 

### Additional statistics

In [None]:
TQIP['Primary External Cause'].value_counts()[:3]

W01.0XXA    133371
V43.52XA     41582
W19.XXXA     37256
Name: Primary External Cause, dtype: int64

Age:

In [None]:
TQIP['Age (years)'].quantile(q=(0.25, 0.5, 0.75))

0.25    26.0
0.50    49.0
0.75    69.0
Name: Age (years), dtype: float64

In [None]:
TQIP['AIS derived ISS'].quantile(q=(0.25, 0.5, 0.75))

0.25     4.0
0.50     8.0
0.75    10.0
Name: AIS derived ISS, dtype: float64

Statistics for early- and late mortality: 

In [None]:
TQIP['Early Mortality'].value_counts()

0    966357
1     12651
Name: Early Mortality, dtype: int64

In [None]:
TQIP['Late Mortality'].value_counts(normalize=True)

0    0.980006
1    0.019994
Name: Late Mortality, dtype: float64

In [None]:
TQIP['Late Mortality'].value_counts()

0    959434
1     19574
Name: Late Mortality, dtype: int64

In [None]:
TQIP['Early Mortality'].value_counts(normalize=True)

0    0.987078
1    0.012922
Name: Early Mortality, dtype: float64

Statistics for overall morbidity:

In [None]:
morbidity = col_names_long[146:163]

In [None]:
TQIP['Morbidity'] = 0
for i in morbidity:
    TQIP.loc[TQIP[i] == 1, 'Morbidity'] = 1

In [None]:
TQIP['Morbidity'].value_counts(normalize = True)

0    0.967399
1    0.032601
Name: Morbidity, dtype: float64

Statistics for the primary external cause of injury:

In [None]:
counts = TQIP['Primary External Cause'].nunique()
count_na = TQIP['Primary External Cause'].isna().value_counts()[1]
per_na = TQIP['Primary External Cause'].isna().value_counts(normalize=True)[1]
print('TQIP')
print('\t' + ' Unique values:', f"{counts:,}")
print('\t' + ' Missing:', f"{count_na:,}", '('+"{:.1%}".format(per_na).replace('%', ' %')+')')

TQIP
	 Unique values: 2,042
	 Missing: 1,422 (0.1 %)


## Statistics across the train, validate and test datasets

Load datasets: 

In [None]:
datasets = ['train', 'valid', 'test']

In [None]:
for i in datasets:
    globals()['TQIP_' + i] = pd.read_feather(data/('feather/2017_TQIP_' + i))

The length of the three dataframes: 

In [None]:
len(TQIP_train), len(TQIP_valid), len(TQIP_test)

(718375, 179593, 81040)

### Categorical input features

Define categorical input variables: 

In [None]:
cat_names = col_names_long[0:32] + col_names_long[40:134] + col_names_long[143:146]  + col_names_long[163:164]

Remove primary external cause of injury:

In [None]:
cat_names.remove('Primary External Cause')

Remove the injuries: 

In [None]:
regions = ['Head', 'Face', 'Neck', 'Thoracic', 'Abdominal', 'Spine', 'Upper Extremity', 'Lower Extremity', 'Unspecified Injury']

In [None]:
cols = []
for i in regions: 
    cols = cols + ([col for col in TQIP.columns if i in col])
cat_names_no_inj = [x for x in cat_names if x not in cols]

Create an empty dataframe for the statistics of the categorical input variables:

In [None]:
cat_df = pd.DataFrame(columns=['Category', 'Train','Valid', 'Test'])

Fill in the dataframe: 

In [None]:
for c in cat_names_no_inj:
    Train_counts = TQIP_train[c].value_counts(dropna=False)
    Train_per = Train_counts / Train_counts.sum()
    Valid_counts = TQIP_valid[c].value_counts(dropna=False)
    Valid_per = Valid_counts / Valid_counts.sum()
    Test_counts = TQIP_test[c].value_counts(dropna=False)
    Test_per = Test_counts / Test_counts.sum()
    fmt_pct = '{:.1%}'.format
    fmt_ths = '{:,}'.format
    cat_df = cat_df.append(pd.DataFrame({'Category': c, 
                                         'Train': Train_counts.map(fmt_ths) + ' (' + Train_per.map(fmt_pct) + ')', 
                                         'Valid': Valid_counts.map(fmt_ths) + ' (' + Valid_per.map(fmt_pct) + ')', 
                                         'Test': Test_counts.map(fmt_ths) + ' (' + Test_per.map(fmt_pct) + ')'}))

Reset index and save the dataframe as a csv file:

In [None]:
cat_df.index = cat_df.index.set_names('variable')

In [None]:
cat_df.reset_index(inplace=True)

In [None]:
cat_df.to_csv('cat_df.csv')

#### Statistics for the primary external cause of injury

In [None]:
for i in datasets:
    df = globals()['TQIP_' +i]
    counts = df['Primary External Cause'].nunique()
    count_na = df['Primary External Cause'].isna().value_counts()[1]
    per_na = df['Primary External Cause'].isna().value_counts(normalize=True)[1]
    print('TQIP_' + i)
    print('\t' + ' Unique values:', f"{counts:,}")
    print('\t' + ' Missing:', f"{count_na:,}", '('+"{:.1%}".format(per_na).replace('%', ' %')+')')

TQIP_train
	 Unique values: 1,885
	 Missing: 1,044 (0.1 %)
TQIP_valid
	 Unique values: 1,318
	 Missing: 276 (0.2 %)
TQIP_test
	 Unique values: 1,061
	 Missing: 102 (0.1 %)


#### Statistics for the injury characteristics

In [None]:
for x in datasets: 
    for i in regions:
        var = globals()['TQIP_' + x]
        cols = [col for col in var.columns if i in col]
        var[i + '_inj'] = 0
        for y in cols:
            var.loc[(var[y] > 0), i + '_inj'] = 1

In [None]:
for x in datasets:
    print('TQIP_' + x + ': Statistics for injuries:')
    for i in regions:
        var = globals()['TQIP_' + x]
        counts = var[i + '_inj'].value_counts()
        per = var[i + '_inj'].value_counts(normalize = True)
        print("\t", i + '_inj:' + f"{counts[0]:,}" ' ('+"{:.1%}".format(per[0]).replace('%', ' %')+')')
        print("\t", i + '_inj:' + f"{counts[1]:,}", ' ('+"{:.1%}".format(per[1]).replace('%', ' %')+')')

TQIP_train: Statistics for injuries:
	 Head_inj:464,723 (64.7 %)
	 Head_inj:253,652  (35.3 %)
	 Face_inj:533,559 (74.3 %)
	 Face_inj:184,816  (25.7 %)
	 Neck_inj:698,239 (97.2 %)
	 Neck_inj:20,136  (2.8 %)
	 Thoracic_inj:544,243 (75.8 %)
	 Thoracic_inj:174,132  (24.2 %)
	 Abdominal_inj:628,579 (87.5 %)
	 Abdominal_inj:89,796  (12.5 %)
	 Spine_inj:596,058 (83.0 %)
	 Spine_inj:122,317  (17.0 %)
	 Upper Extremity_inj:477,082 (66.4 %)
	 Upper Extremity_inj:241,293  (33.6 %)
	 Lower Extremity_inj:409,575 (57.0 %)
	 Lower Extremity_inj:308,800  (43.0 %)
	 Unspecified Injury_inj:673,385 (93.7 %)
	 Unspecified Injury_inj:44,990  (6.3 %)
TQIP_valid: Statistics for injuries:
	 Head_inj:116,268 (64.7 %)
	 Head_inj:63,325  (35.3 %)
	 Face_inj:133,540 (74.4 %)
	 Face_inj:46,053  (25.6 %)
	 Neck_inj:174,594 (97.2 %)
	 Neck_inj:4,999  (2.8 %)
	 Thoracic_inj:135,965 (75.7 %)
	 Thoracic_inj:43,628  (24.3 %)
	 Abdominal_inj:157,417 (87.7 %)
	 Abdominal_inj:22,176  (12.3 %)
	 Spine_inj:148,882 (82.9 %)
	

### Continous input features

Define continous input variables: 

In [None]:
cont_names_ = col_names_long[32:40] + col_names_long[134:143]

Print statistics for the train, validation and test datasets: 

In [None]:
for x in datasets:
    print('TQIP_' + x + ': Statistics for continous variables:')
    for i in cont_names_:
        var = globals()['TQIP_'+x]
        l,m,h = var[i].quantile(q=(0.25, 0.5, 0.75))
        missing = var[i].isna().sum()
        perc = missing/(len(var))
        print("\t" + i + ':', m, '['+str(l)+', ' + str(h) +'] - ' + f"{missing:,}", '(' + "{:.1%}".format(perc) + ')')

TQIP_train: Statistics for continous variables:
	Age (years): 50.0 [27.0, 70.0] - 43,990 (6.1%)
	Weight: 75.0 [61.2, 90.0] - 49,008 (6.8%)
	Height: 170.0 [160.02, 177.8] - 102,799 (14.3%)
	Initial EMS Systolic Blood Pressure: 138.0 [120.0, 156.0] - 324,404 (45.2%)
	Initial EMS Pulse Rate: 88.0 [76.0, 102.0] - 317,477 (44.2%)
	Initial EMS Respiratory Rate: 18.0 [16.0, 20.0] - 329,167 (45.8%)
	Initial EMS Oxygen Saturation: 97.0 [95.0, 99.0] - 386,044 (53.7%)
	Initial EMS Total GCS: 15.0 [14.0, 15.0] - 322,875 (44.9%)
	Time to EMS Response (mins): 8.0 [5.0, 14.0] - 255,163 (35.5%)
	Time EMS spent at scene (mins): 16.0 [11.0, 23.0] - 250,686 (34.9%)
	Time from dispatch to ED/hospital arrival (mins): 49.0 [35.0, 71.0] - 249,122 (34.7%)
	Initial ED Systolic Blood Pressure: 137.0 [121.0, 154.0] - 23,814 (3.3%)
	Initial ED Pulse Rate: 86.0 [74.0, 100.0] - 16,610 (2.3%)
	Initial ED Temperature: 36.7 [36.4, 36.9] - 73,549 (10.2%)
	Initial ED Respiratory Rate: 18.0 [16.0, 20.0] - 25,312 (3.5%)
	

### Output variables

Define output variables: 

In [None]:
y_names = col_names_long[146:165]

Create an empty dataframe for the statistics of the output variables:

In [None]:
y_names_df = pd.DataFrame(columns=['Category', 'Train','Valid', 'Test'])

Fill in the dataframe: 

In [None]:
for c in y_names:
    Train_counts = TQIP_train[c].value_counts(dropna=False)
    Train_per = Train_counts / Train_counts.sum()
    Valid_counts = TQIP_valid[c].value_counts(dropna=False)
    Valid_per = Valid_counts / Valid_counts.sum()
    Test_counts = TQIP_test[c].value_counts(dropna=False)
    Test_per = Test_counts / Test_counts.sum()
    fmt_pct = '{:.1%}'.format
    fmt_ths = '{:,}'.format
    y_names_df = y_names_df.append(pd.DataFrame({'Category': c, 
                                         'Train': Train_counts.map(fmt_ths) + ' (' + Train_per.map(fmt_pct) + ')', 
                                         'Valid': Valid_counts.map(fmt_ths) + ' (' + Valid_per.map(fmt_pct) + ')', 
                                         'Test': Test_counts.map(fmt_ths) + ' (' + Test_per.map(fmt_pct) + ')'}))

Reset index and save the dataframe as a csv file:

In [None]:
y_names_df.index = y_names_df.index.set_names('variable')

In [None]:
y_names_df.reset_index(inplace=True)

In [None]:
y_names_df.to_csv('y_names_df_1.csv')

## Data Leakage:

#### Early mortality and prehospital cardiac arrest:

In [None]:
TQIP.loc[TQIP['Early Mortality'] == 1]['Pre-Hospital Cardiac Arrest'].value_counts(Normalize) 

Yes    0.520438
No     0.479562
Name: Pre-Hospital Cardiac Arrest, dtype: float64

In [None]:
TQIP.loc[TQIP['Pre-Hospital Cardiac Arrest'] == 'Yes']['Early Mortality'].value_counts(Normalize) 

1    0.553715
0    0.446285
Name: Early Mortality, dtype: float64

Examining the data, we found that 52.0% of the patients who died within 24 hours had a prehospital cardiac arrest, and that 55.4% of the patients with a prehospital cardiac arrest died within 24 hours. 

#### Early and late mortality:

In [None]:
TQIP.loc[(TQIP['Early Mortality'] == 1) & (TQIP['Late Mortality'] == 1)]

Unnamed: 0,inc_key,Gender,SEX_BIU,Age (years),Race Category: Asian,Race Category: Pacific Islander,Race Category: Other,Race Category: American Indian,Race Category: Black,Race Category: White,...,Minor Unspecified Injury,Moderate Unspecified Injury,Serious Unspecified Injury,Severe Unspecified Injury,Critical Unspecified Injury,Maximum Unspecified Injury,Unspecified Injury (NFS),UNSPEC_<NA>,Early Mortality,Late Mortality


No patients were coded as having experienced both early- and late death.