# Eligibility for mobilization: Cohort ID and Discretizing script

This script identifies the cohort using CLIF 2.0 tables and discretizes the dataset at an hourly level

 
                        🚨Code will break if the following requirements are not satisfied🚨  
#### Requirements:
* Required table filenames should be `clif_patient`, `clif_hospitalization`, `clif_adt`, `clif_vitals`, `clif_labs`, `clif_medication_admin_continuous`, `clif_respiratory_support`
* Within each table, the following variables and categories are required.

| Table Name | Required Variables | Required Categories |
| --- | --- | --- |
| **patient** | `patient_id`, `race_category`, `ethnicity_category`, `sex_category` | - |
| **hospitalization** | `patient_id`, `hospitalization_id`, `admission_dttm`, `discharge_dttm`, `age_at_admission` | - |
| **vitals** | `hospitalization_id`, `recorded_dttm`, `vital_category`, `vital_value` | 'heart_rate', 'resp_rate', 'sbp', 'dbp', 'map', 'spo2', 'weight_kg', 'height_cm' |
| **labs** | `hospitalization_id`, `lab_result_dttm`, `lab_category`, `lab_value` | 'lactate' |
| **medication_admin_continuous** | `hospitalization_id`, `admin_dttm`, `med_name`, `med_category`, `med_dose`, `med_dose_unit` | "norepinephrine", "epinephrine", "phenylephrine", "vasopressin", "dopamine", "angiotensin"(optional), "nicardipine", "nitroprusside", "clevidipine", "cisatracurium", "vecuronium" |
| **respiratory_support** | `hospitalization_id`, `recorded_dttm`, `device_category`, `mode_category`, `tracheostomy`, `fio2_set`, `lpm_set`, `resp_rate_set`, `peep_set`, `resp_rate_obs` | - |

## Load Libraries

In [1]:
#! pip install pandas numpy duckdb seaborn matplotlib plotly
import pandas as pd
import numpy as np
import pyCLIF

  from pandas.core import (


Loaded configuration from config.json
{'site_name': 'Hopkins', 'tables_path': '/home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/', 'file_type': 'parquet'}


## Required columns and categories

In [2]:
rst_required_columns = [
    'hospitalization_id',
    'recorded_dttm',
    'device_name',
    'device_category',
    'mode_name', 
    'mode_category',
    'tracheostomy',
    'fio2_set',
    'lpm_set',
    'resp_rate_set',
    'peep_set',
    'resp_rate_obs',
    'tidal_volume_set'
]

vitals_required_columns = [
    'hospitalization_id',
    'recorded_dttm',
    'vital_category',
    'vital_value'
]
vitals_of_interest = ['heart_rate', 'respiratory_rate', 'sbp', 'dbp', 'map', 'spo2', 'weight_kg', 'height_cm']

labs_required_columns = [
    'hospitalization_id',
    'lab_result_dttm',
    'lab_category',
    'lab_value',
    'lab_value_numeric'
]
labs_of_interest = ['lactate']

meds_required_columns = [
    'hospitalization_id',
    'admin_dttm',
    'med_name',
    'med_category',
    'med_dose',
    'med_dose_unit'
]
meds_of_interest = [
    'norepinephrine', 'epinephrine', 'phenylephrine', 'vasopressin',
    'dopamine', 'angiotensin', 'nicardipine', 'nitroprusside',
    'clevidipine', 'cisatracurium', 'vecuronium'
]

## Load data

In [3]:
patient = pyCLIF.load_data('clif_patient')
hospitalization = pyCLIF.load_data('clif_hospitalization')
hospitalization['hospitalization_id']= hospitalization['hospitalization_id'].astype(str)
patient['patient_id']= patient['patient_id'].astype(str)

Data loaded successfully from /home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/clif_patient.parquet
Data loaded successfully from /home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/clif_hospitalization.parquet


In [4]:
# Standardize all _dttm variables to the same format
patient = pyCLIF.standardize_datetime(patient)
hospitalization = pyCLIF.standardize_datetime(hospitalization)

In [5]:
patient = pyCLIF.remove_duplicates(patient, ['patient_id'], 'patient')
hospitalization = pyCLIF.remove_duplicates(hospitalization, ['hospitalization_id'], 'hospitalization')

Processing DataFrame: patient
Found 2 duplicate rows based on columns: ['patient_id']
Dropped 1 duplicate rows. New DataFrame has 135170 rows.
Processing DataFrame: hospitalization
No duplicates found based on columns: ['hospitalization_id'].


In [6]:
print(f"Total Number of unique encounters in the data: {pyCLIF.count_unique_encounters(hospitalization, 'hospitalization_id')}")

Total Number of unique encounters in the data: 454476


## Cohort Identification

### Inclusion Criteria:

* Filter Admissions for 2020-03-01 to 2022-03-31
* Encounters receiving invasive mechanical ventilation during this period
* A cool off period of 4 hours after first intubation for analysis

### Exclusion criteria:

1. Encounters that were on vent for less than 2 hours
2. Encounters that were on trach in the first 72 hours 
3. Encounters that received Cisatracurium or Vecuronium for 4 hours or more within the first 72 hours

In [7]:
cohort = hospitalization[
    (hospitalization['admission_dttm'] >= '2020-03-01') &
    (hospitalization['admission_dttm'] <= '2022-03-31') &
    (hospitalization['age_at_admission'] >= 18)
].reset_index(drop=True)[['hospitalization_id']].drop_duplicates()

cohort_ids = cohort['hospitalization_id'].unique().tolist()
print(f"Number of unique encounters after filtering by date and age:", cohort['hospitalization_id'].nunique())

Number of unique encounters after filtering by date and age: 126598


In [8]:
# Import clif respiratory table for this cohort
rst_filters = {
    'hospitalization_id': cohort_ids
}
resp_support_raw = pyCLIF.load_data('clif_respiratory_support', columns=rst_required_columns, filters=rst_filters)
resp_support_raw['hospitalization_id']= resp_support_raw['hospitalization_id'].astype(str)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Data loaded successfully from /home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/clif_respiratory_support.parquet


In [9]:
resp_support = resp_support_raw.copy()
resp_support['recorded_dttm'] = pd.to_datetime(resp_support['recorded_dttm'])
resp_support['device_category'] = resp_support['device_category'].str.lower()
resp_support['mode_category'] = resp_support['mode_category'].str.lower()

In [10]:
# Apply Nick's Waterfall fill logic for respiratory support table
# This can take time: 2 - 12 mins depending on data size
processed_resp_support = pyCLIF.process_resp_support(resp_support)

Initiating waterfall processing...
Creating recorded_date and recorded_hour...
Sorting data by 'hospitalization_id' and 'recorded_dttm'...
Fixing missing 'device_category' and 'device_name' based on 'mode_category'...
Fixing 'device_category' and 'device_name' based on neighboring records...
Handling duplicates and removing rows with all key variables missing...
Filling forward 'device_category' within each hospitalization...
Creating 'device_cat_id' to track changes in 'device_category'...
Filling 'device_name' within each 'device_cat_id'...
Creating 'device_id' to track changes in 'device_name'...
Filling 'mode_category' within each 'device_id'...
Creating 'mode_cat_id' to track changes in 'mode_category'...
Filling 'mode_name' within each 'mode_cat_id'...
Creating 'mode_name_id' to track changes in 'mode_name'...
Adjusting 'fio2_set' for 'room air' device_category...
Adjusting 'mode_category' for 't-piece' devices...
Filling remaining variables within each 'mode_name_id'...
Filling 

In [11]:
# Identify the cohort on invasive mechanical ventilation 
columns_to_keep = [
    'hospitalization_id', 'recorded_dttm', 'device_name','device_category',
    'mode_name', 'mode_category' , 'tracheostomy',
    'fio2_set', 'lpm_set', 'peep_set', 
    'resp_rate_obs', 'resp_rate_set'
]

ventilator_usage = processed_resp_support[processed_resp_support['device_category'].str.contains("imv", case=False, na=False)]
cohort_on_vent = ventilator_usage.merge(cohort, on='hospitalization_id', how='left')
cohort_on_vent = cohort_on_vent[columns_to_keep]
cohort_on_vent['on_vent'] = cohort_on_vent['device_category'].str.contains("imv", case=False, na=False).astype(int)
cohort_on_vent.loc[:, 'recorded_dttm'] = pd.to_datetime(cohort_on_vent['recorded_dttm'])
cohort_on_vent = cohort_on_vent.sort_values(by=['hospitalization_id', 'recorded_dttm'])
cohort_on_vent = cohort_on_vent[cohort_on_vent['on_vent'] == 1]


# Apply thresholds and replace values outside these with NaN using .loc[]
# UPDATE THIS TO USE CSV / JSON FROM OUTLIER DIRECTORY
# cohort_on_vent.loc[:, 'fio2_set'] = cohort_on_vent['fio2_set'].where(cohort_on_vent['fio2_set'].between(0.21, 1, inclusive='both'), np.nan)
# Calculate the mean of 'fio2_set', excluding NaN values
fio2_mean = cohort_on_vent['fio2_set'].mean(skipna=True)
print("FIO2_SET MEAN", fio2_mean)
# If the mean is greater than 1, divide 'fio2_set' by 100
if fio2_mean > 1:
    # Only divide values greater than 1 to avoid re-dividing already correct values
    print("Updated fio2_set to be between 0.21 and 1")
    cohort_on_vent.loc[cohort_on_vent['fio2_set'] > 1, 'fio2_set'] = \
        cohort_on_vent.loc[cohort_on_vent['fio2_set'] > 1, 'fio2_set'] / 100

cohort_on_vent.loc[:, 'fio2_set'] = cohort_on_vent['fio2_set'].where(cohort_on_vent['fio2_set'].between(0.21, 1, inclusive='both'), np.nan)
cohort_on_vent.loc[:, 'resp_rate_set'] = cohort_on_vent['resp_rate_set'].where(cohort_on_vent['resp_rate_set'].between(0, 60, inclusive='both'), np.nan)
cohort_on_vent.loc[:, 'peep_set'] = cohort_on_vent['peep_set'].where(cohort_on_vent['peep_set'].between(0, 50, inclusive='both'), np.nan)
cohort_on_vent.loc[:, 'resp_rate_obs'] = cohort_on_vent['resp_rate_obs'].where(cohort_on_vent['resp_rate_obs'].between(0, 100, inclusive='both'), np.nan)
cohort_on_vent.loc[:, 'lpm_set'] = cohort_on_vent['lpm_set'].where(cohort_on_vent['lpm_set'].between(0, 60, inclusive='both'), np.nan)

cohort_on_vent['recorded_date'] = cohort_on_vent['recorded_dttm'].dt.date
cohort_on_vent['recorded_hour'] = cohort_on_vent['recorded_dttm'].dt.hour

print(f"Number of unique encounters after filtering for ventilator usage: {cohort_on_vent['hospitalization_id'].nunique()}")

FIO2_SET MEAN 0.4654953087272558
Number of unique encounters after filtering for ventilator usage: 10403


In [12]:
vent_start_end = cohort_on_vent.groupby('hospitalization_id').agg(
    vent_start_time=('recorded_dttm', 'min'),
    vent_end_time=('recorded_dttm', 'max')
).reset_index()

# Exclude encounters where vent start time and end time is the same 
vent_start_end = vent_start_end[vent_start_end['vent_start_time'] != vent_start_end['vent_end_time']]
print(f"Number of unique encounters after filtering for ventilator usage: {vent_start_end['hospitalization_id'].nunique()}")

Number of unique encounters after filtering for ventilator usage: 10088


In [13]:
# import required vitals
vitals_filters = {
    'hospitalization_id': cohort_ids,
    'vital_category': vitals_of_interest
}
vitals = pyCLIF.load_data('clif_vitals', columns=vitals_required_columns, filters=vitals_filters)
vitals['hospitalization_id']= vitals['hospitalization_id'].astype(str)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Data loaded successfully from /home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/clif_vitals.parquet


In [14]:
vitals.value_counts('vital_category')
## if you don't have MAP, we can calculate here- TODO

vital_category
sbp                 9597959
dbp                 9597867
heart_rate          9476001
map                 9265964
spo2                9186579
respiratory_rate    8307875
weight_kg            471225
height_cm            157121
Name: count, dtype: int64

In [15]:
# Get first_vital_dttm and last_vital_dttm for each hospitalization_id 
# We use this as proxy for admission and discharge dttm to construct hourly sequence for each hospitalization
vital_dttm_bounds = vitals.groupby('hospitalization_id')['recorded_dttm'].agg(['min', 'max']).reset_index()
vital_dttm_bounds.columns = ['hospitalization_id', 'first_vital_dttm', 'last_vital_dttm']
print("unique encounters in vitals", pyCLIF.count_unique_encounters(vital_dttm_bounds))

unique encounters in vitals 126296


In [16]:
## get height , weight and bmi for 
# Filter vitals to include only height and weight
vitals_bmi = vitals[vitals['vital_category'].isin(['weight_kg', 'height_cm'])].copy()

# Remove outliers
vitals_bmi = vitals_bmi[
    ((vitals_bmi['vital_category'] == 'height_cm') & (vitals_bmi['vital_value'] >= 76.2) & (vitals_bmi['vital_value'] <= 244)) |
    ((vitals_bmi['vital_category'] == 'weight_kg') & (vitals_bmi['vital_value'] >= 20) & (vitals_bmi['vital_value'] <= 1100))
]

# Merge with vent_start_end to get ventilation start time
vitals_bmi = vitals_bmi.merge(vent_start_end[['hospitalization_id', 'vent_start_time']], on='hospitalization_id', how='left')

# Calculate time difference between recorded_dttm and vent_start_time
vitals_bmi['recorded_dttm'] = pd.to_datetime(vitals_bmi['recorded_dttm'])
vitals_bmi['vent_start_time'] = pd.to_datetime(vitals_bmi['vent_start_time'])
vitals_bmi['time_diff'] = (vitals_bmi['recorded_dttm'] - vitals_bmi['vent_start_time']).dt.total_seconds() / 3600  # in hours

# Define whether measurement is before or after vent_start_time
vitals_bmi['before_vent_start'] = (vitals_bmi['time_diff'] <= 0).astype(int)

# Calculate absolute time difference
vitals_bmi['abs_time_diff'] = vitals_bmi['time_diff'].abs()

# Sort data to prioritize measurements before vent start and closest in time
vitals_bmi = vitals_bmi.sort_values(['hospitalization_id', 'vital_category', 'before_vent_start', 'abs_time_diff'], ascending=[True, True, False, True])

# Drop duplicates to keep the closest measurement for each vital_category per hospitalization_id
vitals_bmi = vitals_bmi.drop_duplicates(subset=['hospitalization_id', 'vital_category'], keep='first')

# Pivot to get height and weight in columns
vitals_bmi_pivot = vitals_bmi.pivot(index='hospitalization_id', columns='vital_category', values='vital_value').reset_index()

# Calculate BMI
vitals_bmi_pivot['bmi'] = vitals_bmi_pivot['weight_kg'] / ((vitals_bmi_pivot['height_cm'] / 100) ** 2)

## Hourly sequence for the cohort

In [17]:
final_cohort = vent_start_end.merge(vital_dttm_bounds, on='hospitalization_id', how='inner')
print("unique encounters in resp filtered", pyCLIF.count_unique_encounters(final_cohort))

unique encounters in resp filtered 10088


In [18]:
# sanity check - last recorded vital shouldn't be less than vent start time
# if this happens, check your CLIF tables bro
cases_before_vent_start = final_cohort[final_cohort['last_vital_dttm'] < final_cohort['vent_start_time']]
print("Cases where last vital dttm is before vent_start time:", len(cases_before_vent_start))
cases_before_vent_start

Cases where last vital dttm is before vent_start time: 2


Unnamed: 0,hospitalization_id,vent_start_time,vent_end_time,first_vital_dttm,last_vital_dttm
1374,1567544868278289.0,2021-01-03 10:41:00-05:00,2021-01-03 12:50:00-05:00,2021-01-03 09:10:00-05:00,2021-01-03 10:40:55-05:00
5440,3205441582599331.0,2020-11-16 14:29:00-05:00,2020-11-16 17:14:00-05:00,2020-11-15 22:48:00-05:00,2020-11-16 14:06:00-05:00


In [19]:
## save IDs in this cohort to filter other tables
cohort_ids = final_cohort['hospitalization_id'].unique().tolist()
print("total number of unique hospitalizations in the identified cohort", len(cohort_ids))

total number of unique hospitalizations in the identified cohort 10088


In [20]:
# Function to generate hourly sequence for each group (hospitalization_id)
def generate_hourly_sequence(group):
    # Get the vent start time and discharge time
    start_time = group['vent_start_time'].iloc[0]
    end_time = group['last_vital_dttm'].iloc[0]
    
    # Generate the sequence of hourly timestamps
    hourly_timestamps = pd.date_range(start=start_time, end=end_time, freq='h')
    
    # Create a new DataFrame for this sequence
    return pd.DataFrame({
        'hospitalization_id': group['hospitalization_id'].iloc[0],
        'recorded_dttm': hourly_timestamps
    })

# Apply the function to each group and concatenate the results
hour_sequence = final_cohort.groupby('hospitalization_id')\
    .apply(generate_hourly_sequence)\
    .reset_index(drop=True)

# Add `recorded_date` and `recorded_hour` columns
# Convert recorded_dttm to datetime sanity check
hour_sequence['recorded_dttm'] = pd.to_datetime(hour_sequence['recorded_dttm'])
hour_sequence['recorded_date'] = hour_sequence['recorded_dttm'].dt.date
hour_sequence['recorded_hour'] = hour_sequence['recorded_dttm'].dt.hour
hour_sequence = hour_sequence.drop('recorded_dttm', axis=1)
hour_sequence = hour_sequence.drop_duplicates()
hour_sequence['time_from_vent'] = hour_sequence.groupby('hospitalization_id').cumcount()
## add a cool off period of 4 hours after first intubation
hour_sequence['time_from_vent_adjusted'] = hour_sequence['time_from_vent'].apply(lambda x: x - 4 if x >= 4 else -1).astype(int)


  hour_sequence = final_cohort.groupby('hospitalization_id')\


In [21]:
## SHOULDN'T HAVE ANY DUPLICATES
hour_sequence_check = pyCLIF.remove_duplicates(hour_sequence, ['hospitalization_id', 'recorded_date', 'recorded_hour'], 'hour_sequence_check')
del hour_sequence_check

Processing DataFrame: hour_sequence_check
No duplicates found based on columns: ['hospitalization_id', 'recorded_date', 'recorded_hour'].


## Hourly Respiratory support

In [22]:
hourly_vent_df = cohort_on_vent.groupby(['hospitalization_id', 'recorded_date', 'recorded_hour']).agg(
    min_resp_rate_obs=pd.NamedAgg(column='resp_rate_obs', aggfunc='min'),
    min_lpm_set=pd.NamedAgg(column='lpm_set', aggfunc='min'),
    min_fio2_set=pd.NamedAgg(column='fio2_set', aggfunc='min'),
    min_peep_set=pd.NamedAgg(column='peep_set', aggfunc='min'),
    max_resp_rate_obs=pd.NamedAgg(column='resp_rate_obs', aggfunc='max'),
    max_lpm_set=pd.NamedAgg(column='lpm_set', aggfunc='max'),
    max_fio2_set=pd.NamedAgg(column='fio2_set', aggfunc='max'),
    max_peep_set=pd.NamedAgg(column='peep_set', aggfunc='max'),
    hourly_trach=pd.NamedAgg(column='tracheostomy', aggfunc=lambda x: 1 if x.max() == 1 else 0),
    hourly_on_vent=pd.NamedAgg(column='on_vent', aggfunc=lambda x: 1 if x.max() == 1 else 0)
).reset_index()

In [23]:
# Merge hourly_vent_df with hour_sequence on hospitalization_id, recorded_date, and recorded_hour
final_df = pd.merge(hour_sequence, hourly_vent_df, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], 
                     how='left')
print("unique encounters who were ever on vent (before applying exclusion criteria)", pyCLIF.count_unique_encounters(final_df))

unique encounters who were ever on vent (before applying exclusion criteria) 10086


In [24]:
# Calculate the total hours on vent for each encounter within the first 72 hours
first_72_hours = hour_sequence[(hour_sequence['time_from_vent_adjusted'] >= 0) & (hour_sequence['time_from_vent_adjusted'] < 72)]
final_df_72 = pd.merge(first_72_hours, hourly_vent_df, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], 
                     how='left')
vent_hours_per_encounter = final_df_72.groupby('hospitalization_id')['hourly_on_vent'].sum()
# Identify encounters with less than 2 hours on vent
encounters_less_than_2_hours = vent_hours_per_encounter[vent_hours_per_encounter <= 2].index

In [25]:
# exclude those encounters that were on the vent for less than 2 hours in the first 72 hours
final_df = final_df[~final_df['hospitalization_id'].isin(encounters_less_than_2_hours)]
print("\n encounters that were on the vent for less than 2 hours in the first 72 hours", len(encounters_less_than_2_hours))
print("\n unique encounters after excluding encounters on vent for 2 hrs or less", pyCLIF.count_unique_encounters(final_df))


 encounters that were on the vent for less than 2 hours in the first 72 hours 1731

 unique encounters after excluding encounters on vent for 2 hrs or less 8355


In [26]:
# Exclude encounters with tracheostomy in the first 72 hours
# Identify encounters with trach in the first 72 hours
encounters_with_trach = final_df_72.groupby('hospitalization_id')['hourly_trach'].max()
# Identify encounters where trach is present
encounters_with_trach = encounters_with_trach[encounters_with_trach == 1].index

In [27]:
# Exclude encounters with trach in the first 72 hours
final_df = final_df[~final_df['hospitalization_id'].isin(encounters_with_trach)]
print("\n encounters with trach in the first 72 hours", len(encounters_with_trach))
print("\n unique encounters after excluding encounters on trach during the first 72 hours", pyCLIF.count_unique_encounters(final_df))


 encounters with trach in the first 72 hours 479

 unique encounters after excluding encounters on trach during the first 72 hours 7896


## Hourly Meds

* Exclude encounters that are on cisatracurium or vecuronium for more than 4 hours within the first 72 hours
* Calculate NE equivalent levels using "norepinephrine", "epinephrine", "phenylephrine", "vasopressin", "dopamine",  "angiotensin"
* Create flags for "nicardipine", "nitroprusside", "clevidipine" for the red criteria under consensus criteria


In [28]:
# Import clif continuous meds for the cohort on vent during the required time period
meds_filters = {
    'hospitalization_id': cohort_ids,
    'med_category': meds_of_interest
}
meds = pyCLIF.load_data('clif_medication_admin_continuous', columns=meds_required_columns, filters=meds_filters)
print("unique encounters in meds", pyCLIF.count_unique_encounters(meds))
meds['hospitalization_id']= meds['hospitalization_id'].astype(str)

Data loaded successfully from /home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/clif_medication_admin_continuous.parquet
unique encounters in meds 4711


In [29]:
meds['admin_dttm'] = pd.to_datetime(meds['admin_dttm'], format='%Y-%m-%d %H:%M:%S')
meds['med_dose'] = pd.to_numeric(meds['med_dose'], errors='coerce')
# Create 'date' and 'hour_of_day' columns
meds['recorded_date'] = meds['admin_dttm'].dt.date
meds['recorded_hour'] = meds['admin_dttm'].dt.hour

Exclude encounters that are on cisatracurium or vecuronium for more than 4 hours within the first 72 hours

In [30]:
# Ensure 'admin_dttm' is in datetime format
cisatracurium_filtered = meds[meds['med_category'].str.contains("cisatracurium", case=False, na=False)].drop_duplicates()
# Sort by 'hospitalization_id' and 'admin_dttm'
cisatracurium_filtered = cisatracurium_filtered.sort_values(['hospitalization_id', 'recorded_date', 'recorded_hour'])

# Merge with vent_start_end to get vent_start_time
cisatracurium_filtered = cisatracurium_filtered.merge(
    final_df_72[['hospitalization_id', 'recorded_date', 'recorded_hour']], 
    on=['hospitalization_id', 'recorded_date', 'recorded_hour'], 
    how='left'
)

# Define the maximum allowed gap between doses (e.g., 1 hour)
max_gap = pd.Timedelta(hours=1)

# Function to identify continuous periods
def identify_continuous_periods(group):
    group = group.copy()
    group['time_diff'] = group['admin_dttm'].diff()
    group['new_period'] = (group['time_diff'] > max_gap) | (group['time_diff'].isna())
    group['period_id'] = group['new_period'].cumsum()
    return group

# Apply the function to each 'hospitalization_id'
cis_periods = cisatracurium_filtered.groupby('hospitalization_id').apply(identify_continuous_periods).reset_index(drop=True)

# Calculate the duration of each continuous period
period_durations = cis_periods.groupby(['hospitalization_id', 'period_id']).agg(
    period_start=('admin_dttm', 'min'),
    period_end=('admin_dttm', 'max')
).reset_index()

period_durations['period_duration'] = (
    period_durations['period_end'] - period_durations['period_start']
).dt.total_seconds() / 3600  # Convert to hours

# Identify patients with any continuous period >= 4 hours
cis_flag_df = period_durations.groupby('hospitalization_id').agg(
    max_period_duration=('period_duration', 'max')
).reset_index()

cis_flag_df['cis_flag'] = (cis_flag_df['max_period_duration'] >= 4).astype(int)

  cis_periods = cisatracurium_filtered.groupby('hospitalization_id').apply(identify_continuous_periods).reset_index(drop=True)


In [31]:
encounters_with_cis = cis_flag_df[cis_flag_df['cis_flag'] == 1]['hospitalization_id'].drop_duplicates()
final_df = final_df[~final_df['hospitalization_id'].isin(encounters_with_cis)]
print("\n encounters with cis for more than 4 hours  in the first 72 hours", len(encounters_with_cis))
print("\n unique encounters after excluding encounters on trach during the first 72 hours", pyCLIF.count_unique_encounters(final_df))


 encounters with cis for more than 4 hours  in the first 72 hours 63

 unique encounters after excluding encounters on trach during the first 72 hours 7834


In [32]:
# Ensure 'admin_dttm' is in datetime format
vecuronium_filtered = meds[meds['med_category'].str.contains("vecuronium", case=False, na=False)].drop_duplicates()
# Sort by 'hospitalization_id' and 'admin_dttm'
vecuronium_filtered = vecuronium_filtered.sort_values(['hospitalization_id', 'recorded_date', 'recorded_hour'])

# Merge with vent_start_end to get vent_start_time
vecuronium_filtered = vecuronium_filtered.merge(
    final_df_72[['hospitalization_id', 'recorded_date', 'recorded_hour']], 
    on=['hospitalization_id', 'recorded_date', 'recorded_hour'], 
    how='left'
)

# Define the maximum allowed gap between doses (e.g., 1 hour)
max_gap = pd.Timedelta(hours=1)

# Function to identify continuous periods
def identify_continuous_periods(group):
    group = group.copy()
    group['time_diff'] = group['admin_dttm'].diff()
    group['new_period'] = (group['time_diff'] > max_gap) | (group['time_diff'].isna())
    group['period_id'] = group['new_period'].cumsum()
    return group

# Apply the function to each 'hospitalization_id'
vec_periods = vecuronium_filtered.groupby('hospitalization_id').apply(identify_continuous_periods).reset_index(drop=True)

# Calculate the duration of each continuous period
period_durations = vec_periods.groupby(['hospitalization_id', 'period_id']).agg(
    period_start=('admin_dttm', 'min'),
    period_end=('admin_dttm', 'max')
).reset_index()

period_durations['period_duration'] = (
    period_durations['period_end'] - period_durations['period_start']
).dt.total_seconds() / 3600  # Convert to hours

# Identify patients with any continuous period >= 4 hours
vec_flag_df = period_durations.groupby('hospitalization_id').agg(
    max_period_duration=('period_duration', 'max')
).reset_index()

vec_flag_df['vec_flag'] = (vec_flag_df['max_period_duration'] >= 4).astype(int)

  vec_periods = vecuronium_filtered.groupby('hospitalization_id').apply(identify_continuous_periods).reset_index(drop=True)


In [33]:
encounters_with_vec = vec_flag_df[vec_flag_df['vec_flag'] == 1]['hospitalization_id'].drop_duplicates()
final_df = final_df[~final_df['hospitalization_id'].isin(encounters_with_vec)]
print("\n encounters with vec for more than 4 hours  in the first 72 hours", len(encounters_with_cis))
print("\n unique encounters after excluding encounters on trach during the first 72 hours", pyCLIF.count_unique_encounters(final_df))


 encounters with cis for more than 4 hours  in the first 72 hours 63

 unique encounters after excluding encounters on trach during the first 72 hours 7676


In [34]:
# ## Norepinephrine equivalent calculation
# Goradia S, Sardaneh AA, Narayan SW, Penm J, Patanwala AE. Vasopressor dose equivalence: 
# A scoping review and suggested formula. J Crit Care. 2021 Feb;61:233-240. doi: 10.1016/j.jcrc.2020.11.002. Epub 2020 Nov 14. PMID: 33220576.

meds_list = [
    "norepinephrine", "epinephrine", "phenylephrine", 
    "vasopressin", "dopamine",  
    "angiotensin"
]

# Function to check if 'med_dose_unit' contains '/hr' or '/min'
def has_per_hour_or_min(unit):
    if pd.isnull(unit):
        return False
    unit = unit.lower()
    return '/hr' in unit or '/min' in unit

# Filter meds to include only rows with '/hr' or '/min' in 'med_dose_unit'
meds_filtered = meds[meds['med_dose_unit'].apply(has_per_hour_or_min)].copy()

ne_df = meds_filtered[meds_filtered['med_category'].isin(meds_list)].copy()


In [35]:
# Create a summary table for each med_category
summary_table = ne_df.groupby('med_category').agg(
    total_N=('med_category', 'size'),
    min=('med_dose', 'min'),
    max=('med_dose', 'max'),
    first_quantile=('med_dose', lambda x: x.quantile(0.25)),
    second_quantile=('med_dose', lambda x: x.quantile(0.5)),
    third_quantile=('med_dose', lambda x: x.quantile(0.75)),
    missing_values=('med_dose', lambda x: x.isna().sum())
).reset_index()

## check the distrbituon of required continuous meds
summary_table

Unnamed: 0,med_category,total_N,min,max,first_quantile,second_quantile,third_quantile,missing_values
0,dopamine,2565,0.0,30.0,2.0,4.0,6.0,0
1,epinephrine,9074,0.0,100.0,0.03,0.06,0.15,0
2,norepinephrine,117777,0.0,600.0,0.05,0.15,0.54,0
3,phenylephrine,6301,0.0,300.0,0.3,0.95,10.0,0
4,vasopressin,28872,0.0,1000.0,0.03,0.04,0.04,0


In [36]:
# Check the med_dose_unit for each med_category in the meds table
med_dose_unit_check = meds.groupby(['med_category', 'med_dose_unit']).size().reset_index(name='count')
# Display the results
med_dose_unit_check

Unnamed: 0,med_category,med_dose_unit,count
0,cisatracurium,mcg/kg/min,6514
1,cisatracurium,mg,7
2,cisatracurium,mg/kg/hr,76
3,dopamine,mcg,4
4,dopamine,mcg/kg/min,2565
5,epinephrine,mcg/kg/min,8362
6,epinephrine,mcg/min,712
7,epinephrine,mg,2898
8,nicardipine,mcg/kg/min,414
9,nicardipine,mg,6


In [37]:

# **2. Convert Medication Doses to Required Units**

# Define medications and their unit conversion information
meds_list = [
    "norepinephrine", "epinephrine", "phenylephrine",
    "vasopressin", "dopamine", "angiotensin"
]

med_unit_info = {
    'norepinephrine': {
        'required_unit': 'mcg/kg/min',
        'acceptable_units': ['mcg/kg/min', 'mcg/kg/hr', 'mg/kg/hr', 'mcg/min', 'mg/hr'],
    },
    'epinephrine': {
        'required_unit': 'mcg/kg/min',
        'acceptable_units': ['mcg/kg/min', 'mcg/kg/hr', 'mg/kg/hr', 'mcg/min', 'mg/hr'],
    },
    'phenylephrine': {
        'required_unit': 'mcg/kg/min',
        'acceptable_units': ['mcg/kg/min', 'mcg/kg/hr', 'mg/kg/hr', 'mcg/min', 'mg/hr'],
    },
    'dopamine': {
        'required_unit': 'mcg/kg/min',
        'acceptable_units': ['mcg/kg/min', 'mcg/kg/hr', 'mg/kg/hr', 'mcg/min', 'mg/hr'],
    },
    'metaraminol': {
        'required_unit': 'mcg/kg/min',
        'acceptable_units': ['mg/hr', 'mcg/min'],
    },
    'angiotensin': {
        'required_unit': 'mcg/kg/min',
        'acceptable_units': ['ng/kg/min', 'ng/kg/hr'],
    },
    'vasopressin': {
        'required_unit': 'units/min',
        'acceptable_units': ['units/min', 'units/hr', 'milliunits/min', 'milliunits/hr'],
    },
}

# Function to get conversion factor for each medication
def get_conversion_factor(med_category, med_dose_unit, weight_kg):
    med_info = med_unit_info.get(med_category, None)
    if not med_info:
        # Medication not in the list
        return None
    required_unit = med_info['required_unit']
    acceptable_units = med_info['acceptable_units']
    med_dose_unit = med_dose_unit.lower()
    if med_category in ['norepinephrine', 'epinephrine', 'phenylephrine', 'dopamine', 'metaraminol']:
        # Required unit: mcg/kg/min
        if med_dose_unit == 'mcg/kg/min':
            factor = 1.0
        elif med_dose_unit == 'mcg/kg/hr':
            factor = 1 / 60
        elif med_dose_unit == 'mg/kg/hr':
            factor = 1000 / 60
        elif med_dose_unit == 'mcg/min':
            factor = 1 / weight_kg
        elif med_dose_unit == 'mg/hr':
            factor = 1000 / 60 / weight_kg
        else:
            return None
    elif med_category == 'angiotensin':
        # Required unit: mcg/kg/min
        if med_dose_unit == 'ng/kg/min':
            factor = 1 / 1000
        elif med_dose_unit == 'ng/kg/hr':
            factor = 1 / 1000 / 60
        else:
            return None
    elif med_category == 'vasopressin':
        # Required unit: units/min
        if med_dose_unit == 'units/min':
            factor = 1.0
        elif med_dose_unit == 'units/hr':
            factor = 1 / 60
        elif med_dose_unit == 'milliunits/min':
            factor = 1 / 1000
        elif med_dose_unit == 'milliunits/hr':
            factor = 1 / 1000 / 60
        else:
            return None
    else:
        return None
    return factor

# Merge weight_kg into meds_filtered (assuming 'vitals_bmi_pivot' is available)
meds_filtered = meds_filtered.merge(vitals_bmi_pivot[['hospitalization_id', 'weight_kg']], on='hospitalization_id', how='left')

# Remove rows with missing weight_kg
meds_filtered = meds_filtered[~meds_filtered['weight_kg'].isnull()].copy()

# Function to convert doses
def convert_dose(row):
    med_category = row['med_category']
    med_dose = row['med_dose']
    med_dose_unit = row['med_dose_unit']
    weight_kg = row['weight_kg']
    factor = get_conversion_factor(med_category, med_dose_unit, weight_kg)
    if factor is None:
        return np.nan
    return med_dose * factor

# Apply the conversion to get 'med_dose_converted'
meds_filtered['med_dose_converted'] = meds_filtered.apply(convert_dose, axis=1)

# Drop rows with NaN in 'med_dose_converted' (unrecognized units)
meds_filtered = meds_filtered[~meds_filtered['med_dose_converted'].isnull()].copy()

# Define acceptable dose ranges
med_dose_ranges = {
    'norepinephrine': (0.01, 3),
    'epinephrine': (0.01, 0.1),
    'phenylephrine': (0.1, 5),
    'dopamine': (2, 20),
    'metaraminol': (0.5, 10),  # Hypothetical range
    'angiotensin': (0.02, 0.2),  # After conversion to mcg/kg/min
    'vasopressin': (0.01, 0.1),  # Units/min acceptable range
}

# Function to check if dose is within range
def is_dose_within_range(row):
    med_category = row['med_category']
    med_dose_converted = row['med_dose_converted']
    dose_range = med_dose_ranges.get(med_category, None)
    if dose_range is None:
        return False
    min_dose, max_dose = dose_range
    return min_dose <= med_dose_converted <= max_dose

# Filter doses within acceptable ranges
meds_filtered = meds_filtered[meds_filtered.apply(is_dose_within_range, axis=1)].copy()

# **4. Flag Medications Not in the Dataset**

for med in meds_list:
    if med not in meds_filtered['med_category'].unique():
        print(f"{med} is not in the dataset.")

# Pivot and Aggregate the Data**

# Create 'recorded_date' and 'recorded_hour' columns
meds_filtered['admin_dttm'] = pd.to_datetime(meds_filtered['admin_dttm'])
meds_filtered['recorded_date'] = meds_filtered['admin_dttm'].dt.date
meds_filtered['recorded_hour'] = meds_filtered['admin_dttm'].dt.hour

# Group and aggregate doses
group_cols = ['hospitalization_id', 'recorded_date', 'recorded_hour', 'med_category']
dose_agg = meds_filtered.groupby(group_cols)['med_dose_converted'].agg(['min', 'max']).reset_index()

# Pivot to have medications as columns
dose_pivot_min = dose_agg.pivot_table(index=['hospitalization_id', 'recorded_date', 'recorded_hour'], columns='med_category', values='min').reset_index()
dose_pivot_max = dose_agg.pivot_table(index=['hospitalization_id', 'recorded_date', 'recorded_hour'], columns='med_category', values='max').reset_index()

# Rename columns to indicate min and max
dose_pivot_min.columns = ['hospitalization_id', 'recorded_date', 'recorded_hour'] + ['min_' + col for col in dose_pivot_min.columns if col not in ['hospitalization_id', 'recorded_date', 'recorded_hour']]
dose_pivot_max.columns = ['hospitalization_id', 'recorded_date', 'recorded_hour'] + ['max_' + col for col in dose_pivot_max.columns if col not in ['hospitalization_id', 'recorded_date', 'recorded_hour']]

# Merge min and max DataFrames
dose_pivot = pd.merge(dose_pivot_min, dose_pivot_max, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], how='outer')

# **6. Calculate Norepinephrine Equivalents**

# Replace NaN with 0 for calculations
dose_pivot.fillna(0, inplace=True)

# Calculate NE min
dose_pivot['ne_calc_min'] = (
    dose_pivot.get('min_norepinephrine', 0) +
    dose_pivot.get('min_epinephrine', 0) +
    dose_pivot.get('min_phenylephrine', 0) / 10 +
    dose_pivot.get('min_dopamine', 0) / 100 +
    dose_pivot.get('min_metaraminol', 0) / 8 +
    dose_pivot.get('min_vasopressin', 0) * 2.5 +
    dose_pivot.get('min_angiotensin', 0) * 10
)

# Calculate NE max
dose_pivot['ne_calc_max'] = (
    dose_pivot.get('max_norepinephrine', 0) +
    dose_pivot.get('max_epinephrine', 0) +
    dose_pivot.get('max_phenylephrine', 0) / 10 +
    dose_pivot.get('max_dopamine', 0) / 100 +
    dose_pivot.get('max_metaraminol', 0) / 8 +
    dose_pivot.get('max_vasopressin', 0) * 2.5 +
    dose_pivot.get('max_angiotensin', 0) * 10
)

# **7. Prepare the Final Dataset**

# Keep only the required columns
ne_calc_df = dose_pivot[['hospitalization_id', 'recorded_date', 
                         'recorded_hour', 
                         'ne_calc_min', 'ne_calc_max']].drop_duplicates(subset=['hospitalization_id', 'recorded_date', 'recorded_hour'])

angiotensin is not in the dataset.


In [38]:
final_df = pd.merge(final_df, ne_calc_df, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], how='left')

In [39]:
red_meds_list = [
    "nicardipine", "nitroprusside", "clevidipine"
]

# Filter meds_filtered for the medications in red_meds_list
red_meds_df = meds[meds['med_category'].isin(red_meds_list)].copy()

# Create a flag for each medication in red_meds_list
for med in red_meds_list:
    # Create a flag that is 1 if the medication was administered in that hour, 0 otherwise
    red_meds_df[med + '_flag'] = np.where(red_meds_df['med_category'] == med, 1, 0).astype(int)

# Aggregate to get the maximum value for each flag (per hospitalization_id, recorded_date, recorded_hour)
# This ensures that if the medication was administered even once in the hour, the flag is 1
red_meds_flags = red_meds_df.groupby(['hospitalization_id', 'recorded_date', 'recorded_hour']).agg(
    {med + '_flag': 'max' for med in red_meds_list}
).reset_index()

#  combine all flags into a single 'red_meds_flag', you can do so like this:
red_meds_flags['red_meds_flag'] = red_meds_flags[[med + '_flag' for med in red_meds_list]].max(axis=1)

# Select the relevant columns
red_meds_flags_final = red_meds_flags[[
    'hospitalization_id', 'recorded_date', 'recorded_hour',
    'nicardipine_flag', 'nitroprusside_flag',
    'clevidipine_flag', 'red_meds_flag'
]].drop_duplicates(subset=['hospitalization_id', 'recorded_date', 'recorded_hour'])

red_meds_flags_final['nicardipine_flag'] = red_meds_flags_final['nicardipine_flag'].astype(int)
red_meds_flags_final['nitroprusside_flag'] = red_meds_flags_final['nitroprusside_flag'].astype(int)
red_meds_flags_final['clevidipine_flag'] = red_meds_flags_final['clevidipine_flag'].astype(int)
red_meds_flags_final['red_meds_flag'] = red_meds_flags_final['red_meds_flag'].astype(int)

In [40]:
final_df = pd.merge(final_df, red_meds_flags_final, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], how='left')

## Hourly Vitals

In [41]:
vitals.columns

Index(['hospitalization_id', 'recorded_dttm', 'vital_category', 'vital_value'], dtype='object')

In [42]:
# Ensure 'recorded_dttm' is datetime
vitals['recorded_dttm'] = pd.to_datetime(vitals['recorded_dttm'])
# Extract 'recorded_date' and 'recorded_hour'
vitals['recorded_hour'] = vitals['recorded_dttm'].dt.hour
vitals['recorded_date'] = vitals['recorded_dttm'].dt.date

# Check if 'map' exists in 'vital_category'
if 'map' not in vitals['vital_category'].unique():
    print("map is not present, so we'll calculate it")

    # Filter vitals to include only 'sbp' and 'dbp'
    sbp_dbp_vitals = vitals[vitals['vital_category'].isin(['sbp', 'dbp'])].copy()

    # Pivot to have 'sbp' and 'dbp' as columns
    sbp_dbp_pivot = sbp_dbp_vitals.pivot_table(
        index=['hospitalization_id', 'recorded_dttm'],
        columns='vital_category',
        values='vital_value'
    ).reset_index()

    # Drop rows where either 'sbp' or 'dbp' is missing
    sbp_dbp_pivot = sbp_dbp_pivot.dropna(subset=['sbp', 'dbp'])

    # Calculate 'map' using the formula
    sbp_dbp_pivot['map'] = (sbp_dbp_pivot['sbp'] + 2 * sbp_dbp_pivot['dbp']) / 3

    # Create a DataFrame for 'map' vitals
    map_vitals = sbp_dbp_pivot[['hospitalization_id', 'recorded_dttm', 'map']].copy()

    # Add 'vital_category' and 'vital_value' columns
    map_vitals['vital_category'] = 'map'
    map_vitals['vital_value'] = map_vitals['map']
    # Extract 'recorded_date' and 'recorded_hour' for map_vitals
    map_vitals['recorded_date'] = map_vitals['recorded_dttm'].dt.date
    map_vitals['recorded_hour'] = map_vitals['recorded_dttm'].dt.hour

    # Keep only the necessary columns
    map_vitals = map_vitals[['hospitalization_id', 'recorded_dttm', 'recorded_date', 
                             'recorded_hour', 'vital_category', 'vital_value']]


    # Append 'map' vitals back to the original 'vitals' DataFrame
    vitals = pd.concat([vitals, map_vitals], ignore_index=True)

# Proceed with grouping and pivoting
vitals_min_max = vitals.groupby(
    ['hospitalization_id', 'recorded_date', 'recorded_hour', 'vital_category']
).agg(
    min=pd.NamedAgg(column='vital_value', aggfunc='min'),
    max=pd.NamedAgg(column='vital_value', aggfunc='max')
).reset_index()

# Pivot the table to reshape it
vitals_pivot = vitals_min_max.pivot_table(
    index=['hospitalization_id', 'recorded_date', 'recorded_hour'],
    columns='vital_category',
    values=['min', 'max']
).reset_index()

# Flatten the column multi-index after pivot
vitals_pivot.columns = [
    '_'.join(col).strip() if isinstance(col, tuple) else col for col in vitals_pivot.columns
]

# Remove trailing underscores
vitals_pivot.columns = [col.rstrip('_') for col in vitals_pivot.columns]

In [43]:
# merge vitals with final_df
final_df = pd.merge(final_df, vitals_pivot, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], 
                   how='left')

In [44]:
## confirm duplicates don't exist
checkpoint_vitals = pyCLIF.remove_duplicates(final_df, [
    'hospitalization_id','recorded_date', 'recorded_hour'
], 'final_df')
del checkpoint_vitals

Processing DataFrame: final_df
No duplicates found based on columns: ['hospitalization_id', 'recorded_date', 'recorded_hour'].


## Hourly Lab

Get most recent lactate defined as closest lab result time to the start of first intubation event

In [45]:
# Import clif continuous meds and clif labs table for the cohort on vent during the required time period
labs_filters = {
    'hospitalization_id': cohort_ids,
    'lab_category': labs_of_interest
}
labs = pyCLIF.load_data('clif_labs', columns=labs_required_columns, filters=labs_filters)
print("unique encounters in labs", pyCLIF.count_unique_encounters(labs))
labs['hospitalization_id']= labs['hospitalization_id'].astype(str)

Data loaded successfully from /home/idies/workspace/Storage/chochbe1/JH_CCRD/CLIF/rclif/clif_labs.parquet
unique encounters in labs 8256


In [46]:
labs['lab_result_dttm'] = pd.to_datetime(labs['lab_result_dttm'])
labs['recorded_hour'] = labs['lab_result_dttm'].dt.hour
labs['recorded_date'] = labs['lab_result_dttm'].dt.date

lactate_df = pd.merge(labs, vent_start_end, on='hospitalization_id', how='left')
lactate_df['time_since_vent_start_hours'] = (
    (lactate_df['lab_result_dttm'] - lactate_df['vent_start_time']).dt.total_seconds() / 3600
)

# Calculate the absolute time difference between lab_result_dttm and vent_start_time in hours
lactate_df['time_diff_hours'] = abs((lactate_df['lab_result_dttm'] - lactate_df['vent_start_time']).dt.total_seconds() / 3600)

# Filter for observations within the first 72 hours since vent_start_time
lactate_df = lactate_df[(lactate_df['time_since_vent_start_hours'] >= 0) & 
                        (lactate_df['time_since_vent_start_hours'] <= 72)]

# Sort by hospitalization_id, recorded_hour, and time_diff_hours to find the closest measurement to vent_start_time
lactate_df = lactate_df.sort_values(by=['hospitalization_id', 'recorded_date', 'recorded_hour', 'time_diff_hours'])

# Group by hospitalization_id and recorded_hour, and get the first row in each group (which is the closest measurement)
# closest lactate measurement is defined as closest to the vent_start_time in that hour. 
closest_lactate_df = lactate_df.groupby(['hospitalization_id', 'recorded_date','recorded_hour']).first().reset_index()

labs_final = closest_lactate_df[['hospitalization_id', 'recorded_date', 'recorded_hour', 'lab_value_numeric']].copy()

# Rename the 'lab_value_numeric' column to 'lactate'
labs_final = labs_final.rename(columns={'lab_value_numeric': 'lactate'})

final_df = pd.merge(final_df, labs_final, on=['hospitalization_id', 'recorded_date', 'recorded_hour'], 
                   how='left')

In [47]:
checkpoint_labs= pyCLIF.remove_duplicates(final_df, [
    'hospitalization_id', 'recorded_date', 'recorded_hour'
], 'final_df')
del checkpoint_labs

Processing DataFrame: final_df
No duplicates found based on columns: ['hospitalization_id', 'recorded_date', 'recorded_hour'].


In [48]:
final_df.columns

Index(['hospitalization_id', 'recorded_date', 'recorded_hour',
       'time_from_vent', 'time_from_vent_adjusted', 'min_resp_rate_obs',
       'min_lpm_set', 'min_fio2_set', 'min_peep_set', 'max_resp_rate_obs',
       'max_lpm_set', 'max_fio2_set', 'max_peep_set', 'hourly_trach',
       'hourly_on_vent', 'ne_calc_min', 'ne_calc_max', 'nicardipine_flag',
       'nitroprusside_flag', 'clevidipine_flag', 'red_meds_flag', 'max_dbp',
       'max_heart_rate', 'max_height_cm', 'max_map', 'max_respiratory_rate',
       'max_sbp', 'max_spo2', 'max_weight_kg', 'min_dbp', 'min_heart_rate',
       'min_height_cm', 'min_map', 'min_respiratory_rate', 'min_sbp',
       'min_spo2', 'min_weight_kg', 'lactate'],
      dtype='object')

## Write analysis dataset 

In [49]:
final_df.to_parquet('../output/intermediate/final_df.parquet')
vent_start_end.to_parquet('../output/intermediate/vent_start_end.parquet')
final_df['hospitalization_id'].to_csv('../output/intermediate/cohort_ids.csv', index=False)