# Perform  Parameter-Specific Chunking on Vital Parameters
Based on the visual analysis (derive_chunking_rules.ipynb), we derived two possible chunking options:
* Chunk after 60 min timedelta to previous measurement
* Chunk after 120 min timedelta to previous measurement

After the discussion with the teaching team, we decided to **chunk after 65 min timedelta to previous measurement** with the possibility to adapt that in future

## Expand Chartevents by Information about Timedelta to Previous Measurement

### Load and Prepare Data

In [28]:
import pandas as pd
import pyarrow as pa

# Read chartevents_clean from parquet file to pandas data frame
chartevents_subset = pd.read_parquet('./data/chartevents_clean.parquet', engine='pyarrow')
unique_icu_stays = pd.read_parquet('./data/unique_icustays_in_chartevents_subset.parquet', engine='pyarrow')


In [29]:
# Select relevant ICUSTAY_ID for analysis - only the ones appearing for the analyzed ITEMIDs
icustayid_filter = unique_icu_stays.ICUSTAY_ID

# Filter by ICUSTAY
chunk_analysis_data = chartevents_subset[chartevents_subset.ICUSTAY_ID.isin(icustayid_filter)].copy()

In [30]:
# Chunk Analysis is only being conducted on the vital parameters, not thresholds
# Filter for ITEMIDs that refer to vital parameters
# Heart Rate: 220045 | NBP: 220179 | O2: 220277
itemids_for_values_filter = [220045, 220179, 220277]
chunk_analysis_data = chunk_analysis_data[chunk_analysis_data.ITEMID.isin(itemids_for_values_filter)].copy()
len(chunk_analysis_data)

6719837

### Create New Data Frame Columns for Analysis

Example of relevant resulting data frame columns:

ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | **CHARTTIME_PREV**         | **DIF_CHARTTIME_PREV_MIN**

20221       |   220045 | 2181-11-25T19:06:00 | 115      | NaA | NaN

20221       |   220045 | 2181-11-25T19:16:00 | 113      | 2181-11-25T19:06:00 | 44

Add Timestamp of Next Measurement as Column to Row of Current Measurement - CHARTTIME_NEXT 


In [31]:
# Idea: Keep chunk_analysis_data as is, only add a new column that holds the previous timestamp, the difference can then be performed outside the loop
# Prerequisite: Sorted Data by ICUSTAY_ID,ITEMID,CHARTTIME
chunk_analysis_data = chunk_analysis_data.sort_values(by=['ICUSTAY_ID','ITEMID','CHARTTIME'])
chunk_analysis_data['CHARTTIME_PREV'] = chunk_analysis_data.groupby(['ICUSTAY_ID','ITEMID'])['CHARTTIME'].shift(1)

Calculate Difference between Timestamps - DIF_CHARTTIME_PREV_MIN

In [32]:
chunk_analysis_data['DIF_CHARTTIME_PREV'] = chunk_analysis_data['CHARTTIME']-chunk_analysis_data['CHARTTIME_PREV']
chunk_analysis_data['DIF_CHARTTIME_PREV_S'] = chunk_analysis_data['DIF_CHARTTIME_PREV'].dt.total_seconds()
chunk_analysis_data['DIF_CHARTTIME_PREV_MIN'] = divmod(chunk_analysis_data['DIF_CHARTTIME_PREV_S'], 60)[0]

In [33]:
#drop helper columns
chunk_analysis_data = chunk_analysis_data.drop(columns='DIF_CHARTTIME_PREV')
chunk_analysis_data = chunk_analysis_data.drop(columns='DIF_CHARTTIME_PREV_S')

In [34]:
chunk_analysis_data.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,VALUEUOM,WARNING,ERROR,RESULTSTATUS,STOPPED,VALUENUM_CLEAN,CLEANING_FLAG,CHARTTIME_PREV,DIF_CHARTTIME_PREV_MIN
0,14005075.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:06:00,2181-11-25 19:17:00,20622.0,115,115.0,bpm,0.0,0.0,,,115.0,,NaT,
1,14005090.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:16:00,2181-11-25 19:16:00,20622.0,114,114.0,bpm,0.0,0.0,,,114.0,,2181-11-25 19:06:00,10.0
2,14005105.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 20:00:00,2181-11-25 22:02:00,21108.0,113,113.0,bpm,0.0,0.0,,,113.0,,2181-11-25 19:16:00,44.0
3,14005111.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 21:00:00,2181-11-25 22:02:00,21108.0,108,108.0,bpm,0.0,0.0,,,108.0,,2181-11-25 20:00:00,60.0
4,14005117.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 22:00:00,2181-11-25 22:02:00,21108.0,110,110.0,bpm,0.0,0.0,,,110.0,,2181-11-25 21:00:00,60.0


## Apply Chunking Rule


In [35]:
# As discussed above, it was decided to chunk after 65 minutes Timedelta to Previous Measurement
chunking_dif = 65

Introduce new Chunk Id for first row that violates chunking difference

In [36]:
# Example of relevant resulting data frame columns (chunk_data_merged):

# ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | CHARTTIME_PREV      | DIF_CHARTTIME_PREV_MIN**  | CHUNK_ID
# 20221       |   220045 | 2181-11-25T19:06:00 | 115      | NaA                 | NaN                       | NaN
# 20221       |   220045 | 2181-11-25T19:16:00 | 114      | 2181-11-25T19:06:00 | 10                        | NaN
# 20221       |   220045 | 2181-11-25T21:00:00 | 113      | 2181-11-25T19:06:00 | 114                       | 20221_220045_2181-11-25T21:00:00

In [37]:
# Select all rows where Timedelta to Previous Measurement is larger than chunking dif
chunk_data = chunk_analysis_data[chunk_analysis_data["DIF_CHARTTIME_PREV_MIN"] > chunking_dif].copy()

In [38]:
# Assign a unique Chunking ID to these rows - consisting of ICUSTAY_ID, ITEMID and the CHARTTIME
chunk_data["CHUNK_ID"] = chunk_data.ICUSTAY_ID.map(str) + "_" + chunk_data.ITEMID.map(str) + "_" + chunk_data.CHARTTIME.map(str)

In [39]:
# Check uniqueness - can only be violated if multiple measurements for that itemid/icustayid occured at the same charttime
print(len(chunk_data["CHUNK_ID"].value_counts()))
print(len(chunk_data))
# Uniqueness for this dataset is given

237226
237226


In [40]:
# only keep CHUNKID and index
chunk_data_subset = chunk_data["CHUNK_ID"]

In [41]:
# Merge back to all rows via index
# Now we have a data set that has a CHUNK_ID at the beginning of each measurement that was conducted later than the chunking rule allows
chunk_data_merged = pd.merge(chunk_analysis_data, chunk_data_subset,  how='left', left_index=True, right_index=True )

Introduce new Chunk Id for first row of each ICUSTAY_ID and ITEMID combination

In [42]:
#Example of relevant resulting data frame columns (chunk_data_merged_2):
# ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | CHARTTIME_PREV      | DIF_CHARTTIME_PREV_MIN    | CHUNK_ID                        |CHUNK_ID_MIN
# 20221       |   220045 | 2181-11-25T19:06:00 | 115      | NaA                 | NaN                       | NaN                             |20221_220045_2181-11-25T19:06:00
# 20221       |   220045 | 2181-11-25T19:16:00 | 114      | 2181-11-25T19:06:00 | 10                        | NaN                             |NaN
# 20221       |   220045 | 2181-11-25T21:00:00 | 113      | 2181-11-25T19:06:00 | 114                       | 20221_220045_2181-11-25T21:00:00|NaN

In [43]:
# Assign Chunk ID to first measurement of ICUSTAY_ID - ITEMID combination 
# First, calculate min timestamp
chunk_data_min = chunk_data_merged.groupby(['ICUSTAY_ID','ITEMID'])['CHARTTIME'].min()
chunk_data_min_df = chunk_data_min.to_frame()
chunk_data_min_df.reset_index(inplace=True)

# Second, for each first Charttime (by ICUSTAY_ID/ITEMID) create a Chunk ID
chunk_data_min_df["CHUNK_ID_MIN"] = chunk_data_min_df.ICUSTAY_ID.map(str) + "_" + chunk_data_min_df.ITEMID.map(str) + "_" + chunk_data_min_df.CHARTTIME.map(str)

In [44]:
# Merge that back so we have a chunk id for each first Measurement (by ICUSTAYID/TEMID)
chunk_data_merged_2 = pd.merge(chunk_data_merged, chunk_data_min_df,  how='left', on=['ICUSTAY_ID','ITEMID','CHARTTIME'])

Pass CHUNK_ID_MIN to CHUNK_ID

That way we have a column that has a CHUNK_ID at the beginning of each measurement that was conducted later than the chunking rule allows as well as an initial CHUNK_ID

In [45]:
# ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | CHARTTIME_PREV      | DIF_CHARTTIME_PREV_MIN    | CHUNK_ID                        |CHUNK_ID_MIN
# 20221       |   220045 | 2181-11-25T19:06:00 | 115      | NaA                 | NaN                       | 20221_220045_2181-11-25T19:06:00|20221_220045_2181-11-25T19:06:00
# 20221       |   220045 | 2181-11-25T19:16:00 | 114      | 2181-11-25T19:06:00 | 10                        | NaN                             |NaN
# 20221       |   220045 | 2181-11-25T21:00:00 | 113      | 2181-11-25T19:06:00 | 114                       | 20221_220045_2181-11-25T21:00:00|NaN

In [46]:
import numpy as np
# if CHUNK_ID_MIN not NaN,write chunk_id_min in chunk_id
chunk_data_merged_2['CHUNK_ID'] = np.where(chunk_data_merged_2['CHUNK_ID_MIN'].notnull(), chunk_data_merged_2['CHUNK_ID_MIN'], chunk_data_merged_2['CHUNK_ID'])

Fill all cells with previous CHUNK_ID, until new CHUNK_ID occurs

Data must be sorted to make that work. By having a CHUNK_ID for each first occurance of ICUSTAY_ID - ITEM_ID, we make sure to not write CHUNK_IDs to other ICUSTAYs or ITEMIDs

In [47]:
# ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | CHARTTIME_PREV      |  CHUNK_ID                       |CHUNK_ID_MIN                    | CHUNK_ID_FILLED
# 20221       |   220045 | 2181-11-25T19:06:00 | 115      | NaN                 | 20221_220045_2181-11-25T19:06:00|20221_220045_2181-11-25T19:06:00|20221_220045_2181-11-25T19:06:00
# 20221       |   220045 | 2181-11-25T19:16:00 | 114      | 2181-11-25T19:06:00 | NaN                             |NaN                             |20221_220045_2181-11-25T19:06:00
# 20221       |   220045 | 2181-11-25T21:00:00 | 113      | 2181-11-25T19:06:00 | 20221_220045_2181-11-25T21:00:00|NaN                             |20221_220045_2181-11-25T21:00:00

In [48]:
chunk_data_merged_2['CHUNK_ID_FILLED'] = chunk_data_merged_2['CHUNK_ID'].fillna(method='ffill')

Remove Helper Columns

In [49]:
chunk_data_merged_2 = chunk_data_merged_2.drop(columns='CHUNK_ID_MIN')
# we could drop more helper columns but CHUNK_ID will become helpful for the visualization to directely see were a Chunk ID first was introduced so we keep that

In [50]:
chunk_data_merged_2.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,...,WARNING,ERROR,RESULTSTATUS,STOPPED,VALUENUM_CLEAN,CLEANING_FLAG,CHARTTIME_PREV,DIF_CHARTTIME_PREV_MIN,CHUNK_ID,CHUNK_ID_FILLED
0,14005075.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:06:00,2181-11-25 19:17:00,20622.0,115,115.0,...,0.0,0.0,,,115.0,,NaT,,200001.0_220045.0_2181-11-25 19:06:00,200001.0_220045.0_2181-11-25 19:06:00
1,14005090.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:16:00,2181-11-25 19:16:00,20622.0,114,114.0,...,0.0,0.0,,,114.0,,2181-11-25 19:06:00,10.0,,200001.0_220045.0_2181-11-25 19:06:00
2,14005105.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 20:00:00,2181-11-25 22:02:00,21108.0,113,113.0,...,0.0,0.0,,,113.0,,2181-11-25 19:16:00,44.0,,200001.0_220045.0_2181-11-25 19:06:00
3,14005111.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 21:00:00,2181-11-25 22:02:00,21108.0,108,108.0,...,0.0,0.0,,,108.0,,2181-11-25 20:00:00,60.0,,200001.0_220045.0_2181-11-25 19:06:00
4,14005117.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 22:00:00,2181-11-25 22:02:00,21108.0,110,110.0,...,0.0,0.0,,,110.0,,2181-11-25 21:00:00,60.0,,200001.0_220045.0_2181-11-25 19:06:00


In [51]:
# Save as parquet file
pd.DataFrame(chunk_data_merged_2).to_parquet('./data/chartevents_clean_values_with_chunkid_' + str(chunking_dif) + '.parquet', engine='pyarrow')

# Apply Chunk IDs on Thresholds

### Load and Prepare Data

In [52]:
import pandas as pd
import pyarrow as pa

chartevents_subset = pd.read_parquet('./data/chartevents_clean.parquet', engine='pyarrow')
chunk_data = pd.read_parquet('./data/chartevents_clean_values_with_chunkid_65.parquet', engine='pyarrow')
unique_icu_stays = pd.read_parquet('./data/unique_icustays_in_chartevents_subset.parquet', engine='pyarrow')

In [53]:
# Select relevant ICUSTAY_ID for analysis - only the ones appearing for the analyzed ITEMIDs
icustayid_filter = unique_icu_stays.ICUSTAY_ID

# Filter by ICUSTAY
chunk_analysis_data = chartevents_subset[chartevents_subset.ICUSTAY_ID.isin(icustayid_filter)].copy()

In [54]:
# Match columns for later union
import numpy as np
chunk_data = chunk_data.drop(columns=['CHARTTIME_PREV', 'DIF_CHARTTIME_PREV_MIN','CHUNK_ID'])
chartevents_subset.insert(loc=len(chartevents_subset.columns), column='CHUNK_ID_FILLED', value=np.nan)

In [55]:
chartevents_subset.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,VALUEUOM,WARNING,ERROR,RESULTSTATUS,STOPPED,VALUENUM_CLEAN,CLEANING_FLAG,CHUNK_ID_FILLED
0,14005075.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:06:00,2181-11-25 19:17:00,20622.0,115,115.0,bpm,0.0,0.0,,,115.0,,
1,14005090.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:16:00,2181-11-25 19:16:00,20622.0,114,114.0,bpm,0.0,0.0,,,114.0,,
2,14005105.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 20:00:00,2181-11-25 22:02:00,21108.0,113,113.0,bpm,0.0,0.0,,,113.0,,
3,14005111.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 21:00:00,2181-11-25 22:02:00,21108.0,108,108.0,bpm,0.0,0.0,,,108.0,,
4,14005117.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 22:00:00,2181-11-25 22:02:00,21108.0,110,110.0,bpm,0.0,0.0,,,110.0,,


In [56]:
chunk_data.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,ITEMID,CHARTTIME,STORETIME,CGID,VALUE,VALUENUM,VALUEUOM,WARNING,ERROR,RESULTSTATUS,STOPPED,VALUENUM_CLEAN,CLEANING_FLAG,CHUNK_ID_FILLED
0,14005075.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:06:00,2181-11-25 19:17:00,20622.0,115,115.0,bpm,0.0,0.0,,,115.0,,200001.0_220045.0_2181-11-25 19:06:00
1,14005090.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 19:16:00,2181-11-25 19:16:00,20622.0,114,114.0,bpm,0.0,0.0,,,114.0,,200001.0_220045.0_2181-11-25 19:06:00
2,14005105.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 20:00:00,2181-11-25 22:02:00,21108.0,113,113.0,bpm,0.0,0.0,,,113.0,,200001.0_220045.0_2181-11-25 19:06:00
3,14005111.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 21:00:00,2181-11-25 22:02:00,21108.0,108,108.0,bpm,0.0,0.0,,,108.0,,200001.0_220045.0_2181-11-25 19:06:00
4,14005117.0,55973.0,152234.0,200001.0,220045.0,2181-11-25 22:00:00,2181-11-25 22:02:00,21108.0,110,110.0,bpm,0.0,0.0,,,110.0,,200001.0_220045.0_2181-11-25 19:06:00


In [57]:
# Respective ITEMIDs
itemids_for_thresholds_HR = [220046, 220047]
itemids_for_thresholds_NBP = [223751, 223752]
itemids_for_thresholds_O2 = [223769, 223770]
itemids_for_value_HR = [220045]
itemids_for_value_NBP = [220179]
itemids_for_value_O2 = [220277]


In [58]:
# Threshold data
threshold_data_HR = chartevents_subset[chartevents_subset.ITEMID.isin(itemids_for_thresholds_HR)].copy()
threshold_data_NBP = chartevents_subset[chartevents_subset.ITEMID.isin(itemids_for_thresholds_NBP)].copy()
threshold_data_O2 = chartevents_subset[chartevents_subset.ITEMID.isin(itemids_for_thresholds_O2)].copy()


In [59]:
# Vital Parameter data
value_chunk_data_HR = chunk_data[chunk_data.ITEMID.isin(itemids_for_value_HR)].copy()
value_chunk_data_NBP = chunk_data[chunk_data.ITEMID.isin(itemids_for_value_NBP)].copy()
value_chunk_data_O2 = chunk_data[chunk_data.ITEMID.isin(itemids_for_value_O2)].copy()


In [60]:
# Union Threshold and Vital Parameter data
threshold_and_value_HR = threshold_data_HR.append(value_chunk_data_HR)
threshold_and_value_NBP = threshold_data_NBP.append(value_chunk_data_NBP)
threshold_and_value_O2 = threshold_data_O2.append(value_chunk_data_O2)

In [61]:
# Sort bei ICUSTAY_ID & CHARTTIME - We do not have to look at ITEMID as we have seperate data frames
threshold_and_value_HR = threshold_and_value_HR.sort_values(by=['ICUSTAY_ID','CHARTTIME'])
threshold_and_value_NBP = threshold_and_value_NBP.sort_values(by=['ICUSTAY_ID','CHARTTIME'])
threshold_and_value_O2 = threshold_and_value_O2.sort_values(by=['ICUSTAY_ID','CHARTTIME'])

In [None]:
# Example rows of resulting data frame - HR
# ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | CHUNK_ID_FILLED
# 20221       |   220045 | 2181-11-25T19:06:00 | 115      | 20221_220045_2181-11-25T19:06:00
# 20221       |   220046 | 2181-11-25T19:07:00 | 130      | NaN
# 20221       |   220047 | 2181-11-25T21:07:00 | 60       | NaN

In [62]:
# now we can use ffill to fill the correspoding Chunk IDs in -> use seperate columns for better validation
# Problem: Cases where the threshold was first set before a value appeared - need to make sure to not write CHUNK_ID of different ICUSTAY in that
# Solution: group by ICUSTAYID, first forward fill to fill threshold chunks where value was there before threshold, then backward fill to fill cases where threshold existed before value
#For HR
threshold_and_value_HR["CHUNK_ID_FILLED_TH"] = threshold_and_value_HR.groupby('ICUSTAY_ID')['CHUNK_ID_FILLED'].transform(lambda x: x.ffill())
threshold_and_value_HR["CHUNK_ID_FILLED_TH"] = threshold_and_value_HR.groupby('ICUSTAY_ID')['CHUNK_ID_FILLED_TH'].transform(lambda x: x.bfill())

#For NBP
threshold_and_value_NBP["CHUNK_ID_FILLED_TH"] = threshold_and_value_NBP.groupby('ICUSTAY_ID')['CHUNK_ID_FILLED'].transform(lambda x: x.ffill())
threshold_and_value_NBP["CHUNK_ID_FILLED_TH"] = threshold_and_value_NBP.groupby('ICUSTAY_ID')['CHUNK_ID_FILLED_TH'].transform(lambda x: x.bfill())

#For O2
threshold_and_value_O2["CHUNK_ID_FILLED_TH"] = threshold_and_value_O2.groupby('ICUSTAY_ID')['CHUNK_ID_FILLED'].transform(lambda x: x.ffill())
threshold_and_value_O2["CHUNK_ID_FILLED_TH"] = threshold_and_value_O2.groupby('ICUSTAY_ID')['CHUNK_ID_FILLED_TH'].transform(lambda x: x.bfill())

In [None]:
# Example rows of resulting data frame - HR
# ICUSTAY_ID  |  ITEMID  | CHARTTIME           | VALUENUM | CHUNK_ID_FILLED                  | CHUNK_ID_FILLED_TH
# 20221       |   220045 | 2181-11-25T19:06:00 | 115      | 20221_220045_2181-11-25T19:06:00 | 20221_220045_2181-11-25T19:06:00
# 20221       |   220046 | 2181-11-25T19:07:00 | 130      | NaN                              | 20221_220045_2181-11-25T19:06:00
# 20221       |   220047 | 2181-11-25T21:07:00 | 60       | NaN                              | 20221_220045_2181-11-25T19:06:00

In [64]:
threshold_and_value_all = pd.concat([threshold_and_value_HR, threshold_and_value_NBP, threshold_and_value_O2])

In [65]:
# Sort the rows like they were in chartevents
threshold_and_value_all = threshold_and_value_all.sort_values(by=['ICUSTAY_ID', 'CHARTTIME','ITEMID'])

In [66]:
pd.DataFrame(threshold_and_value_all).to_parquet('./data/chartevents_clean_values_and_thresholds_with_chunkid_' + str(chunking_dif) + '.parquet', engine='pyarrow')