# Creation of Sampling Rate for Vital Parameters in Chunks

Aim: Analyze Sampling Rates by CHUNK ID to generate an overview

The sampling rate describes the average of vital parameter measurements obtained in one hour for a specific vital parameter - The threshold values are not analyzed

Structure of the new data frame sampling_rates_for_chunkid.parquet:
* ICUSTAY_ID
* ITEMID
* CHUNK_ID
* CHARTTIME_MIN -> minimum timestamp for that ICUSTAY_ID - ITEMID. "When was the first measurement of this parameter conducted for this ICUSTAY_ID?"
* CHARTTIME_MAX -> maximum timestamp for that ICUSTAY_ID - ITEMID. "When was the last measurement of this parameter conducted for this ICUSTAY_ID?"
* CHUNKID_DURATION(h) -> timedelta between first and last timestamp for that ICUSTAY_ID - ITEMID. "How much time has passed between the first and the last measurement in hours?"
* VALUENUM_COUNT -> number of measurements for this ICUSTAY_ID - ITEMID. "How many measurements are available over the entire period for this parameter for this ICUSTAY_ID?"
* SAMPLING_RATE -> number of measurements for this ICUSTAY_ID - ITEMID divided by the timedelta between first and last timestamp for that ICUSTAY_ID - ITEMID in hours. "How many measurements were obtained on average per hour for this ICUSTAY_ID - ITEMID?"

## Load and Prepare Data

First, a filter is being applied that filters on relevant ITEMIDs.

In [13]:
import pandas as pd
import pyarrow as pa

# Read chunk data from parquet file to pandas data frame
sampling_rate_data = pd.read_parquet('./data/chartevent_subset_values_with_chunkid_65.parquet', engine='pyarrow')


In [14]:
# Sampling Rate Analysis is only being conducted on the vital parameters, not thresholds
# Filter for ITEMIDs that refer to vital parameters
# Heart Rate: 220045 | NBP: 220179 | O2: 220277
itemids_for_values_filter = [220045, 220179, 220277]
sampling_rate_data = sampling_rate_data[sampling_rate_data.ITEMID.isin(itemids_for_values_filter)].copy()


In [15]:
# Rename CHUNK ID column
sampling_rate_data = sampling_rate_data.rename(columns={"CHUNK_ID_FILLED":"CHUNK_ID"})

## Fill Sampling Rate Data Frame

Calculate the relevant columns with groupby statements, as this turned out to be much faster than a for-loop.
One row is being generated per CHUNK_ID combination.

In [16]:
# Calculate CHARTTIME_MIN for each CHUNK_ID
sampling_rate_data_min = sampling_rate_data.groupby(['CHUNK_ID'])['CHARTTIME'].min()
sampling_rate_data_min_df = sampling_rate_data_min.to_frame()
sampling_rate_data_min_df.reset_index(inplace=True)
sampling_rate_data_min_df = sampling_rate_data_min_df.rename(columns = {'CHARTTIME':'CHARTTIME_MIN'})
len(sampling_rate_data_min_df)

307241

In [17]:
# Calculate CHARTTIME_MAX for each CHUNK_ID
sampling_rate_data_max = sampling_rate_data.groupby(['CHUNK_ID'])['CHARTTIME'].max()
sampling_rate_data_max_df = sampling_rate_data_max.to_frame()
sampling_rate_data_max_df.reset_index(inplace=True)
sampling_rate_data_max_df = sampling_rate_data_max_df.rename(columns = {'CHARTTIME':'CHARTTIME_MAX'})
len(sampling_rate_data_max_df)

307241

In [18]:
# Calculate VALUENUM_COUNT for each CHUNK_ID
sampling_rate_data_count = sampling_rate_data[['CHUNK_ID','VALUENUM']].copy()
sampling_rate_data_count = sampling_rate_data_count.groupby(['CHUNK_ID']).count()
sampling_rate_data_count = sampling_rate_data_count.rename(columns = {'VALUENUM':'VALUENUM_COUNT'})
sampling_rate_data_count = sampling_rate_data_count.reset_index()



In [20]:
# Merge together by CHUNKID
# Resulting data frame columns: CHUNK_ID, CHARTTIME_MIN, CHARTTIME_MAX, VALUENUM_COUNT
sampling_rates_for_chunkid = pd.merge(sampling_rate_data_min_df, sampling_rate_data_max_df,  how='left', on=['CHUNK_ID'])
sampling_rates_for_chunkid = pd.merge(sampling_rates_for_chunkid,sampling_rate_data_count,how='left', on=['CHUNK_ID'])
len(sampling_rates_for_chunkid)
sampling_rates_for_chunkid.head()

Unnamed: 0,CHUNK_ID,CHARTTIME_MIN,CHARTTIME_MAX,VALUENUM_COUNT
0,200001.0_220045.0_2181-11-25 19:06:00,2181-11-25 19:06:00,2181-11-27 05:00:00,36
1,200001.0_220045.0_2181-11-27 07:00:00,2181-11-27 07:00:00,2181-11-27 16:00:00,10
2,200001.0_220045.0_2181-11-27 18:48:00,2181-11-27 18:48:00,2181-11-28 20:00:00,53
3,200001.0_220179.0_2181-11-25 19:08:00,2181-11-25 19:08:00,2181-11-26 15:00:00,21
4,200001.0_220179.0_2181-11-28 13:10:00,2181-11-28 13:10:00,2181-11-28 13:22:00,4


In [21]:
import datetime as dt
# Calculate ICUSTAY_DURATION(h) for each CHUNK_ID
sampling_rates_for_chunkid['CHUNKID_DURATION'] = sampling_rates_for_chunkid['CHARTTIME_MAX']-sampling_rates_for_chunkid['CHARTTIME_MIN']
sampling_rates_for_chunkid['CHUNKID_DURATION(s)'] = sampling_rates_for_chunkid['CHUNKID_DURATION'].dt.total_seconds()
sampling_rates_for_chunkid['CHUNKID_DURATION(h)'] = divmod(sampling_rates_for_chunkid['CHUNKID_DURATION(s)'], 3600)[0]

# Drop helper columns
sampling_rates_for_chunkid = sampling_rates_for_chunkid.drop(columns=['CHUNKID_DURATION','CHUNKID_DURATION(s)'])

In [22]:
import numpy as np
# Calculate SAMPLING_RATE for each CHUNK_ID
# If the ICUSTAY_DURATION(h) is zero, take VALUENUM_COUNT as SAMPLING_RATE to not divide by 0
sampling_rates_for_chunkid['SAMPLING_RATE'] = np.where(sampling_rates_for_chunkid['CHUNKID_DURATION(h)'] == 0,sampling_rates_for_chunkid['VALUENUM_COUNT'],(sampling_rates_for_chunkid['VALUENUM_COUNT']/sampling_rates_for_chunkid['CHUNKID_DURATION(h)']))
         

In [23]:
# Merge ICUSTAYID and  ITEMID to sampling_rates_for_chunkid
icustay_and_itemid_for_chunk = sampling_rate_data[['ICUSTAY_ID','ITEMID','CHUNK_ID']]
icustay_and_itemid_for_chunk=icustay_and_itemid_for_chunk.drop_duplicates()
sampling_rates_for_chunkid = pd.merge(icustay_and_itemid_for_chunk,sampling_rates_for_chunkid,how='left', on=['CHUNK_ID'])

In [24]:
sampling_rates_for_chunkid.head()

Unnamed: 0,ICUSTAY_ID,ITEMID,CHUNK_ID,CHARTTIME_MIN,CHARTTIME_MAX,VALUENUM_COUNT,CHUNKID_DURATION(h),SAMPLING_RATE
0,200001.0,220045.0,200001.0_220045.0_2181-11-25 19:06:00,2181-11-25 19:06:00,2181-11-27 05:00:00,36,33.0,1.090909
1,200001.0,220045.0,200001.0_220045.0_2181-11-27 07:00:00,2181-11-27 07:00:00,2181-11-27 16:00:00,10,9.0,1.111111
2,200001.0,220045.0,200001.0_220045.0_2181-11-27 18:48:00,2181-11-27 18:48:00,2181-11-28 20:00:00,53,25.0,2.12
3,200001.0,220179.0,200001.0_220179.0_2181-11-25 19:08:00,2181-11-25 19:08:00,2181-11-26 15:00:00,21,19.0,1.105263
4,200001.0,220179.0,200001.0_220179.0_2181-11-28 13:10:00,2181-11-28 13:10:00,2181-11-28 13:22:00,4,0.0,4.0


## Save Data Frame to parquet File

In [None]:
import pandas as pd
import pyarrow as pa
import numpy as np
# Save sampling_rates_for_icustay_itemid as parquet file
pd.DataFrame(sampling_rates_for_chunkid).to_parquet('./data/sampling_rates_for_chunkid.parquet', engine='pyarrow')