# **LSTM/TCN preprocessing for feature set 1 for dynamic data: Feature subset**

To create a decision fusion classification system to see whether that captures the paterns in the features better, we need to preprocess our time-series data into feature subsets.

Each subset will be passed through its own LSTM model and the decisions fused.

**Subset splitting logic:**
- We will split out subsets by their average samplign frequency per patient
- Features that have more data points can be resampled more frequently
- Features that have fewer data points will be resampled less frequently

This aims to avoid using a resampling rate that is amenable to some features but for others requires significant data synthesis.

Previously, all data was resampled to a 30 min interval and values interpolated where necessary. However, for features sich as PaO2 and PaCO2, this would have requried significant data creation and interpolation which is not optimal for learning true data patterns.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

### **Step 1: Load time series data**

In [None]:
# Load the train and test data
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/preprocessing_v2/03_train_data_standard_preprocess_done.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/preprocessing_v2/03_test_data_standard_preprocess_done.parquet'

train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df.head()

Unnamed: 0,subject_id,itemid,valuenum,time_to_extubation_mins,time_from_window_start,label,extubation_failure
0,10001884,223835,40.0,160.0,200.0,Inspired O2 Fraction,1
1,10001884,224685,,160.0,200.0,Tidal Volume (observed),1
2,10001884,224686,,160.0,200.0,Tidal Volume (spontaneous),1
3,10001884,224687,6.1,160.0,200.0,Minute Volume,1
4,10001884,224695,17.0,160.0,200.0,Peak Insp. Pressure,1


In [None]:
# Rename time_from_window_start to time_from_window_start_mins and label to feature_label for both train and test
train_df = train_df.rename(columns={'time_from_window_start': 'time_from_window_start_mins', 'label': 'feature_label'})
test_df = test_df.rename(columns={'time_from_window_start': 'time_from_window_start_mins', 'label': 'feature_label'})

# Drop time to extubation mins column and item id
train_df = train_df.drop(columns=['time_to_extubation_mins', 'itemid'])
test_df = test_df.drop(columns=['time_to_extubation_mins', 'itemid'])

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90905 entries, 0 to 116638
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subject_id                   90905 non-null  int64  
 1   valuenum                     88094 non-null  float64
 2   time_from_window_start_mins  90905 non-null  float64
 3   feature_label                90905 non-null  object 
 4   extubation_failure           90905 non-null  int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 4.2+ MB


In [None]:
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()

### **Step 2: Determine the subset groups**

We will determine the subset groups based on the average sampling frequency per patient.

We will remove ventilator mode categorical data as recommended by Dr Mayur.

In [None]:
# Remove all data where the feature_label is Ventilator Mode
train_df = train_df[train_df['feature_label'] != 'Ventilator Mode']
test_df = test_df[test_df['feature_label'] != 'Ventilator Mode']

# Reset the index of the dataframes
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# Highlight the unique feature labels
unique_features_train = train_df['feature_label'].unique()
unique_features_test = test_df['feature_label'].unique()

print("Unique features in train_df:", unique_features_train)
print("Unique features in test_df:", unique_features_test)

Unique features in train_df: ['Inspired O2 Fraction' 'Tidal Volume (observed)'
 'Tidal Volume (spontaneous)' 'Minute Volume' 'Peak Insp. Pressure'
 'Respiratory Rate' 'O2 saturation pulseoxymetry' 'Arterial O2 pressure'
 'Arterial CO2 Pressure' 'PH (Arterial)']
Unique features in test_df: ['O2 saturation pulseoxymetry' 'Inspired O2 Fraction'
 'Tidal Volume (observed)' 'Tidal Volume (spontaneous)' 'Minute Volume'
 'Peak Insp. Pressure' 'Respiratory Rate' 'Arterial O2 pressure'
 'Arterial CO2 Pressure' 'PH (Arterial)']


In [None]:
# Group by subject_id and itemid to count the number of samples for each combination
per_patient_sampling_frequency = train_df.groupby(['subject_id', 'feature_label']).size().reset_index(name='count')

per_patient_sampling_frequency_pivot = per_patient_sampling_frequency.pivot(index='subject_id', columns='feature_label', values='count').fillna(0)

# Calculate the average sampling frequency per feature
average_sampling_frequency = per_patient_sampling_frequency_pivot.mean().sort_values(ascending=False)

# Create columns for the table
average_sampling_frequency_df = pd.DataFrame({'Feature': average_sampling_frequency.index, 'Average Train Set Sampling Frequency': average_sampling_frequency.values})

# Display the table
average_sampling_frequency_df

Unnamed: 0,Feature,Average Train Set Sampling Frequency
0,O2 saturation pulseoxymetry,6.597074
1,Respiratory Rate,6.578989
2,Inspired O2 Fraction,2.106383
3,Tidal Volume (observed),1.616755
4,Minute Volume,1.600798
5,Peak Insp. Pressure,1.528457
6,Tidal Volume (spontaneous),1.370745
7,PH (Arterial),0.532979
8,Arterial CO2 Pressure,0.521809
9,Arterial O2 pressure,0.521809


Low observed features have already been removed and so we are left with 10 time series features to group. **Note: the frequencies are based on the training data only**

The features will be split as follows:

- Low Frequency Subset (frequency < 1 in 6 hours)
- Medium Frequency Subset (1 < frequency < 3 in 6 hours)
- High Frequency Subset (frequency > 3 in 6 hours)

This facilitates creating subsets where each set can be resampled at a bespoke rate that better considers the actual sampling frequency of the real data.

This splits the data as follows:

In [None]:
low_frequency_features = average_sampling_frequency_df[average_sampling_frequency_df['Average Train Set Sampling Frequency'] < 1]['Feature'].tolist()
low_frequency_features

['PH (Arterial)', 'Arterial CO2 Pressure', 'Arterial O2 pressure']

In [None]:
medium_frequency_features = average_sampling_frequency_df[(average_sampling_frequency_df['Average Train Set Sampling Frequency'] >= 1) & (average_sampling_frequency_df['Average Train Set Sampling Frequency'] < 3)]['Feature'].tolist()
medium_frequency_features

['Inspired O2 Fraction',
 'Tidal Volume (observed)',
 'Minute Volume',
 'Peak Insp. Pressure',
 'Tidal Volume (spontaneous)']

In [None]:
high_frequency_features = average_sampling_frequency_df[average_sampling_frequency_df['Average Train Set Sampling Frequency'] >= 3]['Feature'].tolist()
high_frequency_features

['O2 saturation pulseoxymetry', 'Respiratory Rate']

### **Step 3: Split train and test data into the three subsets**

We will now split the train and test data into the three subsets based on our feature grouping.

In [None]:
def filter_features(df, feature_list):
    """
    Filters the dataframe to include only rows where the feature_label column matches one of the features in the feature_list.

    Parameters:
    df (pd.DataFrame): The input dataframe containing patient data.
    feature_list (list): A list of feature names to filter by.

    Returns:
    pd.DataFrame: A filtered dataframe containing only the specified features.
    """
    filtered_df = df[df['feature_label'].isin(feature_list)]
    return filtered_df

In [None]:
# Extract the low frequency data
low_frequency_train_df = filter_features(train_df, low_frequency_features)
low_frequency_test_df = filter_features(test_df, low_frequency_features)

# Check the unique feature_labels in the data
low_frequency_train_df['feature_label'].unique()

array(['Arterial O2 pressure', 'Arterial CO2 Pressure', 'PH (Arterial)'],
      dtype=object)

In [None]:
# Extract the medium frequency data
medium_frequency_train_df = filter_features(train_df, medium_frequency_features)
medium_frequency_test_df = filter_features(test_df, medium_frequency_features)

# Check the unique feature_labels in the data
medium_frequency_train_df['feature_label'].unique()

array(['Inspired O2 Fraction', 'Tidal Volume (observed)',
       'Tidal Volume (spontaneous)', 'Minute Volume',
       'Peak Insp. Pressure'], dtype=object)

In [None]:
# Extract the high frequency data
high_frequency_train_df = filter_features(train_df, high_frequency_features)
high_frequency_test_df = filter_features(test_df, high_frequency_features)

# Check the unique feature_labels in the data
high_frequency_train_df['feature_label'].unique()

array(['Respiratory Rate', 'O2 saturation pulseoxymetry'], dtype=object)

In [None]:
# Save the data
data_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets'
low_frequency_train_df.to_parquet(f'{data_path}/low_frequency_train_df.parquet', index=False)
low_frequency_test_df.to_parquet(f'{data_path}/low_frequency_test_df.parquet', index=False)
medium_frequency_train_df.to_parquet(f'{data_path}/medium_frequency_train_df.parquet', index=False)
medium_frequency_test_df.to_parquet(f'{data_path}/medium_frequency_test_df.parquet', index=False)
high_frequency_train_df.to_parquet(f'{data_path}/high_frequency_train_df.parquet', index=False)
high_frequency_test_df.to_parquet(f'{data_path}/high_frequency_test_df.parquet', index=False)

**Count the number of patients in each**

In [None]:
# Number of patients in the original data
print(f"Number of patients in the original train data: {train_df['subject_id'].nunique()}")
print(f"Number of patients in the original test data: {test_df['subject_id'].nunique()}")

Number of patients in the original train data: 3760
Number of patients in the original test data: 941


In [None]:
# Number of unique subject ids
print(f"Number of unique subject ids in low frequency train data: {low_frequency_train_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in low frequency test data: {low_frequency_test_df['subject_id'].nunique()}")

Number of unique subject ids in low frequency train data: 1399
Number of unique subject ids in low frequency test data: 350


In [None]:
print(f"Number of unique subject ids in medium frequency train data: {medium_frequency_train_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in medium frequency test data: {medium_frequency_test_df['subject_id'].nunique()}")

Number of unique subject ids in medium frequency train data: 3722
Number of unique subject ids in medium frequency test data: 930


In [None]:
print(f"Number of unique subject ids in high frequency train data: {high_frequency_train_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in high frequency test data: {high_frequency_test_df['subject_id'].nunique()}")

Number of unique subject ids in high frequency train data: 3760
Number of unique subject ids in high frequency test data: 940


We will need to ensure that all patients are reflected across all data sets. The difference is explained by some patients not having recorded values for some features.

As expected, there are fewer patients in the

As such, we will need to handle where certain patients are not represented in certain subsets by creating data for them using a relevant strategy.

### **Step 4: Resample and interpolate the data to bespoke frequencies**

We will need to choose a sequence length for each feature subset that balances data synthesis and data density.

If the models are given too few data points then they are limited in terms of the patterns they can learn. However, if the proportion of synthetic data is high then the models will be trained on primarily synthetic data patterns and not be able to predict on real data with any great accuracy.

The resampling rate for this study will be the following:
- Low frequency: every 2 hours (seq_length = 4)
- Medium frequency: every 1 hour (seq_length = 7)
- High frequency: every 30 mins (seq_length = 13)

Any lower for low frequency data then there would not be enough to predict on which would interfere with the global prediction.

**Handle where there are missing patients compared to the orignial data**

Some patients do not have any data for select features and therefore do not appear in the derived subsets.

***Decision: Do we make each subset have the full patient dataset or just use the patients where there is data present?***

While having a full patients population be passed through each LSTM model, this would involve significant data synthesis, especially for the low frequency subset where data would essentially need to be created from scratch for those patients who are not present. This would introduce noise and biases, forcing the models to learn from artifical patterns which may not exist in the real data and lead to overfitting.

**As such, we will only train the models on the patient data available and not synthetically add patients if they are not present.**

***Decision: How to handle where patients have data for one feature but no data for other features***

We want to avoid synthesising data as much as possible and therefore need to find a way to handle missing data for specific features.

As such, we will implement a masking layer system where the model is told to ingore features in patients where it is not present.

Where a patient has no data for a specific feature, we will not impute that data but will set all time steps for that feature as NaN. This NaN data will then be

**Count the number of patients who do not have data for all features in the data subsets**

In [None]:
def count_incomplete_patients(df, feature_labels):
    """
    Counts the number of patients in the dataframe who do not have data for all specified feature labels.

    Parameters:
    df (pd.DataFrame): The input dataframe containing patient data.
    feature_labels (list): A list of feature labels to check for completeness.

    Returns:
    int: The number of patients without complete data for all specified feature labels.
    """
    # Group by patient (assuming 'subject_id' identifies patients)
    patient_groups = df.groupby('subject_id')

    incomplete_count = 0

    # Iterate over each patient group
    for patient_id, group in patient_groups:
        # Check if the patient has data for all feature labels
        if not all(feature in group['feature_label'].values for feature in feature_labels):
            incomplete_count += 1

    return incomplete_count

In [None]:
# Low frequency features
low_frequency_incomplete_count_train = count_incomplete_patients(low_frequency_train_df, low_frequency_features)
print(f"Number of patients with incomplete data for low frequency features: {low_frequency_incomplete_count_train}")
low_frequency_incomplete_count_test = count_incomplete_patients(low_frequency_test_df, low_frequency_features)
print(f"Number of patients with incomplete data for low frequency features: {low_frequency_incomplete_count_test}")

Number of patients with incomplete data for low frequency features: 12
Number of patients with incomplete data for low frequency features: 1


In [None]:
# Medium frequency features
medium_frequency_incomplete_count_train = count_incomplete_patients(medium_frequency_train_df, medium_frequency_features)
print(f"Number of patients with incomplete data for medium frequency features: {medium_frequency_incomplete_count_train}")
medium_frequency_incomplete_count_test = count_incomplete_patients(medium_frequency_test_df, medium_frequency_features)
print(f"Number of patients with incomplete data for medium frequency features: {medium_frequency_incomplete_count_test}")

Number of patients with incomplete data for medium frequency features: 603
Number of patients with incomplete data for medium frequency features: 162


In [None]:
# High frequency features
high_frequency_incomplete_count_train = count_incomplete_patients(high_frequency_train_df, high_frequency_features)
print(f"Number of patients with incomplete data for high frequency features: {high_frequency_incomplete_count_train}")
high_frequency_incomplete_count_test = count_incomplete_patients(high_frequency_test_df, high_frequency_features)
print(f"Number of patients with incomplete data for high frequency features: {high_frequency_incomplete_count_test}")

Number of patients with incomplete data for high frequency features: 14
Number of patients with incomplete data for high frequency features: 7


There are quite a few patients where data is missing and so synethsising the data from scratch would bias the data towards synthetic patterns.

As such, we will set the missing data as NaNs and mask when passing through the model to that it does not learn from that data.

**Convert time to timedelta format**

In [29]:
# Convert the time_from_window_start_mins to timedelta in mins for all dataframes
low_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_train_df['time_from_window_start_mins'], unit='m')
low_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_test_df['time_from_window_start_mins'], unit='m')
medium_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(medium_frequency_train_df['time_from_window_start_mins'], unit='m')
medium_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(medium_frequency_test_df['time_from_window_start_mins'], unit='m')
high_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(high_frequency_train_df['time_from_window_start_mins'], unit='m')
high_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(high_frequency_test_df['time_from_window_start_mins'], unit='m')

low_frequency_train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5928 entries, 59 to 86374
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype          
---  ------                       --------------  -----          
 0   subject_id                   5928 non-null   int64          
 1   valuenum                     5847 non-null   float64        
 2   time_from_window_start_mins  5928 non-null   timedelta64[ns]
 3   feature_label                5928 non-null   object         
 4   extubation_failure           5928 non-null   int64          
dtypes: float64(1), int64(2), object(1), timedelta64[ns](1)
memory usage: 277.9+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_train_df['time_from_window_start_mins'], unit='m')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_test_df['time_from_window_start_mins'], unit='m')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

In [30]:
low_frequency_train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
59,10002428,127.0,0 days 05:42:00,Arterial O2 pressure,0
60,10002428,43.0,0 days 05:42:00,Arterial CO2 Pressure,0
61,10002428,7.43,0 days 05:42:00,PH (Arterial),0
79,10004235,97.0,0 days 00:22:00,Arterial O2 pressure,1
80,10004235,38.0,0 days 00:22:00,Arterial CO2 Pressure,1


In [133]:
low_frequency_train_copy = low_frequency_train_df.copy()
low_frequency_test_copy = low_frequency_test_df.copy()
medium_frequency_train_copy = medium_frequency_train_df.copy()
medium_frequency_test_copy = medium_frequency_test_df.copy()
high_frequency_train_copy = high_frequency_train_df.copy()
high_frequency_test_copy = high_frequency_test_df.copy()

### **Step 5: Applying resampling and interpolation logic**

Data Preparation:

Calculate Start and End Means:

Compute mean values for each feature in the training set within the first and last halves of the target sampling window.

Fill Start and End Values:

For Each Patient and Feature:

Start Value:

- If a value exists at 0 minutes, use it.
- Else, check for a value within the first half of the target sampling window.
- If found, use the earliest value within this window.
- Otherwise, use the mean start value calculated from the training data.

End Value:

- If a value exists at 360 minutes, use it.
- Else, check for a value within the last half of the target sampling window.
- If found, use the latest value within this window.
- Otherwise, use the mean end value calculated from the training data.

Add Missing Start and End Values:

- Append new rows to the dataset for the missing start and end values.

Resample and Interpolate:

For Each Patient and Feature:

Resample to Initial Interval:

- Resample data to the initial interval (e.g., 1 minute) to ensure a uniform timeline.

Interpolate Missing Values:

- Apply linear interpolation to fill missing values if there is any initial data.
- If no data is present for the feature, fill the entire series with NaNs.
- Resample to Target Interval:
- Resample the data to the target interval (e.g., 30 minutes) to obtain the desired frequency.

Align Extubation Failure Labels:
- Forward and backward fill the extubation - failure labels to align with the resampled index.

Mask Creation:

Generate Masks:
- Create masks indicating where NaN values are present in the processed data for both training and test sets.

Ensure Consistency:

Use Training Data for Means:

- Ensure that mean values for filling start and end points are calculated from the training data to avoid data leakage.

Output:

Return Processed Data:

- Provide the processed training and test datasets along with their respective masks indicating missing values.

In [134]:
def fill_start_end_values(df, feature_labels, start_means, end_means, start_window, end_window_start):
    """
    Fill missing start and end values for features in a DataFrame with specified means or existing values.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data for various features.
    feature_labels (list): A list of feature labels to check for completeness.
    start_means (pd.Series): A Series containing mean values for each feature to be used if no start value is present.
    end_means (pd.Series): A Series containing mean values for each feature to be used if no end value is present.
    start_window (pd.Timedelta): The time window within which to look for a starting value.
    end_window_start (pd.Timedelta): The time window start within which to look for an ending value.

    Returns:
    pd.DataFrame: The DataFrame with filled start and end values.
    """
    new_rows = []

    for feature in feature_labels:
        for subject_id in df['subject_id'].unique():
            subject_df = df[df['subject_id'] == subject_id]
            extubation_failure_label = subject_df['extubation_failure'].iloc[0]

            # Check if the original data for this feature for this patient is entirely NaN
            if subject_df[subject_df['feature_label'] == feature]['valuenum'].isna().all():
                # If all values are NaN, ensure any filled values are also NaN
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(0),
                    'valuenum': np.nan,
                    'extubation_failure': extubation_failure_label
                })
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(minutes=360),
                    'valuenum': np.nan,
                    'extubation_failure': extubation_failure_label
                })
                continue

            # Handle start values
            start_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(0))
            if not start_check.any():
                # Check if there's a value in the first half of the sampling window
                start_window_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] <= start_window)
                if start_window_check.any():
                    # Use the earliest value within the start window
                    start_value = subject_df[start_window_check].sort_values('time_from_window_start_mins').iloc[0]['valuenum']
                else:
                    # Use the mean start value if no value is found within the start window
                    start_value = start_means[feature]
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(0),
                    'valuenum': start_value,
                    'extubation_failure': extubation_failure_label
                })

            # Handle end values
            end_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=360))
            if not end_check.any():
                # Check if there's a value in the last half of the sampling window
                end_window_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] >= end_window_start)
                if end_window_check.any():
                    # Use the latest value within the end window
                    end_value = subject_df[end_window_check].sort_values('time_from_window_start_mins', ascending=False).iloc[0]['valuenum']
                else:
                    # Use the mean end value if no value is found within the end window
                    end_value = end_means[feature]
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(minutes=360),
                    'valuenum': end_value,
                    'extubation_failure': extubation_failure_label
                })

    # Add new rows to the dataframe
    if new_rows:
        new_df = pd.DataFrame(new_rows)
        df = pd.concat([df, new_df], ignore_index=True)

    return df

In [135]:
def resample_and_interpolate(df, feature_labels, initial_interval='1T', target_interval='120T'):
    """
    Resample and interpolate missing values for numerical features in a DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data for various features.
    feature_labels (list): A list of feature labels to check for completeness.
    initial_interval (str): The initial resampling interval to ensure all data points are included. Default is '1T' (1 minute).
    target_interval (str): The target resampling interval. Default is '120T' (120 minutes).

    Returns:
    pd.DataFrame: The DataFrame with resampled and interpolated values.
    """
    target_interval_minutes = int(target_interval.strip('T'))
    resampled_dfs = []

    for subject_id in df['subject_id'].unique():
        subject_df = df[df['subject_id'] == subject_id]

        for feature in feature_labels:
            feature_df = subject_df[subject_df['feature_label'] == feature].set_index('time_from_window_start_mins')

            # Convert index to timedelta for resampling
            feature_df.index = pd.to_timedelta(feature_df.index, unit='m')

            # Remove duplicates by taking the first value if duplicates exist
            feature_df = feature_df[~feature_df.index.duplicated(keep='first')]

            if feature_df.empty or feature_df['valuenum'].isna().all():
                # If the feature is completely absent or all values are NaN, create NaNs for the entire interval
                new_index = pd.timedelta_range(start='0 min', periods=int(360 / target_interval_minutes + 1), freq=f'{target_interval_minutes}T')
                feature_df = pd.DataFrame(index=new_index, columns=feature_df.columns)
                feature_df['valuenum'] = np.nan
                feature_df['extubation_failure'] = subject_df['extubation_failure'].iloc[0]
            else:
                # Step 1: Resample to every minute to ensure data points at every minute
                feature_df = feature_df.resample(initial_interval).asfreq()

                # Step 2: Interpolate missing values if the feature has some data
                feature_df['valuenum'] = feature_df['valuenum'].interpolate(method='linear')

                # Check for any remaining NaN values and fill with the mean of the feature for the specific patient
                if feature_df['valuenum'].isna().sum() > 0:
                    feature_mean = subject_df[subject_df['feature_label'] == feature]['valuenum'].mean()
                    feature_df['valuenum'].fillna(feature_mean, inplace=True)

                # Step 3: Resample to the target interval
                feature_df = feature_df.resample(target_interval).asfreq()

            # Align extubation_failure columns with the resampled index and forward/backward fill for each patient
            feature_df['extubation_failure'] = feature_df['extubation_failure'].ffill().bfill()

            # Restore subject_id and feature_label columns
            feature_df['subject_id'] = subject_id
            feature_df['feature_label'] = feature

            # Reset index to retain the time information correctly
            feature_df.reset_index(inplace=True)
            feature_df.rename(columns={'index': 'time_from_window_start_mins'}, inplace=True)

            # Convert minutes to timedelta
            feature_df['time_from_window_start_mins'] = pd.to_timedelta(feature_df['time_from_window_start_mins'], unit='m')

            resampled_dfs.append(feature_df)

    # Concatenate all resampled dataframes
    resampled_df = pd.concat(resampled_dfs).reset_index(drop=True)

    return resampled_df

In [136]:
def process_data(train_df, test_df, feature_labels, initial_interval='1T', target_interval='120T'):
    """
    Process the train and test datasets by filling start and end values, resampling, and interpolating missing values.

    Parameters:
    train_df (pd.DataFrame): The input training DataFrame containing time series data for various features.
    test_df (pd.DataFrame): The input test DataFrame containing time series data for various features.
    feature_labels (list): A list of feature labels to check for completeness.
    initial_interval (str): The initial resampling interval to ensure all data points are included. Default is '1T' (1 minute).
    target_interval (str): The target resampling interval. Default is '120T' (120 minutes).

    Returns:
    tuple: The processed training and test DataFrames along with their corresponding masks indicating where values were NaNs.
    """
    # Convert target interval to minutes for window calculations
    target_interval_minutes = int(target_interval.strip('T'))
    start_window = pd.to_timedelta(target_interval_minutes // 2, unit='m')
    end_window_start = pd.to_timedelta(360 - target_interval_minutes // 2, unit='m')

    # Calculate means from training data for start and end windows
    start_means = train_df[train_df['time_from_window_start_mins'] <= start_window].groupby('feature_label')['valuenum'].mean()
    end_means = train_df[train_df['time_from_window_start_mins'] >= end_window_start].groupby('feature_label')['valuenum'].mean()

    # Fill start and end values for train and test data
    train_df = fill_start_end_values(train_df, feature_labels, start_means, end_means, start_window, end_window_start)
    test_df = fill_start_end_values(test_df, feature_labels, start_means, end_means, start_window, end_window_start)

    # Resample and interpolate data for train and test sets
    train_df = resample_and_interpolate(train_df, feature_labels, initial_interval, target_interval)
    test_df = resample_and_interpolate(test_df, feature_labels, initial_interval, target_interval)

    # Create masks indicating where values are NaNs
    train_mask = train_df.isna()
    test_mask = test_df.isna()

    return train_df, test_df, train_mask, test_mask

**Process data**

In [137]:
# Process the low frequency data to resample for every 2 hours
low_frequency_train_df_resampled, low_frequency_test_df_resampled, low_frequency_train_mask, low_frequency_test_mask = process_data(low_frequency_train_copy, low_frequency_test_copy, low_frequency_features, initial_interval='1T', target_interval='120T')

In [141]:
# Process medium frequency data to 1 hour
medium_frequency_train_df_resampled, medium_frequency_test_df_resampled, medium_frequency_train_mask, medium_frequency_test_mask = process_data(medium_frequency_train_copy, medium_frequency_test_copy, medium_frequency_features, initial_interval='1T', target_interval='60T')

**Check processing was done correctly**

In [139]:
def check_processed_data(original_df, modified_df, feature_labels, target_interval='30T'):
    """
    Check if all patients have been processed correctly in the modified data.

    Parameters:
    original_df (pd.DataFrame): The original DataFrame containing time series data for various features.
    modified_df (pd.DataFrame): The modified DataFrame containing resampled and interpolated data.
    feature_labels (list): A list of feature labels to check for completeness.
    target_interval (str): The target resampling interval. Default is '30T' (30 minutes).

    Returns:
    bool: True if all patients have been processed correctly, False otherwise.
    """
    target_interval_minutes = int(target_interval.strip('T'))  # Convert target interval to minutes
    expected_intervals = [pd.Timedelta(minutes=m) for m in range(0, 361, target_interval_minutes)]  # Expected time intervals

    for subject_id in original_df['subject_id'].unique():
        for feature in feature_labels:
            original_feature_df = original_df[(original_df['subject_id'] == subject_id) & (original_df['feature_label'] == feature)]
            modified_feature_df = modified_df[(modified_df['subject_id'] == subject_id) & (modified_df['feature_label'] == feature)]

            # Check if the original data has any non-NaN values for this feature
            original_has_valid_data = original_feature_df['valuenum'].notna().any()

            modified_values = modified_feature_df['valuenum'].values
            modified_times = modified_feature_df['time_from_window_start_mins'].values

            # Check that all expected intervals are present in the modified data
            if not all(time in modified_times for time in expected_intervals):
                print(f"Missing intervals for subject {subject_id}, feature {feature}")
                return False

            if original_has_valid_data:
                # If the original data had valid values, check that the modified data has no NaNs
                if np.isnan(modified_values).any():
                    print(f"NaN values found in modified data for subject {subject_id}, feature {feature}")
                    return False
            else:
                # If the original data did not have any valid values, check that the modified data is all NaNs
                if not np.isnan(modified_values).all():
                    print(f"Non-NaN values found in modified data for subject {subject_id}, feature {feature} where original data had no values")
                    return False

    return True

In [140]:
check_processed_data(low_frequency_train_copy, low_frequency_train_df_resampled, low_frequency_features, target_interval='120T')

True

In [142]:
check_processed_data(low_frequency_test_copy, low_frequency_test_df_resampled, low_frequency_features, target_interval='120T')

True

In [143]:
check_processed_data(medium_frequency_train_copy, medium_frequency_train_df_resampled, medium_frequency_features, target_interval='60T')

True

In [144]:
check_processed_data(medium_frequency_test_copy, medium_frequency_test_df_resampled, medium_frequency_features, target_interval='60T')

True

In [145]:
# Process high frequency data every 30 mins
high_frequency_train_df_resampled, high_frequency_test_df_resampled, high_frequency_train_mask, high_frequency_test_mask = process_data(high_frequency_train_copy, high_frequency_test_copy, high_frequency_features, initial_interval='1T', target_interval='30T')

# Check high frequency data processed correctly
check_processed_data(high_frequency_train_copy, high_frequency_train_df_resampled, high_frequency_features, target_interval='30T')
check_processed_data(high_frequency_test_copy, high_frequency_test_df_resampled, high_frequency_features, target_interval='30T')

True

Save progress

In [146]:
# Save progress so far
low_frequency_train_df_resampled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/low_frequency_train_df_resampled.parquet')
low_frequency_test_df_resampled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/low_frequency_test_df_resampled.parquet')
medium_frequency_train_df_resampled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/medium_frequency_train_df_resampled.parquet')
medium_frequency_test_df_resampled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/medium_frequency_test_df_resampled.parquet')
high_frequency_train_df_resampled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/high_frequency_train_df_resampled.parquet')
high_frequency_test_df_resampled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/high_frequency_test_df_resampled.parquet')

### **Step 6: Feature scaling**

We now need to scale the numerical features uing a Min Max Scaler.

For LSTM, Min max Scaling is preferable as it maintains temporal relationships better than with Standard scaling. It preserves the range of data and avoids negative values.

The scaler applied to the train data will also need to be applied to the test data in the same manner.

In [147]:
low_frequency_train_df_resampled.head()

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10002428,7.411848,PH (Arterial),0.0
1,0 days 02:00:00,10002428,7.418217,PH (Arterial),0.0
2,0 days 04:00:00,10002428,7.424586,PH (Arterial),0.0
3,0 days 06:00:00,10002428,7.43,PH (Arterial),0.0
4,0 days 00:00:00,10002428,40.862416,Arterial CO2 Pressure,0.0


In [151]:
medium_frequency_train_df_resampled.head(8)

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10001884,41.527331,Inspired O2 Fraction,1.0
1,0 days 01:00:00,10001884,41.069132,Inspired O2 Fraction,1.0
2,0 days 02:00:00,10001884,40.610932,Inspired O2 Fraction,1.0
3,0 days 03:00:00,10001884,40.152733,Inspired O2 Fraction,1.0
4,0 days 04:00:00,10001884,41.504017,Inspired O2 Fraction,1.0
5,0 days 05:00:00,10001884,43.760043,Inspired O2 Fraction,1.0
6,0 days 06:00:00,10001884,46.016069,Inspired O2 Fraction,1.0
7,0 days 00:00:00,10001884,,Tidal Volume (observed),1.0


In [152]:
high_frequency_train_df_resampled.head(14)

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10001884,97.718774,O2 saturation pulseoxymetry,1.0
1,0 days 00:30:00,10001884,98.785714,O2 saturation pulseoxymetry,1.0
2,0 days 01:00:00,10001884,97.714286,O2 saturation pulseoxymetry,1.0
3,0 days 01:30:00,10001884,97.0,O2 saturation pulseoxymetry,1.0
4,0 days 02:00:00,10001884,97.0,O2 saturation pulseoxymetry,1.0
5,0 days 02:30:00,10001884,97.0,O2 saturation pulseoxymetry,1.0
6,0 days 03:00:00,10001884,97.0,O2 saturation pulseoxymetry,1.0
7,0 days 03:30:00,10001884,98.0,O2 saturation pulseoxymetry,1.0
8,0 days 04:00:00,10001884,98.0,O2 saturation pulseoxymetry,1.0
9,0 days 04:30:00,10001884,97.892857,O2 saturation pulseoxymetry,1.0


**Function to apply a MinMaxScaler to each feature**

In [153]:
from sklearn.preprocessing import MinMaxScaler

In [154]:
def scale_features(train_df, test_df, numerical_features):
    """
    Scale numerical features in the train and test dataframes using MinMaxScaler.

    Parameters:
    train_df (pd.DataFrame): The training dataframe.
    test_df (pd.DataFrame): The testing dataframe.
    numerical_features (list): List of feature labels to be scaled.

    Returns:
    tuple: The scaled train and test dataframes.
    """
    scalers = {}

    for feature in numerical_features:
        # Initialize the MinMaxScaler for the feature
        scalers[feature] = MinMaxScaler()

        # Create masks for the feature in train and test dataframes
        feature_mask_train = train_df['feature_label'] == feature
        feature_mask_test = test_df['feature_label'] == feature

        # Store the original NaN masks
        nan_mask_train = train_df.loc[feature_mask_train, 'valuenum'].isna()
        nan_mask_test = test_df.loc[feature_mask_test, 'valuenum'].isna()

        # Fill NaNs with a temporary value
        train_values = train_df.loc[feature_mask_train, 'valuenum'].fillna(0).values.reshape(-1, 1)
        test_values = test_df.loc[feature_mask_test, 'valuenum'].fillna(0).values.reshape(-1, 1)

        # Fit and transform the train dataframe values
        train_scaled = scalers[feature].fit_transform(train_values)
        # Transform the test dataframe values
        test_scaled = scalers[feature].transform(test_values)

        # Restore the NaN values
        train_scaled[nan_mask_train.values] = np.nan
        test_scaled[nan_mask_test.values] = np.nan

        # Assign the scaled values back to the dataframes
        train_df.loc[feature_mask_train, 'valuenum'] = train_scaled
        test_df.loc[feature_mask_test, 'valuenum'] = test_scaled

        print(f'Feature {feature} has been normalized')

    # Ensure indices align if necessary
    train_df.reset_index(drop=True, inplace=True)
    test_df.reset_index(drop=True, inplace=True)

    # Display the sizes after normalization
    print(f"Number of rows in train dataframe after normalization: {len(train_df)}")
    print(f"Number of rows in test dataframe after normalization: {len(test_df)}")

    return train_df, test_df

In [155]:
low_frequency_train_df_resampled_copy = low_frequency_train_df_resampled.copy()
low_frequency_test_df_resampled_copy = low_frequency_test_df_resampled.copy()
medium_frequency_train_df_resampled_copy = medium_frequency_train_df_resampled.copy()
medium_frequency_test_df_resampled_copy = medium_frequency_test_df_resampled.copy()
high_frequency_train_df_resampled_copy = high_frequency_train_df_resampled.copy()
high_frequency_test_df_resampled_copy = high_frequency_test_df_resampled.copy()

In [156]:
# Resample all the data
low_frequency_train_scaled, low_frequency_test_scaled = scale_features(low_frequency_train_df_resampled_copy, low_frequency_test_df_resampled_copy, low_frequency_features)
medium_frequency_train_scaled, medium_frequency_test_scaled = scale_features(medium_frequency_train_df_resampled_copy, medium_frequency_test_df_resampled_copy, medium_frequency_features)
high_frequency_train_scaled, high_frequency_test_scaled = scale_features(high_frequency_train_df_resampled_copy, high_frequency_test_df_resampled_copy, high_frequency_features)

Feature PH (Arterial) has been normalized
Feature Arterial CO2 Pressure has been normalized
Feature Arterial O2 pressure has been normalized
Number of rows in train dataframe after normalization: 16788
Number of rows in test dataframe after normalization: 4200
Feature Inspired O2 Fraction has been normalized
Feature Tidal Volume (observed) has been normalized
Feature Minute Volume has been normalized
Feature Peak Insp. Pressure has been normalized
Feature Tidal Volume (spontaneous) has been normalized
Number of rows in train dataframe after normalization: 130270
Number of rows in test dataframe after normalization: 32550
Feature O2 saturation pulseoxymetry has been normalized
Feature Respiratory Rate has been normalized
Number of rows in train dataframe after normalization: 97760
Number of rows in test dataframe after normalization: 24440


In [158]:
# Save progress
low_frequency_train_scaled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/03_low_frequency_train_scaled.parquet')
low_frequency_test_scaled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/03_low_frequency_test_scaled.parquet')
medium_frequency_train_scaled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/03_medium_frequency_train_scaled.parquet')
medium_frequency_test_scaled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/03_medium_frequency_test_scaled.parquet')
high_frequency_train_scaled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/03_high_frequency_train_scaled.parquet')
high_frequency_test_scaled.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/feature_subsets/03_high_frequency_test_scaled.parquet')

**Step 7: Create sequences for LSTM**

Finally, we need to create 3D arrays for LSTM input in the format (patients, time steps, features).

In [161]:
# Convert extubation failure column to integer
low_frequency_train_scaled['extubation_failure'] = low_frequency_train_scaled['extubation_failure'].astype('int64')
low_frequency_test_scaled['extubation_failure'] = low_frequency_test_scaled['extubation_failure'].astype('int64')
medium_frequency_train_scaled['extubation_failure'] = medium_frequency_train_scaled['extubation_failure'].astype('int64')
medium_frequency_test_scaled['extubation_failure'] = medium_frequency_test_scaled['extubation_failure'].astype('int64')
high_frequency_train_scaled['extubation_failure'] = high_frequency_train_scaled['extubation_failure'].astype('int64')
high_frequency_test_scaled['extubation_failure'] = high_frequency_test_scaled['extubation_failure'].astype('int64')

In [162]:
low_frequency_train_scaled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16788 entries, 0 to 16787
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype          
---  ------                       --------------  -----          
 0   time_from_window_start_mins  16788 non-null  timedelta64[ns]
 1   subject_id                   16788 non-null  int64          
 2   valuenum                     16516 non-null  float64        
 3   feature_label                16788 non-null  object         
 4   extubation_failure           16788 non-null  int64          
dtypes: float64(1), int64(2), object(1), timedelta64[ns](1)
memory usage: 655.9+ KB


In [163]:
# Create copies of all datasets
low_frequency_train_scaled_copy = low_frequency_train_scaled.copy()
low_frequency_test_scaled_copy = low_frequency_test_scaled.copy()
medium_frequency_train_scaled_copy = medium_frequency_train_scaled.copy()
medium_frequency_test_scaled_copy = medium_frequency_test_scaled.copy()
high_frequency_train_scaled_copy = high_frequency_train_scaled.copy()
high_frequency_test_scaled_copy = high_frequency_test_scaled.copy()

**Count the number of patients in each feature subset**

In [164]:
# Count the number of patients in each dataset
print(f"Number of patients in low frequency train dataset: {low_frequency_train_scaled['subject_id'].nunique()}")
print(f"Number of patients in low frequency test dataset: {low_frequency_test_scaled['subject_id'].nunique()}")
print(f"Number of patients in medium frequency train dataset: {medium_frequency_train_scaled['subject_id'].nunique()}")
print(f"Number of patients in medium frequency test dataset: {medium_frequency_test_scaled['subject_id'].nunique()}")
print(f"Number of patients in high frequency train dataset: {high_frequency_train_scaled['subject_id'].nunique()}")
print(f"Number of patients in high frequency test dataset: {high_frequency_test_scaled['subject_id'].nunique()}")

Number of patients in low frequency train dataset: 1399
Number of patients in low frequency test dataset: 350
Number of patients in medium frequency train dataset: 3722
Number of patients in medium frequency test dataset: 930
Number of patients in high frequency train dataset: 3760
Number of patients in high frequency test dataset: 940


**Function to create sequences**

In [166]:
def create_sequences(data, sequence_length):
    """
    Convert time series data into sequences for LSTM input, including subject IDs.

    Parameters:
    data (pd.DataFrame): The input DataFrame containing time series data.
    sequence_length (int): The length of each sequence.

    Returns:
    np.array: Numpy array of sequences.
    np.array: Numpy array of labels.
    list: List of subject IDs corresponding to each sequence.
    """
    sequences = []
    labels = []
    subject_ids = []

    # Extract unique feature labels
    feature_labels = data['feature_label'].unique()

    # Group data by patient
    grouped = data.groupby('subject_id')

    for subject_id, group in grouped:
        # Ensure data is sorted by time
        group = group.sort_values(by='time_from_window_start_mins')

        # Pivot the data to ensure all features are included
        pivoted_data = group.pivot(index='time_from_window_start_mins', columns='feature_label', values='valuenum')

        # Ensure the pivoted data has the correct order of columns
        pivoted_data = pivoted_data[feature_labels]

        # Create sequences
        for i in range(len(pivoted_data) - sequence_length + 1):
            sequence = pivoted_data.iloc[i:i + sequence_length].values
            sequences.append(sequence)
            labels.append(group['extubation_failure'].iloc[i + sequence_length - 1])
            subject_ids.append(subject_id)

    return np.array(sequences), np.array(labels), subject_ids

In [170]:
# Set the sequence length for each dataset
low_frequency_seq_length = 360 // 120 + 1
medium_frequency_seq_length = 360 // 60 + 1
high_frequency_seq_length = 360 // 30 + 1

print(f"Low frequency sequence length: {low_frequency_seq_length}")
print(f"Medium frequency sequence length: {medium_frequency_seq_length}")
print(f"High frequency sequence length: {high_frequency_seq_length}")

Low frequency sequence length: 4
Medium frequency sequence length: 7
High frequency sequence length: 13


In [174]:
output_dir = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/feature_subsets/'

In [175]:
low_frequency_train_sequences, low_frequency_train_labels, low_frequency_train_subject_ids = create_sequences(low_frequency_train_scaled, low_frequency_seq_length)
medium_frequency_train_sequences, medium_frequency_train_labels, medium_frequency_train_subject_ids = create_sequences(medium_frequency_train_scaled, medium_frequency_seq_length)
high_frequency_train_sequences, high_frequency_train_labels, high_frequency_train_subject_ids = create_sequences(high_frequency_train_scaled, high_frequency_seq_length)

print(f"Low frequency train sequences shape: {low_frequency_train_sequences.shape}")
print(f"Low frequency train labels shape: {low_frequency_train_labels.shape}")
print(f"Low frequency train subject ids shape: {len(low_frequency_train_subject_ids)}")

print(f"Medium frequency train sequences shape: {medium_frequency_train_sequences.shape}")
print(f"Medium frequency train labels shape: {medium_frequency_train_labels.shape}")
print(f"Medium frequency train subject ids shape: {len(medium_frequency_train_subject_ids)}")

print(f"High frequency train sequences shape: {high_frequency_train_sequences.shape}")
print(f"High frequency train labels shape: {high_frequency_train_labels.shape}")
print(f"High frequency train subject ids shape: {len(high_frequency_train_subject_ids)}")

Low frequency train sequences shape: (1399, 4, 3)
Low frequency train labels shape: (1399,)
Low frequency train subject ids shape: 1399
Medium frequency train sequences shape: (3722, 7, 5)
Medium frequency train labels shape: (3722,)
Medium frequency train subject ids shape: 3722
High frequency train sequences shape: (3760, 13, 2)
High frequency train labels shape: (3760,)
High frequency train subject ids shape: 3760


In [176]:
# Save the sequences in the drive for model input
np.save(output_dir + 'low_frequency_train_sequences_v1.npy', low_frequency_train_sequences)
np.save(output_dir + 'low_frequency_train_labels_v1.npy', low_frequency_train_labels)
np.save(output_dir + 'low_frequency_train_subject_ids_v1.npy', low_frequency_train_subject_ids)

np.save(output_dir + 'medium_frequency_train_sequences_v1.npy', medium_frequency_train_sequences)
np.save(output_dir + 'medium_frequency_train_labels_v1.npy', medium_frequency_train_labels)
np.save(output_dir + 'medium_frequency_train_subject_ids_v1.npy', medium_frequency_train_subject_ids)

np.save(output_dir + 'high_frequency_train_sequences_v1.npy', high_frequency_train_sequences)
np.save(output_dir + 'high_frequency_train_labels_v1.npy', high_frequency_train_labels)
np.save(output_dir + 'high_frequency_train_subject_ids_v1.npy', high_frequency_train_subject_ids)

In [179]:
# Also save the feature names for feature ablation
low_frequency_feature_names = low_frequency_train_scaled['feature_label'].unique()
medium_frequency_feature_names = medium_frequency_train_scaled['feature_label'].unique()
high_frequency_feature_names = high_frequency_train_scaled['feature_label'].unique()

print(f"Low frequency feature names: {low_frequency_feature_names}")
print(f"Medium frequency feature names: {medium_frequency_feature_names}")
print(f"High frequency feature names: {high_frequency_feature_names}")

Low frequency feature names: ['PH (Arterial)' 'Arterial CO2 Pressure' 'Arterial O2 pressure']
Medium frequency feature names: ['Inspired O2 Fraction' 'Tidal Volume (observed)' 'Minute Volume'
 'Peak Insp. Pressure' 'Tidal Volume (spontaneous)']
High frequency feature names: ['O2 saturation pulseoxymetry' 'Respiratory Rate']


In [180]:
# Save the feature names
np.save(output_dir + 'low_frequency_feature_names_v1.npy', low_frequency_feature_names)
np.save(output_dir + 'medium_frequency_feature_names_v1.npy', medium_frequency_feature_names)
np.save(output_dir + 'high_frequency_feature_names_v1.npy', high_frequency_feature_names)