# **LSTM preprocessing for feature set 1 dynamic data**

Given the perculiar performance of the LSTM models, we are running the data preprocessing again to ensure there is no data leakage and data is well distributed through the train and test sets.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load the train and test data
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/preprocessing_v2/03_train_data_standard_preprocess_done.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/preprocessing_v2/03_test_data_standard_preprocess_done.parquet'

train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df.head()

Unnamed: 0,subject_id,itemid,valuenum,time_to_extubation_mins,time_from_window_start,label,extubation_failure
0,10001884,223835,40.0,160.0,200.0,Inspired O2 Fraction,1
1,10001884,224685,,160.0,200.0,Tidal Volume (observed),1
2,10001884,224686,,160.0,200.0,Tidal Volume (spontaneous),1
3,10001884,224687,6.1,160.0,200.0,Minute Volume,1
4,10001884,224695,17.0,160.0,200.0,Peak Insp. Pressure,1


In [None]:
# Rename time_from_window_start to time_from_window_start_mins and label to feature_label for both train and test
train_df = train_df.rename(columns={'time_from_window_start': 'time_from_window_start_mins', 'label': 'feature_label'})
test_df = test_df.rename(columns={'time_from_window_start': 'time_from_window_start_mins', 'label': 'feature_label'})

# Drop time to extubation mins column and item id
train_df = train_df.drop(columns=['time_to_extubation_mins', 'itemid'])
test_df = test_df.drop(columns=['time_to_extubation_mins', 'itemid'])

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90905 entries, 0 to 116638
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subject_id                   90905 non-null  int64  
 1   valuenum                     88094 non-null  float64
 2   time_from_window_start_mins  90905 non-null  float64
 3   feature_label                90905 non-null  object 
 4   extubation_failure           90905 non-null  int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 4.2+ MB


In [None]:
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()

### **Step 1 - Resample and interpolation**

The train and test data need to be resampled to consistent sampling frequencies.

Each patient needs sequences of data for each feature, with regular sampling frequency.

We will need to impute and resample the data to achieve this given data collection seems to be sporadic in the original MIMIC data.

Time series models, including LSTM, require input data to have a consistent interval between observation for best performance.

In [None]:
# Group by subject_id and itemid to count the number of samples for each combination
per_patient_sampling_frequency = train_df.groupby(['subject_id', 'feature_label']).size().reset_index(name='count')

per_patient_sampling_frequency_pivot = per_patient_sampling_frequency.pivot(index='subject_id', columns='feature_label', values='count').fillna(0)

# Calculate the average sampling frequency per feature
average_sampling_frequency = per_patient_sampling_frequency_pivot.mean().sort_values(ascending=False)

# Create columns for the table
average_sampling_frequency_df = pd.DataFrame({'Feature': average_sampling_frequency.index, 'Average Train Set Sampling Frequency': average_sampling_frequency.values})

# Display the table
average_sampling_frequency_df

Unnamed: 0,Feature,Average Train Set Sampling Frequency
0,O2 saturation pulseoxymetry,6.597074
1,Respiratory Rate,6.578989
2,Inspired O2 Fraction,2.106383
3,Tidal Volume (observed),1.616755
4,Minute Volume,1.600798
5,Peak Insp. Pressure,1.528457
6,Tidal Volume (spontaneous),1.370745
7,Ventilator Mode,1.201064
8,PH (Arterial),0.532979
9,Arterial CO2 Pressure,0.521809


Features are sampled at varying rates from 6/7 times in 6 hours to 0.52 times every 6 hours as per the above.

For our data, we need all features to be sampled uniformly over the same time intervals and need to consider the desired sampling frequency, uniformity and level of interpolation.

Excessive interpolation for features with low sampling rates can introduce noise and misleading trends, but sufficient temporal resolution is required for model training.

Zeng et al. use padding and packing to fill out the data but in relation to this study, this is not relevant as that is desgined to handle variable sequence length whereas we are focusing on data in a fixed 6 hour window.



**Decision: Target sampling frequency**

As implemented in previous preprocessing steps, the desired sampling frequency is 30 min intervals.

This provides a balance of temporal details and length of sequence favoured by LSTMs and excessive interpolation.

1. Increased Data Resolution:
Capture More Detail: By resampling to 30-minute intervals, you capture more detailed information about the patient's condition and changes over time. This higher resolution can help the model detect subtle patterns and trends that might be missed with 1-hour intervals.
Improved Responsiveness: Medical events can happen quickly, and more frequent sampling can improve the model's ability to respond to changes in a patient's state more promptly.
2. Better Representation of Dynamics:
Granular Changes: Many physiological parameters can change significantly within an hour. Capturing data every 30 minutes provides a more granular view of these changes, which can be critical for accurate predictions.
Short-term Variations: Some clinical interventions and patient responses occur within shorter time frames. A 30-minute interval can better capture these short-term variations and their impact on the overall trend.
3. Alignment with Clinical Practices:
Clinical Relevance: In many clinical settings, patient data is recorded at intervals shorter than an hour, often ranging from every few minutes to every 30 minutes. Resampling to 30 minutes can align more closely with how data is actually collected and reviewed in practice.
Actionable Insights: More frequent data points can provide healthcare professionals with actionable insights that are more timely, potentially leading to better patient outcomes.
4. Improved Model Performance:
More Data Points: Resampling to 30 minutes effectively doubles the number of data points compared to 1-hour intervals. This can provide the model with more data to learn from, potentially improving its performance and robustness.
Enhanced Temporal Patterns: LSTMs and other recurrent neural networks are designed to capture temporal dependencies. More frequent data points can enhance the model's ability to learn and leverage these temporal patterns.
5. Handling Missing Data:
Interpolation and Imputation: When dealing with missing data, having more frequent data points can improve the accuracy of interpolation and imputation methods. The assumption of linearity or other patterns in the data is more likely to hold true over shorter intervals.
6. Flexibility:
Aggregation: If needed, data resampled to 30 minutes can always be aggregated back to 1-hour intervals. However, the reverse is not possible without additional assumptions or loss of detail.

Considerations:

While there are clear advantages to using a 30-minute interval, there are also potential challenges, such as increased computational load and the risk of creating too much synthetic data if the original data is sparse. It's important to balance the need for detail with the quality of the data and the computational resources available.

In [None]:
sequence_length = 13

**Decision - Data imputation strategy**

- Forward fill, backward fill: These are classic techniques but will not work for features in patients that have no measurements present,

- Linear / Spline interpolation: These are sligtly more sophisticated, not too computationally intensive and encourage some temporal patters be it rudimentary.

Other methods are possible e.g. Kalman Smoothing or KNN Inputation but those are computationally intensive and relatively complex for the purposes of this study.

Simple forward and backward fill would limit possible temporal patterns.

However, linear or spline interpolation relies on having boundary values to interpolate between.

Given the nature of the time series medical data and features being used - a tailored imputation strategy is necessary.

Linear interpolation is useful for variables with frequent changes or do not require a smooth curve.

Spline interpolation is useful for variables that change smoothly or continuously but requires k+1 data points where k is the order of the spline functon e.g. 2. Hence, this will not work where there are zero, one or two data points for a feature for a patient - this cannot be consistently used across the data and hence will not be used.



**Decision - How to treat ventilator mode**

Data is inherently categorical with differences between different modes being qualitative.

By treating as a **categorical** value, the model can interpret the data correctly.

In [None]:
# Filter the DataFrame to get only the rows where feature_label is 'Ventilator Mode'
ventilator_mode_df = train_df[train_df['feature_label'] == 'Ventilator Mode']

# Get the unique values in the 'valuenum' column
unique_modes = ventilator_mode_df['valuenum'].unique()

# Print the unique values
print(unique_modes)

[11. 49.  2. 51. 30. 13. 10. 71. 53.  1. 47. 17. 12. 48. 50. 46.]


In [None]:
len(unique_modes)

16

There are 16 different ventilator modes present in the data. Since each mode has a specific meaning or function that has no quantitative relationship between them, it would be incorrect to treat Ventilator Mode as numerical data.

The modes here are unique to the machines used in the MIMIC study (find out what they are called) and so ventilator mode here will be unique to the MIMIC study and less generalisable to other machines in other ICUs.

A decision will need to be made as to whether to keep or remove Ventilator mode based on Clinical discussion with Dr Murali so we will keep and process it for now.

**Decision - Resampling logic**

With domain knowledge and conversations with clinicians, the following logic will be used for data sampling and interpolation.

**Create start and end values**
- If a feature for a patient has a value within the first or last 15 mins then that is treated as the start / 0 min value or the end / 360 min value respectively
- If this is not present, the average value for that feature across all patients in the first / last 15 mins will be taken (these mean values need to be saved from the training data and applied to the test data)
- Ventilator mode is categorical so the same logic will be applied but using the mode for the values rather than the means (the mean is redundant for ventilator mode) i.e. If no value is present at the start, the most common ventilation mode in the first 15 mins across all patients should be taken as this is likley the start mode. And if no value is present at the end then the most common ventilation mode should be taken across all patients in the last 15 mins before extubation.


This creates start (0 min) and end values (360 min) for all features.




**Step 1.1 - Fill start and end values**

First we will make start and end values for each feature for each patient if they are not available

In [None]:
# Define the start and end time windows
start_window = pd.Timedelta(minutes=15)
end_window_start = pd.Timedelta(minutes=345)

The means for the start and the end will be calculated from the training data to prevent data leakage.

In [None]:
# Convert time_from_window_start_mins to time delta
train_df['time_from_window_start_mins'] = pd.to_timedelta(train_df['time_from_window_start_mins'], unit='m')
test_df['time_from_window_start_mins'] = pd.to_timedelta(test_df['time_from_window_start_mins'], unit='m')

In [None]:
# Calculate the mean values for each feature at the start and end windows from the training data
train_start_means = train_df[(train_df['time_from_window_start_mins'] <= start_window) &
                       (train_df['feature_label'] != 'Ventilator Mode')].groupby('feature_label')['valuenum'].mean()

train_end_means = train_df[(train_df['time_from_window_start_mins'] >= end_window_start) &
                     (train_df['feature_label'] != 'Ventilator Mode')].groupby('feature_label')['valuenum'].mean()

print("Start Means:")
print(train_start_means)

print("\nEnd Means:")
print(train_end_means)

Start Means:
feature_label
Arterial CO2 Pressure           41.354167
Arterial O2 pressure           107.958333
Inspired O2 Fraction            41.746398
Minute Volume                    8.190493
O2 saturation pulseoxymetry     97.718774
PH (Arterial)                    7.416979
Peak Insp. Pressure             16.117021
Respiratory Rate                19.184170
Tidal Volume (observed)        464.349023
Tidal Volume (spontaneous)     461.750000
Name: valuenum, dtype: float64

End Means:
feature_label
Arterial CO2 Pressure           38.222222
Arterial O2 pressure           111.229358
Inspired O2 Fraction            46.122958
Minute Volume                    7.598125
O2 saturation pulseoxymetry     97.491180
PH (Arterial)                    7.425000
Peak Insp. Pressure              7.830328
Respiratory Rate                20.597641
Tidal Volume (observed)        475.661538
Tidal Volume (spontaneous)     465.671875
Name: valuenum, dtype: float64


Next we calculate the start and end ventilator modes from the training data.

In [None]:
# Calculate the most common value for "Ventilator Mode" in the start and end windows
ventilator_start_mode = train_df[(train_df['time_from_window_start_mins'] <= start_window) &
                                 (train_df['feature_label'] == 'Ventilator Mode')]['valuenum'].mode()[0]

ventilator_end_mode = train_df[(train_df['time_from_window_start_mins'] >= end_window_start) &
                               (train_df['feature_label'] == 'Ventilator Mode')]['valuenum'].mode()[0]


print("Ventilator Start Mode:", ventilator_start_mode)
print("Ventilator End Mode:", ventilator_end_mode)

Ventilator Start Mode: 11.0
Ventilator End Mode: 30.0


Now we define a function to implement the start and end value imputation logic.

In [None]:
def fill_start_end_values(df, start_means, end_means, start_window, end_window_start):
    """
    Fill missing start and end values for features in a DataFrame with specified means or existing values.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data for various features.
    start_means (pd.Series): A Series containing mean values for each feature to be used if no start value is present.
    end_means (pd.Series): A Series containing mean values for each feature to be used if no end value is present.
    start_window (pd.Timedelta): The time window within which to look for a starting value.
    end_window_start (pd.Timedelta): The start time for the window within which to look for an ending value.

    Returns:
    pd.DataFrame: The DataFrame with filled start and end values.
    """
    new_rows = []

    for feature in start_means.index:
        for subject_id in df['subject_id'].unique():
            subject_df = df[df['subject_id'] == subject_id]
            extubation_failure_label = subject_df['extubation_failure'].iloc[0]

            # Handle start values
            start_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=0))
            if not start_check.any():
                # Check if there's a value in the first 15 mins
                start_window_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] <= start_window)
                if start_window_check.any():
                    # Use the earliest value within the start window
                    start_value = subject_df[start_window_check].sort_values('time_from_window_start_mins').iloc[0]['valuenum']
                    new_rows.append({
                        'subject_id': subject_id,
                        'feature_label': feature,
                        'time_from_window_start_mins': pd.Timedelta(minutes=0),
                        'valuenum': start_value,
                        'extubation_failure': extubation_failure_label
                    })
                else:
                    # Use the mean start value if no value is found within the start window
                    new_rows.append({
                        'subject_id': subject_id,
                        'feature_label': feature,
                        'time_from_window_start_mins': pd.Timedelta(minutes=0),
                        'valuenum': start_means[feature],
                        'extubation_failure': extubation_failure_label
                    })

            # Handle end values
            end_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=360))
            if not end_check.any():
                # Check if there's a value in the last 15 mins
                end_window_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] >= end_window_start)
                if end_window_check.any():
                    # Use the latest value within the end window
                    end_value = subject_df[end_window_check].sort_values('time_from_window_start_mins', ascending=False).iloc[0]['valuenum']
                    new_rows.append({
                        'subject_id': subject_id,
                        'feature_label': feature,
                        'time_from_window_start_mins': pd.Timedelta(minutes=360),
                        'valuenum': end_value,
                        'extubation_failure': extubation_failure_label
                    })
                else:
                    # Use the mean end value if no value is found within the end window
                    new_rows.append({
                        'subject_id': subject_id,
                        'feature_label': feature,
                        'time_from_window_start_mins': pd.Timedelta(minutes=360),
                        'valuenum': end_means[feature],
                        'extubation_failure': extubation_failure_label
                    })

    # Add new rows to the dataframe
    if new_rows:
        new_df = pd.DataFrame(new_rows)
        df = pd.concat([df, new_df], ignore_index=True)

    return df

In [None]:
# Count the number of rows before imputation
print("Number of rows in train_df before imputation:", train_df_copy.shape[0])
print("Number of rows in test_df before imputation:", test_df_copy.shape[0])

Number of rows in train_df before imputation: 90905
Number of rows in test_df before imputation: 22894


In [None]:
# Fill start and end values for numerical features
train_df = fill_start_end_values(train_df, train_start_means, train_end_means, start_window, end_window_start)
test_df = fill_start_end_values(test_df, train_start_means, train_end_means, start_window, end_window_start)

# Count the number of rows
print("Number of rows in train_df:", train_df.shape[0])
print("Number of rows in test_df:", test_df.shape[0])

Number of rows in train_df: 163394
Number of rows in test_df: 41037


Next we need to do the same but for ventilator mode.

In [None]:
def fill_ventilator_mode_start_end(train_df, start_window, end_window_start, ventilator_start_mode, ventilator_end_mode):
    """
    Fill missing start and end values for 'Ventilator Mode' in a DataFrame.

    Parameters:
    train_df (pd.DataFrame): The input DataFrame containing time series data for various features.
    start_window (pd.Timedelta): The time window within which to look for a starting value.
    end_window_start (pd.Timedelta): The start time for the window within which to look for an ending value.
    ventilator_start_mode (int): The default start mode for the ventilator if no value is found.
    ventilator_end_mode (int): The default end mode for the ventilator if no value is found.

    Returns:
    pd.DataFrame: The DataFrame with filled start and end values for 'Ventilator Mode'.
    """
    new_rows = []

    for subject_id in train_df['subject_id'].unique():
        subject_df = train_df[train_df['subject_id'] == subject_id]
        extubation_failure_label = subject_df['extubation_failure'].iloc[0]

        # Fill start value
        ventilator_start_check = (subject_df['feature_label'] == 'Ventilator Mode') & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=0))
        if not ventilator_start_check.any():
            ventilator_start_window_check = (subject_df['feature_label'] == 'Ventilator Mode') & (subject_df['time_from_window_start_mins'] <= start_window)
            if ventilator_start_window_check.any():
                ventilator_start_value = subject_df[ventilator_start_window_check].sort_values('time_from_window_start_mins').iloc[0]['valuenum']
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': 'Ventilator Mode',
                    'time_from_window_start_mins': pd.Timedelta(minutes=0),
                    'valuenum': ventilator_start_value,
                    'extubation_failure': extubation_failure_label
                })
            else:
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': 'Ventilator Mode',
                    'time_from_window_start_mins': pd.Timedelta(minutes=0),
                    'valuenum': ventilator_start_mode,
                    'extubation_failure': extubation_failure_label
                })

        # Fill end value
        ventilator_end_check = (subject_df['feature_label'] == 'Ventilator Mode') & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=360))
        if not ventilator_end_check.any():
            ventilator_end_window_check = (subject_df['feature_label'] == 'Ventilator Mode') & (subject_df['time_from_window_start_mins'] >= end_window_start)
            if ventilator_end_window_check.any():
                ventilator_end_value = subject_df[ventilator_end_window_check].sort_values('time_from_window_start_mins', ascending=False).iloc[0]['valuenum']
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': 'Ventilator Mode',
                    'time_from_window_start_mins': pd.Timedelta(minutes=360),
                    'valuenum': ventilator_end_value,
                    'extubation_failure': extubation_failure_label
                })
            else:
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': 'Ventilator Mode',
                    'time_from_window_start_mins': pd.Timedelta(minutes=360),
                    'valuenum': ventilator_end_mode,
                    'extubation_failure': extubation_failure_label
                })

    # Add new rows to the dataframe
    if new_rows:
        new_df = pd.DataFrame(new_rows)
        train_df = pd.concat([train_df, new_df], ignore_index=True)

    return train_df

This makes sense as 3701 x 9 features x 2 start or end = c. 74000

In [None]:
train_df = fill_ventilator_mode_start_end(train_df, start_window, end_window_start, ventilator_start_mode, ventilator_end_mode)
test_df = fill_ventilator_mode_start_end(test_df, start_window, end_window_start, ventilator_start_mode, ventilator_end_mode)

# Count the number of rows
print("Number of rows in train_df:", train_df.shape[0])
print("Number of rows in test_df:", test_df.shape[0])

Number of rows in train_df: 170693
Number of rows in test_df: 42877


In [None]:
train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,0 days 03:20:00,Inspired O2 Fraction,1
1,10001884,,0 days 03:20:00,Tidal Volume (observed),1
2,10001884,,0 days 03:20:00,Tidal Volume (spontaneous),1
3,10001884,6.1,0 days 03:20:00,Minute Volume,1
4,10001884,17.0,0 days 03:20:00,Peak Insp. Pressure,1


Now we check if all features have a start and end value.

In [None]:
# Check if every feature for every patient has a value at 0 mins and 360 mins
def check_start_end_values(df):
    start_check = df[df['time_from_window_start_mins'] == pd.Timedelta(minutes=0)]
    end_check = df[df['time_from_window_start_mins'] == pd.Timedelta(minutes=360)]

    missing_start = []
    missing_end = []

    for subject_id in df['subject_id'].unique():
        subject_df = df[df['subject_id'] == subject_id]
        for feature in df['feature_label'].unique():
            if not ((subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=0))).any():
                missing_start.append((subject_id, feature))
            if not ((subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=360))).any():
                missing_end.append((subject_id, feature))

    return missing_start, missing_end

In [None]:
missing_start_train, missing_end_train = check_start_end_values(train_df)
missing_start_test, missing_end_test = check_start_end_values(test_df)

print("Missing start values in train_df:", len(missing_start_train))
print("Missing end values in train_df:", len(missing_end_train))
print("Missing start values in test_df:", len(missing_start_test))
print("Missing end values in test_df:", len(missing_end_test))

Missing start values in train_df: 0
Missing end values in train_df: 0
Missing start values in test_df: 0
Missing end values in test_df: 0


In [None]:
train_df_start_end_values_copy = train_df.copy()
test_df_start_end_values_copy = test_df.copy()

In [None]:
# Save for now
train_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/01_train_data_start_end_values_filled.parquet')
test_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/01_test_data_start_end_values_filled.parquet')

**Step 1.2 - Resample and interpolate data**

We need to resample the data to 30 min time intervals. The logic for this is as below:


Numerical features:
- The features are all continuous data and so in an indeal scenario, interpolation techniques such as spline would be more suitable s it produces a smooth curve through all the data points - more naturally capturing time series behaivour. However spline requires that there are at least k+1 data points.
Spline also relies on data density and given the amount of synthetic data that will be required here, it may not be appropriate as it is more sensitive to noise and we have low data density
- Standard linear interpolation on a resampled sequence will not take into account values inbetween intervals and so a more sophisticated logic is required
- The data will be initially resampled to every 1 min (which captures any data thats already there but not at a 30 min interval), then linear interpolation will be applied to fill the values and finally data re-sampled to the target 30 min intervals - this ensures that all data points no matter where they were recorded are taken into consideration
- Any values that cannot be interpolated will be filled with the mean value for that feature for that patient

Ventilator mode:
- Cannot be interpolated as this is meaningless
- Value will be mode imputed for the most common value in a 30 min window around that time point across all patients
- The modes used will need to be calculated from the training data


In [None]:
def resample_and_interpolate(df, numerical_features, initial_interval='1T', target_interval='30T'):
    """
    Resample and interpolate missing values for numerical features in a DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data for various features.
    numerical_features (list): A list of numerical features to be interpolated.
    initial_interval (str): The initial resampling interval to ensure all data points are included. Default is '1T' (1 minute).
    target_interval (str): The target resampling interval. Default is '30T' (30 minutes).

    Returns:
    pd.DataFrame: The DataFrame with resampled and interpolated values.
    """
    resampled_dfs = []

    for subject_id in df['subject_id'].unique():
        subject_df = df[df['subject_id'] == subject_id]

        for feature in df['feature_label'].unique():
            feature_df = subject_df[subject_df['feature_label'] == feature].set_index('time_from_window_start_mins')

            # Remove duplicates by taking the first value if duplicates exist
            feature_df = feature_df[~feature_df.index.duplicated(keep='first')]

            # Preserve additional columns before resampling
            additional_columns = feature_df[['extubation_failure']]

            # Step 1: Resample to every minute to ensure data points at every minute
            feature_df = feature_df.resample(initial_interval).asfreq()

            # Step 2: Interpolate missing values if the feature is numerical
            if feature in numerical_features:
                feature_df['valuenum'] = feature_df['valuenum'].interpolate(method='linear')

                # Check for any remaining NaN values and fill with the mean of the feature for the specific patient
                if feature_df['valuenum'].isna().sum() > 0:
                    feature_mean = subject_df[subject_df['feature_label'] == feature]['valuenum'].mean()
                    feature_df['valuenum'].fillna(feature_mean, inplace=True)

            # Step 3: Resample to the target interval
            feature_df = feature_df.resample(target_interval).asfreq()

            # Align extubation_failure columns with the resampled index and forward/backward fill for each patient
            for col in additional_columns.columns:
                feature_df[col] = feature_df[col].ffill().bfill()

            # Restore subject_id and feature_label columns
            feature_df['subject_id'] = subject_id
            feature_df['feature_label'] = feature

            resampled_dfs.append(feature_df.reset_index())

    # Concatenate all resampled dataframes
    resampled_df = pd.concat(resampled_dfs).reset_index(drop=True)

    return resampled_df

In [None]:
numerical_features = train_df['feature_label'].unique()
numerical_features = numerical_features[numerical_features != 'Ventilator Mode']
numerical_features

array(['Inspired O2 Fraction', 'Tidal Volume (observed)',
       'Tidal Volume (spontaneous)', 'Minute Volume',
       'Peak Insp. Pressure', 'Respiratory Rate',
       'O2 saturation pulseoxymetry', 'Arterial O2 pressure',
       'Arterial CO2 Pressure', 'PH (Arterial)'], dtype=object)

In [None]:
train_df_copy_2 = train_df.copy()
test_df_copy_2 = test_df.copy()

In [None]:
# Resample and interpolate the data
train_df = resample_and_interpolate(train_df, numerical_features)
test_df = resample_and_interpolate(test_df, numerical_features)

In [None]:
# Count the number of rows
print("Number of rows in train_df:", train_df.shape[0])
print("Number of rows in test_df:", test_df.shape[0])

Number of rows in train_df: 537680
Number of rows in test_df: 134563


In [None]:
# Count the number of NaNs
print("Number of NaNs in train_df:", train_df.isna().sum().sum())
print("Number of NaNs in test_df:", test_df.isna().sum().sum())

Number of NaNs in train_df: 41921
Number of NaNs in test_df: 10479


In [None]:
train_df_copy_num_filled = train_df.copy()
test_df_copy_num_filled = test_df.copy()

In [None]:
# Group the DataFrame by 'feature_label' and count NaNs for each group
nan_counts_by_feature = train_df_copy_num_filled.groupby('feature_label').apply(lambda x: x.isna().sum().sum())

print(nan_counts_by_feature)

feature_label
Arterial CO2 Pressure             12
Arterial O2 pressure              12
Inspired O2 Fraction              82
Minute Volume                    400
O2 saturation pulseoxymetry       48
PH (Arterial)                     16
Peak Insp. Pressure                0
Respiratory Rate                   0
Tidal Volume (observed)          300
Tidal Volume (spontaneous)       350
Ventilator Mode                40701
dtype: int64


In [None]:
# Ignoring ventilator mode, what is the % of NaN data points
nan_counts_by_feature = train_df_copy_num_filled[train_df_copy_num_filled['feature_label'] != 'Ventilator Mode'].groupby('feature_label').apply(lambda x: x.isna().sum().sum())

# Calculate the total number
total_count = train_df_copy_num_filled[train_df_copy_num_filled['feature_label'] != 'Ventilator Mode'].shape[0]

# Calculate the percentage
percentage_nan = (nan_counts_by_feature / total_count) * 100

print(percentage_nan)

feature_label
Arterial CO2 Pressure          0.002455
Arterial O2 pressure           0.002455
Inspired O2 Fraction           0.016776
Minute Volume                  0.081833
O2 saturation pulseoxymetry    0.009820
PH (Arterial)                  0.003273
Peak Insp. Pressure            0.000000
Respiratory Rate               0.000000
Tidal Volume (observed)        0.061375
Tidal Volume (spontaneous)     0.071604
dtype: float64


*Note prior to final forward fill or backward fill filling up any NaNs*

Only a small % of values are left as NaN so implementing forward fill or backward fill will not have a significantly negative impact on the data as it is less than 0.1% of each feature's data points.

We will amend the resample and interpolate function to handle any leftover missing values.

In [None]:
# Reset the training and test data to before values were interpolated
train_df = train_df_start_end_values_copy.copy()
test_df = test_df_start_end_values_copy.copy()

In [None]:
train_df = resample_and_interpolate(train_df, numerical_features)

nan_counts_by_feature = train_df[train_df['feature_label'] != 'Ventilator Mode'].groupby('feature_label').apply(lambda x: x.isna().sum().sum())
nan_counts_by_feature

feature_label
Arterial CO2 Pressure          0
Arterial O2 pressure           0
Inspired O2 Fraction           0
Minute Volume                  0
O2 saturation pulseoxymetry    0
PH (Arterial)                  0
Peak Insp. Pressure            0
Respiratory Rate               0
Tidal Volume (observed)        0
Tidal Volume (spontaneous)     0
dtype: int64

In [None]:
test_df = resample_and_interpolate(test_df, numerical_features)
nan_counts_by_feature = train_df[train_df['feature_label'] != 'Ventilator Mode'].groupby('feature_label').apply(lambda x: x.isna().sum().sum())
nan_counts_by_feature

feature_label
Arterial CO2 Pressure          0
Arterial O2 pressure           0
Inspired O2 Fraction           0
Minute Volume                  0
O2 saturation pulseoxymetry    0
PH (Arterial)                  0
Peak Insp. Pressure            0
Respiratory Rate               0
Tidal Volume (observed)        0
Tidal Volume (spontaneous)     0
dtype: int64

In [None]:
train_df_copy_num_filled_no_nan = train_df.copy()
test_df_copy_num_filled_no_nan = test_df.copy()

Now we need to apply this logic to ventilator mode

In [None]:
def mode_impute_ventilator_mode(df, interval='30T'):
    """
    Mode impute missing values for 'Ventilator Mode' based on the most common value in a 30-minute window.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data.
    interval (str): The resampling interval. Default is '30T' (30 minutes).

    Returns:
    pd.DataFrame: The DataFrame with mode-imputed values for 'Ventilator Mode'.
    """
    # Ensure 'Ventilator Mode' is treated as a categorical feature
    df['valuenum'] = df['valuenum'].astype('category')

    # Create a mask for 'Ventilator Mode'
    ventilator_mode_mask = (df['feature_label'] == 'Ventilator Mode')

    # Get the times that need imputation
    times_to_impute = df[ventilator_mode_mask & df['valuenum'].isna()]['time_from_window_start_mins']

    for time in times_to_impute:
        # Define the window
        window_start = time - pd.Timedelta(minutes=15)
        window_end = time + pd.Timedelta(minutes=15)

        # Filter the dataframe to find the mode within the window
        window_df = df[(df['time_from_window_start_mins'] >= window_start) &
                       (df['time_from_window_start_mins'] <= window_end) &
                       ventilator_mode_mask]

        if not window_df['valuenum'].mode().empty:
            mode_value = window_df['valuenum'].mode().iloc[0]
        else:
            # If no mode is found within the window, fall back to a default value
            mode_value = df[ventilator_mode_mask]['valuenum'].mode().iloc[0]

        # Impute the missing value with the mode
        df.loc[(df['time_from_window_start_mins'] == time) & ventilator_mode_mask, 'valuenum'] = mode_value

    return df

We will first apply this to the training data and then use the training modes to impute the test data.

In [None]:
# Mode impute ventilator mode for training data
train_df = mode_impute_ventilator_mode(train_df)

# Group the DataFrame by 'feature_label' and count NaNs for each group
nan_counts_by_feature = train_df.groupby('feature_label').apply(lambda x: x.isna().sum().sum())

print(nan_counts_by_feature)

feature_label
Arterial CO2 Pressure          0
Arterial O2 pressure           0
Inspired O2 Fraction           0
Minute Volume                  0
O2 saturation pulseoxymetry    0
PH (Arterial)                  0
Peak Insp. Pressure            0
Respiratory Rate               0
Tidal Volume (observed)        0
Tidal Volume (spontaneous)     0
Ventilator Mode                0
dtype: int64


In [None]:
test_df = test_df_copy_num_filled_no_nan.copy()

Now we need to fill in the ventilator mode values using mode imputation (modes will be taken from the training data to prevent data leakage)

In [None]:
# Fill in ventilator mode values - mode values are derived from the training data
def calculate_window_modes(train_df, feature_label='Ventilator Mode', interval='30T', max_time='360T'):
    """
    Calculate the mode for each time window in the training data.

    Parameters:
    train_df (pd.DataFrame): The training DataFrame containing time series data.
    feature_label (str): The feature label to calculate modes for. Default is 'Ventilator Mode'.
    interval (str): The resampling interval. Default is '30T' (30 minutes).
    max_time (str): The maximum time to check for intervals. Default is '360T' (360 minutes).

    Returns:
    dict: A dictionary with time windows as keys and mode values as values.
    """
    interval = pd.Timedelta(interval)
    max_time = pd.Timedelta(max_time)

    # Initialize a dictionary to store modes for each 30-minute window
    window_mode_dict = {}

    # Calculate modes for each 30-minute window in the training data
    for start_time in pd.timedelta_range(start='0 minutes', end=max_time, freq=interval):
        end_time = start_time + interval
        window_df = train_df[
            (train_df['feature_label'] == feature_label) &
            (train_df['time_from_window_start_mins'] >= start_time) &
            (train_df['time_from_window_start_mins'] < end_time)
        ]
        if not window_df['valuenum'].mode().empty:
            mode_value = window_df['valuenum'].mode().iloc[0]
            window_mode_dict[start_time] = mode_value

    return window_mode_dict

In [None]:
def test_data_mode_impute_ventilator_mode(test_df, mode_dict, interval='30T'):
    """
    Mode impute missing values for 'Ventilator Mode' based on the most common value in a 30-minute window
    using modes derived from the training data.

    Parameters:
    test_df (pd.DataFrame): The test DataFrame containing time series data.
    mode_dict (dict): Dictionary containing mode values for 'Ventilator Mode' for each 30-minute window.
    interval (str): The resampling interval. Default is '30T' (30 minutes).

    Returns:
    pd.DataFrame: The DataFrame with mode-imputed values for 'Ventilator Mode'.
    """
    interval = pd.Timedelta(interval)

    # Ensure 'Ventilator Mode' is treated as a categorical feature
    test_df['valuenum'] = test_df['valuenum'].astype('category')

    # Create a mask for 'Ventilator Mode'
    ventilator_mode_mask = (test_df['feature_label'] == 'Ventilator Mode')

    # Get the times that need imputation
    times_to_impute = test_df[ventilator_mode_mask & test_df['valuenum'].isna()]['time_from_window_start_mins']

    for time in times_to_impute:
        # Find the nearest start time for the 30-minute window
        nearest_start_time = (time // interval) * interval

        # Use the mode from the training data for the corresponding 30-minute window
        if nearest_start_time in mode_dict:
            mode_value = mode_dict[nearest_start_time]
        else:
            # If no mode is found for the specific window, use the overall mode from training data
            mode_value = test_df[ventilator_mode_mask]['valuenum'].mode().iloc[0]

        # Impute the missing value with the mode
        test_df.loc[(test_df['time_from_window_start_mins'] == time) & ventilator_mode_mask, 'valuenum'] = mode_value

    return test_df

In [None]:
# Calculate the modes from the training data
train_ventilator_mode_dict = calculate_window_modes(train_df)

In [None]:
# Mode impute the test data
test_df = test_data_mode_impute_ventilator_mode(test_df, train_ventilator_mode_dict)

# Group the DataFrame by 'feature_label' and count NaNs for each group
nan_counts_by_feature = test_df.groupby('feature_label').apply(lambda x: x.isna().sum().sum())

print(nan_counts_by_feature)

# Save train and test data
train_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/02_train_data_num_and_ventilator_mode_filled_v2.parquet')
test_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/02_test_data_num_and_ventilator_mode_filled_v2.parquet')

feature_label
Arterial CO2 Pressure          0
Arterial O2 pressure           0
Inspired O2 Fraction           0
Minute Volume                  0
O2 saturation pulseoxymetry    0
PH (Arterial)                  0
Peak Insp. Pressure            0
Respiratory Rate               0
Tidal Volume (observed)        0
Tidal Volume (spontaneous)     0
Ventilator Mode                0
dtype: int64


### **Step 2 - Feature Engineering**

Create SaO2:FiO2 and P/F ratios as new features by dividing current features and creating a new feature at each time step.

In [None]:
# Make valuenum a float and extubation failure an int
train_df['valuenum'] = train_df['valuenum'].astype('float64')
train_df['extubation_failure'] = train_df['extubation_failure'].astype('int64')
test_df['valuenum'] = test_df['valuenum'].astype('float64')
test_df['extubation_failure'] = test_df['extubation_failure'].astype('int64')

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537680 entries, 0 to 537679
Data columns (total 5 columns):
 #   Column                       Non-Null Count   Dtype          
---  ------                       --------------   -----          
 0   time_from_window_start_mins  537680 non-null  timedelta64[ns]
 1   subject_id                   537680 non-null  int64          
 2   valuenum                     537680 non-null  float64        
 3   feature_label                537680 non-null  object         
 4   extubation_failure           537680 non-null  int64          
dtypes: float64(1), int64(2), object(1), timedelta64[ns](1)
memory usage: 20.5+ MB


In [None]:
train_df_copy_pre_feature_eng = train_df.copy()
test_df_copy_pre_feature_eng = test_df.copy()

In [None]:
def create_new_feature(df, numerator_label, denominator_label, output_label):
    """
    Create a new feature by dividing one existing feature by another at each time point for each patient.

    Parameters:
    df (pd.DataFrame): The original dataframe containing the data.
    numerator_label (str): The feature label for the numerator.
    denominator_label (str): The feature label for the denominator.
    output_label (str): The feature label for the new feature.

    Returns:
    pd.DataFrame: A dataframe containing the new feature, with NaN values filled with the mean value across all patients.
    """

    # Filter the dataframe to get rows corresponding to the numerator and denominator feature labels
    numerator_df = df[df['feature_label'] == numerator_label].copy()
    denominator_df = df[df['feature_label'] == denominator_label].copy()

    # Ensure they are sorted by subject_id and time_from_window_start_mins for proper alignment
    numerator_df.sort_values(by=['subject_id', 'time_from_window_start_mins'], inplace=True)
    denominator_df.sort_values(by=['subject_id', 'time_from_window_start_mins'], inplace=True)

    # Merge the two dataframes on subject_id and time_from_window_start_mins to align the numerator and denominator values
    merged_df = pd.merge(
        numerator_df[['subject_id', 'time_from_window_start_mins', 'valuenum']],
        denominator_df[['subject_id', 'time_from_window_start_mins', 'valuenum']],
        on=['subject_id', 'time_from_window_start_mins'],
        suffixes=('_numerator', '_denominator')
    )

    # Calculate the new feature, setting result to NaN where the denominator is zero to avoid division by zero
    merged_df[output_label] = np.where(
        merged_df['valuenum_denominator'] != 0,
        merged_df['valuenum_numerator'] / merged_df['valuenum_denominator'],
        np.nan
    )

    # Calculate the mean value of the new feature, ignoring NaN values
    mean_value = merged_df[output_label].mean()

    # Fill NaN values with the calculated mean value
    merged_df[output_label].fillna(mean_value, inplace=True)

    # Create a dataframe for the new feature
    new_feature_df = merged_df[['subject_id', 'time_from_window_start_mins', output_label]].copy()
    new_feature_df['feature_label'] = output_label
    new_feature_df.rename(columns={output_label: 'valuenum'}, inplace=True)

    # Merge extubation_failure info
    extubation_failure_df = df[['subject_id', 'extubation_failure']].drop_duplicates().set_index('subject_id')
    new_feature_df = new_feature_df.join(extubation_failure_df, on='subject_id')

    # Append the new feature dataframe to the original dataframe
    df = pd.concat([df, new_feature_df], ignore_index=True)

    return df

In [None]:
# Create SpO2:FiO2 ratio feature
train_df = create_new_feature(train_df, 'O2 saturation pulseoxymetry', 'Inspired O2 Fraction', 'SpO2:FiO2')
test_df = create_new_feature(test_df, 'O2 saturation pulseoxymetry', 'Inspired O2 Fraction', 'SpO2:FiO2')

# Count NaNs
nan_count = train_df.isna().sum()

# Display the result
print(f"Number of NaN values: {nan_count}")

# Count test nans
nan_count = test_df.isna().sum()

# Display the result
print(f"Number of NaN values: {nan_count}")

Number of NaN values: time_from_window_start_mins    0
subject_id                     0
valuenum                       0
feature_label                  0
extubation_failure             0
dtype: int64
Number of NaN values: time_from_window_start_mins    0
subject_id                     0
valuenum                       0
feature_label                  0
extubation_failure             0
dtype: int64


In [None]:
# Create P:F ratio feature
train_df = create_new_feature(train_df, 'Arterial O2 pressure', 'Inspired O2 Fraction', 'P:F ratio')
test_df = create_new_feature(test_df, 'Arterial O2 pressure', 'Inspired O2 Fraction', 'P:F ratio')

# Count NaNs
nan_count = train_df.isna().sum()

# Display the result
print(f"Number of NaN values: {nan_count}")

# Count test nans
nan_count = test_df.isna().sum()

# Display the result
print(f"Number of NaN values: {nan_count}")

Number of NaN values: time_from_window_start_mins    0
subject_id                     0
valuenum                       0
feature_label                  0
extubation_failure             0
dtype: int64
Number of NaN values: time_from_window_start_mins    0
subject_id                     0
valuenum                       0
feature_label                  0
extubation_failure             0
dtype: int64


In [None]:
# List the features in train data
print(train_df['feature_label'].unique())
print(test_df['feature_label'].unique())

['Inspired O2 Fraction' 'Tidal Volume (observed)'
 'Tidal Volume (spontaneous)' 'Minute Volume' 'Peak Insp. Pressure'
 'Respiratory Rate' 'O2 saturation pulseoxymetry' 'Ventilator Mode'
 'Arterial O2 pressure' 'Arterial CO2 Pressure' 'PH (Arterial)'
 'SpO2:FiO2' 'P:F ratio']
['O2 saturation pulseoxymetry' 'Inspired O2 Fraction'
 'Tidal Volume (observed)' 'Tidal Volume (spontaneous)' 'Minute Volume'
 'Peak Insp. Pressure' 'Respiratory Rate' 'Ventilator Mode'
 'Arterial O2 pressure' 'Arterial CO2 Pressure' 'PH (Arterial)'
 'SpO2:FiO2' 'P:F ratio']


Now we need to check we have data points at consistent time intervals for each feature.

In [None]:
# Check that all features have valid values at 30 min intervals
# Check completeness that all patients have valid data points for 30 min intervals
def check_complete_intervals(df, interval='30T', max_time=pd.Timedelta(minutes=360)):
    """
    Check if every patient and every feature has a value for every interval up to max_time.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data.
    interval (str): The interval to check. Default is '30T' (30 minutes).
    max_time (pd.Timedelta): The maximum time to check for intervals. Default is 360 minutes.

    Returns:
    pd.DataFrame: DataFrame indicating any missing intervals.
    """
    # Create a time index for the intervals
    time_index = pd.timedelta_range(start='0 minutes', end=max_time, freq=interval)

    missing_intervals = []

    for subject_id in df['subject_id'].unique():
        subject_df = df[df['subject_id'] == subject_id]

        for feature in df['feature_label'].unique():
            feature_df = subject_df[subject_df['feature_label'] == feature]
            feature_times = feature_df['time_from_window_start_mins']

            for time in time_index:
                if time not in feature_times.values:
                    missing_intervals.append({
                        'subject_id': subject_id,
                        'feature_label': feature,
                        'missing_interval': time
                    })

    if missing_intervals:
        missing_df = pd.DataFrame(missing_intervals)
    else:
        missing_df = pd.DataFrame(columns=['subject_id', 'feature_label', 'missing_interval'])

    return missing_df

In [None]:
train_missing_intervals = check_complete_intervals(train_df)
test_missing_intervals = check_complete_intervals(test_df)

# Display the result
print(f"Number of missing intervals: {len(train_missing_intervals)}")
print(f"Number of missing intervals: {len(test_missing_intervals)}")

Number of missing intervals: 0
Number of missing intervals: 0


In [None]:
# Save progress
train_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/03_train_data_num_and_ventilator_mode_filled_and_feature_eng.parquet')
test_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/03_test_data_num_and_ventilator_mode_filled_and_feature_eng.parquet')

In [None]:
train_df_copy_post_feature_eng = train_df.copy()
test_df_copy_post_feature_eng = test_df.copy()

In [None]:
# Add new features to the list of numerical features
numerical_features = list(set(numerical_features))

In [None]:
numerical_features.append('SpO2:FiO2')
numerical_features.append('P:F ratio')
numerical_features

['Minute Volume',
 'Respiratory Rate',
 'Inspired O2 Fraction',
 'O2 saturation pulseoxymetry',
 'PH (Arterial)',
 'Tidal Volume (observed)',
 'Arterial O2 pressure',
 'Arterial CO2 Pressure',
 'Peak Insp. Pressure',
 'Tidal Volume (spontaneous)',
 'SpO2:FiO2',
 'P:F ratio']

### **Step 3 - Scaling and One-hot encoding**

Scale numerical features with MinMax Scaler and one hot encode ventilation mode as it is categorical.

For LSTM, Min max Scaling is preferable as it maintains temporal relationships better than with Standard scaling. It preserves the range of data and avoids negative values.

The scaler applied to the train data will also need to be applied to the test data in the same manner.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
def scale_features(train_df, test_df, numerical_features):
    """
    Scale numerical features in the train and test dataframes using MinMaxScaler.

    Parameters:
    train_df (pd.DataFrame): The training dataframe.
    test_df (pd.DataFrame): The testing dataframe.
    numerical_features (list): List of feature labels to be scaled.

    Returns:
    tuple: The scaled train and test dataframes.
    """
    scalers = {}

    for feature in numerical_features:
        # Initialize the MinMaxScaler for the feature
        scalers[feature] = MinMaxScaler()

        # Create masks for the feature in train and test dataframes
        feature_mask_train = train_df['feature_label'] == feature
        feature_mask_test = test_df['feature_label'] == feature

        # Fit and transform the train dataframe values
        train_df.loc[feature_mask_train, 'valuenum'] = scalers[feature].fit_transform(train_df.loc[feature_mask_train, ['valuenum']])

        # Transform the test dataframe values
        test_df.loc[feature_mask_test, 'valuenum'] = scalers[feature].transform(test_df.loc[feature_mask_test, ['valuenum']])

        print(f'Feature {feature} has been normalized')

    # Ensure indices align if necessary
    train_df.reset_index(drop=True, inplace=True)
    test_df.reset_index(drop=True, inplace=True)

    # Display the sizes after normalization
    print(f"Number of rows in train dataframe after normalization: {len(train_df)}")
    print(f"Number of rows in test dataframe after normalization: {len(test_df)}")

    return train_df, test_df

In [None]:
# Scale the data using MinMaxScaler
train_df, test_df = scale_features(train_df, test_df, numerical_features)

Feature Minute Volume has been normalized
Feature Respiratory Rate has been normalized
Feature Inspired O2 Fraction has been normalized
Feature O2 saturation pulseoxymetry has been normalized
Feature PH (Arterial) has been normalized
Feature Tidal Volume (observed) has been normalized
Feature Arterial O2 pressure has been normalized
Feature Arterial CO2 Pressure has been normalized
Feature Peak Insp. Pressure has been normalized
Feature Tidal Volume (spontaneous) has been normalized
Feature SpO2:FiO2 has been normalized
Feature P:F ratio has been normalized
Number of rows in train dataframe after normalization: 635440
Number of rows in test dataframe after normalization: 159029


In [None]:
train_df_copy_post_scaling = train_df.copy()
test_df_copy_post_scaling = test_df.copy()

Now we need to OneHotEncode Ventilator mode as it is a categorical variable.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Check that the ventilator modes in the train set are the same as in the test set
train_ventilator_modes = train_df[train_df['feature_label'] == 'Ventilator Mode']['valuenum'].unique()
test_ventilator_modes = test_df[test_df['feature_label'] == 'Ventilator Mode']['valuenum'].unique()

# Check they are the same
if set(train_ventilator_modes) == set(test_ventilator_modes):
    print("The ventilator modes in the train and test sets are the same.")
else:
    print("The ventilator modes in the train and test sets are different.")

# Identify the differences
test_diff = set(test_ventilator_modes) - set(train_ventilator_modes)
train_diff = set(train_ventilator_modes) - set(test_ventilator_modes)

print(f"Ventilator modes in the test set but not in the train set: {test_diff}")
print(f"Ventilator modes in the train set but not in the test set: {train_diff}")

The ventilator modes in the train and test sets are different.
Ventilator modes in the test set but not in the train set: set()
Ventilator modes in the train set but not in the test set: {2.0, 71.0, 12.0, 13.0, 50.0, 53.0}


In [None]:
train_ventilator_modes

array([11., 30.,  2., 49., 51., 13., 10., 71., 12., 53., 50.])

In [None]:
test_ventilator_modes

array([11., 30., 49., 10., 51.])

There are no ventilator modes in the test set that are not in the training set.

We need to make sure that the columns in the training set align with the test set but any features not in the test set are added as columns and their values set to 0.

In [None]:
def one_hot_encode_ventilator_mode_with_dummies(df, all_modes, feature_col='feature_label', value_col='valuenum', label_col='extubation_failure'):
    """
    One-hot encode the 'Ventilator Mode' feature in the DataFrame by creating new features
    within the 'feature_label' column and setting 'valuenum' to 1 for the relevant mode at each time step.

    Parameters:
    df (pd.DataFrame): The DataFrame containing time series data.
    all_modes (list): A list of all possible ventilator modes.
    feature_col (str): The column name that contains feature labels.
    value_col (str): The column name that contains feature values.
    label_col (str): The column name that contains the label.

    Returns:
    pd.DataFrame: The DataFrame with one-hot encoded ventilator mode features.
    """
    # Filter out the rows corresponding to 'Ventilator Mode'
    ventilator_mode_mask = (df[feature_col] == 'Ventilator Mode')
    ventilator_mode_df = df[ventilator_mode_mask].copy()

    # Create a DataFrame with all combinations of subject_id, time_from_window_start_mins, and all ventilator modes
    unique_subjects_times = ventilator_mode_df[['subject_id', 'time_from_window_start_mins']].drop_duplicates()
    mode_columns = [f'Ventilator_Mode_{mode}' for mode in all_modes]
    dummy_df = pd.DataFrame(0, index=unique_subjects_times.index, columns=mode_columns)
    dummy_df = pd.concat([unique_subjects_times.reset_index(drop=True), dummy_df.reset_index(drop=True)], axis=1)

    # Populate the dummy DataFrame with 1s for the appropriate ventilator modes
    for mode in all_modes:
        mode_mask = (ventilator_mode_df[value_col] == mode)
        relevant_rows = ventilator_mode_df[mode_mask][['subject_id', 'time_from_window_start_mins']]
        dummy_df.loc[dummy_df.set_index(['subject_id', 'time_from_window_start_mins']).index.isin(
            relevant_rows.set_index(['subject_id', 'time_from_window_start_mins']).index), f'Ventilator_Mode_{mode}'] = 1

    # Melt the dummy DataFrame back to long format
    dummy_df_melted = dummy_df.melt(id_vars=['subject_id', 'time_from_window_start_mins'],
                                    var_name=feature_col, value_name=value_col)

    # Add the extubation_failure column back to the melted DataFrame
    dummy_df_melted = dummy_df_melted.merge(
        ventilator_mode_df[['subject_id', 'time_from_window_start_mins', label_col]].drop_duplicates(),
        on=['subject_id', 'time_from_window_start_mins'], how='left'
    )

    # Drop the original 'Ventilator Mode' rows from the original dataframe
    df = df[~ventilator_mode_mask]

    # Merge the encoded ventilator mode data back into the original dataframe
    df = pd.concat([df, dummy_df_melted])

    # Fill NaN values in valuenum columns with 0 for ventilator modes
    df[value_col].fillna(0, inplace=True)

    return df

def one_hot_encode_ventilator_mode(train_df, test_df, feature_col='feature_label', value_col='valuenum', label_col='extubation_failure'):
    """
    One-hot encode the 'Ventilator Mode' feature in the training and test dataframes by creating new features
    within the 'feature_label' column and setting 'valuenum' to 1 for the relevant mode at each time step.

    Parameters:
    train_df (pd.DataFrame): The training DataFrame containing time series data.
    test_df (pd.DataFrame): The test DataFrame containing time series data.
    feature_col (str): The column name that contains feature labels.
    value_col (str): The column name that contains feature values.
    label_col (str): The column name that contains the label.

    Returns:
    pd.DataFrame, pd.DataFrame: The DataFrames with one-hot encoded ventilator mode features.
    """
    # Determine all possible ventilator modes from the training and test data
    all_modes = sorted(list(set(train_df[train_df[feature_col] == 'Ventilator Mode'][value_col].unique()).union(
                            set(test_df[test_df[feature_col] == 'Ventilator Mode'][value_col].unique()))))

    # Encode ventilator mode for train and test data
    train_df_encoded = one_hot_encode_ventilator_mode_with_dummies(train_df, all_modes, feature_col, value_col, label_col)
    test_df_encoded = one_hot_encode_ventilator_mode_with_dummies(test_df, all_modes, feature_col, value_col, label_col)

    return train_df_encoded, test_df_encoded

In [None]:
train_df_encoded, test_df_encoded = one_hot_encode_ventilator_mode(train_df, test_df)

train_df_encoded

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10001884,0.384193,Inspired O2 Fraction,1
1,0 days 00:30:00,10001884,0.379341,Inspired O2 Fraction,1
2,0 days 01:00:00,10001884,0.374490,Inspired O2 Fraction,1
3,0 days 01:30:00,10001884,0.369639,Inspired O2 Fraction,1
4,0 days 02:00:00,10001884,0.364788,Inspired O2 Fraction,1
...,...,...,...,...,...
537675,0 days 04:00:00,17923146,0.000000,Ventilator_Mode_71.0,0
537676,0 days 04:30:00,17923146,0.000000,Ventilator_Mode_71.0,0
537677,0 days 05:00:00,17923146,0.000000,Ventilator_Mode_71.0,0
537678,0 days 05:30:00,17923146,0.000000,Ventilator_Mode_71.0,0


In [None]:
print(train_df_encoded['feature_label'].nunique())
print(test_df_encoded['feature_label'].nunique())

23
23


In [None]:
test_df_encoded

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10004720,0.545455,O2 saturation pulseoxymetry,1
1,0 days 00:30:00,10004720,0.454545,O2 saturation pulseoxymetry,1
2,0 days 01:00:00,10004720,0.363636,O2 saturation pulseoxymetry,1
3,0 days 01:30:00,10004720,0.500000,O2 saturation pulseoxymetry,1
4,0 days 02:00:00,10004720,0.636364,O2 saturation pulseoxymetry,1
...,...,...,...,...,...
134558,0 days 04:00:00,17910586,0.000000,Ventilator_Mode_71.0,1
134559,0 days 04:30:00,17910586,0.000000,Ventilator_Mode_71.0,1
134560,0 days 05:00:00,17910586,0.000000,Ventilator_Mode_71.0,1
134561,0 days 05:30:00,17910586,0.000000,Ventilator_Mode_71.0,1


In [None]:
# Save progress so far
train_df_encoded.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/04_train_data_num_and_ventilator_mode_filled_and_feature_eng_and_scaled_and_ohe_v3.parquet')
test_df_encoded.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/04_test_data_num_and_ventilator_mode_filled_and_feature_eng_and_scaled_and_ohe_v3.parquet')

In [None]:
# Collect all data for subject id = 10001884
subject_10001884 = train_df_encoded[train_df_encoded['subject_id'] == 10001884]
subject_10001884.to_csv('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/04_train_data_num_and_ventilator_mode_filled_and_feature_eng_and_scaled_and_ohe_subject_10001884_v2.csv')

# **Step 4 - Create sequences for LSTM**

Create 3D arrays for LSTM input in the format (patients, time steps, features).

In [None]:
train_df_encoded_copy = train_df_encoded.copy()
test_df_encoded_copy = test_df_encoded.copy()

In [None]:
train_df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1124240 entries, 0 to 537679
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype          
---  ------                       --------------    -----          
 0   time_from_window_start_mins  1124240 non-null  timedelta64[ns]
 1   subject_id                   1124240 non-null  int64          
 2   valuenum                     1124240 non-null  float64        
 3   feature_label                1124240 non-null  object         
 4   extubation_failure           1124240 non-null  int64          
dtypes: float64(1), int64(2), object(1), timedelta64[ns](1)
memory usage: 51.5+ MB


In [None]:
# For Ventilator_Mode features, remove the .0 at the end of the feature_label
train_df_encoded['feature_label'] = train_df_encoded['feature_label'].str.replace('.0', '', regex=False)
test_df_encoded['feature_label'] = test_df_encoded['feature_label'].str.replace('.0', '', regex=False)

train_df_encoded

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10001884,0.384193,Inspired O2 Fraction,1
1,0 days 00:30:00,10001884,0.379341,Inspired O2 Fraction,1
2,0 days 01:00:00,10001884,0.374490,Inspired O2 Fraction,1
3,0 days 01:30:00,10001884,0.369639,Inspired O2 Fraction,1
4,0 days 02:00:00,10001884,0.364788,Inspired O2 Fraction,1
...,...,...,...,...,...
537675,0 days 04:00:00,17923146,0.000000,Ventilator_Mode_71,0
537676,0 days 04:30:00,17923146,0.000000,Ventilator_Mode_71,0
537677,0 days 05:00:00,17923146,0.000000,Ventilator_Mode_71,0
537678,0 days 05:30:00,17923146,0.000000,Ventilator_Mode_71,0


In [None]:
train_df_encoded_copy_2 = train_df_encoded.copy()
test_df_encoded_copy_2 = test_df_encoded.copy()

In [None]:
feature_labels = train_df_encoded['feature_label'].unique()
len(feature_labels)

23

In [None]:
feature_labels_test = test_df_encoded['feature_label'].unique()
len(feature_labels_test)

23

In [15]:
# Define sequence creation function that maintains feature names
def create_sequences(data, sequence_length):
    """
    Convert time series data into sequences for LSTM input, including subject IDs.

    Parameters:
    data (pd.DataFrame): The input DataFrame containing time series data.
    sequence_length (int): The length of each sequence.

    Returns:
    np.array: Numpy array of sequences.
    np.array: Numpy array of labels.
    list: List of subject IDs corresponding to each sequence.
    """
    sequences = []
    labels = []
    subject_ids = []

    # Extract unique feature labels
    feature_labels = data['feature_label'].unique()

    # Group data by patient
    grouped = data.groupby('subject_id')

    for subject_id, group in grouped:
        # Ensure data is sorted by time
        group = group.sort_values(by='time_from_window_start_mins')

        # Pivot the data to ensure all features are included
        pivoted_data = group.pivot(index='time_from_window_start_mins', columns='feature_label', values='valuenum')

        # Ensure the pivoted data has the correct order of columns
        pivoted_data = pivoted_data[feature_labels]

        # Create sequences
        for i in range(len(pivoted_data) - sequence_length + 1):
            sequence = pivoted_data.iloc[i:i + sequence_length]
            sequences.append(sequence.values)
            labels.append(group['extubation_failure'].iloc[i + sequence_length - 1])
            subject_ids.append(subject_id)

    return np.array(sequences), np.array(labels), subject_ids

In [12]:
sequence_length = 13

In [13]:
output_dir = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/final_sequences/'

In [None]:
train_sequences, train_labels, train_subject_ids = create_sequences(train_df_encoded, sequence_length)
test_sequences, test_labels, test_subject_ids = create_sequences(test_df_encoded, sequence_length)

print("Shape of training sequences:" , train_sequences.shape)
print("Shape of training labels:" , train_labels.shape)
print("Shape of testing sequences:" , test_sequences.shape)
print("Shape of testing labels:" , test_labels.shape)

# Save the sequences to the drive for model input
np.save(output_dir + 'train_sequences_v2.npy', train_sequences)
np.save(output_dir + 'train_labels_v2.npy', train_labels)
np.save(output_dir + 'test_sequences_v2.npy', test_sequences)
np.save(output_dir + 'test_labels_v2.npy', test_labels)

Shape of training sequences: (3760, 13, 23)
Shape of training labels: (3760,)
Shape of testing sequences: (941, 13, 23)
Shape of testing labels: (941,)


In [None]:
# List the first few train_subject_ids
print(train_subject_ids[:5])

# List the first few test_subject_ids
print(test_subject_ids[:5])

[10001884, 10002428, 10004235, 10010867, 10011365]
[10004720, 10004733, 10005817, 10013643, 10022620]


In [None]:
# Save train and test subject ids to align static data later
output_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/final_sequences/'
np.save(output_path + 'train_subject_ids_v2.npy', train_subject_ids)
np.save(output_path + 'test_subject_ids_v2.npy', test_subject_ids)

In [None]:
# Get the feature names from the training and test data
train_feature_names = train_df_encoded['feature_label'].unique()
test_feature_names = test_df_encoded['feature_label'].unique()

# Ensure both sets of feature names are the same
if set(train_feature_names) == set(test_feature_names):
    print("The feature names in the training and test data are the same.")
else:
    print("The feature names in the training and test data are different.")


The feature names in the training and test data are the same.


In [None]:
# Save the features
output_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/final_sequences/'
np.save(output_path + 'feature_names_vF.npy', train_feature_names)

AttributeError: 'numpy.ndarray' object has no attribute 'to_parquet'

This data can now be used as input into the LSTM model.

# **Addendum**

Havign spoken to Dr Murali, we hae decided to remove Ventilator Mode as a feature from the data.

Individual ventilation models likley are not that helpful - trying to segment into control vs support modes would be useful but not easy without full domain knowledge of the MIMIC study and ventilators used. Patients are mainly extubated once they are trialled on support mode but it is not possible to discern this from the raw data.

As such we will remove all data relating to Ventilator Mode from the datasets.

In [5]:
# Load data prior to sequence creation
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/incremental_datasets/04_train_data_num_and_ventilator_mode_filled_and_feature_eng_and_scaled_and_ohe_v3.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/incremental_datasets/04_test_data_num_and_ventilator_mode_filled_and_feature_eng_and_scaled_and_ohe_v3.parquet'

train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10001884,0.384193,Inspired O2 Fraction,1
1,0 days 00:30:00,10001884,0.379341,Inspired O2 Fraction,1
2,0 days 01:00:00,10001884,0.374490,Inspired O2 Fraction,1
3,0 days 01:30:00,10001884,0.369639,Inspired O2 Fraction,1
4,0 days 02:00:00,10001884,0.364788,Inspired O2 Fraction,1
...,...,...,...,...,...
537675,0 days 04:00:00,17923146,0.000000,Ventilator_Mode_71.0,0
537676,0 days 04:30:00,17923146,0.000000,Ventilator_Mode_71.0,0
537677,0 days 05:00:00,17923146,0.000000,Ventilator_Mode_71.0,0
537678,0 days 05:30:00,17923146,0.000000,Ventilator_Mode_71.0,0


In [6]:
# Remove all rows where feature_label starts with Ventilator_Mode...
train_df = train_df[~train_df['feature_label'].str.startswith('Ventilator_Mode')]
test_df = test_df[~test_df['feature_label'].str.startswith('Ventilator_Mode')]

train_df

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
0,0 days 00:00:00,10001884,0.384193,Inspired O2 Fraction,1
1,0 days 00:30:00,10001884,0.379341,Inspired O2 Fraction,1
2,0 days 01:00:00,10001884,0.374490,Inspired O2 Fraction,1
3,0 days 01:30:00,10001884,0.369639,Inspired O2 Fraction,1
4,0 days 02:00:00,10001884,0.364788,Inspired O2 Fraction,1
...,...,...,...,...,...
635435,0 days 04:00:00,17923146,0.292706,P:F ratio,0
635436,0 days 04:30:00,17923146,0.262863,P:F ratio,0
635437,0 days 05:00:00,17923146,0.237820,P:F ratio,0
635438,0 days 05:30:00,17923146,0.216505,P:F ratio,0


In [7]:
# Analyse the unique features in the data
train_df['feature_label'].unique()

array(['Inspired O2 Fraction', 'Tidal Volume (observed)',
       'Tidal Volume (spontaneous)', 'Minute Volume',
       'Peak Insp. Pressure', 'Respiratory Rate',
       'O2 saturation pulseoxymetry', 'Arterial O2 pressure',
       'Arterial CO2 Pressure', 'PH (Arterial)', 'SpO2:FiO2', 'P:F ratio'],
      dtype=object)

In [8]:
test_df['feature_label'].unique()

array(['O2 saturation pulseoxymetry', 'Inspired O2 Fraction',
       'Tidal Volume (observed)', 'Tidal Volume (spontaneous)',
       'Minute Volume', 'Peak Insp. Pressure', 'Respiratory Rate',
       'Arterial O2 pressure', 'Arterial CO2 Pressure', 'PH (Arterial)',
       'SpO2:FiO2', 'P:F ratio'], dtype=object)

In [9]:
# Save progress
train_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/incremental_datasets/05_train_data_no_ventilator_mode.parquet')
test_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/incremental_datasets/05_test_data_no_ventilator_mode.parquet')

Now that Ventilator Mode has been removed we can create our LSTM sequences.

In [11]:
# Get the feature names from the training and test data
train_feature_names = train_df['feature_label'].unique()
test_feature_names = test_df['feature_label'].unique()

# Ensure both sets of feature names are the same
if set(train_feature_names) == set(test_feature_names):
    print("The feature names in the training and test data are the same.")
else:
    print("The feature names in the training and test data are different.")

The feature names in the training and test data are the same.


In [16]:
train_sequences_2, train_labels_2, train_subject_ids_2 = create_sequences(train_df, sequence_length)
test_sequences_2, test_labels_2, test_subject_ids_2 = create_sequences(test_df, sequence_length)

print("Shape of training sequences:" , train_sequences_2.shape)
print("Shape of training labels:" , train_labels_2.shape)
print("Shape of testing sequences:" , test_sequences_2.shape)
print("Shape of testing labels:" , test_labels_2.shape)

Shape of training sequences: (3760, 13, 12)
Shape of training labels: (3760,)
Shape of testing sequences: (941, 13, 12)
Shape of testing labels: (941,)


In [17]:
# Save the output sequences
output_dir = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/final_sequences/'

np.save(output_dir + 'train_sequences_v3.npy', train_sequences_2)
np.save(output_dir + 'train_labels_v3.npy', train_labels_2)
np.save(output_dir + 'test_sequences_v3.npy', test_sequences_2)
np.save(output_dir + 'test_labels_v3.npy', test_labels_2)

In [18]:
# Save the feature names
output_dir = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/final_sequences/'

np.save(output_dir + 'feature_names_v3.npy', train_feature_names)

In [19]:
# Save the train and test subject ids
output_dir = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/preprocessing_run_2/final_sequences/'

np.save(output_dir + 'train_subject_ids_v3.npy', train_subject_ids_2)
np.save(output_dir + 'test_subject_ids_v3.npy', test_subject_ids_2)