# **LSTM preprocessing for feature set 2 dynamic data**

We can now preprocess our train and test data in preparation for input into our LSTM architecture.

The steps in pre-processing will be identical to that applied to data from feature set 1. The only difference being that we have three categorical features that we will not be able to interpolate with linear interpolation and hence we will apply a simple forward fill to ensure the values are integer values and not floats.

We will be splitting our features into subsets based on average sampling frequency as carried out previously.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Load the train and test data
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/03_train_data_f2_outliers_removed.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/03_test_data_f2_outliers_removed.parquet'
train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,200.0,Inspired O2 Fraction,1
1,10001884,,200.0,Tidal Volume (observed),1
2,10001884,,200.0,Tidal Volume (spontaneous),1
3,10001884,6.1,200.0,Minute Volume,1
4,10001884,17.0,200.0,Peak Insp. Pressure,1


In [4]:
train_df = train_df.rename(columns={'time_from_window_start': 'time_from_window_start_mins'})
test_df = test_df.rename(columns={'time_from_window_start': 'time_from_window_start_mins'})

train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,200.0,Inspired O2 Fraction,1
1,10001884,,200.0,Tidal Volume (observed),1
2,10001884,,200.0,Tidal Volume (spontaneous),1
3,10001884,6.1,200.0,Minute Volume,1
4,10001884,17.0,200.0,Peak Insp. Pressure,1


In [5]:
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()

### **Step 1 - Determine the subset groups**

The train and test data need to be resampled to consistent time intervals.

We will apply a bespoke sampling frequency to each feature subset we extract.

As before, we will split the features into three subsets: Low, Medium and High frequency




In [6]:
# Group by subject_id and itemid to count the number of samples for each combination
per_patient_sampling_frequency = train_df.groupby(['subject_id', 'feature_label']).size().reset_index(name='count')

per_patient_sampling_frequency_pivot = per_patient_sampling_frequency.pivot(index='subject_id', columns='feature_label', values='count').fillna(0)

# Calculate the average sampling frequency per feature
average_sampling_frequency = per_patient_sampling_frequency_pivot.mean().sort_values(ascending=False)

# Create columns for the table
average_sampling_frequency_df = pd.DataFrame({'Feature': average_sampling_frequency.index, 'Average Train Set Sampling Frequency': average_sampling_frequency.values})

# Display the table
average_sampling_frequency_df

Unnamed: 0,Feature,Average Train Set Sampling Frequency
0,Heart Rate,6.610638
1,O2 saturation pulseoxymetry,6.597074
2,Respiratory Rate,6.578989
3,Arterial Blood Pressure mean,3.75133
4,Arterial Blood Pressure diastolic,3.746277
5,Arterial Blood Pressure systolic,3.745745
6,Inspired O2 Fraction,2.106383
7,GCS - Eye Opening,1.631383
8,GCS - Motor Response,1.627128
9,Tidal Volume (observed),1.616755


Low observed features have already been removed and so we are left with 19 time series features to group. **Note: the frequencies are based on the training data only**

The features will be split as follows: **keeping the same as previously done**

- Low Frequency Subset (frequency < 1 in 6 hours)
- Medium Frequency Subset (1 < frequency < 3 in 6 hours)
- High Frequency Subset (frequency > 3 in 6 hours)

This facilitates creating subsets where each set can be resampled at a bespoke rate that better considers the actual sampling frequency of the real data.

This splits the data as follows:

In [7]:
low_frequency_features = average_sampling_frequency_df[average_sampling_frequency_df['Average Train Set Sampling Frequency'] < 1]['Feature'].tolist()
medium_frequency_features = average_sampling_frequency_df[(average_sampling_frequency_df['Average Train Set Sampling Frequency'] >= 1) & (average_sampling_frequency_df['Average Train Set Sampling Frequency'] < 3)]['Feature'].tolist()
high_frequency_features = average_sampling_frequency_df[average_sampling_frequency_df['Average Train Set Sampling Frequency'] >= 3]['Feature'].tolist()

print(f"Low frequency features: {low_frequency_features}")
print(f"Medium frequency features: {medium_frequency_features}")
print(f"High frequency features: {high_frequency_features}")

Low frequency features: ['PH (Arterial)', 'Arterial O2 pressure', 'Arterial CO2 Pressure']
Medium frequency features: ['Inspired O2 Fraction', 'GCS - Eye Opening', 'GCS - Motor Response', 'Tidal Volume (observed)', 'Minute Volume', 'Mean Airway Pressure', 'Peak Insp. Pressure', 'Temperature Fahrenheit', 'Tidal Volume (spontaneous)', 'Richmond-RAS Scale']
High frequency features: ['Heart Rate', 'O2 saturation pulseoxymetry', 'Respiratory Rate', 'Arterial Blood Pressure mean', 'Arterial Blood Pressure diastolic', 'Arterial Blood Pressure systolic']


### **Step 2: Split the train and test data into three subsets**

In [8]:
def filter_features(df, feature_list):
    """
    Filters the dataframe to include only rows where the feature_label column matches one of the features in the feature_list.

    Parameters:
    df (pd.DataFrame): The input dataframe containing patient data.
    feature_list (list): A list of feature names to filter by.

    Returns:
    pd.DataFrame: A filtered dataframe containing only the specified features.
    """
    filtered_df = df[df['feature_label'].isin(feature_list)]
    return filtered_df

In [9]:
# Extract the low frequency data
low_frequency_train_df = filter_features(train_df, low_frequency_features)
low_frequency_test_df = filter_features(test_df, low_frequency_features)

# Check the unique feature_labels in the data
low_frequency_train_df['feature_label'].unique()

array(['Arterial O2 pressure', 'Arterial CO2 Pressure', 'PH (Arterial)'],
      dtype=object)

In [10]:
# Extract the medium frequency data
medium_frequency_train_df = filter_features(train_df, medium_frequency_features)
medium_frequency_test_df = filter_features(test_df, medium_frequency_features)

# Check the unique feature_labels in the data
medium_frequency_train_df['feature_label'].unique()

array(['Inspired O2 Fraction', 'Tidal Volume (observed)',
       'Tidal Volume (spontaneous)', 'Minute Volume',
       'Peak Insp. Pressure', 'Mean Airway Pressure', 'GCS - Eye Opening',
       'GCS - Motor Response', 'Richmond-RAS Scale',
       'Temperature Fahrenheit'], dtype=object)

In [11]:
# Extract the high frequency data
high_frequency_train_df = filter_features(train_df, high_frequency_features)
high_frequency_test_df = filter_features(test_df, high_frequency_features)

# Check the unique feature_labels in the data
high_frequency_train_df['feature_label'].unique()

array(['Heart Rate', 'Respiratory Rate', 'O2 saturation pulseoxymetry',
       'Arterial Blood Pressure systolic',
       'Arterial Blood Pressure diastolic',
       'Arterial Blood Pressure mean'], dtype=object)

In [12]:
# Save the data
data_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/feature_subsets'
low_frequency_train_df.to_parquet(f'{data_path}/01_low_frequency_train_df.parquet', index=False)
low_frequency_test_df.to_parquet(f'{data_path}/01_low_frequency_test_df.parquet', index=False)
medium_frequency_train_df.to_parquet(f'{data_path}/02_medium_frequency_train_df.parquet', index=False)
medium_frequency_test_df.to_parquet(f'{data_path}/02_medium_frequency_test_df.parquet', index=False)
high_frequency_train_df.to_parquet(f'{data_path}/03_high_frequency_train_df.parquet', index=False)
high_frequency_test_df.to_parquet(f'{data_path}/03_high_frequency_test_df.parquet', index=False)

**Count the number of patients in each**

In [12]:
# Number of patients in the original data
print(f"Number of patients in the original train data: {train_df['subject_id'].nunique()}")
print(f"Number of patients in the original test data: {test_df['subject_id'].nunique()}")
# Number of unique subject ids
print(f"Number of unique subject ids in low frequency train data: {low_frequency_train_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in low frequency test data: {low_frequency_test_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in medium frequency train data: {medium_frequency_train_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in medium frequency test data: {medium_frequency_test_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in high frequency train data: {high_frequency_train_df['subject_id'].nunique()}")
print(f"Number of unique subject ids in high frequency test data: {high_frequency_test_df['subject_id'].nunique()}")

Number of patients in the original train data: 3760
Number of patients in the original test data: 941
Number of unique subject ids in low frequency train data: 1399
Number of unique subject ids in low frequency test data: 350
Number of unique subject ids in medium frequency train data: 3758
Number of unique subject ids in medium frequency test data: 940
Number of unique subject ids in high frequency train data: 3760
Number of unique subject ids in high frequency test data: 940


As before, we will need to ensure that all patients are reflected across all datasets.

We will need to handle where patients are not represented in certain feature subsets and where patients are represented but not across all features in that subset.

### **Step 3: Resample and Interpolate the data to bespoke frequencies**

The resampling rate for this study will be the following:
- Low frequency: every 2 hours (seq_length = 4)
- Medium frequency: every 1 hour (seq_length = 7)
- High frequency: every 30 mins (seq_length = 13)

**Handle where there are missing patients compared to the orignial data**

Some patients do not have any data for select features and therefore do not appear in the derived subsets.

***Decision: Do we make each subset have the full patient dataset or just use the patients where there is data present?***

While having a full patients population be passed through each LSTM model, this would involve significant data synthesis, especially for the low frequency subset where data would essentially need to be created from scratch for those patients who are not present. This would introduce noise and biases, forcing the models to learn from artifical patterns which may not exist in the real data and lead to overfitting.

**As such, we will only train the models on the patient data available and not synthetically add patients if they are not present.**

***Decision: How to handle where patients have data for one feature but no data for other features***

We want to avoid synthesising data as much as possible and therefore need to find a way to handle missing data for specific features.

As such, we will implement a masking layer system where the model is told to ingore features in patients where it is not present.

Where a patient has no data for a specific feature, we will not impute that data but will set all time steps for that feature as NaN. This NaN data will then be

***Decision: How to handle categorical features***

Categorical features refer to scores (GCS and RAS).

We will handle categorical features by, if there is a data point for that patient we will forward fill and backward fill to maintain constant values for that patient.

If there is no value for that patient we will fill the data with NaNs as previous and mask later in model training.

**Count the number of patients who do not have data for all features in data subsets**

In [13]:
def count_incomplete_patients(df, feature_labels):
    """
    Counts the number of patients in the dataframe who do not have data for all specified feature labels.

    Parameters:
    df (pd.DataFrame): The input dataframe containing patient data.
    feature_labels (list): A list of feature labels to check for completeness.

    Returns:
    int: The number of patients without complete data for all specified feature labels.
    """
    # Group by patient (assuming 'subject_id' identifies patients)
    patient_groups = df.groupby('subject_id')

    incomplete_count = 0

    # Iterate over each patient group
    for patient_id, group in patient_groups:
        # Check if the patient has data for all feature labels
        if not all(feature in group['feature_label'].values for feature in feature_labels):
            incomplete_count += 1

    return incomplete_count

In [14]:
# Low frequency features
low_frequency_incomplete_count_train = count_incomplete_patients(low_frequency_train_df, low_frequency_features)
print(f"Number of patients with incomplete data for low frequency features: {low_frequency_incomplete_count_train}")
low_frequency_incomplete_count_test = count_incomplete_patients(low_frequency_test_df, low_frequency_features)
print(f"Number of patients with incomplete data for low frequency features: {low_frequency_incomplete_count_test}")
# Medium frequency features
medium_frequency_incomplete_count_train = count_incomplete_patients(medium_frequency_train_df, medium_frequency_features)
print(f"Number of patients with incomplete data for medium frequency features: {medium_frequency_incomplete_count_train}")
medium_frequency_incomplete_count_test = count_incomplete_patients(medium_frequency_test_df, medium_frequency_features)
print(f"Number of patients with incomplete data for medium frequency features: {medium_frequency_incomplete_count_test}")
# High frequency features
high_frequency_incomplete_count_train = count_incomplete_patients(high_frequency_train_df, high_frequency_features)
print(f"Number of patients with incomplete data for high frequency features: {high_frequency_incomplete_count_train}")
high_frequency_incomplete_count_test = count_incomplete_patients(high_frequency_test_df, high_frequency_features)
print(f"Number of patients with incomplete data for high frequency features: {high_frequency_incomplete_count_test}")

Number of patients with incomplete data for low frequency features: 12
Number of patients with incomplete data for low frequency features: 1
Number of patients with incomplete data for medium frequency features: 1638
Number of patients with incomplete data for medium frequency features: 416
Number of patients with incomplete data for high frequency features: 1629
Number of patients with incomplete data for high frequency features: 414


There are a large number of patients with incomplete data hence, validating the decision to mask data where it is not available.

**Convert time to timedelta format**

In [15]:
# Convert the time_from_window_start_mins to timedelta in mins for all dataframes
low_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_train_df['time_from_window_start_mins'], unit='m')
low_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_test_df['time_from_window_start_mins'], unit='m')
medium_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(medium_frequency_train_df['time_from_window_start_mins'], unit='m')
medium_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(medium_frequency_test_df['time_from_window_start_mins'], unit='m')
high_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(high_frequency_train_df['time_from_window_start_mins'], unit='m')
high_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(high_frequency_test_df['time_from_window_start_mins'], unit='m')

low_frequency_train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5928 entries, 121 to 248375
Data columns (total 5 columns):
 #   Column                       Non-Null Count  Dtype          
---  ------                       --------------  -----          
 0   subject_id                   5928 non-null   int64          
 1   valuenum                     5847 non-null   float64        
 2   time_from_window_start_mins  5928 non-null   timedelta64[ns]
 3   feature_label                5928 non-null   object         
 4   extubation_failure           5928 non-null   int64          
dtypes: float64(1), int64(2), object(1), timedelta64[ns](1)
memory usage: 277.9+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_frequency_train_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_train_df['time_from_window_start_mins'], unit='m')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_frequency_test_df['time_from_window_start_mins'] = pd.to_timedelta(low_frequency_test_df['time_from_window_start_mins'], unit='m')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

In [17]:
low_frequency_train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
121,10002428,127.0,0 days 05:42:00,Arterial O2 pressure,0
122,10002428,43.0,0 days 05:42:00,Arterial CO2 Pressure,0
123,10002428,7.43,0 days 05:42:00,PH (Arterial),0
151,10004235,97.0,0 days 00:22:00,Arterial O2 pressure,1
152,10004235,38.0,0 days 00:22:00,Arterial CO2 Pressure,1


In [16]:
low_frequency_train_copy = low_frequency_train_df.copy()
low_frequency_test_copy = low_frequency_test_df.copy()
medium_frequency_train_copy = medium_frequency_train_df.copy()
medium_frequency_test_copy = medium_frequency_test_df.copy()
high_frequency_train_copy = high_frequency_train_df.copy()
high_frequency_test_copy = high_frequency_test_df.copy()

### **Step 4: Applying resampling and interpolation logic**

We will apply the same resampling and interpolation logic for the numerical values as we have applied to the feature set 1 data.

The categorical values will be handled adjacently but as described, be forward or backward filled where data is present and masked where data is not present.

In [57]:
def fill_start_end_values(df, feature_labels, start_means, end_means, start_window, end_window_start):
    """
    Fill missing start and end values for features in a DataFrame with specified means or existing values.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data for various features.
    feature_labels (list): A list of feature labels to check for completeness.
    start_means (pd.Series): A Series containing mean values for each feature to be used if no start value is present.
    end_means (pd.Series): A Series containing mean values for each feature to be used if no end value is present.
    start_window (pd.Timedelta): The time window within which to look for a starting value.
    end_window_start (pd.Timedelta): The time window start within which to look for an ending value.

    Returns:
    pd.DataFrame: The DataFrame with filled start and end values.
    """
    new_rows = []

    for feature in feature_labels:
        for subject_id in df['subject_id'].unique():
            subject_df = df[df['subject_id'] == subject_id]
            extubation_failure_label = subject_df['extubation_failure'].iloc[0]

            # Check if the original data for this feature for this patient is entirely NaN
            if subject_df[subject_df['feature_label'] == feature]['valuenum'].isna().all():
                # If all values are NaN, ensure any filled values are also NaN
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(0),
                    'valuenum': np.nan,
                    'extubation_failure': extubation_failure_label
                })
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(minutes=360),
                    'valuenum': np.nan,
                    'extubation_failure': extubation_failure_label
                })
                continue

            # Handle start values
            start_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(0))
            if not start_check.any():
                # Check if there's a value in the first half of the sampling window
                start_window_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] <= start_window)
                if start_window_check.any():
                    # Use the earliest value within the start window
                    start_value = subject_df[start_window_check].sort_values('time_from_window_start_mins').iloc[0]['valuenum']
                else:
                    # Use the mean start value if no value is found within the start window
                    start_value = start_means[feature]
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(0),
                    'valuenum': start_value,
                    'extubation_failure': extubation_failure_label
                })

            # Handle end values
            end_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] == pd.Timedelta(minutes=360))
            if not end_check.any():
                # Check if there's a value in the last half of the sampling window
                end_window_check = (subject_df['feature_label'] == feature) & (subject_df['time_from_window_start_mins'] >= end_window_start)
                if end_window_check.any():
                    # Use the latest value within the end window
                    end_value = subject_df[end_window_check].sort_values('time_from_window_start_mins', ascending=False).iloc[0]['valuenum']
                else:
                    # Use the mean end value if no value is found within the end window
                    end_value = end_means[feature]
                new_rows.append({
                    'subject_id': subject_id,
                    'feature_label': feature,
                    'time_from_window_start_mins': pd.Timedelta(minutes=360),
                    'valuenum': end_value,
                    'extubation_failure': extubation_failure_label
                })

    # Add new rows to the dataframe
    if new_rows:
        new_df = pd.DataFrame(new_rows)
        df = pd.concat([df, new_df], ignore_index=True)

    return df

In [59]:
def resample_and_interpolate(df, feature_labels, categorical_features, initial_interval='1T', target_interval='120T'):
    target_interval_minutes = int(target_interval.strip('T'))
    resampled_dfs = []

    # Get all unique subject IDs
    all_subject_ids = df['subject_id'].unique()

    for subject_id in all_subject_ids:
        subject_df = df[df['subject_id'] == subject_id]

        for feature in feature_labels:
            feature_df = subject_df[subject_df['feature_label'] == feature].set_index('time_from_window_start_mins')

            # Convert index to timedelta for resampling
            feature_df.index = pd.to_timedelta(feature_df.index, unit='m')

            # Remove duplicates by taking the first value if duplicates exist
            feature_df = feature_df[~feature_df.index.duplicated(keep='first')]

            if feature in categorical_features:
                # Handle categorical features
                if feature_df.empty or feature_df['valuenum'].isna().all():
                    # If the feature is completely absent or all values are NaN, create NaNs for the entire interval
                    new_index = pd.timedelta_range(start='0 min', periods=int(360 / target_interval_minutes + 1), freq=f'{target_interval_minutes}T')
                    feature_df = pd.DataFrame(index=new_index, columns=feature_df.columns)
                    feature_df['valuenum'] = np.nan
                    feature_df['extubation_failure'] = subject_df['extubation_failure'].iloc[0] if not subject_df.empty else np.nan
                else:
                    # Step 1: Resample to every minute to ensure data points at every minute
                    feature_df = feature_df.resample(initial_interval).asfreq()

                    # Step 2: Apply forward fill followed by backward fill
                    feature_df['valuenum'] = feature_df['valuenum'].ffill().bfill()

                    # Step 3: Resample to the target interval
                    feature_df = feature_df.resample(target_interval).asfreq()
            else:
                # Handle numerical features
                if feature_df.empty or feature_df['valuenum'].isna().all():
                    # If the feature is completely absent or all values are NaN, create NaNs for the entire interval
                    new_index = pd.timedelta_range(start='0 min', periods=int(360 / target_interval_minutes + 1), freq=f'{target_interval_minutes}T')
                    feature_df = pd.DataFrame(index=new_index, columns=feature_df.columns)
                    feature_df['valuenum'] = np.nan
                    feature_df['extubation_failure'] = subject_df['extubation_failure'].iloc[0] if not subject_df.empty else np.nan
                else:
                    # Step 1: Resample to every minute to ensure data points at every minute
                    feature_df = feature_df.resample(initial_interval).asfreq()

                    # Step 2: Interpolate missing values if the feature has some data
                    feature_df['valuenum'] = feature_df['valuenum'].interpolate(method='linear')

                    # Check for any remaining NaN values and fill with the mean of the feature for the specific patient
                    if feature_df['valuenum'].isna().sum() > 0:
                        feature_mean = subject_df[subject_df['feature_label'] == feature]['valuenum'].mean()
                        feature_df['valuenum'].fillna(feature_mean, inplace=True)

                    # Step 3: Resample to the target interval
                    feature_df = feature_df.resample(target_interval).asfreq()

            # Align extubation_failure columns with the resampled index and forward/backward fill for each patient
            feature_df['extubation_failure'] = feature_df['extubation_failure'].ffill().bfill()

            # Restore subject_id and feature_label columns
            feature_df['subject_id'] = subject_id
            feature_df['feature_label'] = feature

            # Reset index to retain the time information correctly
            feature_df.reset_index(inplace=True)
            feature_df.rename(columns={'index': 'time_from_window_start_mins'}, inplace=True)

            # Convert minutes to timedelta
            feature_df['time_from_window_start_mins'] = pd.to_timedelta(feature_df['time_from_window_start_mins'], unit='m')

            resampled_dfs.append(feature_df)

    # Concatenate all resampled dataframes
    resampled_df = pd.concat(resampled_dfs).reset_index(drop=True)

    return resampled_df

In [60]:
def process_data(train_df, test_df, feature_labels, categorical_features, initial_interval='1T', target_interval='120T'):
    """
    Process the train and test datasets by filling start and end values, resampling, and interpolating missing values.

    Parameters:
    train_df (pd.DataFrame): The input training DataFrame containing time series data for various features.
    test_df (pd.DataFrame): The input test DataFrame containing time series data for various features.
    feature_labels (list): A list of feature labels to check for completeness.
    categorical_features (list): A list of feature labels that are categorical numerical features.
    initial_interval (str): The initial resampling interval to ensure all data points are included. Default is '1T' (1 minute).
    target_interval (str): The target resampling interval. Default is '120T' (120 minutes).

    Returns:
    tuple: The processed training and test DataFrames along with their corresponding masks indicating where values were NaNs.
    """
    # Convert target interval to minutes for window calculations
    target_interval_minutes = int(target_interval.strip('T'))
    start_window = pd.to_timedelta(target_interval_minutes // 2, unit='m')
    end_window_start = pd.to_timedelta(360 - target_interval_minutes // 2, unit='m')

    # Calculate means from training data for start and end windows
    start_means = train_df[train_df['time_from_window_start_mins'] <= start_window].groupby('feature_label')['valuenum'].mean()
    end_means = train_df[train_df['time_from_window_start_mins'] >= end_window_start].groupby('feature_label')['valuenum'].mean()

    # Fill start and end values for train and test data
    train_df = fill_start_end_values(train_df, feature_labels, start_means, end_means, start_window, end_window_start)
    test_df = fill_start_end_values(test_df, feature_labels, start_means, end_means, start_window, end_window_start)

    # Resample and interpolate data for train and test sets
    train_df = resample_and_interpolate(train_df, feature_labels, categorical_features, initial_interval, target_interval)
    test_df = resample_and_interpolate(test_df, feature_labels, categorical_features, initial_interval, target_interval)

    # Create masks indicating where values are NaNs
    train_mask = train_df.isna()
    test_mask = test_df.isna()

    return train_df, test_df, train_mask, test_mask

In [61]:
def check_processed_data(original_df, modified_df, feature_labels, target_interval='30T'):
    """
    Check if all patients have been processed correctly in the modified data.

    Parameters:
    original_df (pd.DataFrame): The original DataFrame containing time series data for various features.
    modified_df (pd.DataFrame): The modified DataFrame containing resampled and interpolated data.
    feature_labels (list): A list of feature labels to check for completeness.
    target_interval (str): The target resampling interval. Default is '30T' (30 minutes).

    Returns:
    bool: True if all patients have been processed correctly, False otherwise.
    """
    target_interval_minutes = int(target_interval.strip('T'))  # Convert target interval to minutes
    expected_intervals = [pd.Timedelta(minutes=m) for m in range(0, 361, target_interval_minutes)]  # Expected time intervals

    for subject_id in original_df['subject_id'].unique():
        for feature in feature_labels:
            original_feature_df = original_df[(original_df['subject_id'] == subject_id) & (original_df['feature_label'] == feature)]
            modified_feature_df = modified_df[(modified_df['subject_id'] == subject_id) & (modified_df['feature_label'] == feature)]

            # Check if the original data has any non-NaN values for this feature
            original_has_valid_data = original_feature_df['valuenum'].notna().any()

            modified_values = modified_feature_df['valuenum'].values
            modified_times = modified_feature_df['time_from_window_start_mins'].values

            # Check that all expected intervals are present in the modified data
            if not all(time in modified_times for time in expected_intervals):
                print(f"Missing intervals for subject {subject_id}, feature {feature}")
                return False

            if original_has_valid_data:
                # If the original data had valid values, check that the modified data has no NaNs
                if np.isnan(modified_values).any():
                    print(f"NaN values found in modified data for subject {subject_id}, feature {feature}")
                    return False
            else:
                # If the original data did not have any valid values, check that the modified data is all NaNs
                if not np.isnan(modified_values).all():
                    print(f"Non-NaN values found in modified data for subject {subject_id}, feature {feature} where original data had no values")
                    return False

    return True

In [63]:
# Define categorical features
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']

In [64]:
# Process the low frequency data to resaple to every 2 hours
low_frequency_train_resampled, low_frequency_test_resampled, low_frequency_train_mask, low_frequency_test_mask = process_data(low_frequency_train_copy, low_frequency_test_copy, low_frequency_features, categorical_features, initial_interval='1T', target_interval='120T')

# Check low frequency data processed correctly
# check_processed_data(low_frequency_train_copy, low_frequency_train_resampled, low_frequency_features, target_interval='120T')

In [65]:
# Count the number of patients in the low frequency data
print(f"Number of patients in the low frequency train data: {low_frequency_train_resampled['subject_id'].nunique()}")

Number of patients in the low frequency train data: 1399


In [25]:
# Process the medium frequency data to resample every 1 hour
medium_frequency_train_resampled, medium_frequency_test_resampled, medium_frequency_train_mask, medium_frequency_test_mask = process_data(medium_frequency_train_copy, medium_frequency_test_copy, medium_frequency_features, categorical_features, initial_interval='1T', target_interval='60T')

# Check medium frequency data processed correctly
check_processed_data(medium_frequency_train_copy, medium_frequency_train_resampled, medium_frequency_features, target_interval='60T')

True

In [26]:
# Process the high frequency data to resample every 30 mins
high_frequency_train_resampled, high_frequency_test_resampled, high_frequency_train_mask, high_frequency_test_mask = process_data(high_frequency_train_copy, high_frequency_test_copy, high_frequency_features, categorical_features, initial_interval='1T', target_interval='30T')

# Check high frequency data processed correctly
check_processed_data(high_frequency_train_copy, high_frequency_train_resampled, high_frequency_features, target_interval='30T')

True

**Checking that categorical features were handled correctly**

In [31]:
medium_frequency_train_resampled[medium_frequency_train_resampled['feature_label'] == 'Richmond-RAS Scale'].head(20)

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
63,0 days 00:00:00,10001884,-1.305419,Richmond-RAS Scale,1.0
64,0 days 01:00:00,10001884,-1.305419,Richmond-RAS Scale,1.0
65,0 days 02:00:00,10001884,-1.305419,Richmond-RAS Scale,1.0
66,0 days 03:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
67,0 days 04:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
68,0 days 05:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
69,0 days 06:00:00,10001884,-0.475104,Richmond-RAS Scale,1.0
133,0 days 00:00:00,10002428,-1.305419,Richmond-RAS Scale,0.0
134,0 days 01:00:00,10002428,-1.305419,Richmond-RAS Scale,0.0
135,0 days 02:00:00,10002428,0.0,Richmond-RAS Scale,0.0


Ensure all categorical features are integer values

In [33]:
medium_frequency_train_resampled_copy = medium_frequency_train_resampled.copy()
medium_frequency_test_resampled_copy = medium_frequency_test_resampled.copy()

In [32]:
# Round all categorical values and ensure they are within the RAS scale range of +4 to -5
def round_and_clip_categorical(df, categorical_features, min_value, max_value):
    """
    Rounds and clips categorical feature values in the DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing time series data.
    categorical_features (list): A list of feature labels to be processed as categorical.
    min_value (int): The minimum value for clipping.
    max_value (int): The maximum value for clipping.

    Returns:
    pd.DataFrame: The processed DataFrame with categorical feature values rounded and clipped.
    """
    for feature in categorical_features:
        feature_mask = df['feature_label'] == feature
        df.loc[feature_mask & df['valuenum'].notna(), 'valuenum'] = df.loc[feature_mask & df['valuenum'].notna(), 'valuenum'].round().clip(min_value, max_value)
    return df

In [34]:
# Apply function to categorical data
medium_frequency_train_resampled_copy = round_and_clip_categorical(medium_frequency_train_resampled_copy, categorical_features, -5, 4)

In [38]:
medium_frequency_train_resampled_copy[medium_frequency_train_resampled_copy['feature_label'] == 'Richmond-RAS Scale'].head(30)

Unnamed: 0,time_from_window_start_mins,subject_id,valuenum,feature_label,extubation_failure
63,0 days 00:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
64,0 days 01:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
65,0 days 02:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
66,0 days 03:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
67,0 days 04:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
68,0 days 05:00:00,10001884,-1.0,Richmond-RAS Scale,1.0
69,0 days 06:00:00,10001884,-0.0,Richmond-RAS Scale,1.0
133,0 days 00:00:00,10002428,-1.0,Richmond-RAS Scale,0.0
134,0 days 01:00:00,10002428,-1.0,Richmond-RAS Scale,0.0
135,0 days 02:00:00,10002428,0.0,Richmond-RAS Scale,0.0


In [39]:
# Apply to test data
medium_frequency_test_resampled_copy = round_and_clip_categorical(medium_frequency_test_resampled_copy, categorical_features, -5, 4)

In [41]:
# Save progress so far
data_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/feature_subsets'
low_frequency_train_resampled.to_parquet(f'{data_path}/04_low_frequency_train_resampled.parquet')
low_frequency_test_resampled.to_parquet(f'{data_path}/04_low_frequency_test_resampled.parquet')
medium_frequency_train_resampled_copy.to_parquet(f'{data_path}/04_medium_frequency_train_resampled.parquet')
medium_frequency_test_resampled_copy.to_parquet(f'{data_path}/04_medium_frequency_test_resampled.parquet')
high_frequency_train_resampled.to_parquet(f'{data_path}/04_high_frequency_train_resampled.parquet')
high_frequency_test_resampled.to_parquet(f'{data_path}/04_high_frequency_test_resampled.parquet')

In [44]:
low_frequency_train_resampled_copy = low_frequency_train_resampled.copy()
low_frequency_test_resampled_copy = low_frequency_test_resampled.copy()
high_frequency_train_resampled_copy = high_frequency_train_resampled.copy()
high_frequency_test_resampled_copy = high_frequency_test_resampled.copy()

### **Step 5: Feature scaling**

We will now scale our features using MinMax Scaling.

In [45]:
from sklearn.preprocessing import MinMaxScaler

In [46]:
def scale_features(train_df, test_df, numerical_features):
    """
    Scale numerical features in the train and test dataframes using MinMaxScaler.

    Parameters:
    train_df (pd.DataFrame): The training dataframe.
    test_df (pd.DataFrame): The testing dataframe.
    numerical_features (list): List of feature labels to be scaled.

    Returns:
    tuple: The scaled train and test dataframes.
    """
    scalers = {}

    for feature in numerical_features:
        # Initialize the MinMaxScaler for the feature
        scalers[feature] = MinMaxScaler()

        # Create masks for the feature in train and test dataframes
        feature_mask_train = train_df['feature_label'] == feature
        feature_mask_test = test_df['feature_label'] == feature

        # Store the original NaN masks
        nan_mask_train = train_df.loc[feature_mask_train, 'valuenum'].isna()
        nan_mask_test = test_df.loc[feature_mask_test, 'valuenum'].isna()

        # Fill NaNs with a temporary value
        train_values = train_df.loc[feature_mask_train, 'valuenum'].fillna(0).values.reshape(-1, 1)
        test_values = test_df.loc[feature_mask_test, 'valuenum'].fillna(0).values.reshape(-1, 1)

        # Fit and transform the train dataframe values
        train_scaled = scalers[feature].fit_transform(train_values)
        # Transform the test dataframe values
        test_scaled = scalers[feature].transform(test_values)

        # Restore the NaN values
        train_scaled[nan_mask_train.values] = np.nan
        test_scaled[nan_mask_test.values] = np.nan

        # Assign the scaled values back to the dataframes
        train_df.loc[feature_mask_train, 'valuenum'] = train_scaled
        test_df.loc[feature_mask_test, 'valuenum'] = test_scaled

        print(f'Feature {feature} has been normalized')

    # Ensure indices align if necessary
    train_df.reset_index(drop=True, inplace=True)
    test_df.reset_index(drop=True, inplace=True)

    # Display the sizes after normalization
    print(f"Number of rows in train dataframe after normalization: {len(train_df)}")
    print(f"Number of rows in test dataframe after normalization: {len(test_df)}")

    return train_df, test_df

In [47]:
# Scale all data
low_frequency_train_scaled, low_frequency_test_scaled = scale_features(low_frequency_train_resampled_copy, low_frequency_test_resampled_copy, low_frequency_features)
medium_frequency_train_scaled, medium_frequency_test_scaled = scale_features(medium_frequency_train_resampled_copy, medium_frequency_test_resampled_copy, medium_frequency_features)
high_frequency_train_scaled, high_frequency_test_scaled = scale_features(high_frequency_train_resampled_copy, high_frequency_test_resampled_copy, high_frequency_features)

Feature PH (Arterial) has been normalized
Feature Arterial O2 pressure has been normalized
Feature Arterial CO2 Pressure has been normalized
Number of rows in train dataframe after normalization: 16788
Number of rows in test dataframe after normalization: 4200
Feature Inspired O2 Fraction has been normalized
Feature GCS - Eye Opening has been normalized
Feature GCS - Motor Response has been normalized
Feature Tidal Volume (observed) has been normalized
Feature Minute Volume has been normalized
Feature Mean Airway Pressure has been normalized
Feature Peak Insp. Pressure has been normalized
Feature Temperature Fahrenheit has been normalized
Feature Tidal Volume (spontaneous) has been normalized
Feature Richmond-RAS Scale has been normalized
Number of rows in train dataframe after normalization: 263060
Number of rows in test dataframe after normalization: 65800
Feature Heart Rate has been normalized
Feature O2 saturation pulseoxymetry has been normalized
Feature Respiratory Rate has been 

In [48]:
# Save progress
data_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/feature_subsets'
low_frequency_train_scaled.to_parquet(f'{data_path}/05_low_frequency_train_scaled.parquet')
low_frequency_test_scaled.to_parquet(f'{data_path}/05_low_frequency_test_scaled.parquet')
medium_frequency_train_scaled.to_parquet(f'{data_path}/05_medium_frequency_train_scaled.parquet')
medium_frequency_test_scaled.to_parquet(f'{data_path}/05_medium_frequency_test_scaled.parquet')
high_frequency_train_scaled.to_parquet(f'{data_path}/05_high_frequency_train_scaled.parquet')
high_frequency_test_scaled.to_parquet(f'{data_path}/05_high_frequency_test_scaled.parquet')


In [50]:
# Count the number of patients in each dataset
print(f"Number of patients in low frequency train dataset: {low_frequency_train_scaled['subject_id'].nunique()}")
print(f"Number of patients in low frequency test dataset: {low_frequency_test_scaled['subject_id'].nunique()}")
print(f"Number of patients in medium frequency train dataset: {medium_frequency_train_scaled['subject_id'].nunique()}")
print(f"Number of patients in medium frequency test dataset: {medium_frequency_test_scaled['subject_id'].nunique()}")
print(f"Number of patients in high frequency train dataset: {high_frequency_train_scaled['subject_id'].nunique()}")
print(f"Number of patients in high frequency test dataset: {high_frequency_test_scaled['subject_id'].nunique()}")

Number of patients in low frequency train dataset: 1399
Number of patients in low frequency test dataset: 350
Number of patients in medium frequency train dataset: 3758
Number of patients in medium frequency test dataset: 940
Number of patients in high frequency train dataset: 3760
Number of patients in high frequency test dataset: 940
