# **Feature Set 2 - Dynamic Data preprocessing**

We will now carry out preprocessing of the raw data extracted based on the features identified in our second feature set plus our first feature set.

The logic and steps used to preprocess the data for the additional features will be identical to that used for feature set 1 dynamic features.

When the data is split into the train and test set, we will include the same patients in the train and test set as we have used for all analysis with feature set 1 for consistency.

**Decision - use the exact same patient-wise train and test split**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### **Step 1: Load the time-series data**

In [None]:
full_data = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/06_extracted_data_analysis/02_feature_set_2_analysis/patient_f2_dynamic_data.parquet'

patient_data = pd.read_parquet(full_data)

patient_data.head()

Unnamed: 0,subject_id,itemid,valuenum,time_from_window_start_mins,feature_label
0,10001884,223835,40.0,200.0,Inspired O2 Fraction
1,10001884,224685,284.0,200.0,Tidal Volume (observed)
2,10001884,224686,284.0,200.0,Tidal Volume (spontaneous)
3,10001884,224687,6.1,200.0,Minute Volume
4,10001884,224695,17.0,200.0,Peak Insp. Pressure


In [None]:
# Drop the itemid column
patient_data = patient_data.drop(columns=['itemid'])

patient_data.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label
0,10001884,40.0,200.0,Inspired O2 Fraction
1,10001884,284.0,200.0,Tidal Volume (observed)
2,10001884,284.0,200.0,Tidal Volume (spontaneous)
3,10001884,6.1,200.0,Minute Volume
4,10001884,17.0,200.0,Peak Insp. Pressure


In [None]:
# Identify all unique fature_labels
feature_labels = patient_data['feature_label'].unique()

print(len(feature_labels))

print(feature_labels)

34
['Inspired O2 Fraction' 'Tidal Volume (observed)'
 'Tidal Volume (spontaneous)' 'Minute Volume' 'Peak Insp. Pressure'
 'Mean Airway Pressure' 'EtCO2' 'Heart Rate' 'Respiratory Rate'
 'GCS - Eye Opening' 'GCS - Motor Response' 'O2 saturation pulseoxymetry'
 'Richmond-RAS Scale' 'Ventilator Mode' 'Arterial Blood Pressure systolic'
 'Arterial Blood Pressure diastolic' 'Arterial Blood Pressure mean'
 'Temperature Fahrenheit' 'Hematocrit (serum)' 'Sodium (serum)'
 'Potassium (serum)' 'Arterial O2 pressure' 'Arterial CO2 Pressure'
 'PH (Arterial)' 'Ionized Calcium' 'Lactic Acid' 'Hemoglobin' 'WBC'
 'Creatinine (serum)' 'Glucose (serum)' 'Platelet Count'
 'Plateau Pressure' 'Total Bilirubin' 'Negative Insp. Force']


We have all relevent features from both feature set 1 and feature set 2.

In [None]:
# Count the number of unique subject_ids
unique_subject_ids = patient_data['subject_id'].nunique()

print(unique_subject_ids)

4701


We also have the same number of patients in the global dataset as with feature set 1 so we can do the same 80/20 train/test split.

In [None]:
patient_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248413 entries, 0 to 248412
Data columns (total 4 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   subject_id                   248413 non-null  int64  
 1   valuenum                     247614 non-null  float64
 2   time_from_window_start_mins  248413 non-null  float64
 3   feature_label                248413 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 7.6+ MB


### **Step 2: Remove low observed features**

As before, we will remove any features that have an observation frequency per patient below 0.5 in the 6 hours.

In [None]:
# Calculate the average sampling frequency
per_patient_sampling_frequency = patient_data.groupby(['subject_id', 'feature_label']).size().reset_index(name='count')

per_patient_sampling_frequency_pivot = per_patient_sampling_frequency.pivot(index='subject_id', columns='feature_label', values='count').fillna(0)

# Calculate the average sampling frequency
average_sampling_frequency = per_patient_sampling_frequency_pivot.mean().sort_values(ascending=False)

# Create columns for the table
average_sampling_frequency_df = pd.DataFrame({'Feature': average_sampling_frequency.index, 'Average Sampling Frequency': average_sampling_frequency.values})

# Display the table
average_sampling_frequency_df

Unnamed: 0,Feature,Average Sampling Frequency
0,Heart Rate,6.647522
1,Respiratory Rate,6.614763
2,O2 saturation pulseoxymetry,6.610508
3,Arterial Blood Pressure mean,3.761115
4,Arterial Blood Pressure diastolic,3.754308
5,Arterial Blood Pressure systolic,3.753669
6,Inspired O2 Fraction,2.104446
7,GCS - Eye Opening,1.633057
8,GCS - Motor Response,1.628377
9,Tidal Volume (observed),1.600298


In [None]:
# Define the threshold
threshold = 0.5

In [None]:
# Identify all the features below the threshold
low_observed_features = average_sampling_frequency_df[average_sampling_frequency_df['Average Sampling Frequency'] < threshold]['Feature'].tolist()
low_observed_features

['Sodium (serum)',
 'Potassium (serum)',
 'Glucose (serum)',
 'Creatinine (serum)',
 'Hematocrit (serum)',
 'Ionized Calcium',
 'Hemoglobin',
 'Platelet Count',
 'WBC',
 'Lactic Acid',
 'EtCO2',
 'Plateau Pressure',
 'Total Bilirubin',
 'Negative Insp. Force']

There are 14 features that have observation frequencies below the threshold.

We will now remove all data points associated with these features to minimise the amount of data synthesis that will be required later.

In [None]:
patient_data.shape[0]

248413

In [None]:
# From the patient data remove all rows that correspond to the low frequency features
patient_data_filtered_df = patient_data[~patient_data['feature_label'].isin(low_observed_features)]
patient_data_filtered_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label
0,10001884,40.0,200.0,Inspired O2 Fraction
1,10001884,284.0,200.0,Tidal Volume (observed)
2,10001884,284.0,200.0,Tidal Volume (spontaneous)
3,10001884,6.1,200.0,Minute Volume
4,10001884,17.0,200.0,Peak Insp. Pressure


In [None]:
patient_data_filtered_df.shape[0]

print("Number of rows removed:", patient_data.shape[0] - patient_data_filtered_df.shape[0])

Number of rows removed: 14076


In [None]:
# Check that there are no entries of these features remaining
print(patient_data_filtered_df[patient_data_filtered_df['feature_label'].isin(low_observed_features)])

Empty DataFrame
Columns: [subject_id, valuenum, time_from_window_start_mins, feature_label]
Index: []


In [None]:
# Save progress for now in drive
patient_data_filtered_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/01_patient_data_f2_filtered')

### **Step 3: Split the data into train and test sets**

We will split the data using the same patients in the train and test sets used for feature set 1 for consistency.

In [None]:
# First we will attach extubation failure labels to our patients
annotations_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/03_annotated_set/annotation_v03.parquet'
annotations_df = pd.read_parquet(annotations_path)

patient_data_filtered_df.loc[:, 'extubation_failure'] = patient_data_filtered_df['subject_id'].map(annotations_df.set_index('subject_id')['extubation_failure'])
patient_data_filtered_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  patient_data_filtered_df.loc[:, 'extubation_failure'] = patient_data_filtered_df['subject_id'].map(annotations_df.set_index('subject_id')['extubation_failure'])


Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,200.0,Inspired O2 Fraction,1
1,10001884,284.0,200.0,Tidal Volume (observed),1
2,10001884,284.0,200.0,Tidal Volume (spontaneous),1
3,10001884,6.1,200.0,Minute Volume,1
4,10001884,17.0,200.0,Peak Insp. Pressure,1


In [None]:
# Load the train and test sets from feature set 1 preprocessing
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/01_preprocessing_v2/03_train_data_standard_preprocess_done.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/01_preprocessing_v2/03_test_data_standard_preprocess_done.parquet'

train_f1 = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_f1.head()

Unnamed: 0,subject_id,itemid,valuenum,time_to_extubation_mins,time_from_window_start,label,extubation_failure
0,10001884,223835,40.0,160.0,200.0,Inspired O2 Fraction,1
1,10001884,224685,,160.0,200.0,Tidal Volume (observed),1
2,10001884,224686,,160.0,200.0,Tidal Volume (spontaneous),1
3,10001884,224687,6.1,160.0,200.0,Minute Volume,1
4,10001884,224695,17.0,160.0,200.0,Peak Insp. Pressure,1


In [None]:
# Idetnify the patients in the train and test set
train_patients = train_f1['subject_id'].unique()
test_patients = test_df['subject_id'].unique()

print(len(train_patients))
print(len(test_patients))

3760
941


In [None]:
# Split the patient data frame into train and test data by the train and test patients extracted
train_data = patient_data_filtered_df[patient_data_filtered_df['subject_id'].isin(train_patients)]
test_data = patient_data_filtered_df[patient_data_filtered_df['subject_id'].isin(test_patients)]

print(train_data.shape[0])
print(test_data.shape[0])

187110
47227


In [None]:
print(train_data['subject_id'].nunique())
print(test_data['subject_id'].nunique())

3760
941


In [None]:
train_data.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,200.0,Inspired O2 Fraction,1
1,10001884,284.0,200.0,Tidal Volume (observed),1
2,10001884,284.0,200.0,Tidal Volume (spontaneous),1
3,10001884,6.1,200.0,Minute Volume,1
4,10001884,17.0,200.0,Peak Insp. Pressure,1


We have now split our data into the train and test set

In [None]:
# Save progress so far
train_data.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/02_train_data_f2.parquet')
test_data.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/02_test_data_f2.parquet')


Let's see how much synthetic data will be required for our train and test split.

**The split is based on a maximal resampling rate of every 30 mins - which will not be the case for the medium and low frequency subsets but covers the max case scenario**

In [None]:
def calculate_synthetic_data_proportion(df, patient_id_col, feature_col, target_frequency=13):
    synthetic_data_proportion = {}
    for patient_id in df[patient_id_col].unique():
        patient_data = df[df[patient_id_col] == patient_id]
        total_entries = len(patient_data)
        synthetic_entries = 0
        for feature in df[feature_col].unique():
            feature_data = patient_data[patient_data[feature_col] == feature]
            observed_count = len(feature_data)
            if observed_count < target_frequency:
                synthetic_entries += (target_frequency - observed_count)
        synthetic_data_proportion[patient_id] = synthetic_entries / total_entries
    return synthetic_data_proportion

In [None]:
def calculate_patient_synthetic_data_proportion(patient_data, target_frequency=13):
    patient_synthetic_data_proportion = {}
    for patient_id, group in patient_data.groupby('subject_id'):
        observed_count = group['label'].count()
        total_expected = len(group) * target_frequency
        synthetic_count = total_expected - observed_count
        synthetic_proportion = synthetic_count / total_expected
        patient_synthetic_data_proportion[patient_id] = synthetic_proportion
    return patient_synthetic_data_proportion

In [None]:
def compare_synthetic_data_proportions(train_df, test_df, patient_id_col, feature_col, target_frequency=13):
    train_synthetic_data_proportion = calculate_synthetic_data_proportion(train_df, patient_id_col, feature_col, target_frequency)
    test_synthetic_data_proportion = calculate_synthetic_data_proportion(test_df, patient_id_col, feature_col, target_frequency)

    train_avg_synthetic_data_proportion = sum(train_synthetic_data_proportion.values()) / len(train_synthetic_data_proportion)
    test_avg_synthetic_data_proportion = sum(test_synthetic_data_proportion.values()) / len(test_synthetic_data_proportion)

    print(f"Train Average Synthetic Data Proportion: {train_avg_synthetic_data_proportion:.4f}")
    print(f"Test Average Synthetic Data Proportion: {test_avg_synthetic_data_proportion:.4f}")

def compare_class_distributions(train_df, test_df, target_col):
    train_class_distribution = train_df[target_col].value_counts(normalize=True)
    test_class_distribution = test_df[target_col].value_counts(normalize=True)

    print("Train Class Distribution:")
    print(train_class_distribution)
    print("\nTest Class Distribution:")
    print(test_class_distribution)

def compare_feature_distributions(train_df, test_df, feature_cols):
    train_feature_stats = train_df[feature_cols].describe().transpose()
    test_feature_stats = test_df[feature_cols].describe().transpose()

    comparison_df = train_feature_stats[['mean', 'std']].merge(
        test_feature_stats[['mean', 'std']],
        left_index=True,
        right_index=True,
        suffixes=('_train', '_test')
    )

    print("\nFeature Statistics Comparison:")
    print(comparison_df)

In [None]:
compare_synthetic_data_proportions(train_data, test_data, 'subject_id', 'feature_label')

Train Average Synthetic Data Proportion: 4.8382
Test Average Synthetic Data Proportion: 4.8816


The proportion of synthetic data that would be required in the maximal case is similar, meaning the split is well stratified.

We will now compare class distributions

In [None]:
def compare_class_distributions(train_df, test_df, target_col):
    train_class_distribution = train_df[target_col].value_counts(normalize=True)
    test_class_distribution = test_df[target_col].value_counts(normalize=True)

    print("Train Class Distribution:")
    print(train_class_distribution)
    print("\nTest Class Distribution:")
    print(test_class_distribution)

In [None]:
compare_class_distributions(train_data, test_data, 'extubation_failure')

Train Class Distribution:
extubation_failure
0    0.668094
1    0.331906
Name: proportion, dtype: float64

Test Class Distribution:
extubation_failure
0    0.666398
1    0.333602
Name: proportion, dtype: float64


The class distribution is near identical between the two and reflects the inherent class distribution of the dataset which is ideal.

Finally, we will check the feature distributions between the train and test sets.

In [None]:
def pivot_features(data):
    """Pivot the DataFrame so that each feature label in the 'label' column becomes a column."""
    pivoted_df = data.pivot_table(index='subject_id', columns='feature_label', values='valuenum', aggfunc='mean').reset_index()
    return pivoted_df

def compare_feature_distributions(train_df, test_df, feature_labels):
    # Pivot the train and test dataframes
    train_pivoted = pivot_features(train_df)
    test_pivoted = pivot_features(test_df)

    # Calculate statistics for the pivoted features
    train_feature_stats = train_pivoted[feature_labels].describe().transpose()
    test_feature_stats = test_pivoted[feature_labels].describe().transpose()

    # Merge the statistics from train and test dataframes
    comparison_df = train_feature_stats[['mean', 'std']].merge(
        test_feature_stats[['mean', 'std']],
        left_index=True,
        right_index=True,
        suffixes=('_train', '_test')
    )

    # Print the comparison dataframe in a readable format
    print("\nFeature Statistics Comparison:")
    print(comparison_df)

In [None]:
feature_cols = average_sampling_frequency_df['Feature'].tolist()
# Remove low freq features from feature cols
feature_cols = [x for x in feature_cols if x not in low_observed_features]

compare_feature_distributions(train_data, test_data, feature_cols)


Feature Statistics Comparison:
                                   mean_train   std_train   mean_test  \
feature_label                                                           
Heart Rate                          84.674881   16.131241   84.940555   
Respiratory Rate                    19.094851    7.168183   18.855197   
O2 saturation pulseoxymetry         97.569262    2.406151   97.512783   
Arterial Blood Pressure mean        81.807956   15.146365   82.306634   
Arterial Blood Pressure diastolic   60.503410   18.254267   61.305475   
Arterial Blood Pressure systolic   123.695378   19.759156  123.007560   
Inspired O2 Fraction                42.971156    9.731068   42.922510   
GCS - Eye Opening                    3.191998    0.915553    3.152262   
GCS - Motor Response                 5.335014    1.246194    5.222507   
Tidal Volume (observed)            469.480313  146.227187  481.555807   
Minute Volume                        8.783212    6.849875    8.692771   
Mean Airway Pressur

It should be noted that some of these features are not inherently numerical e.g. Ventilator Mode, RAS and GCS and hence their statistics should not be considered.

### **Step 4: Remove any features that clinically are not useful**

Having spoken to clinical practitioners, there are some features that are not useful to look at from this list.

These were identified to be:
- Ventilator Mode: as this is machine specific and therefore not generalisable to other machines

In [None]:
# Remove any rows where the feature_label is ventilator mode
train_data = train_data[train_data['feature_label'] != 'Ventilator Mode']
test_data = test_data[test_data['feature_label'] != 'Ventilator Mode']

print(train_data['feature_label'].unique())
print(test_data['feature_label'].unique())

['Inspired O2 Fraction' 'Tidal Volume (observed)'
 'Tidal Volume (spontaneous)' 'Minute Volume' 'Peak Insp. Pressure'
 'Mean Airway Pressure' 'Heart Rate' 'Respiratory Rate'
 'GCS - Eye Opening' 'GCS - Motor Response' 'O2 saturation pulseoxymetry'
 'Richmond-RAS Scale' 'Arterial Blood Pressure systolic'
 'Arterial Blood Pressure diastolic' 'Arterial Blood Pressure mean'
 'Temperature Fahrenheit' 'Arterial O2 pressure' 'Arterial CO2 Pressure'
 'PH (Arterial)']
['O2 saturation pulseoxymetry' 'Inspired O2 Fraction'
 'Tidal Volume (observed)' 'Tidal Volume (spontaneous)' 'Minute Volume'
 'Peak Insp. Pressure' 'Mean Airway Pressure' 'Temperature Fahrenheit'
 'Richmond-RAS Scale' 'GCS - Eye Opening' 'GCS - Motor Response'
 'Heart Rate' 'Respiratory Rate' 'Arterial Blood Pressure systolic'
 'Arterial Blood Pressure diastolic' 'Arterial Blood Pressure mean'
 'Arterial O2 pressure' 'Arterial CO2 Pressure' 'PH (Arterial)']


### **Step 5: Outlier Detection and Removal**

We can now remove outliers using the ranges provided by MIMIC or where not available the mean ± 3 std as the lower and upper bound.

For features that are inherently categorical but represented numerically we will not remove outliers as the values refer to a specific meaning.

In [None]:
# Load the d_items table
items_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/mimic-iv-2.2-raw-data/icu/d_items.csv'
items_df = pd.read_csv(items_path)
items_df.head()

Unnamed: 0,itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
0,220001,Problem List,Problem List,chartevents,General,,Text,,
1,220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
2,220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
3,220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
4,220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,


In [None]:
# Rename label column to feature label
items_df = items_df.rename(columns={'label': 'feature_label'})
items_df.head()

Unnamed: 0,itemid,feature_label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
0,220001,Problem List,Problem List,chartevents,General,,Text,,
1,220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
2,220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
3,220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
4,220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,


In [None]:
# For the unique labels in the patient data, extract the lownormalvalue and highnormalvalue
mimic_ranges = pd.merge(train_data, items_df, left_on='feature_label', right_on='feature_label', how='left')

# For each label, extract the lownormalvalue and highnormalvalue
mimic_ranges = mimic_ranges[['feature_label', 'lownormalvalue', 'highnormalvalue']].drop_duplicates()

# Remove any rows where both lownormalvalue and highnormalvalue are NaN
mimic_ranges

Unnamed: 0,feature_label,lownormalvalue,highnormalvalue
0,Inspired O2 Fraction,,
1,Tidal Volume (observed),299.0,750.0
2,Tidal Volume (spontaneous),299.0,750.0
3,Minute Volume,,12.1
4,Peak Insp. Pressure,,
5,Mean Airway Pressure,,
6,Heart Rate,,
7,Respiratory Rate,,
8,GCS - Eye Opening,,
9,GCS - Motor Response,,


In [None]:
# For each feature we can calculate statistics
train_feature_stats = train_data.groupby('feature_label').agg({
    'valuenum': ['min', 'max', 'mean', 'median', 'std']
}).reset_index()

train_feature_stats

Unnamed: 0_level_0,feature_label,valuenum,valuenum,valuenum,valuenum,valuenum
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,median,std
0,Arterial Blood Pressure diastolic,0.0,6545.0,60.370794,58.0,56.363856
1,Arterial Blood Pressure mean,-19.0,353.0,81.459553,79.0,18.811019
2,Arterial Blood Pressure systolic,0.0,252.0,123.458819,121.0,24.151498
3,Arterial CO2 Pressure,16.0,89.0,41.04893,40.0,8.622495
4,Arterial O2 pressure,16.0,525.0,112.679409,106.0,38.317571
5,GCS - Eye Opening,1.0,4.0,3.148353,3.0,0.971382
6,GCS - Motor Response,1.0,6.0,5.366133,6.0,1.211888
7,Heart Rate,0.0,182.0,84.905516,83.0,18.082079
8,Inspired O2 Fraction,0.0,100.0,43.433712,40.0,11.918554
9,Mean Airway Pressure,0.0,28.0,7.500339,7.0,3.200701


In [None]:
train_feature_stats.columns = ['_'.join(col).strip() for col in train_feature_stats.columns.values]
train_feature_stats.rename(columns={'label_': 'label'}, inplace=True)
train_feature_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   feature_label_   19 non-null     object 
 1   valuenum_min     19 non-null     float64
 2   valuenum_max     19 non-null     float64
 3   valuenum_mean    19 non-null     float64
 4   valuenum_median  19 non-null     float64
 5   valuenum_std     19 non-null     float64
dtypes: float64(5), object(1)
memory usage: 1.0+ KB


In [None]:
merged_ranges = pd.merge(mimic_ranges, train_feature_stats, left_on='feature_label', right_on='feature_label_')
merged_ranges

Unnamed: 0,feature_label,lownormalvalue,highnormalvalue,feature_label_,valuenum_min,valuenum_max,valuenum_mean,valuenum_median,valuenum_std
0,Inspired O2 Fraction,,,Inspired O2 Fraction,0.0,100.0,43.433712,40.0,11.918554
1,Tidal Volume (observed),299.0,750.0,Tidal Volume (observed),0.0,1360.0,472.025328,452.0,163.021589
2,Tidal Volume (spontaneous),299.0,750.0,Tidal Volume (spontaneous),0.0,52540.0,505.749019,446.0,910.968934
3,Minute Volume,,12.1,Minute Volume,0.0,777.0,8.818543,8.3,10.29624
4,Peak Insp. Pressure,,,Peak Insp. Pressure,0.0,2523.0,14.192709,12.0,33.633063
5,Mean Airway Pressure,,,Mean Airway Pressure,0.0,28.0,7.500339,7.0,3.200701
6,Heart Rate,,,Heart Rate,0.0,182.0,84.905516,83.0,18.082079
7,Respiratory Rate,,,Respiratory Rate,0.0,2325.0,19.175607,19.0,15.833371
8,GCS - Eye Opening,,,GCS - Eye Opening,1.0,4.0,3.148353,3.0,0.971382
9,GCS - Motor Response,,,GCS - Motor Response,1.0,6.0,5.366133,6.0,1.211888


Mean ± 3 std caluclation

In [None]:
# Create reasonable ranges
k = 3

# Update lownormalvalue and highnormalvalue
merged_ranges['lownormalvalue'] = merged_ranges.apply(
    lambda row: row['lownormalvalue'] if not pd.isna(row['lownormalvalue'])
    else max(row['valuenum_min'], row['valuenum_mean'] - k * row['valuenum_std']),
    axis=1
)

merged_ranges['highnormalvalue'] = merged_ranges.apply(
    lambda row: row['highnormalvalue'] if not pd.isna(row['highnormalvalue'])
    else min(row['valuenum_max'], row['valuenum_mean'] + k * row['valuenum_std']),
    axis=1
)

In [None]:
# Drop the extra columns
merged_train_ranges = merged_ranges[['feature_label','lownormalvalue', 'highnormalvalue']]

merged_train_ranges

Unnamed: 0,feature_label,lownormalvalue,highnormalvalue
0,Inspired O2 Fraction,7.678051,79.189374
1,Tidal Volume (observed),299.0,750.0
2,Tidal Volume (spontaneous),299.0,750.0
3,Minute Volume,0.0,12.1
4,Peak Insp. Pressure,0.0,115.091899
5,Mean Airway Pressure,0.0,17.102442
6,Heart Rate,30.659278,139.151754
7,Respiratory Rate,0.0,66.67572
8,GCS - Eye Opening,1.0,4.0
9,GCS - Motor Response,1.73047,6.0


**Define categorical features**

From the features that we have left, the inherently categorical features are:
- GCS - Eye Opening
- GCS - Motor Response
- Richmond-RAS Scale

These will be excluded from outlier removal as the bounds are redundant.

Clinically validating these ranges with Doctor - these are reasonable and we can proceed in removing outliers outside this range and replacing them with NaNs to be later filled.

In [None]:
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']

In [None]:
# Function to set outliers as NaN
def set_outliers_to_nan(df, mimic_df, categorical_features):
    for i, row in mimic_df.iterrows():
        if row['feature_label'] not in categorical_features:
            feature_mask = df['feature_label'] == row['feature_label']
            df.loc[feature_mask & ((df['valuenum'] < row['lownormalvalue']) | (df['valuenum'] > row['highnormalvalue'])), 'valuenum'] = np.nan
    return df

In [None]:
train_data_copy = train_data.copy()
test_data_copy = test_data.copy()

In [None]:
train_data_outliers_removed = set_outliers_to_nan(train_data_copy, merged_train_ranges, categorical_features)
test_data_outliers_removed = set_outliers_to_nan(test_data_copy, merged_train_ranges, categorical_features)

We can now visualise how many data points were outliers.

In [None]:
# Function to calculate the percentage of NaN values per feature
def calculate_nan_stats(df, label_col='feature_label', value_col='valuenum'):
    stats = []
    for label in df[label_col].unique():
        feature_mask = df[label_col] == label
        total_points = feature_mask.sum()
        nan_points = df.loc[feature_mask, value_col].isna().sum()
        percentage_nan = (nan_points / total_points) * 100
        stats.append({
            'label': label,
            'total_data_points': total_points,
            'nan_data_points': nan_points,
            'percentage_nan': percentage_nan
        })
    return pd.DataFrame(stats)

In [None]:
train_nan_stats = calculate_nan_stats(train_data_outliers_removed)
test_nan_stats = calculate_nan_stats(test_data_outliers_removed)

print("Train NaN Stats:")
print(train_nan_stats)
print("\nTest NaN Stats:")
print(test_nan_stats)

Train NaN Stats:
                                label  total_data_points  nan_data_points  \
0                Inspired O2 Fraction               7920              214   
1             Tidal Volume (observed)               6079              843   
2          Tidal Volume (spontaneous)               5154              815   
3                       Minute Volume               6019              657   
4                 Peak Insp. Pressure               5747                1   
5                Mean Airway Pressure               5906               78   
6                          Heart Rate              24856              160   
7                    Respiratory Rate              24737                5   
8                   GCS - Eye Opening               6134                0   
9                GCS - Motor Response               6118                0   
10        O2 saturation pulseoxymetry              24805              195   
11                 Richmond-RAS Scale               4824   

It seems arterial blood pressure values have a lot of outliers. This may be due to the bounds provided by MIMIC being too conservative but we did not want to overrule them.

In [None]:
# Calculate the percentage of data points removed in train set
total_data_points = len(train_data_copy)
nan_data_points = train_data_outliers_removed['valuenum'].isna().sum()
percentage_nan_overall = (nan_data_points / total_data_points) * 100

print(f"Percentage of NaN data in train set: {percentage_nan_overall:.2f}%")

Percentage of NaN data in train set: 8.32%


In [None]:
# Calculate the percentage of data points removed in test set
total_data_points = len(test_data_copy)
nan_data_points = test_data_outliers_removed['valuenum'].isna().sum()
percentage_nan_overall = (nan_data_points / total_data_points) * 100

print(f"Percentage of NaN data in test set: {percentage_nan_overall:.2f}%")

Percentage of NaN data in test set: 8.31%


Give reasons as to why Arterial Blood pressure has the most outliers. Physiological variablility, conservative bounds etc.

In [None]:
# Save train and test data
train_data_outliers_removed.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/03_train_data_f2_outliers_removed.parquet')
test_data_outliers_removed.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/03_test_data_f2_outliers_removed.parquet')