# **LightGBM preprocessing**

We looked at which Gradient Boosting architecture was the best performing when applied to predicting extubation failure.

The purpose of using a Gradient Boosting framework is as a baseline for the more sophisticated time-series models.

GBMs inherently cannot process sequences of data, hence any data input would need to be static. As such, it is accetped that the outcome of the GBM models will not be meaningful regarding the intenition of this project to make a prediction based on time series data, but it serves as a useful baseline to compare performance.

The typical GBM models used in literature are XGBoost, LightGBM and CatBoost. Each have their unique adaptations but are all GBMs at heart. To select which one to use as the baseline of this study, we anlaysed their use in literature and LightGBM was the best performer on the primary metric used in this study of ROCAUC.

As such, we will process our patient data for use in a LightGBM for classification prediction of extubation failure.

LightGBM models do not require features to be scaled.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Load the train and test data**

The following has already been applied to the data for LSTM/TCN models:
- Remove low observed features
- Split into train and test sets
- Removed outliers

We will take this data so that we have the same train and test sets and results are comparable as possible.

In [3]:
# Load the train and test data
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/03_train_data_f2_outliers_removed.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/02_feature_set_2/03_test_data_f2_outliers_removed.parquet'

train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,200.0,Inspired O2 Fraction,1
1,10001884,,200.0,Tidal Volume (observed),1
2,10001884,,200.0,Tidal Volume (spontaneous),1
3,10001884,6.1,200.0,Minute Volume,1
4,10001884,17.0,200.0,Peak Insp. Pressure,1


**Remove NaN values**

In the previous pre-processing all outliers were set to NaN. We will just remove those values to avoid further imputation. Furthermore, since we are using mean aggregation these will not likley have any impact.

In [4]:
# Count the number of NaN data points
print('Number of NaN values in train set: ', train_df.isna().sum().sum())
print('Number of NaN values in test set: ', test_df.isna().sum().sum())

Number of NaN values in train set:  15142
Number of NaN values in test set:  3815


In [5]:
# Count the number of data points
print('Number of data points in train set: ', train_df.shape[0])
print('Number of data points in test set: ', test_df.shape[0])

Number of data points in train set:  181953
Number of data points in test set:  45925


In [6]:
# Remove any data point that is NaN
train_df = train_df.dropna()
test_df = test_df.dropna()

# Count the number of data points
print('Number of data points in train set: ', train_df.shape[0])
print('Number of data points in test set: ', test_df.shape[0])

Number of data points in train set:  166811
Number of data points in test set:  42110


**Data aggregation**

For each patient we will need a fixed set of features.

We will do Mean Aggregation to best represent these features - averaging all values across the 6 hour window.

In [7]:
train_copy = train_df.copy()
test_copy = test_df.copy()

In [10]:
from scipy import stats

In [8]:
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']

In [11]:
# Assuming train_df and test_df are your DataFrames
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']  # Add your categorical features here

# Filter categorical features from the DataFrame
cat_train = train_df[train_df['feature_label'].isin(categorical_features)]
cat_test = test_df[test_df['feature_label'].isin(categorical_features)]

# Calculate mode for categorical features
cat_train_pivoted = cat_train.pivot_table(index='subject_id',
                                          columns='feature_label',
                                          values='valuenum',
                                          aggfunc=lambda x: stats.mode(x, keepdims=False)[0] if len(x) > 0 else None) # Handle the case when stats.mode returns a single value
cat_test_pivoted = cat_test.pivot_table(index='subject_id',
                                        columns='feature_label',
                                        values='valuenum',
                                        aggfunc=lambda x: stats.mode(x, keepdims=False)[0] if len(x) > 0 else None) # Handle the case when stats.mode returns a single value

# Rename columns to highlight the mode
cat_train_pivoted.columns = ['mode_' + str(col) for col in cat_train_pivoted.columns.values]
cat_test_pivoted.columns = ['mode_' + str(col) for col in cat_test_pivoted.columns.values]

# Reset index
cat_train_pivoted = cat_train_pivoted.reset_index()
cat_test_pivoted = cat_test_pivoted.reset_index()

# Display the result
cat_train_pivoted.head()
cat_test_pivoted.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale
0,10004720,4.0,2.0,-5.0
1,10004733,,,-4.0
2,10005817,4.0,4.0,-1.0
3,10022620,4.0,6.0,0.0
4,10037861,1.0,1.0,-5.0


In [12]:
# Filter numerical features from the DataFrame
num_train = train_df[~train_df['feature_label'].isin(categorical_features)]
num_test = test_df[~test_df['feature_label'].isin(categorical_features)]

# Calculate mean for numerical features
num_train_pivoted = num_train.pivot_table(index='subject_id',
                                          columns='feature_label',
                                          values='valuenum',
                                          aggfunc='mean')
num_test_pivoted = num_test.pivot_table(index='subject_id',
                                        columns='feature_label',
                                        values='valuenum',
                                        aggfunc='mean')

# Rename columns to highlight the mean
num_train_pivoted.columns = ['mean_' + str(col) for col in num_train_pivoted.columns.values]
num_test_pivoted.columns = ['mean_' + str(col) for col in num_test_pivoted.columns.values]

# Reset index
num_train_pivoted = num_train_pivoted.reset_index()
num_test_pivoted = num_test_pivoted.reset_index()

# Display the result
num_train_pivoted.head()
num_test_pivoted.head()

Unnamed: 0,subject_id,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Heart Rate,mean_Inspired O2 Fraction,mean_Mean Airway Pressure,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous)
0,10004720,,,,,,86.428571,40.0,7.25,7.3,95.166667,,11.5,26.714286,99.2,340.0,340.0
1,10004733,,,,,,70.5,35.0,7.0,5.0,97.833333,,20.0,12.75,99.05,515.0,
2,10005817,62.0,79.875,131.0,35.5,108.5,84.75,40.0,9.233333,9.35,97.714286,7.555,18.0,18.875,98.4,523.0,507.0
3,10013643,66.8,76.125,105.8,37.5,129.5,91.0,40.0,5.666667,6.466667,98.375,7.34,14.666667,18.25,98.5,501.666667,550.5
4,10022620,,,,,,121.166667,40.0,4.333333,6.433333,98.833333,,9.333333,20.833333,98.75,577.5,468.75


In [13]:
# Merge the categorical and numerical pivoted DataFrames
train_combined = pd.merge(cat_train_pivoted, num_train_pivoted, on='subject_id', how='outer')
test_combined = pd.merge(cat_test_pivoted, num_test_pivoted, on='subject_id', how='outer')

# Display the combined result
train_combined.head()
test_combined.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Heart Rate,mean_Inspired O2 Fraction,mean_Mean Airway Pressure,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous)
0,10004720,4.0,2.0,-5.0,,,,,,86.428571,40.0,7.25,7.3,95.166667,,11.5,26.714286,99.2,340.0,340.0
1,10004733,,,-4.0,,,,,,70.5,35.0,7.0,5.0,97.833333,,20.0,12.75,99.05,515.0,
2,10005817,4.0,4.0,-1.0,62.0,79.875,131.0,35.5,108.5,84.75,40.0,9.233333,9.35,97.714286,7.555,18.0,18.875,98.4,523.0,507.0
3,10022620,4.0,6.0,0.0,,,,,,121.166667,40.0,4.333333,6.433333,98.833333,,9.333333,20.833333,98.75,577.5,468.75
4,10037861,1.0,1.0,-5.0,68.0,77.0,126.166667,,,99.333333,30.0,11.0,10.3,98.333333,,21.0,22.0,,486.0,486.0


**Add extubation failure label**

In [14]:
# Add extubation failure label
label_df = train_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
train_aggregated = train_combined.merge(label_df, on='subject_id', how='left')

label_df = test_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
test_aggregated = test_combined.merge(label_df, on='subject_id', how='left')

In [15]:
train_aggregated.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Heart Rate,...,mean_Mean Airway Pressure,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),extubation_failure
0,10001884,3.0,6.0,-1.0,,,,,,74.75,...,7.6,6.1,97.666667,,17.0,20.0,,,,1
1,10002428,3.0,6.0,0.0,61.666667,81.166667,118.5,43.0,127.0,105.833333,...,11.5,9.0,99.833333,7.43,20.5,22.0,98.366667,380.0,355.25,0
2,10004235,4.0,6.0,,,,,37.5,100.5,104.833333,...,6.0,10.5,99.333333,7.325,11.0,13.666667,,,,1
3,10010867,2.0,4.0,-4.0,,,,,,99.0,...,8.9,5.6,97.666667,,16.0,15.333333,99.3,467.0,467.0,0
4,10011365,,,0.0,,,,,,88.166667,...,7.5,9.4,93.166667,,12.0,17.666667,98.8,344.0,344.0,1


**Handle NaN values**

We will need to handle the cases where a patient has no values for a feature - fill with the mean across the patient population.

These features usually correspond to those with the lowest observations.

We will handle the NaN values by filling with the mean of the patient population.

In [16]:
# See which columns have the most NaNs
print(train_aggregated.isna().sum().sort_values(ascending=False))

mean_Arterial CO2 Pressure                2390
mean_Arterial O2 pressure                 2383
mean_PH (Arterial)                        2378
mean_Arterial Blood Pressure diastolic    2204
mean_Arterial Blood Pressure systolic     1756
mean_Arterial Blood Pressure mean         1622
mean_Tidal Volume (spontaneous)            862
mode_Richmond-RAS Scale                    822
mean_Tidal Volume (observed)               529
mean_Temperature Fahrenheit                481
mean_Minute Volume                         381
mode_GCS - Motor Response                  269
mode_GCS - Eye Opening                     257
mean_Peak Insp. Pressure                   194
mean_Mean Airway Pressure                  158
mean_Inspired O2 Fraction                   77
mean_O2 saturation pulseoxymetry            13
mean_Respiratory Rate                        9
mean_Heart Rate                              1
subject_id                                   0
extubation_failure                           0
dtype: int64


In [17]:
print(test_aggregated.isna().sum().sort_values(ascending=False))

mean_Arterial CO2 Pressure                597
mean_PH (Arterial)                        596
mean_Arterial O2 pressure                 593
mean_Arterial Blood Pressure diastolic    538
mean_Arterial Blood Pressure systolic     458
mean_Arterial Blood Pressure mean         411
mean_Tidal Volume (spontaneous)           238
mode_Richmond-RAS Scale                   208
mean_Tidal Volume (observed)              147
mean_Temperature Fahrenheit               123
mean_Minute Volume                         79
mode_GCS - Motor Response                  62
mode_GCS - Eye Opening                     58
mean_Peak Insp. Pressure                   51
mean_Mean Airway Pressure                  39
mean_Inspired O2 Fraction                  17
mean_Respiratory Rate                       7
mean_O2 saturation pulseoxymetry            2
mean_Heart Rate                             2
subject_id                                  0
extubation_failure                          0
dtype: int64


In [18]:
# Calculate this as a percentage of all data points
print("Train data:")
print(train_aggregated.isna().sum().sort_values(ascending=False)/len(train_aggregated))
print("Test data:")
print(test_aggregated.isna().sum().sort_values(ascending=False)/len(test_aggregated))

Train data:
mean_Arterial CO2 Pressure                0.635638
mean_Arterial O2 pressure                 0.633777
mean_PH (Arterial)                        0.632447
mean_Arterial Blood Pressure diastolic    0.586170
mean_Arterial Blood Pressure systolic     0.467021
mean_Arterial Blood Pressure mean         0.431383
mean_Tidal Volume (spontaneous)           0.229255
mode_Richmond-RAS Scale                   0.218617
mean_Tidal Volume (observed)              0.140691
mean_Temperature Fahrenheit               0.127926
mean_Minute Volume                        0.101330
mode_GCS - Motor Response                 0.071543
mode_GCS - Eye Opening                    0.068351
mean_Peak Insp. Pressure                  0.051596
mean_Mean Airway Pressure                 0.042021
mean_Inspired O2 Fraction                 0.020479
mean_O2 saturation pulseoxymetry          0.003457
mean_Respiratory Rate                     0.002394
mean_Heart Rate                           0.000266
subject_id         

For the low observed features, a significant number of patients had no values at all.

This means that there will be a significant amount of data imputed but this is necessary in order to keep the features analogous to those used in the LSTM/TCN training.

**Note: To avoid data leakage, the means being used to fill the test set NaNs are calculated from the training set.**

In [19]:
# Assuming train_combined and test_combined are the merged DataFrames from previous steps
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']  # Add your categorical features here

# Separate categorical and numerical columns in train_combined
cat_columns = [col for col in train_combined.columns if 'mode_' in col]
num_columns = [col for col in train_combined.columns if 'mean_' in col]

# Fill NaN values with the mean for numerical features in train_combined
train_combined[num_columns] = train_combined[num_columns].fillna(train_combined[num_columns].mean())

# Fill NaN values with the mode for categorical features in train_combined
for col in cat_columns:
    mode_value = train_combined[col].mode()[0] if not train_combined[col].mode().empty else None
    train_combined[col] = train_combined[col].fillna(mode_value)

# For test_combined, fill NaNs with the mean of train_combined for numerical features
test_combined[num_columns] = test_combined[num_columns].fillna(train_combined[num_columns].mean())

# For test_combined, fill NaNs with the mode of train_combined for categorical features
for col in cat_columns:
    mode_value = train_combined[col].mode()[0] if not train_combined[col].mode().empty else None
    test_combined[col] = test_combined[col].fillna(mode_value)

# Display the final DataFrames
train_combined.head()
test_combined.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Heart Rate,mean_Inspired O2 Fraction,mean_Mean Airway Pressure,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous)
0,10004720,4.0,2.0,-5.0,68.416171,81.300472,117.652355,40.689586,110.926241,86.428571,40.0,7.25,7.3,95.166667,7.415708,11.5,26.714286,99.2,340.0,340.0
1,10004733,4.0,6.0,-4.0,68.416171,81.300472,117.652355,40.689586,110.926241,70.5,35.0,7.0,5.0,97.833333,7.415708,20.0,12.75,99.05,515.0,470.899154
2,10005817,4.0,4.0,-1.0,62.0,79.875,131.0,35.5,108.5,84.75,40.0,9.233333,9.35,97.714286,7.555,18.0,18.875,98.4,523.0,507.0
3,10022620,4.0,6.0,0.0,68.416171,81.300472,117.652355,40.689586,110.926241,121.166667,40.0,4.333333,6.433333,98.833333,7.415708,9.333333,20.833333,98.75,577.5,468.75
4,10037861,1.0,1.0,-5.0,68.0,77.0,126.166667,40.689586,110.926241,99.333333,30.0,11.0,10.3,98.333333,7.415708,21.0,22.0,98.862106,486.0,486.0


In [20]:
# Calculate this as a percentage of all data points
print("Train data:")
print(train_combined.isna().sum().sort_values(ascending=False)/len(train_combined))
print("Test data:")
print(test_combined.isna().sum().sort_values(ascending=False)/len(test_combined))

Train data:
subject_id                                0.0
mode_GCS - Eye Opening                    0.0
mean_Tidal Volume (observed)              0.0
mean_Temperature Fahrenheit               0.0
mean_Respiratory Rate                     0.0
mean_Peak Insp. Pressure                  0.0
mean_PH (Arterial)                        0.0
mean_O2 saturation pulseoxymetry          0.0
mean_Minute Volume                        0.0
mean_Mean Airway Pressure                 0.0
mean_Inspired O2 Fraction                 0.0
mean_Heart Rate                           0.0
mean_Arterial O2 pressure                 0.0
mean_Arterial CO2 Pressure                0.0
mean_Arterial Blood Pressure systolic     0.0
mean_Arterial Blood Pressure mean         0.0
mean_Arterial Blood Pressure diastolic    0.0
mode_Richmond-RAS Scale                   0.0
mode_GCS - Motor Response                 0.0
mean_Tidal Volume (spontaneous)           0.0
dtype: float64
Test data:
subject_id                                

We will not create new features inkeeping with what was done for the dynamic data. As the features required to make the new ones were sampled at different rates, making new features creates more synthetic data which we want to avoid.

There are no NaN values and all patients have a mean value for all features. We can now use this data to train a LightGBM model.

In [21]:
# Add extubation failure label
label_df = train_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
train_aggregated = train_combined.merge(label_df, on='subject_id', how='left')

label_df = test_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
test_aggregated = test_combined.merge(label_df, on='subject_id', how='left')

In [26]:
train_aggregated.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Heart Rate,...,mean_Mean Airway Pressure,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),extubation_failure
0,10001884,3.0,6.0,-1.0,68.416171,81.300472,117.652355,40.689586,110.926241,74.75,...,7.6,6.1,97.666667,7.415708,17.0,20.0,98.862106,472.138379,470.899154,1
1,10002428,3.0,6.0,0.0,61.666667,81.166667,118.5,43.0,127.0,105.833333,...,11.5,9.0,99.833333,7.43,20.5,22.0,98.366667,380.0,355.25,0
2,10004235,4.0,6.0,-1.0,68.416171,81.300472,117.652355,37.5,100.5,104.833333,...,6.0,10.5,99.333333,7.325,11.0,13.666667,98.862106,472.138379,470.899154,1
3,10010867,2.0,4.0,-4.0,68.416171,81.300472,117.652355,40.689586,110.926241,99.0,...,8.9,5.6,97.666667,7.415708,16.0,15.333333,99.3,467.0,467.0,0
4,10011365,4.0,6.0,0.0,68.416171,81.300472,117.652355,40.689586,110.926241,88.166667,...,7.5,9.4,93.166667,7.415708,12.0,17.666667,98.8,344.0,344.0,1


In [22]:
train_aggregated.shape

(3760, 21)

In [23]:
test_aggregated.shape

(941, 21)

In [24]:
# Save the data
train_aggregated.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/02_feature_set_2/02_lgbm_data/train_aggregated_v2.parquet')
test_aggregated.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/02_feature_set_2/02_lgbm_data/test_aggregated_v2.parquet')