# **LightGBM preprocessing**

We looked at which Gradient Boosting architecture was the best performing when applied to predicting extubation failure.

The purpose of using a Gradient Boosting framework is as a baseline for the more sophisticated time-series models.

GBMs inherently cannot process sequences of data, hence any data input would need to be static. As such, it is accetped that the outcome of the GBM models will not be meaningful regarding the intenition of this project to make a prediction based on time series data, but it serves as a useful baseline to compare performance.

The typical GBM models used in literature are XGBoost, LightGBM and CatBoost. Each have their unique adaptations but are all GBMs at heart. To select which one to use as the baseline of this study, we anlaysed their use in literature and LightGBM was the best performer on the primary metric used in this study of ROCAUC.

As such, we will process our patient data for use in a LightGBM for classification prediction of extubation failure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Load the train and test data**

The following has already been applied to the data for LSTM/TCN models:
- Remove low observed features
- Split into train and test sets
- Removed outliers

We will take this data so that we have the same train and test sets and results are comparable as possible.

In [None]:
# Load the train and test data
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/01_preprocessing_v2/03_train_data_standard_preprocess_done.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/01_feature_set_1/01_preprocessing_v2/03_test_data_standard_preprocess_done.parquet'

train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df.head()

Unnamed: 0,subject_id,itemid,valuenum,time_to_extubation_mins,time_from_window_start,label,extubation_failure
0,10001884,223835,40.0,160.0,200.0,Inspired O2 Fraction,1
1,10001884,224685,,160.0,200.0,Tidal Volume (observed),1
2,10001884,224686,,160.0,200.0,Tidal Volume (spontaneous),1
3,10001884,224687,6.1,160.0,200.0,Minute Volume,1
4,10001884,224695,17.0,160.0,200.0,Peak Insp. Pressure,1




In the previous pre-processing all outliers were set to NaN. We will just remove those values to avoid further imputation. Furthermore, since we are using mean aggregation these will not likley have any impact.

In [None]:
# Count the number of NaN data points
print('Number of NaN values in train set: ', train_df.isna().sum().sum())
print('Number of NaN values in test set: ', test_df.isna().sum().sum())

Number of NaN values in train set:  2811
Number of NaN values in test set:  682


In [None]:
# Count the number of data points
print('Number of data points in train set: ', train_df.shape[0])
print('Number of data points in test set: ', test_df.shape[0])

Number of data points in train set:  90905
Number of data points in test set:  22894


In [None]:
# Remove any data point that is NaN
train_df = train_df.dropna()
test_df = test_df.dropna()

# Count the number of data points
print('Number of data points in train set: ', train_df.shape[0])
print('Number of data points in test set: ', test_df.shape[0])

Number of data points in train set:  88094
Number of data points in test set:  22212


**Data aggregation**

For each patient we will need a fixed set of features.

We will do Mean Aggregation to best represent these features - averaging all values across the 6 hour window.

In [None]:
# Drop itemid column
train_df = train_df.drop(['itemid'], axis=1)
test_df = test_df.drop(['itemid'], axis=1)

In [None]:
train_copy = train_df.copy()
test_copy = test_df.copy()

In [None]:
# Aggregating features for each patient (subject_id) using mean for each label
train_pivoted = train_df.pivot_table(index='subject_id', columns='label', values='valuenum', aggfunc='mean')
test_pivoted = test_df.pivot_table(index='subject_id', columns='label', values='valuenum', aggfunc='mean')

# Rename columns to highlight the mean
train_pivoted.columns = ['mean_' + str(col) for col in train_pivoted.columns.values]
test_pivoted.columns = ['mean_' + str(col) for col in test_pivoted.columns.values]

# Reset index
train_pivoted = train_pivoted.reset_index()
test_pivoted = test_pivoted.reset_index()

In [None]:
train_pivoted.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_Ventilator Mode
0,10001884,,,40.0,6.1,97.666667,,17.0,20.0,,,
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,11.0
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,,,11.0
3,10010867,,,46.666667,5.6,97.666667,,16.0,15.333333,467.0,467.0,
4,10011365,,,45.0,9.4,93.166667,,12.0,17.666667,344.0,344.0,


Add extubation failure label

In [None]:
# Add extubation failure label
label_df = train_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
train_aggregated = train_pivoted.merge(label_df, on='subject_id', how='left')

label_df = test_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
test_aggregated = test_pivoted.merge(label_df, on='subject_id', how='left')

In [None]:
train_aggregated.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_Ventilator Mode,extubation_failure
0,10001884,,,40.0,6.1,97.666667,,17.0,20.0,,,,1
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,11.0,0
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,,,11.0,1
3,10010867,,,46.666667,5.6,97.666667,,16.0,15.333333,467.0,467.0,,0
4,10011365,,,45.0,9.4,93.166667,,12.0,17.666667,344.0,344.0,,1


**Drop Ventilation Mode as cagtegorical data is not useful**

As discussed with Dr Mayur, we will not be assessing Ventialtion Mode as looking at individual ventilation modes is not useful.

In [None]:
# Drop ventilation mode column
train_aggregated = train_aggregated.drop(['mean_Ventilator Mode'], axis=1)
test_aggregated = test_aggregated.drop(['mean_Ventilator Mode'], axis=1)

In [None]:
train_aggregated.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),extubation_failure
0,10001884,,,40.0,6.1,97.666667,,17.0,20.0,,,1
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,0
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,,,1
3,10010867,,,46.666667,5.6,97.666667,,16.0,15.333333,467.0,467.0,0
4,10011365,,,45.0,9.4,93.166667,,12.0,17.666667,344.0,344.0,1


**Handle NaN values**

You will need to handle the cases where a patient has no values for a feature - fill with the mean across the patient population.

We will handle the NaN values by filling with the mean of the patient population.

In [None]:
# See which columns have the most NaNs
print(train_aggregated.isna().sum().sort_values(ascending=False))

mean_Arterial CO2 Pressure          2390
mean_Arterial O2 pressure           2383
mean_PH (Arterial)                  2378
mean_Tidal Volume (spontaneous)      862
mean_Tidal Volume (observed)         529
mean_Minute Volume                   381
mean_Peak Insp. Pressure             194
mean_Inspired O2 Fraction             77
mean_O2 saturation pulseoxymetry      13
mean_Respiratory Rate                  9
subject_id                             0
extubation_failure                     0
dtype: int64


In [None]:
print(test_aggregated.isna().sum().sort_values(ascending=False))

mean_Arterial CO2 Pressure          597
mean_PH (Arterial)                  596
mean_Arterial O2 pressure           593
mean_Tidal Volume (spontaneous)     238
mean_Tidal Volume (observed)        147
mean_Minute Volume                   79
mean_Peak Insp. Pressure             51
mean_Inspired O2 Fraction            17
mean_Respiratory Rate                 7
mean_O2 saturation pulseoxymetry      2
subject_id                            0
extubation_failure                    0
dtype: int64


In [None]:
# Calculate this as a percentage of all data points
print("Train data:")
print(train_aggregated.isna().sum().sort_values(ascending=False)/len(train_aggregated))
print("Test data:")
print(test_aggregated.isna().sum().sort_values(ascending=False)/len(test_aggregated))

Train data:
mean_Arterial CO2 Pressure          0.635638
mean_Arterial O2 pressure           0.633777
mean_PH (Arterial)                  0.632447
mean_Tidal Volume (spontaneous)     0.229255
mean_Tidal Volume (observed)        0.140691
mean_Minute Volume                  0.101330
mean_Peak Insp. Pressure            0.051596
mean_Inspired O2 Fraction           0.020479
mean_O2 saturation pulseoxymetry    0.003457
mean_Respiratory Rate               0.002394
subject_id                          0.000000
extubation_failure                  0.000000
dtype: float64
Test data:
mean_Arterial CO2 Pressure          0.634431
mean_PH (Arterial)                  0.633369
mean_Arterial O2 pressure           0.630181
mean_Tidal Volume (spontaneous)     0.252922
mean_Tidal Volume (observed)        0.156217
mean_Minute Volume                  0.083953
mean_Peak Insp. Pressure            0.054198
mean_Inspired O2 Fraction           0.018066
mean_Respiratory Rate               0.007439
mean_O2 saturatio

For the low observed features, a significant number of patients had no values at all.

This means that there will be a significant amount of data imputed but this is necessary in order to keep the features analogous to those used in the LSTM/TCN training.

**Note: To avoid data leakage, the means being used to fill the test set NaNs are calculated from the training set.**

In [None]:
# Fill the NaN values with the mean of all patients
train_aggregated = train_aggregated.fillna(train_aggregated.mean())

# Fill the test set NaNs with the mean of the training set to avoid data leakage
test_aggregated = test_aggregated.fillna(train_aggregated.mean())

In [None]:
# Calculate this as a percentage of all data points
print("Train data:")
print(train_aggregated.isna().sum().sort_values(ascending=False)/len(train_aggregated))
print("Test data:")
print(test_aggregated.isna().sum().sort_values(ascending=False)/len(test_aggregated))

Train data:
subject_id                          0.0
mean_Arterial CO2 Pressure          0.0
mean_Arterial O2 pressure           0.0
mean_Inspired O2 Fraction           0.0
mean_Minute Volume                  0.0
mean_O2 saturation pulseoxymetry    0.0
mean_PH (Arterial)                  0.0
mean_Peak Insp. Pressure            0.0
mean_Respiratory Rate               0.0
mean_Tidal Volume (observed)        0.0
mean_Tidal Volume (spontaneous)     0.0
extubation_failure                  0.0
dtype: float64
Test data:
subject_id                          0.0
mean_Arterial CO2 Pressure          0.0
mean_Arterial O2 pressure           0.0
mean_Inspired O2 Fraction           0.0
mean_Minute Volume                  0.0
mean_O2 saturation pulseoxymetry    0.0
mean_PH (Arterial)                  0.0
mean_Peak Insp. Pressure            0.0
mean_Respiratory Rate               0.0
mean_Tidal Volume (observed)        0.0
mean_Tidal Volume (spontaneous)     0.0
extubation_failure                  0.0
dt

In [None]:
train_aggregated.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),extubation_failure
0,10001884,40.689586,110.926241,40.0,6.1,97.666667,7.415708,17.0,20.0,472.138379,470.899154,1
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,0
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,472.138379,470.899154,1
3,10010867,40.689586,110.926241,46.666667,5.6,97.666667,7.415708,16.0,15.333333,467.0,467.0,0
4,10011365,40.689586,110.926241,45.0,9.4,93.166667,7.415708,12.0,17.666667,344.0,344.0,1


**Create new features**

We can now create SpO2:FiO2 and P:F ratio features that are clinically informative.

In [None]:
# Create SpO2:FiO2 ratio

# Check that no values of FiO2 are 0 to avoid dividing by 0
print((train_aggregated['mean_Inspired O2 Fraction'] == 0).sum())
print((test_aggregated['mean_Inspired O2 Fraction'] == 0).sum())

0
0


In [None]:
train_aggregated_copy = train_aggregated.copy()
test_aggregated_copy = test_aggregated.copy()

In [None]:
# Create SpO2:FiO2
train_aggregated['mean_SpO2:FiO2'] = train_aggregated['mean_O2 saturation pulseoxymetry'] / train_aggregated['mean_Inspired O2 Fraction']
test_aggregated['mean_SpO2:FiO2'] = test_aggregated['mean_O2 saturation pulseoxymetry'] / test_aggregated['mean_Inspired O2 Fraction']

train_aggregated.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),extubation_failure,mean_SpO2:FiO2
0,10001884,40.689586,110.926241,40.0,6.1,97.666667,7.415708,17.0,20.0,472.138379,470.899154,1,2.441667
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,0,2.34902
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,472.138379,470.899154,1,1.986667
3,10010867,40.689586,110.926241,46.666667,5.6,97.666667,7.415708,16.0,15.333333,467.0,467.0,0,2.092857
4,10011365,40.689586,110.926241,45.0,9.4,93.166667,7.415708,12.0,17.666667,344.0,344.0,1,2.07037


In [None]:
# Create P:F ratio
train_aggregated['mean_P:F ratio'] = train_aggregated['mean_Arterial O2 pressure'] / train_aggregated['mean_Inspired O2 Fraction']
test_aggregated['mean_P:F ratio'] = test_aggregated['mean_Arterial O2 pressure'] / test_aggregated['mean_Inspired O2 Fraction']

train_aggregated.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),extubation_failure,mean_SpO2:FiO2,mean_P:F ratio
0,10001884,40.689586,110.926241,40.0,6.1,97.666667,7.415708,17.0,20.0,472.138379,470.899154,1,2.441667,2.773156
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,0,2.34902,2.988235
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,472.138379,470.899154,1,1.986667,2.01
3,10010867,40.689586,110.926241,46.666667,5.6,97.666667,7.415708,16.0,15.333333,467.0,467.0,0,2.092857,2.376991
4,10011365,40.689586,110.926241,45.0,9.4,93.166667,7.415708,12.0,17.666667,344.0,344.0,1,2.07037,2.465028


In [None]:
# Move extubation_failure column to the end
cols = train_aggregated.columns.tolist()
cols.remove('extubation_failure')
cols.append('extubation_failure')

train_aggregated = train_aggregated[cols]
test_aggregated = test_aggregated[cols]

train_aggregated.head()

Unnamed: 0,subject_id,mean_Arterial CO2 Pressure,mean_Arterial O2 pressure,mean_Inspired O2 Fraction,mean_Minute Volume,mean_O2 saturation pulseoxymetry,mean_PH (Arterial),mean_Peak Insp. Pressure,mean_Respiratory Rate,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_SpO2:FiO2,mean_P:F ratio,extubation_failure
0,10001884,40.689586,110.926241,40.0,6.1,97.666667,7.415708,17.0,20.0,472.138379,470.899154,2.441667,2.773156,1
1,10002428,43.0,127.0,42.5,9.0,99.833333,7.43,20.5,22.0,380.0,355.25,2.34902,2.988235,0
2,10004235,37.5,100.5,50.0,10.5,99.333333,7.325,11.0,13.666667,472.138379,470.899154,1.986667,2.01,1
3,10010867,40.689586,110.926241,46.666667,5.6,97.666667,7.415708,16.0,15.333333,467.0,467.0,2.092857,2.376991,0
4,10011365,40.689586,110.926241,45.0,9.4,93.166667,7.415708,12.0,17.666667,344.0,344.0,2.07037,2.465028,1


In [None]:
# Check for any NaNs
print(train_aggregated.isna().sum().sort_values(ascending=False))
print(test_aggregated.isna().sum().sort_values(ascending=False))

subject_id                          0
mean_Arterial CO2 Pressure          0
mean_Arterial O2 pressure           0
mean_Inspired O2 Fraction           0
mean_Minute Volume                  0
mean_O2 saturation pulseoxymetry    0
mean_PH (Arterial)                  0
mean_Peak Insp. Pressure            0
mean_Respiratory Rate               0
mean_Tidal Volume (observed)        0
mean_Tidal Volume (spontaneous)     0
mean_SpO2:FiO2                      0
mean_P:F ratio                      0
extubation_failure                  0
dtype: int64
subject_id                          0
mean_Arterial CO2 Pressure          0
mean_Arterial O2 pressure           0
mean_Inspired O2 Fraction           0
mean_Minute Volume                  0
mean_O2 saturation pulseoxymetry    0
mean_PH (Arterial)                  0
mean_Peak Insp. Pressure            0
mean_Respiratory Rate               0
mean_Tidal Volume (observed)        0
mean_Tidal Volume (spontaneous)     0
mean_SpO2:FiO2                      0

There are no NaN values and all patients have a mean value for all features. We can now use this data to train a LightGBM model.

In [None]:
train_aggregated.shape

(3760, 14)

In [None]:
test_aggregated.shape

(941, 14)

In [None]:
# Save the data
train_aggregated.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/02_lgbm_data/dynamic_data/train_aggregated.parquet')
test_aggregated.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/02_lgbm_data/dynamic_data/test_aggregated.parquet')