# **LightGBM preprocessing**

We looked at which Gradient Boosting architecture was the best performing when applied to predicting extubation failure.

The purpose of using a Gradient Boosting framework is as a baseline for the more sophisticated time-series models.

GBMs inherently cannot process sequences of data, hence any data input would need to be static. As such, it is accetped that the outcome of the GBM models will not be meaningful regarding the intenition of this project to make a prediction based on time series data, but it serves as a useful baseline to compare performance.

The typical GBM models used in literature are XGBoost, LightGBM and CatBoost. Each have their unique adaptations but are all GBMs at heart. To select which one to use as the baseline of this study, we anlaysed their use in literature and LightGBM was the best performer on the primary metric used in this study of ROCAUC.

As such, we will process our patient data for use in a LightGBM for classification prediction of extubation failure.

LightGBM models do not require features to be scaled.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Load the train and test data**

The following has already been applied to the data for LSTM/TCN models:
- Remove low observed features
- Split into train and test sets
- Removed outliers

We will take this data so that we have the same train and test sets and results are comparable as possible.

In [4]:
# Load the train and test data
train_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/03_feature_set_3/pre_processing/03_train_data_f3_outliers_removed_v2.parquet'
test_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/07_data_preprocessing/03_feature_set_3/pre_processing/03_test_data_f3_outliers_removed_v2.parquet'

train_df = pd.read_parquet(train_path)
test_df = pd.read_parquet(test_path)

train_df.head()

Unnamed: 0,subject_id,valuenum,time_from_window_start_mins,feature_label,extubation_failure
0,10001884,40.0,200.0,Inspired O2 Fraction,1
1,10001884,,200.0,Tidal Volume (observed),1
2,10001884,,200.0,Tidal Volume (spontaneous),1
3,10001884,6.1,200.0,Minute Volume,1
4,10001884,17.0,200.0,Peak Insp. Pressure,1


In [6]:
# Count the number of unique feature labels
print('Number of unique feature labels in train set: ', train_df['feature_label'].nunique())
print('Number of unique feature labels in test set: ', test_df['feature_label'].nunique())

Number of unique feature labels in train set:  34
Number of unique feature labels in test set:  34


In [7]:
# List the features
print('Features in train set: ', train_df['feature_label'].unique())
print('Features in test set: ', test_df['feature_label'].unique())

Features in train set:  ['Inspired O2 Fraction' 'Tidal Volume (observed)'
 'Tidal Volume (spontaneous)' 'Minute Volume' 'Peak Insp. Pressure'
 'Mean Airway Pressure' 'EtCO2' 'Heart Rate' 'Respiratory Rate'
 'GCS - Eye Opening' 'GCS - Motor Response' 'O2 saturation pulseoxymetry'
 'Richmond-RAS Scale' 'Arterial Blood Pressure systolic'
 'Arterial Blood Pressure diastolic' 'Arterial Blood Pressure mean'
 'Temperature Fahrenheit' 'Hematocrit (serum)' 'Sodium (serum)'
 'Potassium (serum)' 'Arterial O2 pressure' 'Arterial CO2 Pressure'
 'PH (Arterial)' 'Arterial Base Excess' 'Arterial O2 Saturation'
 'Ionized Calcium' 'Lactic Acid' 'Hemoglobin' 'WBC' 'Creatinine (serum)'
 'Glucose (serum)' 'Platelet Count' 'Plateau Pressure'
 'Cardiac Output (CCO)']
Features in test set:  ['O2 saturation pulseoxymetry' 'Inspired O2 Fraction'
 'Tidal Volume (observed)' 'Tidal Volume (spontaneous)' 'Minute Volume'
 'Peak Insp. Pressure' 'Mean Airway Pressure' 'Temperature Fahrenheit'
 'Richmond-RAS Scale' 'GC

**Remove NaN values**

In the previous pre-processing all outliers were set to NaN. We will just remove those values to avoid further imputation. Furthermore, since we are using mean aggregation these will not likley have any impact.

In [8]:
# Count the number of NaN data points
print('Number of NaN values in train set: ', train_df.isna().sum().sum())
print('Number of NaN values in test set: ', test_df.isna().sum().sum())

Number of NaN values in train set:  15610
Number of NaN values in test set:  3982


In [9]:
# Count the number of data points
print('Number of data points in train set: ', train_df.shape[0])
print('Number of data points in test set: ', test_df.shape[0])

Number of data points in train set:  196561
Number of data points in test set:  49576


In [22]:
# Remove any data point that is NaN
train_df = train_df.dropna()
test_df = test_df.dropna()

# Count the number of data points
print('Number of data points in train set: ', train_df.shape[0])
print('Number of data points in test set: ', test_df.shape[0])

Number of data points in train set:  180951
Number of data points in test set:  45594


**Data aggregation**

For each patient we will need a fixed set of features.

We will do Mean Aggregation to best represent these features - averaging all values across the 6 hour window.

*Handling Categorical Features*

**It should be noted that GCS and RAS scores are numerical categorical values. As suhc, taking the mean of these is irrelevant so to create as meaningful values as possible we will take the mode instead of the mean for these features.**

In [23]:
train_copy = train_df.copy()
test_copy = test_df.copy()

In [24]:
from scipy import stats

In [None]:
# # Aggregating features for each patient (subject_id) using mean for each label
# train_pivoted = train_df.pivot_table(index='subject_id', columns='feature_label', values='valuenum', aggfunc='mean')
# test_pivoted = test_df.pivot_table(index='subject_id', columns='feature_label', values='valuenum', aggfunc='mean')

# # Rename columns to highlight the mean
# train_pivoted.columns = ['mean_' + str(col) for col in train_pivoted.columns.values]
# test_pivoted.columns = ['mean_' + str(col) for col in test_pivoted.columns.values]

# # Reset index
# train_pivoted = train_pivoted.reset_index()
# test_pivoted = test_pivoted.reset_index()

In [29]:
def aggregate_features(df, categorical_features):
    # Extract unique features from the DataFrame
    unique_features = df['feature_label'].unique()

    # Filter out categorical features that are not present in the DataFrame
    valid_categorical_features = [feature for feature in categorical_features if feature in unique_features]

    # Define aggregation functions
    def agg_func(feature):
        if feature in valid_categorical_features:
            return lambda x: stats.mode(x)[0][0] if len(x) > 0 else None
        else:
            return 'mean'

    # Determine the aggregation function for each feature
    agg_dict = {feature: agg_func(feature) for feature in unique_features}

    # Pivot table with specified aggregation functions
    pivoted = df.pivot_table(index='subject_id',
                             columns='feature_label',
                             values='valuenum',
                             aggfunc=agg_dict)

    # Rename columns to highlight the aggregation type
    pivoted.columns = [f"{'mode' if col in valid_categorical_features else 'mean'}_{col}" for col in pivoted.columns.values]

    # Reset index
    pivoted = pivoted.reset_index()

    return pivoted

In [31]:
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']

In [33]:
# Assuming train_df and test_df are your DataFrames
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']  # Add your categorical features here

# Filter categorical features from the DataFrame
cat_train = train_df[train_df['feature_label'].isin(categorical_features)]
cat_test = test_df[test_df['feature_label'].isin(categorical_features)]

# Calculate mode for categorical features
cat_train_pivoted = cat_train.pivot_table(index='subject_id',
                                          columns='feature_label',
                                          values='valuenum',
                                          aggfunc=lambda x: stats.mode(x, keepdims=False)[0] if len(x) > 0 else None) # Handle the case when stats.mode returns a single value
cat_test_pivoted = cat_test.pivot_table(index='subject_id',
                                        columns='feature_label',
                                        values='valuenum',
                                        aggfunc=lambda x: stats.mode(x, keepdims=False)[0] if len(x) > 0 else None) # Handle the case when stats.mode returns a single value

# Rename columns to highlight the mode
cat_train_pivoted.columns = ['mode_' + str(col) for col in cat_train_pivoted.columns.values]
cat_test_pivoted.columns = ['mode_' + str(col) for col in cat_test_pivoted.columns.values]

# Reset index
cat_train_pivoted = cat_train_pivoted.reset_index()
cat_test_pivoted = cat_test_pivoted.reset_index()

# Display the result
cat_train_pivoted.head()
cat_test_pivoted.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale
0,10004720,4.0,2.0,-5.0
1,10004733,,,-4.0
2,10005817,4.0,4.0,-1.0
3,10022620,4.0,6.0,0.0
4,10037861,1.0,1.0,-5.0


In [34]:
# Filter numerical features from the DataFrame
num_train = train_df[~train_df['feature_label'].isin(categorical_features)]
num_test = test_df[~test_df['feature_label'].isin(categorical_features)]

# Calculate mean for numerical features
num_train_pivoted = num_train.pivot_table(index='subject_id',
                                          columns='feature_label',
                                          values='valuenum',
                                          aggfunc='mean')
num_test_pivoted = num_test.pivot_table(index='subject_id',
                                        columns='feature_label',
                                        values='valuenum',
                                        aggfunc='mean')

# Rename columns to highlight the mean
num_train_pivoted.columns = ['mean_' + str(col) for col in num_train_pivoted.columns.values]
num_test_pivoted.columns = ['mean_' + str(col) for col in num_test_pivoted.columns.values]

# Reset index
num_train_pivoted = num_train_pivoted.reset_index()
num_test_pivoted = num_test_pivoted.reset_index()

# Display the result
num_train_pivoted.head()
num_test_pivoted.head()

Unnamed: 0,subject_id,mean_Arterial Base Excess,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 Saturation,mean_Arterial O2 pressure,mean_Cardiac Output (CCO),mean_Creatinine (serum),...,mean_Peak Insp. Pressure,mean_Plateau Pressure,mean_Platelet Count,mean_Potassium (serum),mean_Respiratory Rate,mean_Sodium (serum),mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_WBC
0,10004720,,,,,,,,,,...,11.5,,,,26.714286,,99.2,340.0,340.0,
1,10004733,,,,,,,,,,...,20.0,,,,12.75,,99.05,515.0,,
2,10005817,9.0,62.0,79.875,131.0,35.5,,108.5,,2.7,...,18.0,,136.0,3.9,18.875,131.0,98.4,523.0,507.0,9.8
3,10013643,-4.5,66.8,76.125,105.8,37.5,,129.5,,0.9,...,14.666667,15.0,166.0,4.2,18.25,136.0,98.5,501.666667,550.5,23.4
4,10022620,,,,,,,,,,...,9.333333,,,,20.833333,,98.75,577.5,468.75,


In [35]:
# Merge the categorical and numerical pivoted DataFrames
train_combined = pd.merge(cat_train_pivoted, num_train_pivoted, on='subject_id', how='outer')
test_combined = pd.merge(cat_test_pivoted, num_test_pivoted, on='subject_id', how='outer')

# Display the combined result
train_combined.head()
test_combined.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Base Excess,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 Saturation,...,mean_Peak Insp. Pressure,mean_Plateau Pressure,mean_Platelet Count,mean_Potassium (serum),mean_Respiratory Rate,mean_Sodium (serum),mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_WBC
0,10004720,4.0,2.0,-5.0,,,,,,,...,11.5,,,,26.714286,,99.2,340.0,340.0,
1,10004733,,,-4.0,,,,,,,...,20.0,,,,12.75,,99.05,515.0,,
2,10005817,4.0,4.0,-1.0,9.0,62.0,79.875,131.0,35.5,,...,18.0,,136.0,3.9,18.875,131.0,98.4,523.0,507.0,9.8
3,10022620,4.0,6.0,0.0,,,,,,,...,9.333333,,,,20.833333,,98.75,577.5,468.75,
4,10037861,1.0,1.0,-5.0,,68.0,77.0,126.166667,,,...,21.0,,,,22.0,,,486.0,486.0,


**Add extubation failure label**

In [37]:
# Add extubation failure label
label_df = train_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
train_aggregated = train_combined.merge(label_df, on='subject_id', how='left')

label_df = test_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
test_aggregated = test_combined.merge(label_df, on='subject_id', how='left')

In [38]:
train_aggregated.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Base Excess,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 Saturation,...,mean_Plateau Pressure,mean_Platelet Count,mean_Potassium (serum),mean_Respiratory Rate,mean_Sodium (serum),mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_WBC,extubation_failure
0,10001884,3.0,6.0,-1.0,,,,,,,...,,,,20.0,,,,,,1
1,10002428,3.0,6.0,0.0,4.0,61.666667,81.166667,118.5,43.0,,...,,,4.0,22.0,143.0,98.366667,380.0,355.25,,0
2,10004235,4.0,6.0,,-5.0,,,,37.5,97.0,...,,27.0,4.3,13.666667,134.0,,,,15.3,1
3,10010867,2.0,4.0,-4.0,,,,,,,...,,,,15.333333,,99.3,467.0,467.0,,0
4,10011365,,,0.0,,,,,,,...,,,,17.666667,,98.8,344.0,344.0,,1


**Handle NaN values**

We will need to handle the cases where a patient has no values for a feature - fill with the mean across the patient population and the mode across the patient population for the categroical features.

These features usually correspond to those with the lowest observations.

We will handle the NaN values by filling with the mean of the patient population.

In [39]:
# See which columns have the most NaNs
print(train_aggregated.isna().sum().sort_values(ascending=False))

mean_Cardiac Output (CCO)                 3620
mean_Arterial O2 Saturation               3270
mean_Plateau Pressure                     3257
mean_EtCO2                                3198
mean_Lactic Acid                          3140
mean_WBC                                  2973
mean_Ionized Calcium                      2951
mean_Platelet Count                       2945
mean_Hemoglobin                           2893
mean_Hematocrit (serum)                   2817
mean_Creatinine (serum)                   2790
mean_Glucose (serum)                      2788
mean_Potassium (serum)                    2669
mean_Sodium (serum)                       2662
mean_Arterial Base Excess                 2446
mean_Arterial CO2 Pressure                2390
mean_Arterial O2 pressure                 2383
mean_PH (Arterial)                        2378
mean_Arterial Blood Pressure diastolic    2204
mean_Arterial Blood Pressure systolic     1756
mean_Arterial Blood Pressure mean         1622
mean_Tidal Vo

In [40]:
print(test_aggregated.isna().sum().sort_values(ascending=False))

mean_Cardiac Output (CCO)                 905
mean_Arterial O2 Saturation               838
mean_EtCO2                                820
mean_Plateau Pressure                     815
mean_Lactic Acid                          793
mean_Ionized Calcium                      743
mean_WBC                                  739
mean_Platelet Count                       734
mean_Hemoglobin                           723
mean_Hematocrit (serum)                   709
mean_Glucose (serum)                      703
mean_Creatinine (serum)                   695
mean_Potassium (serum)                    665
mean_Sodium (serum)                       662
mean_Arterial Base Excess                 610
mean_Arterial CO2 Pressure                597
mean_PH (Arterial)                        596
mean_Arterial O2 pressure                 593
mean_Arterial Blood Pressure diastolic    538
mean_Arterial Blood Pressure systolic     458
mean_Arterial Blood Pressure mean         411
mean_Tidal Volume (spontaneous)   

In [41]:
# Calculate this as a percentage of all data points
print("Train data:")
print(train_aggregated.isna().sum().sort_values(ascending=False)/len(train_aggregated))
print("Test data:")
print(test_aggregated.isna().sum().sort_values(ascending=False)/len(test_aggregated))

Train data:
mean_Cardiac Output (CCO)                 0.962766
mean_Arterial O2 Saturation               0.869681
mean_Plateau Pressure                     0.866223
mean_EtCO2                                0.850532
mean_Lactic Acid                          0.835106
mean_WBC                                  0.790691
mean_Ionized Calcium                      0.784840
mean_Platelet Count                       0.783245
mean_Hemoglobin                           0.769415
mean_Hematocrit (serum)                   0.749202
mean_Creatinine (serum)                   0.742021
mean_Glucose (serum)                      0.741489
mean_Potassium (serum)                    0.709840
mean_Sodium (serum)                       0.707979
mean_Arterial Base Excess                 0.650532
mean_Arterial CO2 Pressure                0.635638
mean_Arterial O2 pressure                 0.633777
mean_PH (Arterial)                        0.632447
mean_Arterial Blood Pressure diastolic    0.586170
mean_Arterial Blood

For the low observed features, a significant number of patients had no values at all.

This means that there will be a significant amount of data imputed but this is necessary in order to keep the features analogous to those used in the LSTM/TCN training.

**Note: To avoid data leakage, the means being used to fill the test set NaNs are calculated from the training set.**

In [42]:
# Assuming train_combined and test_combined are the merged DataFrames from previous steps
categorical_features = ['GCS - Eye Opening', 'GCS - Motor Response', 'Richmond-RAS Scale']  # Add your categorical features here

# Separate categorical and numerical columns in train_combined
cat_columns = [col for col in train_combined.columns if 'mode_' in col]
num_columns = [col for col in train_combined.columns if 'mean_' in col]

# Fill NaN values with the mean for numerical features in train_combined
train_combined[num_columns] = train_combined[num_columns].fillna(train_combined[num_columns].mean())

# Fill NaN values with the mode for categorical features in train_combined
for col in cat_columns:
    mode_value = train_combined[col].mode()[0] if not train_combined[col].mode().empty else None
    train_combined[col] = train_combined[col].fillna(mode_value)

# For test_combined, fill NaNs with the mean of train_combined for numerical features
test_combined[num_columns] = test_combined[num_columns].fillna(train_combined[num_columns].mean())

# For test_combined, fill NaNs with the mode of train_combined for categorical features
for col in cat_columns:
    mode_value = train_combined[col].mode()[0] if not train_combined[col].mode().empty else None
    test_combined[col] = test_combined[col].fillna(mode_value)

# Display the final DataFrames
train_combined.head()
test_combined.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Base Excess,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 Saturation,...,mean_Peak Insp. Pressure,mean_Plateau Pressure,mean_Platelet Count,mean_Potassium (serum),mean_Respiratory Rate,mean_Sodium (serum),mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_WBC
0,10004720,4.0,2.0,-5.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,11.5,18.102286,163.460736,4.046379,26.714286,140.103825,99.2,340.0,340.0,11.633799
1,10004733,4.0,6.0,-4.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,20.0,18.102286,163.460736,4.046379,12.75,140.103825,99.05,515.0,470.899154,11.633799
2,10005817,4.0,4.0,-1.0,9.0,62.0,79.875,131.0,35.5,96.63619,...,18.0,18.102286,136.0,3.9,18.875,131.0,98.4,523.0,507.0,9.8
3,10022620,4.0,6.0,0.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,9.333333,18.102286,163.460736,4.046379,20.833333,140.103825,98.75,577.5,468.75,11.633799
4,10037861,1.0,1.0,-5.0,1.391908,68.0,77.0,126.166667,40.689586,96.63619,...,21.0,18.102286,163.460736,4.046379,22.0,140.103825,98.862106,486.0,486.0,11.633799


In [45]:
# Calculate this as a percentage of all data points
print("Train data:")
print(train_combined.isna().sum().sort_values(ascending=False)/len(train_combined))
print("Test data:")
print(test_combined.isna().sum().sort_values(ascending=False)/len(test_combined))

Train data:
subject_id                                0.0
mean_Plateau Pressure                     0.0
mean_Lactic Acid                          0.0
mean_Mean Airway Pressure                 0.0
mean_Minute Volume                        0.0
mean_O2 saturation pulseoxymetry          0.0
mean_PH (Arterial)                        0.0
mean_Peak Insp. Pressure                  0.0
mean_Platelet Count                       0.0
mean_Inspired O2 Fraction                 0.0
mean_Potassium (serum)                    0.0
mean_Respiratory Rate                     0.0
mean_Sodium (serum)                       0.0
mean_Temperature Fahrenheit               0.0
mean_Tidal Volume (observed)              0.0
mean_Tidal Volume (spontaneous)           0.0
mean_Ionized Calcium                      0.0
mean_Hemoglobin                           0.0
mode_GCS - Eye Opening                    0.0
mean_Arterial CO2 Pressure                0.0
mode_GCS - Motor Response                 0.0
mode_Richmond-RAS Scal

In [46]:
train_combined.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Base Excess,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 Saturation,...,mean_Peak Insp. Pressure,mean_Plateau Pressure,mean_Platelet Count,mean_Potassium (serum),mean_Respiratory Rate,mean_Sodium (serum),mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_WBC
0,10001884,3.0,6.0,-1.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,17.0,18.102286,163.460736,4.046379,20.0,140.103825,98.862106,472.138379,470.899154,11.633799
1,10002428,3.0,6.0,0.0,4.0,61.666667,81.166667,118.5,43.0,96.63619,...,20.5,18.102286,163.460736,4.0,22.0,143.0,98.366667,380.0,355.25,11.633799
2,10004235,4.0,6.0,-1.0,-5.0,68.416171,81.300472,117.652355,37.5,97.0,...,11.0,18.102286,27.0,4.3,13.666667,134.0,98.862106,472.138379,470.899154,15.3
3,10010867,2.0,4.0,-4.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,16.0,18.102286,163.460736,4.046379,15.333333,140.103825,99.3,467.0,467.0,11.633799
4,10011365,4.0,6.0,0.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,12.0,18.102286,163.460736,4.046379,17.666667,140.103825,98.8,344.0,344.0,11.633799


We will not create new features inkeeping with what was done for the dynamic data. As the features required to make the new ones were sampled at different rates, making new features creates more synthetic data which we want to avoid.

There are no NaN values and all patients have a mean value for all features. We can now use this data to train a LightGBM model.

In [50]:
# Add extubation failure label
label_df = train_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
train_aggregated = train_combined.merge(label_df, on='subject_id', how='left')

label_df = test_df[['subject_id', 'extubation_failure']].drop_duplicates(subset='subject_id')
test_aggregated = test_combined.merge(label_df, on='subject_id', how='left')

In [53]:
train_aggregated.head()

Unnamed: 0,subject_id,mode_GCS - Eye Opening,mode_GCS - Motor Response,mode_Richmond-RAS Scale,mean_Arterial Base Excess,mean_Arterial Blood Pressure diastolic,mean_Arterial Blood Pressure mean,mean_Arterial Blood Pressure systolic,mean_Arterial CO2 Pressure,mean_Arterial O2 Saturation,...,mean_Plateau Pressure,mean_Platelet Count,mean_Potassium (serum),mean_Respiratory Rate,mean_Sodium (serum),mean_Temperature Fahrenheit,mean_Tidal Volume (observed),mean_Tidal Volume (spontaneous),mean_WBC,extubation_failure
0,10001884,3.0,6.0,-1.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,18.102286,163.460736,4.046379,20.0,140.103825,98.862106,472.138379,470.899154,11.633799,1
1,10002428,3.0,6.0,0.0,4.0,61.666667,81.166667,118.5,43.0,96.63619,...,18.102286,163.460736,4.0,22.0,143.0,98.366667,380.0,355.25,11.633799,0
2,10004235,4.0,6.0,-1.0,-5.0,68.416171,81.300472,117.652355,37.5,97.0,...,18.102286,27.0,4.3,13.666667,134.0,98.862106,472.138379,470.899154,15.3,1
3,10010867,2.0,4.0,-4.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,18.102286,163.460736,4.046379,15.333333,140.103825,99.3,467.0,467.0,11.633799,0
4,10011365,4.0,6.0,0.0,1.391908,68.416171,81.300472,117.652355,40.689586,96.63619,...,18.102286,163.460736,4.046379,17.666667,140.103825,98.8,344.0,344.0,11.633799,1


In [54]:
train_aggregated.shape

(3760, 36)

In [55]:
test_aggregated.shape

(941, 36)

In [56]:
# Save the data
train_aggregated.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/03_feature_set_3/02_lgbm_data/train_aggregated_v2.parquet')
test_aggregated.to_parquet('/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/03_feature_set_3/02_lgbm_data/test_aggregated_v2.parquet')