# **Ensemble Methods - Feature Sert 1**

We have now trained 6 models for each of our sets of features. These models show some performance but still lack the ideal performance required in order for them to be deployed in clinical environments.

The purpose of creating ensemble predictions is to determine whether combining the outputs of multiple base models can achieve better performance than the individual models

As a final experiment, we will look at implementing a few ensemble methods to see if a combination of each of the models has better predictive performance than any of the models individually.

The potential benefits of this are:
- Reduction in overfitting
- More robust
- Combine diverse model strengths (LSTM, TCN look at time dependencies where as LGBM looks at aggregate values over time)
- Models handle different types of data (static and dynamic)
- Enhanced generalisation
- Mitigation of individual model limitations

We will need to create ensemble methods for each feature set.

We will investigate different revelant ensemble methods:
- Averaging
- Majority voting
- Stacking
- Potentially more complex methods e.g. Boosting, Bootstrapping

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import json
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Step 1 - Collate predictions from each model**

In [3]:
# Load model predictions
lstm_dynamic_predictions = np.load('/content/drive/MyDrive/MSc_Final_Project/03_model_development/01_feature_set_1/04_lstm_model_fused_decision/dynamic_data/best_models/experiment_1/predictions.npy')
lstm_static_and_dynamic_predictions = np.load('/content/drive/MyDrive/MSc_Final_Project/03_model_development/01_feature_set_1/04_lstm_model_fused_decision/static_and_dynamic_data/best_models/experiment_2/predictions.npy')
tcn_dynamic_predictions = np.load('/content/drive/MyDrive/MSc_Final_Project/03_model_development/01_feature_set_1/05_tcn_model_fused_decision/dynamic_data/best_models/experiment_1/final_model_predictions.npy')
tcn_static_and_dynamic_predictions = np.load('/content/drive/MyDrive/MSc_Final_Project/03_model_development/01_feature_set_1/05_tcn_model_fused_decision/static_and_dynamic_data/best_models/experiment_2/predictions.npy')
lgbm_dynamic_predictions = np.load('/content/drive/MyDrive/MSc_Final_Project/03_model_development/01_feature_set_1/03_lgbm_model/dynamic_data/best_models/experiment_4/predictions.npy')
lgbm_static_and_dynamic_predictions = np.load('/content/drive/MyDrive/MSc_Final_Project/03_model_development/01_feature_set_1/03_lgbm_model/static_and_dynamic_data/best_models/experiment_2/predictions_v2.npy')

# Print the shapes of all predictions
print(f"LSTM Dynamic Predictions Shape: {lstm_dynamic_predictions.shape}")
print(f"LSTM Static and Dynamic Predictions Shape: {lstm_static_and_dynamic_predictions.shape}")
print(f"TCN Dynamic Predictions Shape: {tcn_dynamic_predictions.shape}")
print(f"TCN Static and Dynamic Predictions Shape: {tcn_static_and_dynamic_predictions.shape}")
print(f"LGBM Dynamic Predictions Shape: {lgbm_dynamic_predictions.shape}")
print(f"LGBM Static and Dynamic Predictions Shape: {lgbm_static_and_dynamic_predictions.shape}")

LSTM Dynamic Predictions Shape: (941,)
LSTM Static and Dynamic Predictions Shape: (941,)
TCN Dynamic Predictions Shape: (941,)
TCN Static and Dynamic Predictions Shape: (941,)
LGBM Dynamic Predictions Shape: (941,)
LGBM Static and Dynamic Predictions Shape: (941,)


In [4]:
# Print a sample of each
print("Sample of LSTM Dynamic Predictions:")
print(lstm_dynamic_predictions[:5])
print("\nSample of LSTM Static and Dynamic Predictions:")
print(lstm_static_and_dynamic_predictions[:5])
print("\nSample of TCN Dynamic Predictions:")
print(tcn_dynamic_predictions[:5])
print("\nSample of TCN Static and Dynamic Predictions:")
print(tcn_static_and_dynamic_predictions[:5])
print("\nSample of LGBM Dynamic Predictions:")
print(lgbm_dynamic_predictions[:5])
print("\nSample of LGBM Static and Dynamic Predictions:")
print(lgbm_static_and_dynamic_predictions[:5])

Sample of LSTM Dynamic Predictions:
[0.39246023 0.3333789  0.29905996 0.2822049  0.287461  ]

Sample of LSTM Static and Dynamic Predictions:
[0.37723106 0.3520578  0.31333807 0.3159983  0.29638764]

Sample of TCN Dynamic Predictions:
[0.34264043 0.37695804 0.34792513 0.3183502  0.33554348]

Sample of TCN Static and Dynamic Predictions:
[0.44380704 0.4843297  0.4703515  0.5316279  0.3112441 ]

Sample of LGBM Dynamic Predictions:
[0.44732843 0.46014498 0.31843291 0.40291431 0.56119099]

Sample of LGBM Static and Dynamic Predictions:
[0.5012547  0.49871373 0.49972609 0.49972609 0.49972609]


In [5]:
# Combine all predictions into a single array
all_predictions = np.vstack([
    lstm_dynamic_predictions,
    lstm_static_and_dynamic_predictions,
    tcn_dynamic_predictions,
    tcn_static_and_dynamic_predictions,
    lgbm_dynamic_predictions,
    lgbm_static_and_dynamic_predictions
])

all_predictions.shape

(6, 941)

In [6]:
# Load the true labels for the test set
test_labels_path = '/content/drive/MyDrive/MSc_Final_Project/02_data_analysis/mimic/data_analysis/datasets/08_model_input_data/01_feature_set_1/01_lstm_data/dynamic_data/feature_subsets_run_2/low_df_test_labels_v1.npy'
test_labels = np.load(test_labels_path)

# Print the shape of the test labels
print(f"Test Labels Shape: {test_labels.shape}")

Test Labels Shape: (941,)


In [None]:
test_labels.astype(int)

array([1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,

# **Step 2 - Apply Ensemble methods**

### **Average Predictions**

Average the probabilities for each patient and evaluate on the test labels.

In [7]:
# Compute the average prediction for each sample
ensemble_predictions = np.mean(all_predictions, axis=0)

In [8]:
# Define standard binary classification threshold of 0.5
threshold = 0.5

ensemble_predictions_binary = (ensemble_predictions >= threshold).astype(int)

In [9]:
ensemble_predictions_binary.shape

(941,)

In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

In [11]:
# Evaluate ensemble predictions
def evaluate_ensemble_predictions(ensemble_predictions, test_labels):
    # Ensure predictions and labels are binary and of the same type
    ensemble_predictions = np.array(ensemble_predictions).astype(int)
    test_labels = np.array(test_labels).astype(int)

    # Evaluate the ensemble prediction against the test labels
    accuracy = accuracy_score(test_labels, ensemble_predictions)
    precision = precision_score(test_labels, ensemble_predictions)
    recall = recall_score(test_labels, ensemble_predictions)
    f1 = f1_score(test_labels, ensemble_predictions)
    roc_auc = roc_auc_score(test_labels, ensemble_predictions)

    # Compute confusion matrix to derive specificity
    tn, fp, fn, tp = confusion_matrix(test_labels, ensemble_predictions).ravel()
    specificity = tn / (tn + fp)

    # Print the metrics
    print(f'Accuracy: {accuracy:.4f}')
    print(f'Precision: {precision:.4f}')
    print(f'Recall (Sensitivity): {recall:.4f}')
    print(f'F1 Score: {f1:.4f}')
    print(f'ROC AUC: {roc_auc:.4f}')
    print(f'Specificity: {specificity:.4f}')

In [15]:
ensemble_predictions[:5]

array([0.41745365, 0.4175972 , 0.37480561, 0.39180361, 0.38192555])

In [16]:
# Save the ensemble_predictions
avg_predictions_df = pd.DataFrame(ensemble_predictions)

avg_predictions_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/04_ensemble_methods/results/fs1_avg.parquet')

In [None]:
# Evaluate averaging prediction
evaluate_ensemble_predictions(ensemble_predictions_binary, test_labels)

Accuracy: 0.6950
Precision: 0.6058
Recall (Sensitivity): 0.2039
F1 Score: 0.3051
ROC AUC: 0.5695
Specificity: 0.9351


### **Hard voting classifier**

Each model hard votes for a class.

In [None]:
# Ensure all predictions are binary
lstm_dynamic_predictions_binary = (lstm_dynamic_predictions >= threshold).astype(int)
lstm_static_and_dynamic_predictions_binary = (lstm_static_and_dynamic_predictions >= threshold).astype(int)
tcn_dynamic_predictions_binary = (tcn_dynamic_predictions >= threshold).astype(int)
tcn_static_and_dynamic_predictions_binary = (tcn_static_and_dynamic_predictions >= threshold).astype(int)
lgbm_dynamic_predictions_binary = (lgbm_dynamic_predictions >= threshold).astype(int)
lgbm_static_and_dynamic_predictions_binary = (lgbm_static_and_dynamic_predictions >= threshold).astype(int)

In [None]:
all_predictions_binary = np.vstack([
    lstm_dynamic_predictions_binary,
    lstm_static_and_dynamic_predictions_binary,
    tcn_dynamic_predictions_binary,
    tcn_static_and_dynamic_predictions_binary,
    lgbm_dynamic_predictions_binary,
    lgbm_static_and_dynamic_predictions_binary
]).astype(int)

In [None]:
# Apply hard voting
ensemble_predictions_hard = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=all_predictions_binary)

# Evaluate the ensemble predictions
evaluate_ensemble_predictions(ensemble_predictions_hard, test_labels)

Accuracy: 0.6961
Precision: 0.6949
Recall (Sensitivity): 0.1327
F1 Score: 0.2228
ROC AUC: 0.5521
Specificity: 0.9715


### **Stacking**


Stacking uses a meta-model to learn how to best combine the predictions from two or more base mdoels. This allows it to harness the capabilities of a range of models.



Train Base Models: Multiple base models are trained on the training data. These models can be of different types (e.g., logistic regression, random forest, SVM). **- This has already been done with LSTM, TCN and LGBM**

Generate Meta-Features: The predictions from the base models are used to create a new dataset. Each base model's prediction for each sample becomes a feature in this new dataset. This dataset is called the meta-features.

Train Meta-Model: A meta-model (or level-2 model) is trained on the meta-features. The goal of the meta-model is to learn how to combine the base model predictions to make the final prediction.

Make Final Predictions: The meta-model uses the meta-features (predictions from the base models) to make the final predictions.

For this we will use the standard simple meta-model: Logistic Regression as linear models are often used as the meta-model.

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [18]:
# Combine predictions into a feature matrix for the meta-model
X_meta = np.vstack([
    lstm_dynamic_predictions,
    lstm_static_and_dynamic_predictions,
    tcn_dynamic_predictions,
    tcn_static_and_dynamic_predictions,
    lgbm_dynamic_predictions,
    lgbm_dynamic_predictions
]).T

In [19]:
# Split meta features into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_meta, test_labels, test_size=0.2, random_state=42)

In [20]:
# Define the meta-model
meta_model = LogisticRegression()

In [21]:
# Train the meta-model
meta_model.fit(X_train, y_train)

In [22]:
# Predict using the validation data
ensemble_predictions_stacking = meta_model.predict(X_val)

In [24]:
ensemble_predictions_stacking_prob = meta_model.predict_proba(X_val)[:, 1]

In [25]:
ensemble_predictions_stacking_prob[:5]

array([0.27972424, 0.39284481, 0.29969915, 0.33859392, 0.26473086])

In [26]:
ensemble_predictions_stacking_df = pd.DataFrame(ensemble_predictions_stacking_prob)

ensemble_predictions_stacking_df.to_parquet('/content/drive/MyDrive/MSc_Final_Project/04_ensemble_methods/results/fs1_stacking.parquet')

In [None]:
# Evaluate on the test labels
evaluate_ensemble_predictions(ensemble_predictions_stacking, y_val)

Accuracy: 0.6984
Precision: 0.6500
Recall (Sensitivity): 0.2063
F1 Score: 0.3133
ROC AUC: 0.5754
Specificity: 0.9444


### **AdaBoost**

Boosting framework to improve the performance of the ensemble model.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
adaboost_model = AdaBoostClassifier(n_estimators=50, random_state=42)

In [None]:
# Train the AdaBoost model
adaboost_model.fit(X_train, y_train)

# Predict
ensemble_predictions_adaboost = adaboost_model.predict(X_val)

In [None]:
# Evaluate ensemble predictions
evaluate_ensemble_predictions(ensemble_predictions_adaboost, y_val)

Accuracy: 0.6455
Precision: 0.4545
Recall (Sensitivity): 0.3175
F1 Score: 0.3738
ROC AUC: 0.5635
Specificity: 0.8095


### **Bagging**

Reduces variance and helps avoid overfitting by averaging predictions over multiple models trained on different subsets.

In [None]:
from sklearn.ensemble import BaggingClassifier

In [None]:
bagging_model = BaggingClassifier(n_estimators=50, random_state=42)

In [None]:
# Train and predict
bagging_model.fit(X_train, y_train)
ensemble_predictions_bagging = bagging_model.predict(X_val)

In [None]:
# Evaluate
evaluate_ensemble_predictions(ensemble_predictions_bagging, y_val)

Accuracy: 0.6138
Precision: 0.3529
Recall (Sensitivity): 0.1905
F1 Score: 0.2474
ROC AUC: 0.5079
Specificity: 0.8254


### **Optimal weights for each model**

We can tune what weights applied to the predictions for each model provides the best performance on ROC AUC basis.

In [None]:
from scipy.optimize import differential_evolution
from sklearn.metrics import roc_auc_score

In [None]:
# Define the objective function to minimize (negative ROC AUC)
def objective(weights):
    # Ensure weights sum to 1
    weights = np.array(weights)
    if np.sum(weights) != 1:
        weights = weights / np.sum(weights)

    # Calculate the weighted average of predictions
    weighted_predictions = np.average(all_predictions, axis=0, weights=weights)

    # Calculate the negative ROC AUC
    return -roc_auc_score(test_labels, weighted_predictions)

# Bounds: Each weight is between 0 and 1
bounds = [(0, 1)] * all_predictions.shape[0]

# Initial guess (equal weights)
initial_guess = [1 / all_predictions.shape[0]] * all_predictions.shape[0]

# Optimize weights using differential evolution
result = differential_evolution(objective, bounds, strategy='best1bin', maxiter=1000, popsize=15, tol=1e-6, mutation=(0.5, 1), recombination=0.7, seed=42)

# Get the best weights
best_weights = result.x

# Calculate the final weighted predictions using the best weights
final_weighted_predictions = np.average(all_predictions, axis=0, weights=best_weights)

In [None]:
# Evaluate the final weighted predictions
final_roc_auc = roc_auc_score(test_labels, final_weighted_predictions)

In [None]:
# Convert weighted predictions to binary labels using a threshold of 0.5
final_binary_predictions = (final_weighted_predictions > 0.5).astype(int)

final_accuracy = accuracy_score(test_labels, final_binary_predictions)
final_precision = precision_score(test_labels, final_binary_predictions)
final_recall = recall_score(test_labels, final_binary_predictions)
final_f1 = f1_score(test_labels, final_binary_predictions)
tn, fp, fn, tp = confusion_matrix(test_labels, final_binary_predictions).ravel()
final_specificity = tn / (tn + fp)

# Print all evaluation metrics
print(f'Optimal Weights: {best_weights}')
print(f'Final ROC AUC: {final_roc_auc:.4f}')
print(f'Final Accuracy: {final_accuracy:.4f}')
print(f'Final Precision: {final_precision:.4f}')
print(f'Final Recall (Sensitivity): {final_recall:.4f}')
print(f'Final F1 Score: {final_f1:.4f}')
print(f'Final Specificity: {final_specificity:.4f}')

Optimal Weights: [0.08963606 0.60888677 0.0210073  0.12180916 0.11114151 0.8553699 ]
Final ROC AUC: 0.6678
Final Accuracy: 0.6908
Final Precision: 0.6364
Final Recall (Sensitivity): 0.1359
Final F1 Score: 0.2240
Final Specificity: 0.9620


In [None]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

In [None]:
best_weights = [0.08963606, 0.60888677, 0.0210073, 0.12180916, 0.11114151, 0.8553699]
best_weights = np.array(best_weights)
best_weights = softmax(best_weights)
best_weights

array([0.12788627, 0.21494713, 0.11940399, 0.13206767, 0.13066631,
       0.27502863])

In [None]:
best_weights = [0.08963606, 0.60888677, 0.0210073, 0.12180916, 0.11114151, 0.8553699]
best_weights = np.array(best_weights)
best_weights = best_weights / np.sum(best_weights)
best_weights

array([0.04958156, 0.33680147, 0.01162004, 0.06737789, 0.06147715,
       0.47314189])