Ensemble-Imbalance-based classification for amyotrophic lateral sclerosis prognostic prediction: identifying short-survival patients at diagnosis
This code uses ensemble and imbalance learning approaches to improve identifying short-survival amyotrophic lateral sclerosis patients at diagnosis time. Furthermore, we utilized the SHAP framework to explain how the best model performed the patient classifications.
The results of this work have been published in the research article "Ensemble-Imbalance-based classification for amyotrophic lateral sclerosis prognostic prediction: identifying short-survival patients at diagnosis" (Papaiz et al., 2023).
If you use this code for your research please cite this paper:
Papaiz F, Dourado MET, Valentim RAdM, Pinto R, de Morais AHF, Arrais JP. Ensemble-Imbalance-based classification for amyotrophic lateral sclerosis prognostic prediction: identifying short-survival patients at diagnosis. 2023.
For those wanting to try it out, this is what you need:
- A working version of Python (version 3.9+) and jupyter-notebook.
- Install the following Python packages:
- numpy (1.23.5)
- pandas (1.5.3)
- matplotlib (3.7.0)
- seaborn (0.12.2)
- scikit-learn (1.2.1)
- imbalanced-learn (0.10.1)
- shap (0.41.0)
- Download the patient data analyzed from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) website (https://ncri1.partners.org/ProACT)
-
Perform the Extract-Load-Transform (ETL) step:
- Start the
jupyter-notebook
environment - Open and execute all code of the
02.01 - Preprocess raw data.ipynb
file, which is inside the02_ETL
folder - After execution, the preprocessed data will be saved in the
03_preprocessed_data
and04_data_to_analyze
folders
- Start the
- Perform the Machine Learning (ML) pipeline:
-
Execute the python program
exec_grid_search_both_scenarios.py
in the05_Train_Validate_Models
folder -
This program will:
- Split the dataset into Training and Validation subsets
- Train and validate the ML models for both scenarios (Single-Model and Ensemble-Imbalance)
- NOTE: It can take a long time to accomplish (even days).
- Save the performance results into CSV files in the
05_Train_Validate_Models/exec_results
folder
-
Pipeline Overview:
-
Validation performance obtained by each scenario and algorithm:
-
- Execute the SHAP explanations over the model that reached the best performance for the Ensemble-Imbalance scenario(i.e., BalancedBagging model using Neural Networks as a base estimator)
- Create a SHAP Kernel Explainer instance using the best model and the Validation set:
explainer = shap.KernelExplainer(<<BEST_MODEL>>.predict, X_valid)
- Generate the SHAP values: (Note: It can take many hours)
shap_values = explainer.shap_values(X_valid)
- Analyze the SHAP results by plotting SHAP graphs. See the examples below:
- Create a SHAP Kernel Explainer instance using the best model and the Validation set:
- Grid-Search hyperparameters used for each algorithm.
- Best models' hyperparameters
-
Additional Information:
Finally, please let us know if you have any comments or suggestions, or if you have questions about the code or the procedure (correspondence e-mail: fabianopapaiz at gmail dot com
).