Skip to content

fabianopapaiz/ensemble_imbalance_model_for_als_prognosis

Repository files navigation

Ensemble-Imbalance-based classification for amyotrophic lateral sclerosis prognostic prediction: identifying short-survival patients at diagnosis

This code uses ensemble and imbalance learning approaches to improve identifying short-survival amyotrophic lateral sclerosis patients at diagnosis time. Furthermore, we utilized the SHAP framework to explain how the best model performed the patient classifications.
The results of this work have been published in the research article "Ensemble-Imbalance-based classification for amyotrophic lateral sclerosis prognostic prediction: identifying short-survival patients at diagnosis" (Papaiz et al., 2023).

If you use this code for your research please cite this paper:

Papaiz F, Dourado MET, Valentim RAdM, Pinto R, de Morais AHF, Arrais JP. Ensemble-Imbalance-based classification for amyotrophic lateral sclerosis prognostic prediction: identifying short-survival patients at diagnosis. 2023.

LICENSE


For those wanting to try it out, this is what you need:

  1. A working version of Python (version 3.9+) and jupyter-notebook.

  1. Install the following Python packages:
    • numpy (1.23.5)
    • pandas (1.5.3)
    • matplotlib (3.7.0)
    • seaborn (0.12.2)
    • scikit-learn (1.2.1)
    • imbalanced-learn (0.10.1)
    • shap (0.41.0)

  1. Download the patient data analyzed from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) website (https://ncri1.partners.org/ProACT)
    • Register and log in to the website

    • Access the Data menu and download the ALL FORMS dataset

    • Extract the zipped data file into the 01_raw_data folder

    • The 01_raw_data folder will contain the following CSV files

      raw_data_folder


  1. Perform the Extract-Load-Transform (ETL) step:

    • Start the jupyter-notebook environment
    • Open and execute all code of the 02.01 - Preprocess raw data.ipynb file, which is inside the 02_ETL folder
    • After execution, the preprocessed data will be saved in the 03_preprocessed_data and 04_data_to_analyze folders

    preprocessed data


  1. Perform the Machine Learning (ML) pipeline:
    • Execute the python program exec_grid_search_both_scenarios.py in the 05_Train_Validate_Models folder

    • This program will:

      • Split the dataset into Training and Validation subsets
      • Train and validate the ML models for both scenarios (Single-Model and Ensemble-Imbalance)
        • NOTE: It can take a long time to accomplish (even days).
      • Save the performance results into CSV files in the 05_Train_Validate_Models/exec_results folder
    • Pipeline Overview:

      ml_pipeline

    • Validation performance obtained by each scenario and algorithm:

      performances_both_scenarios_barplot


  1. Execute the SHAP explanations over the model that reached the best performance for the Ensemble-Imbalance scenario(i.e., BalancedBagging model using Neural Networks as a base estimator)
    • Create a SHAP Kernel Explainer instance using the best model and the Validation set:
      • explainer = shap.KernelExplainer(<<BEST_MODEL>>.predict, X_valid)
    • Generate the SHAP values: (Note: It can take many hours)
      • shap_values = explainer.shap_values(X_valid)
    • Analyze the SHAP results by plotting SHAP graphs. See the examples below:
      • Decision plot:

        patient_B_decision_plot

      • Summary plot: (Bar and Dotted plots)

        SHAP_0_Feature_Importance_and_Beeswarm


  1. Grid-Search hyperparameters used for each algorithm.

grid-search-params


  1. Best models' hyperparameters

best-model-params


  1. Additional Information:

    Exploratory Data Analysis


Finally, please let us know if you have any comments or suggestions, or if you have questions about the code or the procedure (correspondence e-mail: fabianopapaiz at gmail dot com).

About

Ensemble-Imbalance model for ALS Prognosis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages