Skip to content

YingxiaLi2023/multi-omics-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

######################################################################################

Code to the Benchmark Study Based on Multi-Omics Data by Li et al. (2024)

This is the electronic appendix (R code) to the article "Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study" (2024) by Yingxia Li1, Tobias Herold2, Ulrich Mansmann1, Roman Hornung1,3.

  1. Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377 Munich, Germany;
  2. Laboratory for Leukemia Diagnostics, Department of Medicine III, LMU University Hospital, LMU Munich, Munich, Germany;
  3. Munich Center for Machine Learning (MCML), Munich, Germany;

######################################################################################

Program and Platform

  • Program: R, version 4.1.2 (2021-11-01)
  • Platform:
    • Linux (x86-64): For conducting the analyses
    • Windows10 64-bit: For evaluating the results
  • Session Information: The following output from sessionInfo() describes which R packages and versions were used:
  > sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 15 SP1

Matrix products: default

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:                                                                                                                                
[1] grid      stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] mlr_2.19.2          tidyr_1.3.1         survcomp_1.44.1
 [4] stringr_1.5.1       snow_0.4-4          Rmisc_1.5.1
 [7] lattice_0.20-45     RColorBrewer_1.1-3  ranger_0.13.1
[10] prioritylasso_0.2.5 plyr_1.8.7          pec_2022.03.06
[13] prodlim_2019.11.13  patchwork_1.2.0     ParamHelpers_1.14.1
[16] OpenML_1.12         ipflasso_1.1        survival_3.3-1
[19] gridExtra_2.3       glmnet_4.1-3        Matrix_1.4-1
[22] ggplot2_3.5.1       farff_1.1.1         dplyr_1.1.4
[25] bootstrap_2019.6    blockForest_0.2.4

loaded via a namespace (and not attached):                                                                 
 [1] httr_1.4.7          jsonlite_1.8.8      splines_4.1.2
 [4] foreach_1.5.2       rmeta_3.0           globals_0.14.0
 [7] survivalROC_1.0.3   timereg_2.0.2       numDeriv_2016.8-1.1
[10] pillar_1.9.0        backports_1.4.1     glue_1.6.2
[13] digest_0.6.36       checkmate_2.0.0     colorspace_2.0-3
[16] XML_3.99-0.17       pkgconfig_2.0.3     listenv_0.8.0
[19] purrr_1.0.2         scales_1.3.0        parallelMap_1.5.1
[22] lava_1.6.10         tzdb_0.4.0          tibble_3.2.1
[25] generics_0.1.2      cachem_1.1.0        withr_2.5.0
[28] cli_3.6.3           magrittr_2.0.3      memoise_2.0.1
[31] future_1.24.0       fansi_1.0.3         parallelly_1.31.0
[34] SuppDists_1.1-9.7   tools_4.1.2         data.table_1.14.2
[37] hms_1.1.3           lifecycle_1.0.4     BBmisc_1.13
[40] munsell_0.5.0       compiler_4.1.2      rlang_1.1.4
[43] iterators_1.0.14    gtable_0.3.0        codetools_0.2-18
[46] curl_5.2.1          R6_2.5.1            fastmap_1.2.0
[49] future.apply_1.8.1  utf8_1.2.2          fastmatch_1.1-4
[52] KernSmooth_2.23-20  shape_1.4.6         readr_2.1.5
[55] stringi_1.7.6       parallel_4.1.2      Rcpp_1.0.8.3
[58] vctrs_0.6.5         tidyselect_1.2.1

######################################################################################

Repository Structure

1. Data Subfolder

  • Purpose: Contains scripts for downloading the OpenML data needed for reproducing the analyses.
  • Contents:
  • "down_data.R": Downloads data from OpenML.
  • "dataset_ids.RData": Includes OpenML IDs for the datasets to be downloaded.

2. JobScripts Subfolder

  • Purpose: Contains scripts for reproducing the benchmark study.
  • Contents:
  • "AnalysisCluster_1_4_5.R": For single blocks, combinations of 4 blocks, and combinations of 5 blocks.
  • "AnalysisCluster_2.R": For combinations of 2 blocks.
  • "AnalysisCluster_3.R": For combinations of 3 blocks.

3. Functions Subfolder

  • Purpose: Contains scripts with functions used in the benchmark study.
  • Contents:
  • Scripts whose labels contain Functions_AnalysisCluster: Functions for applying the different prediction methods to the different combinations.

4. Evaluations Subfolder

  • Purpose: Contains scripts for evaluating results and reproducing figures and tables.
  • Contents:
  • "Evaluation_AnalysisCluster_fivemethods.R": Evaluates raw results.
  • "bootstrap analysis_ibrier.R" and "bootstrap analysis_cindex.R": Performs the bootstrap analysis.
  • Scripts labeled "figures": Reproduces figures shown in the paper and supplement, including "figures_2_S12_S13_S14.R" which also produces the results of the analysis presented at the end of Section "Best-performing combinations of prediction methods and blocks per dataset".
  • "test_for_figure_2.R" and "tests_for_table_3.R": Performs the statistical tests for Figure 2 and Table 3.
  • "tables_S2_S3.R": Code for producing the information shown in Tables S2 and S3.

5. Results Subfolder

  • Purpose: Contains results and figures from the benchmark study.

  • Contents:

  • rda_files Subfolder:

    • "scenariogrid1.Rda", "scenariogrid2.Rda", "scenariogrid3.Rda": Generated by the R scripts in JobScripts.
    • "resultsumsum.RData", "resultsum.RData", "CI_cindex.xlsx", "CI_ibrier.xlsx", "table_S2_ibrier.docx", "table_S2_cindex.docx": Generated by R scripts in Evaluations.

     

    The file "resultsum.RData" is of particular importance. This file contains a table (R data.frame) resultsum, which contains the unaggregated results of the benchmark study. It has the following columns: comb, dat, cvind, ibrier_bf, cindex_bf, ibrier_rf, cindex_rf, ibrier_lasso, cindex_lasso, ibrier_ipflasso, cindex_ipflasso, ibrier_prioritylasso, cindex_prioritylasso. Here comb provides the block combinations, dat the dataset, and cvind the indices of the five cross-validation repetitions. The columns cindex_bf, ibrier_rf etc. provide the cross-validated cindex and ibrier values.
    This table may be used to extend the benchmark study to include further prediction methods.

  • Figures Subfolder:

    • Contains all figures from the paper and supplement. Most figures have two versions, one with the suffix "_raw" and one without. The versions with the suffix "_raw" were generated by the R code, and the versions without the suffix were subsequently edited for visual reasons.

6. Full reproduction of the results:

  • An MPI environment is required.

  • The R scripts named "AnalysisCluster.R" in the Jobscripts subfolders require the RMPISNOW shell script from the R package snow. Therefore, before executing these scripts you need to install the RMPISNOW shell script from the installed snow R package or inst directory of the package sources of the snow R package in an appropriate location, preferably on your path. See http://homepage.divms.uiowa.edu/~luke/R/cluster/cluster.html for more details. Subsequently, you need to create sh files, each for a different of the above R scripts. The following is the content of an example sh file "simulation_clustdata.sh" for SLURM:

  #!/bin/bash
  #SBATCH -o /myoutfiledirectory/myjob.%j.%N.out
  #SBATCH -D /myhomedirectory
  #SBATCH -J LargeStudy
  #SBATCH --get-user-env 
  #SBATCH --clusters=myclustername
  #SBATCH --partition=mypartitionname
  #SBATCH --qos=mypartitionname
  #SBATCH --nodes=??
  #SBATCH --ntasks-per-node=??
  #SBATCH --mail-type=end
  #SBATCH --mail-user=my@mail.de
  #SBATCH --time=??:??:??

  mpirun RMPISNOW < ./multi-omics-data/Jobscripts/AnalysisCluster.R

The above sh file of course has to be adjusted to be useable (e.g., the "?"s have to replaced by actual numbers, the directories have to be adjusted and you need to specify your e-mail address; an e-mail will be sent to this address once the job is finished).

Note that it is possible to use other parallelization techniques (e.g., the parallel R package) than RMPISNOW to reproduce the results. This is because we use a specific seed for each line in the scenariogrid data frames created by the "AnalysisCluster.R" scripts. Each line in these data frames correspond to one iteration in the benchmark study (see the corresponding files for details). This makes the reproducibility independent of the specific type of parallelization. However, to use a different type of parallelization than RMPINOW, it is necessary to modify the "AnalysisCluster.R" scripts accordingly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages