The genedise project aims at finding druggable genes for a specific disease based on previously essayed targets. Whether these targets were successful or not is not the primary concern - the fact that there was enough evidence to try them is enough for us. In this way, we aim at mimicking the time-consuming task of proposing new reasonable targets.
The files and directories of this project are proceded by a number that indicates the chronological order of their execution.
Scripts are stored in
Their outputs are saved in folders sharing their prefix.
The most relevant prefixes are:
- 2X_: analysis on the STRING network
- 4X_: analysis on the OmniPath network
- 6X_: plots and models combining both networks (depends on the execution of the 2X an 4X scripts)
The output of
sessionInfo() is always stored in the directory
00_metadata to keep track of the package versions.
There are configuration files, such as
03_config.R, that contain a comprehensive amount of parameters, paths and file names.
Generally, these parameters are sourced instead of being hardcoded in the scripts.
The project has package version control through
packrat to ease portability between machines.
Almost all the files in the project are included in the git repository at the moment. Exceptions:
- STRING database files
- Network kernel(s)
The route of these files (Sergi's machines) can be found in the config files.
There are several
set.seed calls throughout the code.
Intermediate results are saved when the space required is not prohibitive.
- Check OpenTargets data sanity
- Choose network: compromise between coverage and size
- Compute and store graph kernel on chosen network
- Save cleaned data, mapped to the network of choice
- Characterisation of disease genes in terms of network properties
- Within-disease study
- Between-disease study
Load configuration files
Load network data
Build CV folds
Define functions for prediction
Define performance metrics
For each disease,input_type,fold
- Define train and validation
- Predict for every method using train
- Compute performance metrics
- Write to disk
Build statistical models for comparing methods
The runs have been executed on the following hardware from the UPC:
- 12 threads (Intel(R) Xeon(R) CPU E7310@1.60GHz)
- 32GB RAM
- 32 threads (Intel(R) Xeon(R) CPU E5email@example.comGHz)
- 32GB RAM
Running the script is barely possible with 16GB of RAM. We recommend using 32GB to avoid spikes with swapping.
For reference, executing all the diseases under a single repeated CV scheme (25 repetitions, 3 folds per repetition) on eko takes one week. Likewise, sun is twice as fast. The code is a mixture between serial and parallel executions because not all the methods run in parallel.
On the other hand, the computationally intensive code was run on a torque-based cluster, but the
parallel R package -part of the R base- was unable to clean up the child processes.
This led to memory exhaustion and proved to be infeasible.
Alternatives to tackle this while keeping reproducibility might be added in the future.