Paper: https://doi.org/10.1038/s41467-021-26792-w
Data:
Code for reproducing the method described in:
Co-evolution based machine-learning for predicting functional interactions between human genes
Doron Stupp, Elad Sharon, Idit Bloch, Marinka Zitnik, Or Zuk, Yuval Tabach*
*. To whom correspondence should be addressed. YuvalTab@ekmd.huji.ac.il
Citation:
@article{stupp2021co,
title={Co-evolution based machine-learning for predicting functional interactions between human genes},
author={Stupp, Doron and Sharon, Elad and Bloch, Idit and Zitnik, Marinka and Zuk, Or and Tabach, Yuval},
journal={Nature communications},
volume={12},
number={1},
pages={1--14},
year={2021},
publisher={Nature Publishing Group}
}
MLPP - Machine Learning based Phylogenetic Profiling, the method described in the aforementioned paper, is a method for predicting functional interactions between human genes based on their phylogenetic profile. This method leverages machine learning based clade-wise co-evolution to improve upon existing phylogenetic profiling approaches.
The code in this repo enables one to reproduce the models and predictions presented in the paper. A small example for reproducing figure 1a and supp. figure 3 from the paper is included as well.
Installation and execution was tested on Ubuntu 20.04 running miniconda. To run the full pipeline it is advised to utilize a parallalized execution using snakemake due to the high workload of predicting all gene-pairs across all lables.
One can install the required software to reproduce our results by using conda. Clone this repo into a directory, and using the terminal from inside the cloned repository run:
conda env create -f enviroment.yaml
This will create a conda enviroment called MLPP
.
To run this example one needs to download the associated phylogenetic profiling matrices and other auxillary data which is available on zenodo at: 10.5281/zenodo.5111607.
To download the data, please execute the data script in a bash terminal inside the folder of the cloned repo. The scripts downloads the data to the current folder. Data total size approximately 1GB.
bash retrieve_data.sh
This example trains several machine learning models to predict functional interactions between genes. This examples reproduces figures 1a and supp. figure 3 from the paper.
For reproducing pathway types comparisons change the argument to --which inReactome_Path
Using the command line, run from inside the cloned repo:
conda activate MLPP
python comparing_classifiers.py --which in_Reactome
Expected output is available through the reports/figures/01_comparing_classifiers
folder.
The full pipeline can be run using snakemake. This takes considerable time due to the prediction of all gene-pairs across all different labels. For the results shown in the paper, the pipeline was ran on an HPC cluster using a slurm scheduler.
First fill config/config_smk.yaml
with the correct project dir:
#proj_dir
proj_dir: XXXXXXXXXXXXXX
Then, to run the pipeline locally run the script run_snakemake.sh
This first installs the src
directory as a python package and then runs the snakemake pipeline.
conda activate MLPP
bash run_snakemake.sh
When using the method or predictions described in the paper please cite
@article{stupp2021co,
title={Co-evolution based machine-learning for predicting functional interactions between human genes},
author={Stupp, Doron and Sharon, Elad and Bloch, Idit and Zitnik, Marinka and Zuk, Or and Tabach, Yuval},
journal={Nature communications},
volume={12},
number={1},
pages={1--14},
year={2021},
publisher={Nature Publishing Group}
}