Predicting protein functions using positive-unlabeled ranking with ontology-based priors

Abstract

Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the \emph{false negative} issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled classification problem. We apply empirical risk minimization, i.e., we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets

Dependencies

Python 3.10
PyTorch 2.1.0
FAIR-ESM (for predicting over FASTA files)
Other basic dependencies can be installed by running: conda env create -f environment.yml
Install diamond program on your system (diamond command should be available)

Data Availability

https://bio2vec.cbrc.kaust.edu.sa/data/pugo/pu-go-data.tar.gz - Here you can find the data used to train and evaluate our method.
Run:

cd PU-GO
wget https://bio2vec.cbrc.kaust.edu.sa/data/pugo/pu-go-data.tar.gz
tar -xzvf pu-go-data.tar.gz

For each subontology directory [mf, cc, bp] run the following command to extract the 10 trained models:

cd data/mf
tar -xzvf models.tar.gz

Scripts

pu_go.py script to train/test PU-GO on the similarity-based split. We show the commands with the selected hyperparameters for each subontology.
- MFO: python pu_go.py -dr data/ -ont mf --run 0 --batch_size 35 --loss_type pu_ranking_multi --margin_factor 0.01152 --max_lr 0.004573 --min_lr_factor 0.0008792 --prior 0.000143 -ld
- CCO: python pu_go.py -dr data/ -ont cc --run 0 --batch_size 30 --loss_type pu_ranking_multi --margin_factor 0.09812 --max_lr 0.0008681 --min_lr_factor 0.08364 --prior 0.0001106 -ld
- BPO: python pu_go.py -dr data/ -ont bp --run 0 --batch_size 39 --loss_type pu_ranking_multi --margin_factor 0.02457 --max_lr 0.0004305 --min_lr_factor 0.09056 --prior 0.0007845 -ld
pu_go_time.py script to train PU-GO on the time-based split.
predict.py script to predict functions given a FASTA file.
- Usage: python predict.py -if your_fasta_file.fa -d cuda

Note: We used the ESM2-15B model using a NVIDIA A100 GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
baselines		baselines
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
evaluate.py		evaluate.py
extract_esm.py		extract_esm.py
gen_negatives.py		gen_negatives.py
nn.py		nn.py
predict.py		predict.py
pu_go.py		pu_go.py
pu_go_time.py		pu_go_time.py
pu_sweep.py		pu_sweep.py
run_diamond.sh		run_diamond.sh
uni2pandas.py		uni2pandas.py
upload_data.sh		upload_data.sh
utils.py		utils.py

bio-ontology-research-group/PU-GO

Folders and files

Latest commit

History

Repository files navigation

Predicting protein functions using positive-unlabeled ranking with ontology-based priors

Abstract

Dependencies

Data Availability

Scripts

Citation (TODO)

About

Topics

Resources

Stars

Watchers

Forks

Languages