Protein Function Prediction

About the Dataset

File Description

Gene Ontology: The ontology data is in the file go-basic.obo. This structure is the 2023-01-01 release of the GO graph. This file is in OBO format, for which there exist many parsing libraries.

subontology_roots = { 'BPO':'GO:0008150',
                      'CCO':'GO:0005575',
                      'MFO':'GO:0003674' }

Training Sequences: train_sequences.fasta contains the protein sequences for the training dataset. The train_sequences.fasta file will indicate from which database the sequence originate.

Labels: train_terms.tsv contains the list of annotated terms (ground truth) for the proteins in train_sequences.fasta. The first column indicates the protein's UniProt accession ID, the second is the GO term ID, and the third indicates in which ontology the term appears.

Taxonomy: train_taxonomy.tsv contains the list of proteins and the species to which they belong, represented by a "taxonomic identifier" (taxon ID) number. The first column is the protein UniProt accession ID and the second is the taxon ID.

Information Accretion: IA.txt contains the information accretion (weights) for each GO term. These weights are used to compute weighted precision and recall, as described in the Evaluation section.

Files

train_sequences.fasta - amino acid sequences for proteins in training set.
train_terms.tsv - the training set of proteins and corresponding annotated GO terms.
train_taxonomy.tsv - taxon ID for proteins in training set.
go-basic.obo - ontology graph structure.
testsuperset.fasta - amino acid sequences for proteins on which the predictions should be made.
testsuperset-taxon-list.tsv - taxon ID for proteins in test superset.
IA.txt - Information Accretion for each term. This is used to weight precision and recall.

Folder Structure

protein-function-prediction
	/data
	    /Train
	        /go-basic.obo
	        /train_sequences.fasta
	        /train_taxonomy.tsv
	        /train_terms.tsv
	    /Tests (Targets)
	        /testsuperset-taxon-list.tsv
	        /testsuperset.fasta
	    /IA.txt
	/t5_embeds
	    /test_embeds.npy
	    /test_ids.npy
	    /train_embeds.npy
	    /train_ids.npy
	/protein_function_prediction.ipynb

Steps to Run and Execute the Code:

Clone the repository.
Download the t5_embeds embeddings from here.
Check the folder structure is as mentioned above or make changes to the data import path in the ipynb file.
Execute the cells one by one.

Note

The dataset data is obtained from CAFA-5.
The final_output.tsv file generated is large in size therefore it is not commited. Execute the cells to obtain the test set prediction.

Citation

@inbook{
	doi:https://doi.org/10.1002/9781394209965.ch6,
	author = {Jakhmola-Mani, Ruchi and
		Sonali and
		Pandey, Aniket and
		Raturi, Dhananjay and
		Singh, Rishita and
		Vanam, Kusala and
		D, Manish and
		Chauhan, Ritu and
		Katare, Deepshikha Pande and
		Nongdam, Potshangbam and
		Potshangbam, Angamba Meetei},
	publisher = {John Wiley & Sons, Ltd},
	isbn = {9781394209965},
	title = {Exploring Machine Learning Algorithms for Gene Function Prediction in Crops},
	booktitle = {Bioinformatics for Plant Research and Crop Breeding},
	chapter = {6},
	pages = {159-183},
	doi = {https://doi.org/10.1002/9781394209965.ch6},
	url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/9781394209965.ch6},
	eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781394209965.ch6},
	year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
protein_function_prediction.ipynb		protein_function_prediction.ipynb
protein_function_prediction.pdf		protein_function_prediction.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Function Prediction

About the Dataset

File Description

Files

Folder Structure

Steps to Run and Execute the Code:

Note

Citation

About

Uh oh!

Releases

Packages

Languages

aniket414/protein-function-prediction

Folders and files

Latest commit

History

Repository files navigation

Protein Function Prediction

About the Dataset

File Description

Files

Folder Structure

Steps to Run and Execute the Code:

Note

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages