OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Vision Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.

Hugging Face links to OPI dataset and OPI-tuned models

OPI Dataset
OPI-Llama-3.1-8B-Instruct
OPI-Galactica-6.7B

Project Overview

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

Usage and license notices: Galactica is intended and licensed for research use only. Llama-3 is licensed for researchers and commercial entities, upholding the principles of openness. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).

OPI dataset construction pipeline

The OPI dataset is curated on our own by extracting key information from Swiss-Prot database. The following figure shows the overall construction process of OPI.

An example of OPI training data:

instruction: 
    What is the EC classification of the input protein sequence based on its biological function?
input:                         
    MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
    LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
    QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
    RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
    FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
    ATERQYELQP
output: 
    2.7.10.2

An example of OPI testing data:

{"id": "seed_task_0", "name": "EC number of price dataset from CLEAN", "instruction":
"Return the EC number of the protein sequence.", "instances": [{"input":
"MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY
AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR
YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD
AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL
LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}

OPI dataset overview

We are excited to announce the release of the OPI dataset, a curated collection of instructions covering 9 tasks for adapting LLMs to protein biology. The dataset is designed to advance LLM-driven research in the field of protein biology. We welcome contributions and enhancements to this dataset from the community. Thera are 1.64M samples, including training (1,615,661) and testing (26,607) sets, in OPI dataset.

Accessing the OPI dataset: The complete OPI dataset can be accessed from Hugging Face, which is organized into the three subfolders—AP, KM, and SU— in the OPI_DATA directory, plusing the full dataset file OPI_full_1.61M_train.json. Once downloaded, you can place all the subfolders and data files in the OPI_DATA folder within the repository. If you want to merge all or several training data files of the tasks into one single training data file, please do like this:

cd OPI_DATA
python merge_task_train_data.py --output OPI_merged_train.json

OPI Dataset folder structure:

./OPI_DATA/
└── SU
│   ├── EC_number
│   │   ├── test
│   │   │   ├── CLEAN_EC_number_new_test.jsonl
│   │   │   └── CLEAN_EC_number_price_test.jsonl
│   │   └── train
│   │       ├── CLEAN_EC_number_train.json
│   ├── Fold_type
│   │   ├── test
│   │   │   └── fold_type_test.jsonl
│   │   └── train
│   │       └── fold_type_train.json
│   └── Subcellular_localization
│       ├── test
│       │   ├── subcell_loc_test.jsonl
│       └── train
            └── subcell_loc_train.json
├── AP
│   └── Keywords
│   │   ├── test
│   │   │   ├── CASPSimilarSeq_keywords_test.jsonl
│   │   │   ├── IDFilterSeq_keywords_test.jsonl
│   │   │   └── UniProtSeq_keywords_test.jsonl
│   │   └── train
│   │       ├── keywords_train.json
│   ├── GO
│   │   ├── test
│   │   │   ├── CASPSimilarSeq_go_terms_test.jsonl
│   │   │   ├── IDFilterSeq_go_terms_test.jsonl
│   │   │   └── UniProtSeq_go_terms_test.jsonl
│   │   └── train
│   │       ├── go_terms_train.json
│   ├── Function
│       ├── test
│       │   ├── CASPSimilarSeq_function_test.jsonl
│       │   ├── IDFilterSeq_function_test.jsonl
│       │   └── UniProtSeq_function_test.jsonl
│       └── train
│           ├── function_train.json
├── KM
    └── gSymbol2Tissue
    │   ├── test
    │   │   └── gene_symbol_to_tissue_test.jsonl
    │   └── train
    │       └── gene_symbol_to_tissue_train.json
    ├── gSymbol2Cancer
    │   ├── test
    │   │   └── gene_symbol_to_cancer_test.jsonl
    │   └── train
    │       └── gene_symbol_to_cancer_train.json
    ├── gName2Cancer
        ├── test
        │   └── gene_name_to_cancer_test.jsonl
        └── train
            └── gene_name_to_cancer_train.json

OPEval: Nine evaluation tasks using the OPI dataset

To assess the effectiveness of instruction tuning with the OPI dataset, we developed OPEval, which comprises three categories of evaluation tasks. Each category includes three specific tasks. The table below outlines the task types, names, and the corresponding sizes of the training and testing sets.

Task Type	Type Abbr.	Task Name	Task Abbr.	Training set size	Testing set size
Sequence Understanding	SU	EC Number Prediction	EC_number	74,487	392 (NEW-392), 149 (Price-149)
		Fold Type Prediction	Fold_type	12,312	718 (Fold), 1254 (Superfamily), 1272 (Family)
		Subcellular Localization Prediction	Subcellular_localization	11,230	2,772
Annotation Prediction	AP	Function Keywords Prediction	Keywords	451,618	184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
		Gene Ontology(GO) Terms Prediction	GO	451,618	184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
		Function Description Prediction	Function	451,618	184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
Knowledge Mining	KM	Tissue Location Prediction from Gene Symbol	gSymbol2Tissue	8,723	2,181
		Cancer Prediction from Gene Symbol	gSymbol2Cancer	590	148
		Cancer Prediction from Gene Name	gName2Cancer	590	148

Instruction tuning with OPI training data

Instruction tuning procedures are available in the instruction_tuning guide.

Accessing the OPI-Tuned Models: We have released the OPI-Llama-3.1-8B-Instruct and OPI-Galactica-6.7B models fine-tuned using OPI_full_1.61M_train.json, which can be accessed from Hugging Face.

Evaluating with OPI testing data

Evalution procedures are outlined in the evaluation guide.

Evaluation results

Comprehensive evaluation results are detailed in th evaluation_results document.

Prediction comparison with SOTA mdoels

Prediction by OPI-tuned model, GPT-4o, Llama-3.1-8B-Instruct, Claude 3.5 Sonnet vs. Ground Trurh Answers are shown in in the model_compare document.

Demo

We use the FastChat platform to visually demonstrate the ability of OPI-Galactica-6.7B model on various evaluation tasks.

Acknowledgement

The codes are adapted from Stanford Alpaca.
Some codes are adapted from Chinese-LLaMA-Alpaca.
Llama-3: Llama-3
Galactica: Galactica

Contact Information

For help or issues using the repos, please submit a GitHub issue.
For other communications, please contact Qiwei Ye (qwye@baai.ac.cn).

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
OPI_DATA		OPI_DATA
bar_chart		bar_chart
compute_scores		compute_scores
configs		configs
demo_figures		demo_figures
eval		eval
train		train
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DATA_LICENSE.txt		DATA_LICENSE.txt
LICENSE		LICENSE
README.md		README.md
WEIGHT_DIFF_LICENSE.txt		WEIGHT_DIFF_LICENSE.txt
evaluation.md		evaluation.md
evaluation_results.md		evaluation_results.md
instruction_tuning.md		instruction_tuning.md
model_compare.md		model_compare.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Hugging Face links to OPI dataset and OPI-tuned models

Contents

Project Overview

OPI dataset construction pipeline

OPI dataset overview

OPEval: Nine evaluation tasks using the OPI dataset

Instruction tuning with OPI training data

Evaluating with OPI testing data

Evaluation results

Prediction comparison with SOTA mdoels

Demo

Acknowledgement

Contact Information

About

Releases

Packages

Contributors 2

Languages

License

baaihealth/opi

Folders and files

Latest commit

History

Repository files navigation

OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Hugging Face links to OPI dataset and OPI-tuned models

Contents

Project Overview

OPI dataset construction pipeline

OPI dataset overview

OPEval: Nine evaluation tasks using the OPI dataset

Instruction tuning with OPI training data

Evaluating with OPI testing data

Evaluation results

Prediction comparison with SOTA mdoels

Demo

Acknowledgement

Contact Information

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages