A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

This repository contains the supplementary material accompanying the paper "A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models." In this paper, we introduced AVIDa-SARS-CoV-2, a labeled dataset of SARS-CoV-2-VHH interactions, and VHHCorpus-2M, which contains over two million VHH sequences, providing novel datasets for the evaluation and pre-training of antibody language models. The datasets are available at https://datasets.cognanous.com under a CC BY-NC 4.0 license.

Overview of data generation process for AVIDa-SARS-CoV-2.

Environment

To get started, clone this repository and run the following command to create a virtual environment.

python -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt

Datasets

Links

Dataset	Links
VHHCorpus-2M	Hugging Face Hub Project Page
AVIDa-SARS-CoV-2	Hugging Face Hub Project Page

Data Processing

The code for converting the raw data (FASTQ file) obtained from next-generation sequencing (NGS) into the labeled dataset, AVIDa-SARS-CoV-2, can be found under ./dataset. We released the FASTQ files for antigen type "OC43" here so that the data processing can be reproduced.

First, you need to create a Docker image.

docker build -t vhh_constructor:latest ./dataset/vhh_constructor

After placing the FASTQ files under dataset/raw/fastq, execute the following command to output a labeled CSV file.

bash ./dataset/preprocess.sh

Benchmarks

Pre-training

VHHBERT is a RoBERTa-based model pre-trained on two million VHH sequences in VHHCorpus-2M. VHHBERT can be pre-trained with the following commands.

python benchmarks/pretrain.py --vocab-file "benchmarks/data/vocab_vhhbert.txt" \
  --epochs 20 \
  --batch-size 128 \
  --save-dir "outputs"

Arguments:

Argument	Required	Default	Description
--vocab-file	Yes		Path of the vocabulary file
--epochs	No	20	Number of epochs
--batch-size	No	128	Size of mini-batch
--seed	No	123	Random seed
--save-dir	No	./saved	Path of the save directory

The pre-trained VHHBERT, released under the MIT License, is available on the Hugging Face Hub.

Fine-tuning

To evaluate the performance of various pre-trained language models for antibody discovery, we defined a binary classification task to predict the binding or non-binding of unknown antibodies to 13 antigens using AVIDa-SARS-CoV-2. For more information on the benchmarking task, see the paper.

Fine-tuning of the language models can be performed using the following command.

python benchmarks/finetune.py --palm-type "VHHBERT" \
  --epochs 30 \
  --batch-size 32 \
  --save-dir "outputs"

palm-type must be one of the following:

VHHBERT
VHHBERT-w/o-PT
AbLang
AntiBERTa2
AntiBERTa2-CSSP
IgBert
ProtBert
ESM-2

Arguments:

Argument	Required	Default	Description
--palm-type	No	VHHBERT	Model name
--embeddings-file	No	./benchmarks/data/antigen_embeddings.pkl	Path of embeddings file for antigens
--epochs	No	20	Number of epochs
--batch-size	No	128	Size of mini-batch
--seed	No	123	Random seed
--save-dir	No	./saved	Path of the save directory

Citation

If you use AVIDa-SARS-CoV-2, VHHCorpus-2M, or VHHBERT in your research, please use the following citation.

@article{tsuruta2024sars,
  title={A {SARS}-{C}o{V}-2 Interaction Dataset and {VHH} Sequence Corpus for Antibody Language Models},
  author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Akihiro Imura},
  journal={arXiv preprint arXiv:2405.18749},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
analysis		analysis
benchmarks		benchmarks
dataset		dataset
docs/images		docs/images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Table of Contents

Environment

Datasets

Links

Data Processing

Benchmarks

Pre-training

Fine-tuning

Citation

About

Releases

Packages

Languages

License

cognano/AVIDa-SARS-CoV-2

Folders and files

Latest commit

History

Repository files navigation

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Table of Contents

Environment

Datasets

Links

Data Processing

Benchmarks

Pre-training

Fine-tuning

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages