Skip to content

Code for paper "A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models"

License

Notifications You must be signed in to change notification settings

cognano/AVIDa-SARS-CoV-2

Repository files navigation

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

This repository contains the supplementary material accompanying the paper "A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models." In this paper, we introduced AVIDa-SARS-CoV-2, a labeled dataset of SARS-CoV-2-VHH interactions, and VHHCorpus-2M, which contains over two million VHH sequences, providing novel datasets for the evaluation and pre-training of antibody language models. The datasets are available at https://datasets.cognanous.com under a CC BY-NC 4.0 license.

dataset-generation-overview

Overview of data generation process for AVIDa-SARS-CoV-2.

Table of Contents

Environment

To get started, clone this repository and run the following command to create a virtual environment.

python -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt

Datasets

Links

Dataset Links
VHHCorpus-2M Hugging Face Hub         Project Page
AVIDa-SARS-CoV-2 Hugging Face Hub         Project Page

Data Processing

The code for converting the raw data (FASTQ file) obtained from next-generation sequencing (NGS) into the labeled dataset, AVIDa-SARS-CoV-2, can be found under ./dataset. We released the FASTQ files for antigen type "OC43" here so that the data processing can be reproduced.

First, you need to create a Docker image.

docker build -t vhh_constructor:latest ./dataset/vhh_constructor

After placing the FASTQ files under dataset/raw/fastq, execute the following command to output a labeled CSV file.

bash ./dataset/preprocess.sh

Benchmarks

Pre-training

VHHBERT is a RoBERTa-based model pre-trained on two million VHH sequences in VHHCorpus-2M. VHHBERT can be pre-trained with the following commands.

python benchmarks/pretrain.py --vocab-file "benchmarks/data/vocab_vhhbert.txt" \
  --epochs 20 \
  --batch-size 128 \
  --save-dir "outputs"

Arguments:

Argument Required Default Description
--vocab-file Yes Path of the vocabulary file
--epochs No 20 Number of epochs
--batch-size No 128 Size of mini-batch
--seed No 123 Random seed
--save-dir No ./saved Path of the save directory

The pre-trained VHHBERT, released under the MIT License, is available on the Hugging Face Hub.

Fine-tuning

To evaluate the performance of various pre-trained language models for antibody discovery, we defined a binary classification task to predict the binding or non-binding of unknown antibodies to 13 antigens using AVIDa-SARS-CoV-2. For more information on the benchmarking task, see the paper.

Fine-tuning of the language models can be performed using the following command.

python benchmarks/finetune.py --palm-type "VHHBERT" \
  --epochs 30 \
  --batch-size 32 \
  --save-dir "outputs"

palm-type must be one of the following:

  • VHHBERT
  • VHHBERT-w/o-PT
  • AbLang
  • AntiBERTa2
  • AntiBERTa2-CSSP
  • IgBert
  • ProtBert
  • ESM-2

Arguments:

Argument Required Default Description
--palm-type No VHHBERT Model name
--embeddings-file No ./benchmarks/data/antigen_embeddings.pkl Path of embeddings file for antigens
--epochs No 20 Number of epochs
--batch-size No 128 Size of mini-batch
--seed No 123 Random seed
--save-dir No ./saved Path of the save directory

Citation

If you use AVIDa-SARS-CoV-2, VHHCorpus-2M, or VHHBERT in your research, please use the following citation.

@article{tsuruta2024sars,
  title={A {SARS}-{C}o{V}-2 Interaction Dataset and {VHH} Sequence Corpus for Antibody Language Models},
  author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Akihiro Imura},
  journal={arXiv preprint arXiv:2405.18749},
  year={2024}
}

About

Code for paper "A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published