## DNABERT
This script fine-tunes the DNABERT2 model for promoter prediction using a classification approach.
Promoter prediction involves identifying promoter regions in DNA sequences—regions that initiate transcription of particular genes.
DNABERT2, a transformer-based model pre-trained on DNA k-mer sequences, is adapted for supervised fine-tuning with labeled promoter data.


In [None]:
# Clone the DNABERT_2 repository which contains the necessary scripts and model configurations.
!git clone https://github.com/MAGICS-LAB/DNABERT_2

Cloning into 'DNABERT_2'...
remote: Enumerating objects: 123, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 123 (delta 22), reused 15 (delta 15), pack-reused 92 (from 2)[K
Receiving objects: 100% (123/123), 882.58 KiB | 2.67 MiB/s, done.
Resolving deltas: 100% (50/50), done.


In [None]:
# Install all required libraries including Transformers, PyTorch, PEFT (for LoRA), and evaluation tools.
!pip install transformers accelerate torch evaluate peft einops

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [None]:
# Download a ZIP archive of the dataset from Google Drive and unzip it.

!gdown https://drive.google.com/uc?id=1GRtbzTe3UXYF1oW27ASNhYX3SZ16D7N2
!unzip -o ./GUE.zip

Downloading...
From (original): https://drive.google.com/uc?id=1GRtbzTe3UXYF1oW27ASNhYX3SZ16D7N2
From (redirected): https://drive.google.com/uc?id=1GRtbzTe3UXYF1oW27ASNhYX3SZ16D7N2&confirm=t&uuid=42ef4405-f644-404c-ad4e-75b7754b2a41
To: /content/GUE.zip
100% 82.3M/82.3M [00:02<00:00, 35.8MB/s]
Archive:  ./GUE.zip
   creating: GUE/
  inflating: __MACOSX/._GUE          
   creating: GUE/prom/
  inflating: GUE/.DS_Store           
  inflating: __MACOSX/GUE/._.DS_Store  
   creating: GUE/EMP/
   creating: GUE/mouse/
   creating: GUE/splice/
   creating: GUE/tf/
   creating: GUE/virus/
   creating: GUE/prom/prom_300_tata/
   creating: GUE/prom/prom_300_notata/
   creating: GUE/prom/prom_core_all/
   creating: GUE/prom/prom_core_notata/
   creating: GUE/prom/prom_core_tata/
   creating: GUE/prom/prom_300_all/
   creating: GUE/EMP/H3K14ac/
   creating: GUE/EMP/H3K4me2/
   creating: GUE/EMP/H3K9ac/
   creating: GUE/EMP/H3K4me3/
   creating: GUE/EMP/H4/
   creating: GUE/EMP/H3/
   creating: GUE

In [None]:
# Import standard libraries, PyTorch, Transformers, PEFT for parameter-efficient fine-tuning, and other utilities.
import os
import csv
import copy
import json
import logging
from dataclasses import dataclass, field
from typing import Any, Optional, Dict, Sequence, Tuple, List, Union

import torch
import transformers
import sklearn
import numpy as np
from torch.utils.data import Dataset

from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
)

# Model, Data, and Training Argument Classes
Define configuration structures for model, data, and training parameters using Python dataclasses.
These classes store hyperparameters and paths in a structured way to be used throughout the training pipeline.

In [None]:
@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
    use_lora: bool = field(default=False, metadata={"help": "whether to use LoRA"})
    lora_r: int = field(default=8, metadata={"help": "hidden dimension for LoRA"})
    lora_alpha: int = field(default=32, metadata={"help": "alpha for LoRA"})
    lora_dropout: float = field(default=0.05, metadata={"help": "dropout rate for LoRA"})
    lora_target_modules: str = field(default="query,value", metadata={"help": "where to perform LoRA"})


@dataclass
class DataArguments:
    data_path: str = field(default=None, metadata={"help": "Path to the training data."})
    kmer: int = field(default=-1, metadata={"help": "k-mer for input sequence. -1 means not using k-mer."})


@dataclass
class TrainingArguments(transformers.TrainingArguments):
    report_to: str = field(default="none")
    cache_dir: Optional[str] = field(default=None)
    run_name: str = field(default="run")
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(default=512, metadata={"help": "Maximum sequence length."})
    gradient_accumulation_steps: int = field(default=1)
    per_device_train_batch_size: int = field(default=1)
    per_device_eval_batch_size: int = field(default=1)
    num_train_epochs: int = field(default=1)
    fp16: bool = field(default=False)
    logging_steps: int = field(default=100)
    save_steps: int = field(default=100)
    eval_steps: int = field(default=100)
    eval_strategy: str = field(default="steps"),
    save_strategy: str = field(default="steps")
    warmup_steps: int = field(default=50)
    weight_decay: float = field(default=0.01)
    learning_rate: float = field(default=1e-4)
    save_total_limit: int = field(default=3)
    load_best_model_at_end: bool = field(default=True)
    output_dir: str = field(default="output")
    find_unused_parameters: bool = field(default=False)
    checkpointing: bool = field(default=False)
    dataloader_pin_memory: bool = field(default=False)
    eval_and_save_results: bool = field(default=True)
    save_model: bool = field(default=False)
    seed: int = field(default=42)

# Safe Model Saving Utility
Custom utility function to safely save the trained model by moving tensors to CPU before saving.

In [None]:
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
    """Collects the state dict and dump to disk."""
    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa


"""
Get the reversed complement of the original DNA sequence.
"""
def get_alter_of_dna_sequence(sequence: str):
    MAP = {"A": "T", "T": "A", "C": "G", "G": "C"}
    # return "".join([MAP[c] for c in reversed(sequence)])
    return "".join([MAP[c] for c in sequence])

"""
Transform a dna sequence to k-mer string
"""
def generate_kmer_str(sequence: str, k: int) -> str:
    """Generate k-mer string from DNA sequence."""
    return " ".join([sequence[i:i+k] for i in range(len(sequence) - k + 1)])


"""
Load or generate k-mer string for each DNA sequence. The generated k-mer string will be saved to the same directory as the original data with the same name but with a suffix of "_{k}mer".
"""
def load_or_generate_kmer(data_path: str, texts: List[str], k: int) -> List[str]:
    """Load or generate k-mer string for each DNA sequence."""
    kmer_path = data_path.replace(".csv", f"_{k}mer.json")
    if os.path.exists(kmer_path):
        logging.warning(f"Loading k-mer from {kmer_path}...")
        with open(kmer_path, "r") as f:
            kmer = json.load(f)
    else:
        logging.warning(f"Generating k-mer...")
        kmer = [generate_kmer_str(text, k) for text in texts]
        with open(kmer_path, "w") as f:
            logging.warning(f"Saving k-mer to {kmer_path}...")
            json.dump(kmer, f)

    return kmer

class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self,
                 data_path: str,
                 tokenizer: transformers.PreTrainedTokenizer,
                 kmer: int = -1):

        super(SupervisedDataset, self).__init__()

        # load data from the disk
        with open(data_path, "r") as f:
            data = list(csv.reader(f))[1:]
        if len(data[0]) == 2:
            # data is in the format of [text, label]
            logging.warning("Perform single sequence classification...")
            texts = [d[0] for d in data]
            labels = [int(d[1]) for d in data]
        elif len(data[0]) == 3:
            # data is in the format of [text1, text2, label]
            logging.warning("Perform sequence-pair classification...")
            texts = [[d[0], d[1]] for d in data]
            labels = [int(d[2]) for d in data]
        else:
            raise ValueError("Data format not supported.")

        if kmer != -1:
            logging.warning(f"Using {kmer}-mer as input...")
            texts = load_or_generate_kmer(data_path, texts, kmer)

        output = tokenizer(
            texts,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )

        self.input_ids = output["input_ids"]
        self.attention_mask = output["attention_mask"]
        self.labels = labels
        self.num_labels = len(set(labels))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i])


@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.Tensor(labels).long()
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

"""
Manually calculate the accuracy, f1, matthews_correlation, precision, recall with sklearn.
"""
def calculate_metric_with_sklearn(predictions: np.ndarray, labels: np.ndarray):
    valid_mask = labels != -100  # Exclude padding tokens (assuming -100 is the padding token ID)
    valid_predictions = predictions[valid_mask]
    valid_labels = labels[valid_mask]
    return {
        "accuracy": sklearn.metrics.accuracy_score(valid_labels, valid_predictions),
        "f1": sklearn.metrics.f1_score(
            valid_labels, valid_predictions, average="macro", zero_division=0
        ),
        "matthews_correlation": sklearn.metrics.matthews_corrcoef(
            valid_labels, valid_predictions
        ),
        "precision": sklearn.metrics.precision_score(
            valid_labels, valid_predictions, average="macro", zero_division=0
        ),
        "recall": sklearn.metrics.recall_score(
            valid_labels, valid_predictions, average="macro", zero_division=0
        ),
    }

# from: https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941/13
def preprocess_logits_for_metrics(logits:Union[torch.Tensor, Tuple[torch.Tensor, Any]], _):
    if isinstance(logits, tuple):  # Unpack logits if it's a tuple
        logits = logits[0]

    if logits.ndim == 3:
        # Reshape logits to 2D if needed
        logits = logits.reshape(-1, logits.shape[-1])

    return torch.argmax(logits, dim=-1)


"""
Compute metrics used for huggingface trainer.
"""
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return calculate_metric_with_sklearn(predictions, labels)

# Set hyperparameters and paths for model training:
- `kmer` is the size of the k-mers to tokenize DNA sequences.
- `seed` ensures reproducibility of results.
- `data` specifies which promoter dataset to use.
## Define model, data, and training configurations using previously defined dataclasses:
- `model_args` sets the DNABERT model checkpoint to use, specific to the chosen k-mer size.
- `data_args` provides the path to the training data and specifies whether to use k-mers.
- `training_args` includes training parameters such as batch size, learning rate, number of epochs,
   evaluation/saving steps, and other runtime settings. Certain values are conditionally adjusted
  based on the dataset name (e.g. more epochs for TATA-promoter data).

In [None]:
kmer = 6
seed = 172
data = "prom_300_all"

model_args = ModelArguments(
    model_name_or_path=f"zhihan1996/DNA_bert_{kmer}",
)
data_args = DataArguments(
    data_path=f"GUE/prom/{data}",
    kmer=kmer,
)
training_args = TrainingArguments(
    run_name=f"DNABER1_{kmer}_{data}_seed{seed}",
    model_max_length=310,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=3e-5,
    num_train_epochs=10 if data == "prom_300_tata" else 4,
    fp16=True,
    save_steps=200 if data == "prom_300_tata" else 400,
    output_dir=f"./ft/{kmer}",
    eval_strategy="steps",
    save_strategy="steps",
    eval_steps=200 if data == "prom_300_tata" else 400,
    warmup_steps=50,
    logging_steps=100000,
    overwrite_output_dir=True,
    log_level="info",
    seed=seed,
    find_unused_parameters=False,
)

# Model Setup, Training, and Evaluation

## Tokenizer Loading:

  Loads the tokenizer from the pretrained DNABERT checkpoint, setting key parameters like:

  - `model_max_length` (to truncate or pad sequences),

  - `padding_side` (set to "right"),

  - `use_fast tokenizer` (for better performance).

 Additionally, for models from `InstaDeepAI`, sets the end-of-sequence (`eos_token`) to the same as the padding token for compatibility.

## Dataset Preparation:

  Loads and tokenizes the training (`train.csv`), validation (`dev.csv`), and test (`test.csv`) datasets using the `SupervisedDataset` class.

  - Applies optional k-mer transformation (e.g. 6-mer if specified).

  - Uses a custom data collator to pad sequences and prepare batches.

## Model Loading:

  Loads the pretrained `AutoModelForSequenceClassification` model from the specified DNABERT checkpoint.

  - The number of output labels is automatically determined from the dataset.

## LoRA Configuration (Optional):

  If LoRA is enabled (`use_lora=True`), configures and applies Low-Rank Adaptation to reduce the number of trainable parameters.

  - LoRA parameters (e.g., rank, alpha, target modules) are defined in `LoraConfig`.

## Training:

  Initializes the HuggingFace `Trainer` with:

  - the model, tokenizer, datasets, training arguments,

  - evaluation metric functions,

  - and batch collator.

  Starts the fine-tuning process with `trainer.train()`.

## Model Saving (Optional):

  If enabled via `training_args.save_model`, saves the model checkpoint after training using a CPU-safe function.

## Evaluation:

  Evaluates the trained model on the test set using `trainer.evaluate()`.

  - Saves the evaluation results (e.g., accuracy, F1 score, MCC) to a JSON file in the specified output directory.

In [None]:
# load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    cache_dir=training_args.cache_dir,
    model_max_length=training_args.model_max_length,
    padding_side="right",
    use_fast=True,
    trust_remote_code=True,
)

if "InstaDeepAI" in model_args.model_name_or_path:
    tokenizer.eos_token = tokenizer.pad_token

# define datasets and data collator
train_dataset = SupervisedDataset(tokenizer=tokenizer,
                                    data_path=os.path.join(data_args.data_path, "train.csv"),
                                    kmer=data_args.kmer)
val_dataset = SupervisedDataset(tokenizer=tokenizer,
                                    data_path=os.path.join(data_args.data_path, "dev.csv"),
                                    kmer=data_args.kmer)
test_dataset = SupervisedDataset(tokenizer=tokenizer,
                                    data_path=os.path.join(data_args.data_path, "test.csv"),
                                    kmer=data_args.kmer)
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)


# load model
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_args.model_name_or_path,
    cache_dir=training_args.cache_dir,
    num_labels=train_dataset.num_labels,
    trust_remote_code=True,
)

# configure LoRA
if model_args.use_lora:
    lora_config = LoraConfig(
        r=model_args.lora_r,
        lora_alpha=model_args.lora_alpha,
        target_modules=list(model_args.lora_target_modules.split(",")),
        lora_dropout=model_args.lora_dropout,
        bias="none",
        task_type="SEQ_CLS",
        inference_mode=False,
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

# define trainer
trainer = transformers.Trainer(model=model,
                                tokenizer=tokenizer,
                                args=training_args,
                                preprocess_logits_for_metrics=preprocess_logits_for_metrics,
                                compute_metrics=compute_metrics,
                                train_dataset=train_dataset,
                                eval_dataset=val_dataset,
                                data_collator=data_collator)
trainer.train()

if training_args.save_model:
    trainer.save_state()
    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

# get the evaluation results from trainer
if training_args.eval_and_save_results:
    results_path = os.path.join(training_args.output_dir, "results", training_args.run_name)
    results = trainer.evaluate(eval_dataset=test_dataset)
    os.makedirs(results_path, exist_ok=True)
    with open(os.path.join(results_path, "eval_results.json"), "w") as f:
        json.dump(results, f)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/807 [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/zhihan1996/DNA_bert_6:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


vocab.txt:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



dnabert_layer.py:   0%|          | 0.00/5.44k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/zhihan1996/DNA_bert_6:
- dnabert_layer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/359M [00:00<?, ?B/s]

Some weights of DNABertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNA_bert_6 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = transformers.Trainer(model=model,
Safetensors PR exists
Using auto half precision backend
***** Running training *****
  Num examples = 47,356
  Num Epochs = 4
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5,920
  Number of trainable parameters = 89,192,450


model.safetensors:   0%|          | 0.00/359M [00:00<?, ?B/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Matthews Correlation,Precision,Recall
400,No log,0.235685,0.905236,0.905235,0.810473,0.905233,0.90524
800,No log,0.229247,0.911149,0.911014,0.824159,0.913176,0.910985
1200,No log,0.162466,0.9375,0.937492,0.875477,0.937902,0.937575
1600,No log,0.183977,0.930574,0.930491,0.863876,0.93312,0.930759
2000,No log,0.141157,0.948311,0.948311,0.896635,0.948313,0.948322
2400,No log,0.124853,0.954392,0.954381,0.908972,0.954631,0.954341
2800,No log,0.117642,0.957601,0.957601,0.915203,0.957598,0.957604
3200,No log,0.133399,0.957264,0.957245,0.914984,0.957797,0.957187
3600,No log,0.135282,0.955912,0.955911,0.911823,0.95591,0.955914
4000,No log,0.122567,0.959291,0.959285,0.918671,0.959415,0.959255



***** Running Evaluation *****
  Num examples = 5920
  Batch size = 32
Saving model checkpoint to ./ft/6/checkpoint-400
Configuration saved in ./ft/6/checkpoint-400/config.json
Model weights saved in ./ft/6/checkpoint-400/model.safetensors
tokenizer config file saved in ./ft/6/checkpoint-400/tokenizer_config.json
Special tokens file saved in ./ft/6/checkpoint-400/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 5920
  Batch size = 32
Saving model checkpoint to ./ft/6/checkpoint-800
Configuration saved in ./ft/6/checkpoint-800/config.json
Model weights saved in ./ft/6/checkpoint-800/model.safetensors
tokenizer config file saved in ./ft/6/checkpoint-800/tokenizer_config.json
Special tokens file saved in ./ft/6/checkpoint-800/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 5920
  Batch size = 32
Saving model checkpoint to ./ft/6/checkpoint-1200
Configuration saved in ./ft/6/checkpoint-1200/config.json
Model weights saved in ./ft/6/checkpoi