# Transformer Fine-Tuning for Regulatory DNA: Classifying Functional Elements and Variant Effects
Angel Morenu  
M.S. Applied Data Science, University of Florida  
Email: angel.morenu@ufl.edu

## Abstract
This project evaluates transformer-based language models for DNA sequence analysis on regulatory genomics tasks. The aims are: (i) classify functional elements (promoters, enhancers, DNase-accessible regions) directly from sequence and (ii) prioritize the regulatory impact of noncoding variants. We will fine-tune DNABERT-2 and Nucleotide Transformer and benchmark against CNN baselines (DeepSEA, Basset, Basenji) and linear probes on frozen embeddings. Public datasets from ENCODE, Roadmap Epigenomics, and the DeepSEA training bundle provide standardized annotations (DNase, histone marks, TF binding). Primary metrics: AUROC and average precision (PR-AUC); secondary: compute cost (runtime, GPU memory) and cross-cell-type generalization. For variant effect prediction, in-silico mutagenesis will be used, with score distributions validated against published DeepSEA benchmarks. The purpose is to test whether transformer models better capture higher-order dependencies in DNA sequences than conventional CNNs.

## Plan of Action

### A. What I Will Implement
1) Data preprocessing pipeline for hg19/hg38 sequences: extract ±1kb windows, build labeled train/validation/test splits from ENCODE/DeepSEA annotations.  
2) Baseline models:  
- CNN baseline (Basset/Basenji minimal config)  
- Linear probe on frozen transformer embeddings  
3) Transformer models: fine-tune DNABERT-2 and Nucleotide Transformer with variable k-mer/BPE tokenization and context length.  
4) Variant effect prediction: in-silico saturation mutagenesis on known regulatory loci.  
5) Evaluation pipeline: AUROC, PR-AUC, bootstrap confidence intervals, runtime profiling, cross-cell-type transfer experiments.

### B. Methods to Compare and Sources
- DNABERT-2: https://huggingface.co/zhihan1996/DNABERT-2-117M  
- Nucleotide Transformer (v1/v2): https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species  
- Basenji: https://github.com/calico/basenji  
- Basset: https://github.com/davek44/Basset  
- DeepSEA: https://deepsea.princeton.edu/help/ (portal: https://deepsea.princeton.edu/)  
Optional ready-to-use models (Kipoi):  
- https://kipoi.org/models/DeepSEA/predict/  
- https://kipoi.org/models/DeepSEA/variantEffects/

### C. Datasets and Sources
- DeepSEA training bundle: http://deepsea.princeton.edu/help/  
- ENCODE Project: https://www.encodeproject.org/  
- Roadmap Epigenomics Data: https://egg2.wustl.edu/roadmap/web_portal/ (also via AWS Open Data and GEO)

### D. Experiments and Measurements
- Multi-label classification (promoters, enhancers, DNase, TF binding, histone marks)  
- Metrics: AUROC, PR-AUC, runtime, GPU memory, cross-cell-type generalization  
- Variant effect prediction: mutagenesis scores compared with DeepSEA benchmarks (correlations, enrichment)

### E. Feasibility Considerations
Large-scale transformer training is computationally intensive. We’ll use pre-trained checkpoints and fine-tune with smaller windows (1kb) on UF HPC or Colab Pro. CNN baselines (Basset/Basenji) remain lightweight for feasible comparisons.

## Preliminary Reading List
1) Ji et al., DNABERT: Bioinformatics 2021.  
2) Nguyen et al., DNABERT-2: bioRxiv 2023.  
3) Dalla-Torre et al., Nucleotide Transformer: arXiv:2306.15006, 2023.  
4) Zhou et al., DeepSEA: Nat Methods 2015.  
5) Kelley et al., Basset: Genome Research 2016.  
6) Avsec et al., Effective gene expression prediction from sequence by integrating long-range interactions.

## Notebook Index
- 00 - Data Download: ./Data_download.ipynb
- 01 – Data Preprocessing: ./01_data_preprocessing.ipynb  
- 02 – Baseline Models: ./02_baseline_models.ipynb  
- 03 – Transformer Fine-tuning: ./03_transformer_finetuning.ipynb  
- 04 – Variant Effects: ./04_variant_effects.ipynb  
- 05 – Results Visualization: ./05_results_visualization.ipynb