This repository contains code to generate resutls for the study "Evaluating the representational power of pre-trained DNA language models for regulatory genomics". (Pre-print link)
The data_generation folder contains script for the pre-processing of datsets, and notebooks of using each gLM to exract layer embeddings. figure contains code and generated figure for the paper.
The rest of the code is orgnized by task and analysis:
lentiMPRAfor Task 1chip-clip-seqfor Task 2 and 6CAGIfor Task 3alternative-splicingfor Task 4RNAenlongfor Task 5motif_idfor all saliency analysis
Within each repository are orgnized based on the input. Most folders contain scripts for gLM representation (except NT), NT, and one-hot based model trainings.
Since not all gLMs can be installed in the same environment, three different environments were used during this study, tf_requirments.yml, torch_requirments.yml and gpn_requirements.yml.
tf_requirmentswill be most frequently used. This environment should be used for all scripst based on Nucleotide Transformer (NT_*), and also the onehot and representation based model trainings(lentiMPRA/representation_perf.ipynb,lentiMPRA/onehot_models.pyetc.).torch_requirmentsare used for HyenaDNA based scripts. Mostly indata_generation/embedding_generation/Heyna_embed.ipynbandCAGI/cagi_NT.ipynb- and
gpn_requirmentsare for all GPN related inference, such asdata_generation/embedding_generation/GPN_embed.ipynbandCAGI/cagi_gpn.ipynb
Original dataset and models trained for this study can be accessed from zenodo, they should be decompressed into the base folder for this repo. No installation is required to run analysis in this repository