## AIDO.DNA-300M

[AIDO.DNA-300M](https://huggingface.co/genbio-ai/AIDO.DNA-300M) is DNA foundation model trained on 10.6 billion nucleotides from 796 species, enabling genome mining, in silico mutagenesis studies, gene expression prediction, and directed sequence generation.

By scaling model depth while maintaining a short context length of 4000 nucleotides, AIDO.DNA shows substantial improvements across a breadth of tasks in functional genomics using transfer learning, sequence generation, and unsupervised annotation of functional elements. Notably, AIDO.DNA outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models.

<img src="images/DNA_300M.png" alt="DNA_300M" width="50%" style="background-color:white;"/>

| Model Arch Component        | Value          |
| ------------- |:-------------:|
| Num Attention Heads      | 32  |
| Num Hidden Layers      | 32       |
| Hidden Size | 4352       |
| Intermediate Size | 11584       |
| Vocab Size | 16      |
| Context Length | 4000      |

Config file:
* [gue_core_promoter_all.yaml](../ModelGenerator/experiments/AIDO.DNA/sequence_classification/gue_core_promoter_all.yaml)

#### Add Logger to yaml

```yaml
trainer:
  logger:
  - class_path: lightning.pytorch.loggers.WandbLogger
    init_args:
      name: null
      project: null
      id: null
```

#### Training command

```bash
export HF_HOME=/tmp/hf_cache
mgen fit \
    --config experiments/AIDO.DNA/sequence_classification/gue_core_promoter_all.yaml \
    --trainer.num_nodes 1 \
    --trainer.logger.name gue_core_promoter_all \
    --trainer.logger.project AIDO_Demo \
    --data.batch_size 8
```

<img src="images/dna_curve.png" alt="dna_curve" width="90%" style="background-color:white;"/>

## Create object step by step in Jupyter Notebook

```yaml
data:
  class_path: modelgenerator.data.GUEClassification
  init_args:
    config_name: prom_core_all
    train_split_name: train
    test_split_name: test
    batch_size: 4
```

In [1]:
from modelgenerator.data import GUEClassification

datamodule = GUEClassification(config_name='prom_core_all', train_split_name='train', test_split_name='test', batch_size=4)
datamodule.setup()

print(len(datamodule.train_dataset))
print(len(datamodule.test_dataset))

> Randomly split 0.1 of train for validation. Random seed: 42


42620
5920


In [4]:
sample = datamodule.train_dataset[0]
sample

{'sequences': 'CGAGCCTCAGGAGGGAGGCTTCTCGTATAGTCTCCCCCTACTGGATCCGTTCGCTTCAGCGGGCGCCAGG',
 'labels': 0}