## AIDO.DNA-300M

[AIDO.DNA-300M](https://huggingface.co/genbio-ai/AIDO.DNA-300M) is DNA foundation model trained on 10.6 billion nucleotides from 796 species, enabling genome mining, in silico mutagenesis studies, gene expression prediction, and directed sequence generation.

By scaling model depth while maintaining a short context length of 4000 nucleotides, AIDO.DNA shows substantial improvements across a breadth of tasks in functional genomics using transfer learning, sequence generation, and unsupervised annotation of functional elements. Notably, AIDO.DNA outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models.

<img src="images/DNA_300M.png" alt="DNA_300M" width="80%" style="background-color:white;"/>

| Model Arch Component        | Value          |
| ------------- |:-------------:|
| Num Attention Heads      | 32  |
| Num Hidden Layers      | 32       |
| Hidden Size | 4352       |
| Intermediate Size | 11584       |
| Vocab Size | 16      |
| Context Length | 4000      |

## ModelGenerator tasks

* **Get embeddings**: input sequence, get per-residue and per-sequence embeddings.
* **Sequence level classification**: input sequence, get classification label (e.g., enzyme/non-enzyme).
* **Token level classification**: input sequence, get per-residue labels (e.g., secondary structure).
* **Sequence level regression**: input sequence, get a real-valued output (e.g., stability).

```python
from modelgenerator.tasks import Embed
from modelgenerator.tasks import SequenceClassification
from modelgenerator.tasks import TokenClassification
from modelgenerator.tasks import SequenceRegression
```

### How to implement these tasks using ModelGenerator?
* **Backbone**: use `genbio-ai/AIDO.DNA-300M` as the backbone model.
* **Adaptors**: different adaptors can be used for different tasks.
* **Dataset**: different datasets can be used for different tasks.
* **Loss functions**: different loss functions can be used for different tasks.

The following section explains how to use the predefined task class in ModelGenerator to load the model

### Embedding

In [1]:
print("Hello world")
import os, sys, pathlib, torch
import torch.nn.functional as F
import numpy as np
os.environ['HF_HOME'] = '/tmp/hf_cache'

from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_dna_300m"}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
embedding = model(transformed_batch)
print(embedding.shape)
print(embedding)

Hello world


You didn't set a max_length for the data in the downstream task


torch.Size([2, 6, 1024])
tensor([[[-0.1211, -1.1656,  0.1341,  ..., -0.3062,  0.7482, -0.5316],
         [-0.3436, -0.7332, -0.4796,  ..., -0.5184, -0.1882, -1.0675],
         [-0.1117, -1.3768,  0.3410,  ..., -0.2125,  0.2736, -0.7600],
         [ 0.2689, -1.0567,  0.2963,  ...,  0.3879,  0.5795, -0.5904],
         [-0.3094, -0.3091,  0.0409,  ...,  0.4478,  0.8100, -0.2500],
         [-0.0301, -0.4987,  0.6923,  ...,  0.1198,  0.3909, -0.7629]],

        [[-0.1009, -0.8940,  0.6659,  ..., -0.2676,  0.1204, -0.1840],
         [-0.0800, -0.5025, -0.2760,  ..., -0.6894, -0.2624, -0.6581],
         [ 0.4181, -0.4056,  0.2300,  ...,  0.1285,  0.6482, -0.2483],
         [ 0.2937, -0.8096,  0.5548,  ...,  0.2902, -0.1164, -0.1896],
         [-0.1779, -0.2603,  0.3952,  ...,  0.0972,  0.5229, -0.3522],
         [ 0.0921, -0.5491,  0.7479,  ...,  0.1172, -0.0194, -0.7005]]],
       grad_fn=<NativeLayerNormBackward0>)


### Sequence Level Classification

In [2]:
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_dna_300m", "model.n_classes": 2}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
logits = model(transformed_batch)
print(logits)
print(torch.argmax(logits, dim=-1))


You didn't set a max_length for the data in the downstream task


tensor([[0.4520, 0.2174],
        [0.2696, 0.2643]], grad_fn=<AddmmBackward0>)
tensor([0, 0])


#### Token Level Classification

In [3]:
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_dna_300m", "model.n_classes": 3}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
logits = model(transformed_batch)
print(logits)
print(torch.argmax(logits, dim=-1))


You didn't set a max_length for the data in the downstream task


tensor([[[-0.2240, -0.2785, -0.2847],
         [-0.3276, -0.3235,  0.0136],
         [ 0.1009, -0.6764, -0.0263],
         [-0.2982, -0.2781, -0.1608],
         [-0.2490, -0.4951, -0.0656],
         [ 0.0326, -0.6318, -0.0705]],

        [[-0.0281, -0.1438, -0.2070],
         [-0.2562, -0.2852,  0.1643],
         [-0.0965, -0.0467, -0.2549],
         [ 0.1097, -0.6410,  0.1017],
         [-0.2554, -0.3513,  0.0117],
         [-0.0156, -0.4872, -0.1562]]], grad_fn=<ViewBackward0>)
tensor([[0, 2, 0, 2, 2, 0],
        [0, 2, 1, 0, 2, 0]])


#### Regression

In [1]:
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_dna_300m"}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
logits = model(transformed_batch)
print(logits)

You didn't set a max_length for the data in the downstream task


tensor([[1.0645],
        [1.0317]], grad_fn=<AddmmBackward0>)
