## AIDO.Protein-16B

[AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) is a protein language model, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB.

By leveraging MoE layers, AIDO.Protein efficiently scales to 16 billion parameters, delivering exceptional performance across a vast variety of tasks in protein sequence understanding and sequence generation. Remarkably, AIDO.Protein demonstrates exceptional capability despite being trained solely on single protein sequences. Across over 280 DMS protein fitness prediction tasks, our model outperforms previous state-of-the-art protein sequence models without MSA and achieves 99% of the performance of models that utilize MSA, highlighting the strength of its learned representations.

Reference: [Mixture of Experts Enable Efficient and Effective Protein Understanding and Design](https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1)

<img src="images/proteinmoe_architecture.png" alt="AIDO.Protein" width="600" style="background-color:white;"/>

| Model Arch Component    | Value |
| ----------------------- | :---: |
| Num Attention Head      |  36   |
| Num Hidden Layer        |  36   |
| Hidden Size             | 2304  |
| FFN Hidden Size         | 7680  |
| Num MoE Layer per Block |   8   |
| Num MoE Layer per Token |   2   |
| Vocab Size              |  44   |
| Context Length          | 2048  |


## Step-by-Step Example

In [1]:
import os, sys, pathlib, torch
os.environ['HF_HOME'] = '/tmp/hf_cache'
import modelgenerator
modelgenerator_path = str(pathlib.Path(modelgenerator.__file__).parent)
print(f"modelgenerator_path: {modelgenerator_path}")

from modelgenerator.huggingface_models.fm4bio import FM4BioForMaskedLM
from modelgenerator.huggingface_models.fm4bio import FM4BioTokenizer

modelgenerator_path: /jfs/pan-li/Demo/ModelGenerator/modelgenerator


In [2]:
model = FM4BioForMaskedLM.from_pretrained("genbio-ai/AIDO.Protein-16B")
vocab_file = os.path.join(modelgenerator_path, "huggingface_models/fm4bio/vocab_protein.txt")
tokenizer = FM4BioTokenizer(vocab_file=vocab_file)



Loading checkpoint shards: 100%|██████████| 13/13 [00:08<00:00,  1.59it/s]


In [None]:
input_ids = tokenizer.encode('HELLQWRLD', add_special_tokens=True)
print(f"input_ids: {input_ids}")
input_ids = torch.tensor([input_ids])  # Batch size 1

with torch.no_grad():
    lm_output = model(input_ids, output_hidden_states=True)
    logits = lm_output.logits
    last_hidden_states = lm_output.hidden_states[-1]

print(f"logits: {logits.shape}")
print(f"last_hidden_states: {last_hidden_states.shape}")

input_ids: [18, 6, 1, 1, 13, 19, 7, 1, 10, 34]


logits: torch.Size([1, 10, 128])
last_hidden_states: torch.Size([1, 10, 2304])


## ModelGenerator tasks

* **Get embeddings**: input sequence, get per-residue and per-sequence embeddings.
* **Sequence level classification**: input sequence, get classification label (e.g., enzyme/non-enzyme).
* **Token level classification**: input sequence, get per-residue labels (e.g., secondary structure).
* **Sequence level regression**: input sequence, get a real-valued output (e.g., stability).

```python
from modelgenerator.tasks import Embed
from modelgenerator.tasks import SequenceClassification
from modelgenerator.tasks import TokenClassification
from modelgenerator.tasks import SequenceRegression
```

### How to implement these tasks using ModelGenerator?
* **Backbone**: use `genbio-ai/AIDO.Protein-16B` as the backbone model.
* **Adaptors**: different adaptors can be used for different tasks.
* **Dataset**: different datasets can be used for different tasks.
* **Loss functions**: different loss functions can be used for different tasks.

The following section explains how to use the predefined task class in ModelGenerator to load the model

### Get Embeddings

In [6]:
import os, sys, pathlib, torch
os.environ['HF_HOME'] = '/tmp/hf_cache'

from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_protein_16b"}).eval()
transformed_batch = model.transform({"sequences": ["HELLQ", "WRLD"]})
for k,v in transformed_batch.items():
    if isinstance(v, torch.Tensor):
        print(f"{k}: {v.shape}")
    elif isinstance(v, list):
        print(f"{k}: list of length {len(v)}")

embedding = model(transformed_batch)
print(embedding.shape)
print(embedding)



Loading checkpoint shards: 100%|██████████| 13/13 [00:08<00:00,  1.51it/s]


sequences: list of length 2
input_ids: torch.Size([2, 6])
attention_mask: torch.Size([2, 6])
special_tokens_mask: list of length 2
torch.Size([2, 6, 2304])
tensor([[[ 0.3044,  0.0566,  0.0445,  ..., -0.1136,  0.1243,  0.3385],
         [ 0.1187,  0.1608, -0.1112,  ...,  0.0048,  0.0487,  0.2208],
         [ 0.2758,  0.1054, -0.0684,  ..., -0.0247, -0.1242,  0.0035],
         [ 0.2359,  0.0590,  0.0292,  ...,  0.0735, -0.0834,  0.1099],
         [ 0.2090, -0.0955,  0.1451,  ..., -0.0390,  0.1434,  0.0508],
         [ 0.1898, -0.0129, -0.0312,  ...,  0.0127, -0.0230,  0.1145]],

        [[ 0.3881,  0.0761,  0.0285,  ...,  0.1029,  0.1006,  0.3173],
         [ 0.2578, -0.1170,  0.0710,  ...,  0.1311,  0.1170,  0.2706],
         [ 0.2240,  0.1821,  0.0016,  ...,  0.0561,  0.1246,  0.1904],
         [ 0.2875,  0.1226, -0.0167,  ...,  0.0818, -0.0253,  0.2927],
         [ 0.1534,  0.0319,  0.0268,  ...,  0.0514, -0.0410,  0.1762],
         [ 0.1700,  0.0045,  0.0832,  ...,  0.0786, -0.0603, 

### Sequence Level Classification

In [8]:
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_protein_16b", "model.n_classes": 3}).eval()
transformed_batch = model.transform({"sequences": ["HELLQ", "WRLD"]})
logits = model(transformed_batch)
print(logits)
print(torch.argmax(logits, dim=-1))

Loading checkpoint shards: 100%|██████████| 13/13 [00:09<00:00,  1.34it/s]


tensor([[-0.0897,  0.0079, -0.2286],
        [ 0.0525, -0.0608, -0.1076]], grad_fn=<AddmmBackward0>)
tensor([1, 0])


### Token Level Classification

In [3]:
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_protein_16b", "model.n_classes": 3}).eval()
transformed_batch = model.transform({"sequences": ["HELLQ", "WRLD"]})
logits = model(transformed_batch)
print(logits)
print(torch.argmax(logits, dim=-1))

Loading checkpoint shards: 100%|██████████| 13/13 [00:14<00:00,  1.09s/it]


tensor([[[ 0.0158, -0.0290, -0.0166],
         [ 0.0178, -0.0479, -0.0003],
         [-0.0464,  0.0480,  0.0859],
         [-0.1204,  0.0147,  0.0833],
         [ 0.0230, -0.0509, -0.0480],
         [-0.0650,  0.2052, -0.0993]],

        [[ 0.0295, -0.0584,  0.0766],
         [ 0.1101, -0.1542, -0.0080],
         [ 0.0885, -0.0574,  0.1072],
         [ 0.0963, -0.1165,  0.0631],
         [-0.0354,  0.1781, -0.0608],
         [ 0.0476, -0.0687,  0.0973]]], grad_fn=<ViewBackward0>)
tensor([[0, 0, 2, 2, 0, 1],
        [2, 0, 2, 0, 1, 2]])


### Sequence Level Regression

In [4]:
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_protein_16b"}).eval()
transformed_batch = model.transform({"sequences": ["HELLQ", "WRLD"]})
logits = model(transformed_batch)
print(logits)

Loading checkpoint shards: 100%|██████████| 13/13 [00:08<00:00,  1.55it/s]


tensor([[0.1339],
        [0.0609]], grad_fn=<AddmmBackward0>)


The above is how to manually load the model and call the model, but what we really want to do is to let ModelGenerator help us complete these things: load the model, load the dataset, preprocess the data, define the loss function, train, and visualize the loss curve.