# Install Conda Environment and ModelGenerator

https://github.com/genbio-ai/AIDO-Foundations-Tutorials

![qrcode](images/qrcode.png)

1. Create an environment with Python 3.12
2. Install ModelGenerator: https://github.com/genbio-ai/ModelGenerator


```bash
# source /opt/miniconda/etc/profile.d/conda.sh
conda create -n genbio python=3.12 -y
conda activate genbio

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install datasets==3.0.0
pip install hf_transfer tabulate
git clone https://github.com/genbio-ai/ModelGenerator.git
cd ModelGenerator
pip install -e .
```

## ModelGenerator

* ModelGenerator loads models and Tokenizers based on **Transformers** library.
* ModelGenerator fine-tunes tasks based on **Lightning** library.


<img src="images/modelgenerator.png" alt="ModelGenerator" width="80%"/>

## Quick start: load model and tokenizer from ModelGenerator

In [2]:
import os
os.environ['HF_HOME'] = '/tmp/hf_cache' # For accelerated model download and loading...
import modelgenerator, torch, pathlib
modelgenerator_path = str(pathlib.Path(modelgenerator.__file__).parent)
print(f"modelgenerator_path: {modelgenerator_path}")
from modelgenerator.huggingface_models.fm4bio import FM4BioForMaskedLM # ModelClass
from modelgenerator.huggingface_models.fm4bio import FM4BioTokenizer # TokenizerClass

modelgenerator_path: /jfs/pan-li/Demo/ModelGenerator/modelgenerator


In [3]:
from transformers import PreTrainedModel, PreTrainedTokenizer
print(issubclass(FM4BioForMaskedLM, PreTrainedModel))
print(issubclass(FM4BioTokenizer, PreTrainedTokenizer))

True
True


[genbio-ai/AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B)

In [5]:
model = FM4BioForMaskedLM.from_pretrained("genbio-ai/AIDO.Protein-16B")
vocab_file = os.path.join(modelgenerator_path, "huggingface_models/fm4bio/vocab_protein.txt")
tokenizer = FM4BioTokenizer(vocab_file=vocab_file)

Loading checkpoint shards: 100%|██████████| 13/13 [00:07<00:00,  1.80it/s]


In [6]:
print(model)

FM4BioForMaskedLM(
  (bert): FM4BioModel(
    (embeddings): FM4BioEmbeddings(
      (word_embeddings): Embedding(128, 2304, padding_idx=0)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): FM4BioEncoder(
      (layer): ModuleList(
        (0-35): 36 x FM4BioLayer(
          (attention): FM4BioAttention(
            (ln): RnaRMSNorm()
            (self): FM4BioSelfAttention(
              (query): Linear(in_features=2304, out_features=2304, bias=True)
              (key): Linear(in_features=2304, out_features=2304, bias=True)
              (value): Linear(in_features=2304, out_features=2304, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): FM4BioSelfOutput(
              (dense): Linear(in_features=2304, out_features=2304, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (ln): RnaRMSNorm()
          (mlp): FM4BioMLP(
            (router): Linear(in_features=2304, o

## Models in this tutorial

### AIDO.Protein-16B

[AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) is a protein language model, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB.

<img src="images/proteinmoe_architecture.png" alt="AIDO.Protein" width="600" style="background-color:white;"/>

### AIDO.Protein-RAG-16B

[AIDO.Protein-RAG-16B](https://huggingface.co/genbio-ai/AIDO.Protein-RAG-16B) is a multimodal protein language model that integrates Multiple Sequence Alignment (MSA) and structural data, building upon the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) foundation. 

<img src="images/rag_2.png" alt="AIDO.Protein-RAG" width="400" style="background-color:white;"/>

### AIDO.DNA-300M

[AIDO.DNA-300M](https://huggingface.co/genbio-ai/AIDO.DNA-300M) is DNA foundation model trained on 10.6 billion nucleotides from 796 species, enabling genome mining, in silico mutagenesis studies, gene expression prediction, and directed sequence generation.

<img src="images/DNA_300M.png" alt="DNA_300M" width="50%" style="background-color:white;"/>

## More resources

* **HuggingFace**: https://huggingface.co/genbio-ai
  
  <img src="images/hf_collections.png" alt="hf_collections" width="50%" style="background-color:white;"/>

* **Github**: https://github.com/genbio-ai/ModelGenerator/
    
    <img src="images/github_modelgenerator.png" alt="github_modelgenerator" width="50%" style="background-color:white;"/>

* **Documentation**: https://genbio-ai.github.io/ModelGenerator/
    
    <img src="images/mg_documentation.png" alt="mg_documentation" width="50%" style="background-color:white;"/>
