# ESM-1nv Training with BioNeMo on Amazon SageMaker

Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0

---
## 1. Setup

### 1.1. Create clients

In [None]:
import boto3
import os
import sagemaker
from time import strftime

boto_session = boto3.session.Session()
sagemaker_session = sagemaker.session.Session(boto_session)
REGION_NAME = sagemaker_session.boto_region_name
S3_BUCKET = sagemaker_session.default_bucket()
S3_PREFIX = "bionemo-training"
S3_FOLDER = sagemaker.s3.s3_path_join("s3://", S3_BUCKET, S3_PREFIX)
print(f"S3 uri is {S3_FOLDER}")

EXPERIMENT_NAME = "bionemo-training-" + strftime("%Y-%m-%d")

SAGEMAKER_EXECUTION_ROLE = sagemaker.session.get_execution_role(sagemaker_session)
print(f"Assumed SageMaker role is {SAGEMAKER_EXECUTION_ROLE}")

### 1.2. Build BioNeMo-SageMaker Container Image

If you don't already have access to the BioNeMo-SageMaker container image, run the following cell to build and deploy it to your AWS account. Take note of the image URI - you'll use it for the processing and training steps below.

Here is an example shell script you can use in your environment (including SageMaker Notebook Instances) to build the container.

Once you have built and pushed the container, we strongly recommend using [ECR image scanning](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html) to ensure that it meets your security requirements.

In [None]:
%%bash

# The name of our algorithm
algorithm_name=bionemo-training

pushd container/training

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

popd

---
## 2. Data Processing

### 2.1. Query UniProt for human amino acid sequences between 100 and 1000 residues in length

In [None]:
from io import BytesIO
import pandas as pd
import requests

query_url = "https://rest.uniprot.org/uniprotkb/stream?query=organism_id:9606+AND+reviewed=True+AND+length=[100+TO+1000]&format=tsv&compressed=true&fields=accession,sequence"
uniprot_request = requests.get(query_url)
bio = BytesIO(uniprot_request.content)

df = pd.read_csv(bio, compression="gzip", sep="\t")
display(df)

### 2.2. Split Data and Upload to S3

In [None]:
train = df.sample(n=9600, random_state=42)
val_test = df.drop(train.index)
val = val_test.sample(n=960, random_state=42)
test = val_test.drop(val.index).sample(n=960, random_state=42)
del val_test

print(f"Training data size: {train.shape}")
print(f"Validation data size: {val.shape}")
print(f"Test data size: {test.shape}")

for dir in ["train", "val", "test"]:
    if not os.path.exists(os.path.join("data", dir)):
        os.makedirs(os.path.join("data", dir))

train.to_csv(os.path.join("data", "train", "x000.csv"), index=False)
val.to_csv(os.path.join("data", "val", "x001.csv"), index=False)
test.to_csv(os.path.join("data", "test", "x002.csv"), index=False)

DATA_PREFIX = os.path.join(S3_PREFIX, "data")
DATA_URI = sagemaker_session.upload_data(
    path="data", bucket=S3_BUCKET, key_prefix=DATA_PREFIX
)
print(f"Sequence data available at {DATA_URI}")

---
## 3. Configure NVIDIA NGC API Credentiatls

Before you create a BioNeMo training job, follow these steps to generate some NGC API credentials and store them in AWS Secrets Manager. 

1. Sign-in or create a new account at NVIDIA [NGC](https://ngc.nvidia.com/signin).
2. Select your name in the top-right corner of the screen and then "Setup"

![Select Setup from the top-right menup](img/1-setup.png)

3. Select "Generate API Key".

![Select Generate API Key](img/2-api-key.png)

4. Select the green "+ Generate API Key" button and confirm.

![Select green Generate API Key button ](img/3-generate.png)

5. Copy the API key - this is the last time you can retrieve it!

6. Before you leave the NVIDIA NGC site, also take note of your organization ID listed under your name in the top-right corner of the screen. You'll need this, plus your API key, to download BioNeMo artifacts.

7. Navigate to the AWS Console and then to AWS Secrets Manager.

![Navigate to AWS Secrets Manager](img/4-sm.png)

8. Select "Store a new secret".
9. Under "Secret type" select "Other type of secret"

![Select other type of secret](img/5-secret-type.png)

10. Under "Key/value" pairs, add a key named "NGC_CLI_API_KEY" with a value of your NGC API key. Add another key named "NGC_CLI_ORG" with a value of your NGC organization. Select Next.

11. Under "Configure secret - Secret name and description", name your secret "NVIDIA_NGC_CREDS" and select Next. You'll use this secret name when submitting BioNeMo jobs to SageMaker.

12. Select the remaining default options to create your secret.


## 4. Submit ESM-1nv Training Job

In [None]:
import os
from sagemaker.experiments.run import Run
from sagemaker.pytorch import PyTorch

# Replace this with your ECR repository URI from above
BIONEMO_IMAGE_URI = (
    "<ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/bionemo-training:latest"
)

bionemo_estimator = PyTorch(
    base_job_name="bionemo-training",
    distribution={"torch_distributed": {"enabled": True}},
    entry_point="train.py",
    hyperparameters={
        "config-name": "esm1nv-training",  # This is  the name of your config file, without the extension
        "model-name": "esm1nv",  # If you don't provide this as a hyperparameter, it will be inferred from the name field in the config file
        "download-pretrained-weights": True,  # Required to fine-tune from pretrained weights. Set to False for pretraining.
        "ngc-cli-secret-name": "NVIDIA_NGC_CREDS",  # Replace this if you used a different name above.
    },
    image_uri=BIONEMO_IMAGE_URI,
    instance_count=2,  # Update this value for multi-node training
    instance_type="ml.g5.2xlarge",  # Update this value for other instance types
    keep_alive_period_in_seconds=3600,
    output_path=os.path.join(S3_FOLDER, "model"),
    role=SAGEMAKER_EXECUTION_ROLE,
    sagemaker_session=sagemaker_session,
    source_dir="src",
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    bionemo_estimator.fit(
        inputs={
            "train": os.path.join(DATA_URI, "train"),
            "val": os.path.join(DATA_URI, "val"),
        },
        wait=False,
    )