# Deploy Generative Virtual Screening NIM Blueprint on Amazon SageMaker

---
NVIDIA Inference Microservices (NIMs) have transformed the Biological Foundation Models (BioFMs) inference at scales, in particular, for drug discovery, using the robust, streamlined, and easy-to-deploy containers. The [BioNeMo NIMs](https://docs.nvidia.com/nim/#bionemo), including [MolMIM](https://docs.nvidia.com/nim/bionemo/molmim/latest/overview.html) for small molecular generation and [DiffDock](https://docs.nvidia.com/nim/bionemo/diffdock/latest/overview.html) for protein-ligand binding pose prediction, can be deployed on Amazon SageMaker to build generative virtual screening workflows. These workflows can be used by computational biologist, bioinformaticians, and data scientists to design, generate, and optimize novel drug candidates. The following example derived from [this NIM blueprint notebook](https://github.com/NVIDIA/BioNeMo/blob/main/examples/blueprints/generative-virtual-screening/generative-virtual-screening.ipynb) shows how to deploy BioNeMo NIMs using the [NVIDIA NIM public ECR gallery on AWS](https://gallery.ecr.aws/nvidia/nim).

## Prerequisites
---

<div class="alert alert-block alert-info">
<b>NOTE:</b>  To run NIM on SageMaker you will need to have your `NGC API KEY` to access NGC resources. This NGC API key is necessary because the NIM container in the public NIM ECR has dependencies that require access to artifacts in the NGC catalog. Check out <a href="https://build.nvidia.com/meta/llama3-70b?signin=true"> this LINK</a> to learn how to get an NGC API KEY. 
</div>

##### 1. Setup and retrieve API key:

1. First you will need to sign into [NGC](9https://ngc.nvidia.com/signin) with your NVIDIA account and password.
2. Navigate to setup.
3. Select “Get API Key”.
4. Generate your API key.
5. Keep your API key secret and in a safe place. Do not share it or store it in a place where others can see or copy it

For more information on NIM, check out the [NIM LLM docs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html).

##### 2. You must have the appropriate push permissions associated with your execution role
- Copy and paste the following json inline policy to your `Amazon SageMaker Execution Role` :

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "imagebuilder:GetComponent",
                "imagebuilder:GetContainerRecipe",
                "ecr:GetAuthorizationToken",
                "ecr:BatchGetImage",
                "ecr:InitiateLayerUpload",
                "ecr:UploadLayerPart",
                "ecr:CompleteLayerUpload",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:PutImage"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "*",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "kms:EncryptionContextKeys": "aws:imagebuilder:arn",
                    "aws:CalledVia": [
                        "imagebuilder.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::ec2imagebuilder*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/imagebuilder/*"
        }
    ]
}
```
- Or add the `EC2InstanceProfileForImageBuilderECRContainerBuilds` permission policy to your `SageMaker Execution Role`

##### 3. NIM public ECR image is currently available only in `us-east-1` region

##### 4. This Jupyter Notebook can be run on a t3.medium instance (ml.t3.medium). However, you may need to request a quota increase. 

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `g5.xlarge` for endpoint usage
4. If needed, request a quota increase for these resources.



<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

---

## Setup

Installs the dependencies and setup roles required to package the model and create SageMaker endpoint. 

In [None]:
import boto3, json, sagemaker
from sagemaker import get_execution_role
from pathlib import Path

sess = boto3.Session()
sm_client = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
bucket = sagemaker_session.default_bucket()
runtime_sm_client = boto3.client("sagemaker-runtime")
region = sess.region_name
sts_client = sess.client('sts')
account_id = sts_client.get_caller_identity()['Account']

## Create SageMaker Endpoint with MolMIM NIM Container

MolMIM can be used to explore “drug-like” chemical space to generate novel small molecules, structurally similar to a given query in SMILES string, with the improved values of the desired properties. This transformer based model deployed on SageMaker can also be used to generate novel small molecules using [CMA-ES](https://en.wikipedia.org/wiki/CMA-ES)-guided sampling with desired properties or characteristics. 


### Set Variables

In this example the MolMIM container will be deployed on `ml.g5.xlarge` instance

In [None]:
public_nim_image = "public.ecr.aws/nvidia/nim:molmim-1.0.0"
nim_model = "nim-molmim"
sm_model_name = "nim-molmim"
instance_type = "ml.g5.xlarge"
payload_model = "nvidia/nim-molmim"
NGC_API_KEY = ""

In [None]:
# Use store magic to save the global variables for running base nim notebook.
%store public_nim_image nim_model sm_model_name instance_type payload_model NGC_API_KEY

In [None]:
%run ../base_nim_NVIDIA.ipynb

Once we have the endpoint's status as `InService`, we can send a query small molecule in [SMILES](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System) string to generate more drug like candidates, structurally similar to [Nirmatrelvir](https://pubchem.ncbi.nlm.nih.gov/compound/Nirmatrelvir).

First we need to set up environment for visualization 

In [None]:
# Please run the following 'pip install' command to install the required libraries
!pip install jupyter ipywidgets loguru py3dmol rdkit pandas

# Importing libraries for 3-D visualization
import py3Dmol
import ipywidgets as widgets
from IPython.display import display

# Importing libraries from RD-Kit for handling and manipulating chemical data
from rdkit import Chem
from rdkit.Chem import AllChem, Draw, QED
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

Now let us visualize Nirmatrelvir molecular structure

In [None]:
# Nirmatrelvir SMILES formula
molecule = "CC1(C2C1C(N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)NC(CC3CCNC3=O)C#N)C"

# Visualizing 2-D structures and QED Score for input compound
input_mol = Chem.MolFromSmiles(molecule)
input_qed = QED.qed(input_mol)
Draw.MolToImage(input_mol, legend=str("Input_Nirmatrelvir \n Score: "+ "%.3f" % input_qed), subImgSize=(350,350))

Next we can run model inference using MolMIM for compound generation

In [None]:
payload = {
    'smi': molecule,
    'num_molecules': 5,
    'algorithm': 'CMA-ES',
    'property_name': 'QED',
    'min_similarity': 0.6, # Ignored if algorithm is not "CMA-ES".
    'iterations': 10,
}


molmim_response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

molmim_output = json.loads(molmim_response["Body"].read().decode("utf8"))
print(json.dumps(molmim_output, indent=2))

generated_ligands = '\n'.join([v['smiles'] for v in molmim_output['generated']])

Finally, visualize the generated small molecules.

In [None]:
# Visualizing 2-D structures of MolMIM-generated compounds
mols_from_gen_smis = []
gen_legends = []
for index, i in enumerate(molmim_output['generated']):
    # Convert SMILES to molecule and ensure uniqueness
    mol = Chem.MolFromSmiles(i['smiles'])
    if mol and mol not in mols_from_gen_smis:
        mols_from_gen_smis.append(mol)
        gen_legends.append(f"Mol_{index + 1} \n Score: {i['score']:.3f}")

# Output the number of unique molecules
print("Total unique molecule designs obtained:", len(mols_from_gen_smis))

# Draw the molecules in a grid
Draw.MolsToGridImage(mols_from_gen_smis, legends=gen_legends, molsPerRow=5, subImgSize=(350, 350))

## Create SageMaker Endpoint with DiffDock NIM Container

To predict the bioactivities of the generated small molecules against certain target protein, we can use Diffdock, a state-of-the-art generative model that predicts the 3D structure of a protein-ligand complex, to find out the best (most probable) binding poses. A highlighted feature from DiffDock is that a presumed binding pocket, which usually can be characterized only from experimental 3D structures, is not needed (a.k.a., blind-docking). This feature is very useful for AI folded protein structures, as it is able to locate all regions on the protein surface to be bound by drug molecules, providing ingishts for further downstream investigations.

The optimized DiffDock also provides the batch-docking function, by which we can concatenate multiple molecules into one request of docking, each of them will be also sampled for mulitple poses (i.e., num_poses=10 in this example). In the output, the predicted docking poses for each molecule is sorted by a confidence score that inferenced from a confidence model.

In [None]:
public_nim_image = "public.ecr.aws/nvidia/nim:diffdock-1.2.0"
nim_model = "nim-diffdock"
sm_model_name = "nim-diffdock"
instance_type = "ml.g5.xlarge"
payload_model = "nvidia/nim-diffdock"
NGC_API_KEY = ""

# Use store magic to save the global variables for running base nim notebook.
%store public_nim_image nim_model sm_model_name instance_type payload_model NGC_API_KEY

In [None]:
%run ../base_nim_NVIDIA.ipynb

In this example, we will use an AlphaFold2 generated protein structure for SARS CoV-2 main protease sequence used in [this example](https://github.com/NVIDIA/BioNeMo/blob/main/examples/blueprints/generative-virtual-screening/generative-virtual-screening.ipynb).

In [None]:
with open('protein.pdb', 'r') as f:
    folded_protein=''.join(f.readlines())
    
payload = {
    'protein': folded_protein,
    'ligand': generated_ligands,
    'ligand_file_type': 'txt',
    'num_poses': 10,
    'time_divisions': 20,
    'num_steps': 18,
}

diffdock_response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

diffdock_output = json.loads(diffdock_response["Body"].read().decode("utf8"))

In [None]:
# Extracting the top1 poses docking coordinates and scores for each generated molecule
best_scores, best_scoring_poses = [], []

for i in range(len(diffdock_output['ligand_positions'])):
    best_scoring_poses.append(diffdock_output['ligand_positions'][i][0])
    best_scores.append(diffdock_output['position_confidence'][i][0])

In [None]:
# defining a function for color definitions for visualization
def ansi_color(text, color):
    """Color text for console output"""
    colors = {
        "red": "\033[31m",
        "green": "\033[32m",
        "yellow": "\033[33m",
        "blue": "\033[34m",
        "magenta": "\033[35m",
        "cyan": "\033[36m",
        "white": "\033[37m",
        "reset": "\033[0m"
    }
    return f"{colors[color]}{text}{colors['reset']}"

# Assuming `docking_poses` is a list of RDKit molecule objects, each representing a docking pose for a compound
# Example: docking_poses = [mol1, mol2, mol3, mol4, mol5]

# Function to update the viewer based on the selected compound index
def update_viewer(compound_index):
    view = py3Dmol.view(width=1200, height=900)


    # Add the protein model
    view.addModel(folded_protein, 'pdb')
    view.setStyle({'model': 0}, {'cartoon': {'color': 'white', 'opacity': 0.7}})
    view.setViewStyle({'style':'outline','color':'black','width':0.03})
    Prot=view.getModel()
    Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})
    view.setStyle({'cartoon': {'colorscheme': {'prop': 'b', 'gradient': 'roygb', 'min': 40, 'max': 100}}})
    view.addSurface(py3Dmol.VDW,{'opacity':0.4,'color':'white'})
    
    # Add the selected docking pose
    pose = best_scoring_poses[compound_index]
    # pose_block = Chem.MolToMolBlock()
    view.addModel(pose, 'mol')
    view.addSurface(py3Dmol.VDW, {'opacity': 0.5, 'colorscheme': 'magentaCarbon'}, {'model': 1})
    view.setStyle({'model': 1}, {'stick': {'radius': 0.3, 'colorscheme': 'magentaCarbon'}})
    score = round(best_scores[compound_index], 3)
    score_color = "green" if score > -0.5 else "blue" if score >= -1.5 else "red"
    print(f"Loaded {ansi_color(compound_index+1, 'magenta')} with XX_confidence score: {ansi_color(best_scores[compound_index], score_color)}")

    # Zoom to fit the structure
    view.zoomTo()
    
    # Display the viewer
    return view.show()

# Create a slider widget to select compounds
compound_slider = widgets.IntSlider(
    value=0,
    min=0,
    max=len(best_scoring_poses) - 1,
    step=1,
    description='Compound:',
    continuous_update=False,
    orientation='horizontal',
    readout=False,
    readout_format='d'
)

# Link the slider to the viewer update function
widgets.interact(update_viewer, compound_index=compound_slider)

In this workflow, we illustrate the ability of BioNeMo NIMs to work in concert to generate meaningful predictions in a small virtual screening workflow. We hope this underscores to the user how easy the tools are to query and assimilate, and how flexible a workflow of this sort can be.

## Delete endpoint and clean up artifacts

In [None]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

---
## Distributors
- Amazon Web Services
- NVIDIA
