## Training Amazon SageMaker models for binding affinity prediction by using the DGL-LifeSci with PyTorch backend
The **Amazon SageMaker Python SDK** makes it easy to train DGL-LifeSci models. In this example, you train a Atomic Convolutional Networks (ACNN) [1] or PotentialNet [2] using the PDBBind dataset. For detailed information of them please see [the example page of DGL-Lifesci](https://github.com/yoheigon/dgl-lifesci/tree/master/examples/binding_affinity_prediction).

[1] Gomes et al. (2017) Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity. *arXiv preprint arXiv:1703.10603*.

[2] Feinberg et al. (2018) PotentialNet for molecular property prediction. *ACS central science* 4.11: 1520-1530.

### Setup
Define a few variables that are needed later in the example.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

# Setup session
sess = sagemaker.Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here.
bucket = sess.default_bucket()

# IAM execution role that gives Amazon SageMaker access to resources in your AWS account.
# You can use the Amazon SageMaker Python SDK to get the role from the notebook environment.
role = get_execution_role()

### The training script
The main.py script provides all the code you need for training an Amazon SageMaker model. 

In [None]:
!cat ./code/main.py

### SageMaker's  estimator class
The Amazon SageMaker Estimator allows you to run single machine in Amazon SageMaker, using CPU or GPU-based instances.

When you create the estimator, pass in the filename of the training script and the name of the IAM execution role. You can also provide a few other parameters. train_instance_count and train_instance_type determine the number and type of Amazon SageMaker instances that are used for the training job. The hyperparameters parameter is a dictionary of values that is passed to your training script as parameters so that you can use argparse to parse them. You can see how to access these values in the main.py script above.

For this example, we use ml.p3.2xlarge for training instance, PDBBind(v2015) core for dataset, and PDBBind for model. 

In [None]:
from sagemaker.pytorch import PyTorch

metric_definitions = [
    {"Name": "mae", "Regex": "mae ([0-9.]+).*$"},
    {"Name": "r2", "Regex": "r2 ([0-9.]+).*$"},
]

# Create estimator
estimator = PyTorch(
    entry_point="main.py",
    source_dir="code",
    role=role,
    framework_version="1.6.0",
    py_version="py3",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    hyperparameters={
        "model": "PotentialNet",
        "dataset_option": "PDBBind_core_pocket_random",
        "version": "v2015",
    },
    metric_definitions=metric_definitions,
)

### Running the Training Job
After you construct the Estimator object, fit it by using Amazon SageMaker. The dataset is automatically downloaded.

In [None]:
# Launch SageMaker training job
estimator.fit()

### (Option) Hyperparameter tuning

SageMaker offers hyperparameter tuning to kick off multiple training jobs with different hyperparameter combinations, to find the set with best model performance. You can pecify the hyperparameters you want to tune and their possible values.

In [None]:
import boto3
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.01),
    "num_epochs": IntegerParameter(100, 200),
}

Next, specify the objective metric that you want to tune and its definition. This includes the regular expression (regex) needed to extract that metric from the Amazon CloudWatch logs of the training job

In [None]:
objective_metric_name = "mae"

Now, create a HyperparameterTuner object, which you pass:

 * The training estimator you created above
 * The hyperparameter ranges
 * Objective metric name and definition
 * Number of training jobs to run in total and how many training jobs should be run simultaneously. More parallel jobs will finish tuning sooner, but may sacrifice accuracy. We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).
 * Whether you should maximize or minimize the objective metric. You haven't specified here since it defaults to 'Maximize', which is what you want for validation accuracy
 

In [None]:
task_tags = [{"Key": "ML Task", "Value": "DGL-Lifesci"}]
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    tags=task_tags,
    max_jobs=2,
    max_parallel_jobs=1,
)

And finally, start the tuning job by calling .fit().

In [None]:
tuner.fit(wait=False)

Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is InProgress.

In [None]:
boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

### Output
You can get the model training output from the Amazon Sagemaker console by searching for the training task and looking for the address of 'S3 model artifact'