# Training MMDetection Mask-RCNN Model on Sagemaker Distributed Cluster

## Motivation
[MMDetection](https://github.com/open-mmlab/mmdetection) is a popular open-source Deep Learning framework focused on Computer Vision models and use cases. MMDetection provides to higher level APIs for model training and inference. It demonstrates [state-of-the-art benchmarks](https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo) for variety of model architecture and extensive Model Zoo.

In this notebook, we will build a custom training container with MMdetection library and then train Mask-RCNN model from scratch on [COCO2017 dataset](https://cocodataset.org/#home) using Sagemaker distributed [training feature](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) in order to reduce training time.

### Preconditions
- To execute this notebook, you will need to have COCO 2017 training and validation datasets uploaded to S3 bucket available for Amazon Sagemaker service.


## Building Training Container

Amazon Sagemaker allows to BYO containers for training, data processing, and inference. In our case, we need to build custom training container which will be pushed to your AWS account [ECR service](https://aws.amazon.com/ecr/). 

For this, we need to login to public ECR with Sagemaker base images and private ECR reposity.

In [16]:
import sagemaker, boto3

session = sagemaker.Session()
region = session.boto_region_name
account = boto3.client('sts').get_caller_identity().get('Account')
bucket = session.default_bucket()

container = "mzanur-mmdetection-training" # your container name
tag = "latest"

In [17]:
bucket

'sagemaker-us-east-1-564829616587'

In [18]:
# login to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# login to your private ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Now, let review training container:
- use Sagemaker PyTorch container as base image;
- install Pytorch libraries and MMdetection dependencies;
- build MMDetection from sources;
- configure Sagemaker env variables, specifically, what script to use at training time.

In [39]:
! pygmentize -l docker Dockerfile.training

[37m# Build an image of mmdetection that can do distributing training on Amazon Sagemaker [39;49;00m

[37m# using Sagemaker PyTorch container as base image[39;49;00m
[37m# from https://github.com/aws/sagemaker-pytorch-container[39;49;00m
[34mARG[39;49;00m [31mUBUNTU[39;49;00m=[33m"18.04"[39;49;00m
[34mARG[39;49;00m [31mPYTORCH[39;49;00m=[33m"1.7.1"[39;49;00m
[34mARG[39;49;00m [31mCUDA[39;49;00m=[33m"110"[39;49;00m
[34mARG[39;49;00m [31mREGION[39;49;00m=[33m"us-east-1"[39;49;00m
[34mFROM[39;49;00m [33m763104351884.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:${PYTORCH}-gpu-py36-cu${CUDA}-ubuntu${UBUNTU}[39;49;00m

[37m############# BASIC SETUP ##############[39;49;00m
 RUN apt-get update
 RUN apt-get install -y curl git
 RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 [33m\[39;49;00m
    && apt-get clean [33m\[39;49;00m
    && rm -rf /var/lib/apt/lists/*

[37m########

<br>
<br>
Next, we build and push custom training container to private ECR
<br>
<br>

In [75]:
! ./build_and_push.sh $container $tag Dockerfile.training

Working in region us-east-1
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  397.8kB
Step 1/26 : ARG UBUNTU="16.04"
Step 2/26 : ARG PYTORCH="1.6.0"
Step 3/26 : ARG CUDA="101"
Step 4/26 : ARG REGION="us-east-1"
Step 5/26 : FROM 763104351884.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:${PYTORCH}-gpu-py36-cu${CUDA}-ubuntu${UBUNTU}
1.6.0-gpu-py36-cu101-ubuntu16.04: Pulling from pytorch-training

[1Ba89234b4: Pulling fs layer 
[1B26c6b9c9: Pulling fs layer 
[1Bbf18aa40: Pulling fs layer 
[1Bc688ebe3: Pulling fs layer 
[1Bd5861307: Pulling fs layer 
[1B27b8f0ff: Pulling fs layer 
[1B81630d15: Pulling fs layer 
[1Be18332c4: Pulling fs layer 
[1Bdfb2533b: Pulling fs layer 
[1B60a54609: Pulling fs layer 
[1Bc09e1537: Pulling fs layer 
[1B7b98fd72: Pulling fs layer 
[1B45e223e8: Pulling fs layer 
[1B55fe5c2c: Pulling fs layer 
[10B7b8f0ff: Waiting fs layer 
[1B0504e048: Pulling fs layer 
[

### Training script

At training time, Sagemaker executes training script defined in `SAGEMAKER_PROGRAM` variable. In our case, this script does following
- parses user parameters passed via Sagemaker Hyperparameter dictionary;
- based on parameters constructs launch command;
- uses `torch.distributed.launch` utility to launch distributed training;
- uses MMDetection `tools/train.py` to configure trianing process.


In [76]:
! pygmentize container_training/mmdetection_train.py

[34mfrom[39;49;00m [04m[36margparse[39;49;00m [34mimport[39;49;00m ArgumentParser
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mmmcv[39;49;00m [34mimport[39;49;00m Config
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m


[34mdef[39;49;00m [32mget_training_world[39;49;00m():

    [33m"""[39;49;00m
[33m    Calculates number of devices in Sagemaker distributed cluster[39;49;00m
[33m    """[39;49;00m

    [37m# Get params of Sagemaker distributed cluster from predefined env variables[39;49;00m
    num_gpus = [36mint[39;49;00m(os.environ[[33m"[39;49;00m[33mSM_NUM_GPUS[39;49;00m[33m"[39;49;00m])
    num_cpus = [36mint[39;49;00m(os.environ[[33m"[39;49;00m[33mSM_NUM_CPUS[39;49;00m[33m"[39;49;00m])
    hosts = json.loads(os.environ[[33m"[39;49;00m[33mSM_HOSTS[39;49;00m

## Define training configuration

In [77]:
# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

In [78]:
from time import gmtime, strftime

prefix_input = 'mmdetection-input'
prefix_output = 'mmdetection-ouput'
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, container, tag)

In [79]:
# algorithm parameters

hyperparameters = {
    "config-file" : "configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py", # config path is relative to MMDetection root directory
    "dataset" : "coco",
    "auto-scale" : "false", # whether to scale LR and Warm Up time
    "validate" : "true", # whether to run validation after training is done
    
    # 'options' allows to override individual config values
    "options" : "total_epochs=12; optimizer.lr=0.08; evaluation.gpu_collect=True",
}

In [80]:
# Sagemaker will parse metrics from STDOUT and store/visualize them as part of training job
metrics = [
    {
        "Name": "loss",
        "Regex": ".*loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_cls",
        "Regex": ".*loss_rpn_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_bbox",
        "Regex": ".*loss_rpn_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_cls",
        "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "acc",
        "Regex": ".*acc:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_bbox",
        "Regex": ".*loss_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_mask",
        "Regex": ".*loss_mask:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "lr",
        "Regex": "lr: (-?\d+.?\d*(?:[Ee]-\d+)?)"
    }
]

## Test training script and container locally


Amazon SageMaker support [local mode](https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=local%20mode#local-mode) which allows you to deploy and run training job locally first, before deploying your training container to remote SageMaker Training cluster.

To use local mode, we first need to install some dependencies. Please note, you may or may not need to restart your kernel for this changes to be applied.

In [81]:
# Install all dependecies for local run. 
# Note you may need to restart your Sagemaker Notebook kernel to have changes applied.
! pip install 'sagemaker[local]' --upgrade



In [82]:
from sagemaker.local import LocalSession

# Configure our local training session
sagemaker_local_session = LocalSession()
sagemaker_local_session.config = {'local': {'local_code': True}}

In [None]:
# Copy training data locally
! mkdir ../coco2017
! aws s3 cp s3://mzanur-coco/complete/ ../coco2017 --recursive

Now, we are ready to run our training container locally. For this, we need to pass special type of instance `local_gpu`. In this case, SageMaker will run training container with access to CUDA devices. Note, if you don't need access to GPUs, you may choose `local` instance type.

Note, depending on configuration of your local host and available memory, you may run into memory issues when loading dataset. In this case, try reducing your batch size to bring down memory consumption. 

In [83]:
est = sagemaker.estimator.Estimator(image,
                                    role=role,
                                    instance_count=1,
                                    instance_type='local_gpu',
                                    output_path="s3://{}/{}".format(bucket, prefix_output),
                                    metric_definitions = metrics,
                                    hyperparameters = hyperparameters, 
                                    sagemaker_session=sagemaker_local_session
)

est.fit({"training" : "file:///home/ec2-user/SageMaker/coco2017"})
# may need to increase shm for local mode - https://github.com/aws/sagemaker-python-sdk/issues/937
# sudo service docker restart on AL1

Creating network "sagemaker-local" with the default driver
Creating 50h3v521d6-algo-1-nqse7 ... 
Creating 50h3v521d6-algo-1-nqse7 ... done
Attaching to 50h3v521d6-algo-1-nqse7
[36m50h3v521d6-algo-1-nqse7 |[0m 2021-04-09 00:17:45,395 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36m50h3v521d6-algo-1-nqse7 |[0m 2021-04-09 00:17:45,426 sagemaker-training-toolkit INFO     Failed to parse hyperparameter config-file value configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py to Json.
[36m50h3v521d6-algo-1-nqse7 |[0m Returning the value itself
[36m50h3v521d6-algo-1-nqse7 |[0m 2021-04-09 00:17:45,427 sagemaker-training-toolkit INFO     Failed to parse hyperparameter dataset value coco to Json.
[36m50h3v521d6-algo-1-nqse7 |[0m Returning the value itself
[36m50h3v521d6-algo-1-nqse7 |[0m 2021-04-09 00:17:45,427 sagemaker-training-toolkit INFO     Failed to parse hyperparameter options value total_epochs=12; optimizer.lr=0.08; evaluation.gpu_coll

Failed to delete: /tmp/tmp8dewqg0l/algo-1-nqse7 Please remove it manually.


KeyboardInterrupt: 

## Start Sagemaker Training 

Now that we tested our training scrip and container locally, we are ready to run training job on disrtibuted SageMaker training cluster. Execute cell below to start training on Sagemaker. Note, that you have available parameters such as `instance_count` and `instance_type` to manage your training cluster configuration.

In [73]:
instance_type = 'ml.p3.16xlarge'
instance_count = 2

est = sagemaker.estimator.Estimator(image,
                                          role=role,
                                          instance_count=instance_count,
                                          instance_type=instance_type,
                                          train_volume_size=100,
                                          output_path="s3://{}/{}".format(bucket, prefix_output),
                                          metric_definitions = metrics,
                                          hyperparameters = hyperparameters, 
                                          sagemaker_session=session
)

est.fit({"training" : "s3://mzanur-data/coco/"})

train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: No S3 objects found under S3 URL "s3://coco2017-34sb3-east1/coco/" given in input data source. Please ensure that the bucket exists in the selected region (us-east-1), that objects exist under that S3 prefix, and that the role "arn:aws:iam::564829616587:role/mzanur-sagemaker" has "s3:ListBucket" permissions on bucket "coco2017-34sb3-east1". Error message from S3: All access to this object has been disabled