# Training MMDetection Mask-RCNN Model on Sagemaker Distributed Cluster

## Motivation
[MMDetection](https://github.com/open-mmlab/mmdetection) is a popular open-source Deep Learning framework focused on Computer Vision models and use cases. MMDetection provides to higher level APIs for model training and inference. It demonstrates [state-of-the-art benchmarks](https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo) for variety of model architecture and extensive Model Zoo.

In this notebook, we will build a custom training container with MMdetection library and then train Mask-RCNN model from scratch on [COCO2017 dataset](https://cocodataset.org/#home) using Sagemaker distributed [training feature](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) in order to reduce training time.

### Preconditions
- To execute this notebook, you will need to have COCO 2017 training and validation datasets uploaded to S3 bucket available for Amazon Sagemaker service.


## Building Training Container

Amazon Sagemaker allows to BYO containers for training, data processing, and inference. In our case, we need to build custom training container which will be pushed to your AWS account [ECR service](https://aws.amazon.com/ecr/). 

For this, we need to login to public ECR with Sagemaker base images and private ECR reposity.

In [1]:
# login to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-2.amazonaws.com
# login to your private ECR
!aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin <REPLACE_WITH_YOUR_ACCOUNT>.dkr.ecr.us-east-2.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Now, let review training container:
- use Sagemaker PyTorch 1.5.0 container as base image;
- install latest version of Pytorch libraries and MMdetection dependencies;
- build MMDetection from sources;
- configure Sagemaker env variables, specifically, what script to use at training time.

In [30]:
! pygmentize -l docker Dockerfile.training

[37m# Use Sagemaker PyTorch container as base image[39;49;00m
[37m# https://github.com/aws/sagemaker-pytorch-container/blob/master/docker/1.5.0/py3/Dockerfile.gpu[39;49;00m
[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04[39;49;00m
[34mLABEL[39;49;00m [31mauthor[39;49;00m=[33m"vadimd@amazon.com"[39;49;00m


[37m############# Installing MMDetection from source ############[39;49;00m

[34mWORKDIR[39;49;00m[33m /opt/ml/code[39;49;00m
[34mRUN[39;49;00m pip install --upgrade --force-reinstall  torch torchvision cython
[34mRUN[39;49;00m pip install mmcv-full==latest+torch1.5.0+cu101 -f https://openmmlab.oss-accelerate.aliyuncs.com/mmcv/dist/index.html

[34mRUN[39;49;00m git clone https://github.com/open-mmlab/mmdetection
[34mRUN[39;49;00m [36mcd[39;49;00m mmdetection/ && [33m\[39;49;00m
    pip install -e .

[37m# to address https://github.com/pytorch/pytorch/issues/37377[39;49;00m
[34mENV

<br>
<br>
Next, we build and push custom training container to private ECR
<br>
<br>

In [None]:
! ./build_and_push.sh mmdetection-training latest Dockerfile.training

### Training script

At training time, Sagemaker executes training script defined in `SAGEMAKER_PROGRAM` variable. In our case, this script does following
- parses user parameters passed via Sagemaker Hyperparameter dictionary;
- based on parameters constructs launch command;
- uses `torch.distributed.launch` utility to launch distributed training;
- uses MMDetection `tools/train.py` to configure trianing process.


In [34]:
! pygmentize container_training/mmdetection_train.py

[34mfrom[39;49;00m [04m[36margparse[39;49;00m [34mimport[39;49;00m ArgumentParser
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mmmcv[39;49;00m [34mimport[39;49;00m Config
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m


[34mdef[39;49;00m [32mget_training_world[39;49;00m():

    [33m"""[39;49;00m
[33m    Calculates number of devices in Sagemaker distributed cluster[39;49;00m
[33m    """[39;49;00m

    [37m# Get params of Sagemaker distributed cluster from predefined env variables[39;49;00m
    num_gpus = [36mint[39;49;00m(os.environ[[33m"[39;49;00m[33mSM_NUM_GPUS[39;49;00m[33m"[39;49;00m])
    num_cpus = [36mint[39;49;00m(os.environ[[33m"[39;49;00m[33mSM_NUM_CPUS[39;49;00m[33m"[39;49;00m])
    hosts = json.loads(os.environ[[33m"[39;49;00m[33mSM_HOSTS[39;49;00m

## Start Sagemaker Training 

In [4]:
# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

In [5]:

import sagemaker
from time import gmtime, strftime

sess = sagemaker.Session()
bucket = sess.default_bucket()
region = "us-east-2"
account = sess.boto_session.client('sts').get_caller_identity()['Account']
prefix_input = 'mmdetection-input'
prefix_output = 'mmdetection-ouput'

In [6]:
container = "mmdetection-training" # your container name
tag = "latest"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, container, tag)

In [12]:
# algorithm parameters

hyperparameters = {
    "config-file" : "configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py", # config path is relative to MMDetection root directory
    "dataset" : "coco",
    "auto-scale" : "false", # whether to scale LR and Warm Up time
    "validate" : "true", # whether to run validation after training is done
    
    # 'options' allows to override individual config values
    "options" : "total_epochs=1; optimizer.lr=0.08; evaluation.gpu_collect=True",
}

In [13]:
# Sagemaker will parse metrics from STDOUT and store/visualize them as part of training job
metrics = [
    {
        "Name": "loss",
        "Regex": ".*loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_cls",
        "Regex": ".*loss_rpn_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_bbox",
        "Regex": ".*loss_rpn_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_cls",
        "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "acc",
        "Regex": ".*acc:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_bbox",
        "Regex": ".*loss_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_mask",
        "Regex": ".*loss_mask:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "lr",
        "Regex": "lr: (-?\d+.?\d*(?:[Ee]-\d+)?)"
    }
]

<br>
<br>

Execute cell below to start training on Sagemaker.
<br>
<br>

In [None]:
est = sagemaker.estimator.Estimator(image,
                                          role=role,
                                          train_instance_count=4,
                                          train_instance_type='ml.p3.16xlarge',
                                          train_volume_size=100,
                                          output_path="s3://{}/{}".format(sess.default_bucket(), prefix_output),
                                          metric_definitions = metrics,
                                          hyperparameters = hyperparameters, 
                                          sagemaker_session=sess
)

est.fit({"training" : <ADD_S3_BUCKET_WITH_COCO2017>})

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-08-03 16:29:15 Starting - Starting the training job...
2020-08-03 16:29:17 Starting - Launching requested ML instances.........
2020-08-03 16:30:52 Starting - Preparing the instances for training.........
2020-08-03 16:32:28 Downloading - Downloading input data............................................................................................................
2020-08-03 16:50:33 Training - Downloading the training image.................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2020-08-03 16:53:36,548 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2020-08-03 16:53:36,550 sagemaker-containers INFO     Failed to parse hyperparameter config-file value configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py to Json.[0m


### Known issues:
- Training job fails if validation is performed for multi-node training cluster. It does look like default validation hook cannot handle validation results across multiple nodes. refer to [this issue](https://github.com/open-mmlab/mmdetection/issues/3424) for details. Current workaround is to train model without validation on multi-node cluster, and then perform validation as separate task on single node only. 