# Reduce Training Time by over 90% and Costs by over 75% while maintaining accuracy on SageMaker with MXNet.
Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don't have to manage servers. Amazon SageMaker supports the leading deep learning frameworks. In this blog we demostrate how SageMaker has built in Horovod support for MXNet, allowing customers to reduce training time by over 90%, reducing costs by over 70%, with little effort, and in some cases no change in accuracy.

## Summary
In the following blog we will demonstrate SageMaker's capability to reduce training time and costs with Horovod. We will begin by introducing Horovod and demonstrate how to use it. Finally, we will end with an advanced section to help users reduce costs and optimize even more.

## Distributed Training Intro
Deep Neural Networks (“DNN”) are very successful in solving many machine learning tasks due to their ability to combine and tune thousands of parameters (layers weights). State-of-the-art models may have millions of trainable parameters and use thousands of input examples during training. Thus, training of DNN models can be computationally expensive and lenghty. For certain models, such as object detection or NLP models, training can take days and weeks on single GPU. 

Distributed training allows to leverage multiple training compute resource at training time, speed up training process, and use compute resource more efficiently. The goal is to split training tasks into independent subtasks and execute these subtasks across multiple devices & compute nodes. There are two approaches how to parallelize training task:

- `data parallelism`: distibuted chunks of training data across devices/nodes, train independently, and then update shared model;
- `model parallelism`: each device learns separate part of the model.

For purpose of this blog post, we'll review only `data parallelism` approach.

## Horovod Overview

[Horovod](http://horovod.ai/) is an open-source framework which implemenets distributed communication between individual training nodes in distributed cluster at training time. Horovod supports main deep learining frameworks, such as MXNet, PyTorch, and Tensorflow. Horovod requires minimal changes in the code to make it "_distributable"_ and considerably increases training performance. 

Horovod is build on top of **ring-allreduce** communcation protocol. This approach allows each training process (i.e. process running on single GPU device) talks to its peers and exchange its gradients as well as perform averaging ("reduction") of subset of gradients. This communication will continue until all nodes have latest updated gradients. Diagram below illustrates how ring-allreduce works.

<center><img src='images/peer_to_peer.png'  style="width: 900px;"><br>
    Ring-allreduce (<a href=https://arxiv.org/pdf/1802.05799.pdf>source</a>)
    <br><br>
</center> 

For a great discussion on Horovod feel free to checkout https://eng.uber.com/horovod/. They have rich and deep discussions into how Horovod works, and how it differs from other distributed training methods.

## Test Problem and Dataset

For purposes of this blog, we choose to train notoriously resource-intensive model architectures - **Mask-RCNN**, as well as **Faster-RCNN**. These model architectures were first introduced in 2018 and 2016 respectively, and are currently considered the baseline model architecture for 2 popular Computer Vision tasks: Segmenation (Mask-RCNN), and Object Detection (Faster-RCNN). Mask-RCNN builds upon Faster-RCNN by adding a mask for segmentation. Apache MXNet provides pre-built Mask-RCNN, and Faster-RCNN models as part of [GluonCV Model Zoo](https://gluon-cv.mxnet.io/model_zoo/index.html).

To train our object detection/instance segmentation models, we use use the popular [COCO2017 dataset](https://cocodataset.org/). This dataset provides more than 200,000 images with bounding boxes and segmentation masks for 80 different object categories. COCO2017 dataset is considered an indsutry standard dataset for benchmarking of CV models.

MXNet GluonCV is a wonderful resource, with rich content and a model zoo which we will be leveraging. It also has an excellent tutorial on how to get the COCO2017 dataset (https://gluon-cv.mxnet.io/build/examples_datasets/mscoco.html). 

In order to make this process as replicatible for SageMaker users we will show an entire process. To begin open up SageMaker and enter a conda_mxnet_p36 kernel

In [None]:
!pip install gluoncv==0.8.0b20200723 -q
!pip install pycocotools -q

In [1]:
import mxnet as mx
#import gluoncv as gcv
import os
import sagemaker
import subprocess
from sagemaker.mxnet.estimator import MXNet

sagemaker_session = sagemaker.Session() # can use LocalSession() to run container locally
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

In [None]:
#We will use GluonCV's tool to download our data
gcv.utils.download('https://gluon-cv.mxnet.io/_downloads/b6ade342998e03f5eaa0f129ad5eee80/mscoco.py',path='./')

In [None]:
#Now to install the dataset. Warning, this may take a while
!python mscoco.py --download-dir data

In [2]:
bucket_name = 'corey-demo' #INSERT BUCKET NAME

In [None]:
#Upload the dataset to your s3 bucket
!aws s3 cp './data/' s3://<INSERT BUCKET NAME>/ --recursive --quiet

So why should you want to use Horovod? 
We will let the data speak for itself, but imagine being able to finish training faster, on cheaper instances, no loss to accuracy, and with very minimal changes to the code base. In this situation we took the code from gluoncv modified a few lines so it would work with SageMaker, and began training. Alternatively you could use one GPU. 

In [7]:
# Define basic configuration of your Sagemaker Parameter/Horovod cluster.
num_instances = 1 #How many nodes you want to use
instance_family = 'ml.p3dn.24xlarge'#Which instance you want to use
gpu_per_instance = 8 #How many gpus are on this instance
bs = 1 # Batch-Size per gpu
#Parameter Server variation
hyperparameters = {
    'epochs':12, 'batch-size': bs, 'horovod':'false','lr':.01,'amp':'true',
    'val-interval':6,'num-workers':16}

for instance_family in ['ml.p3dn.24xlarge']:
    for s in ['train_mask_rcnn.py']:
        estimator = MXNet(
            entry_point=s,
            source_dir='./source',
            role=role,
            train_max_run=72*60*60,
            train_instance_type=instance_family,
            train_instance_count=num_instances,
            framework_version='1.6.0',
            train_volume_size=100,
            base_job_name =s.split('_')[1] + 'rcnn-' + str(num_instances)+ '-' + '-'.join(instance_family.split('.')[1:]),
            py_version='py3',
            hyperparameters=hyperparameters
        )

        estimator.fit(
            {'data':'s3://' + bucket_name + '/data'},
            wait=False
        )

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Now Horovod requires a little additional work. The coding differences follow from (https://horovod.readthedocs.io/en/stable/mxnet.html), however you may have to modify your dataloader for sharding. Minimal but important changes. Feel free to view the code at (aws-labs/<INSERT LINK HERE>) to see how it was done on this codeset. One of the biggest benefits of using GluonCV is that typically the pre-written scripts already have horovod. So in our case it didn't require any additional chanages!

First, we need to add mpi, with the enabled flag.
then we have processes_per_host, and custom_mpi_options. 

- `processes_per_host (int)`: Number of processes MPI should launch on each host. Set this flag for multi-GPU training.
- `custom_mpi_options (str)`: Any mpirun flags passed in this field are added to the mpirun command and executed by Amazon SageMaker for Horovod training.

As seen below.

In [12]:
# Define basic configuration of your Sagemaker Parameter/Horovod cluster.
num_instances = 3 #How many nodes you want to use
instance_family = 'ml.p3.16xlarge'#Which instance you want to use
gpu_per_instance = 8 #How many gpus are on this instance
bs = 1
#Parameter Server variation
distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': gpu_per_instance,
                        }
                }

hyperparameters = {
    'epochs':14, 'batch-size':bs, 'horovod':'true','lr':.01,'amp':'true',
    'val-interval':6,'num-workers':15}

for num_instances in [1]:
    for instance_family in ['ml.p3dn.24xlarge']:
        for s in ['train_mask_rcnn.py']:
            estimator = MXNet(
                entry_point=s,
                source_dir='./source',
                role=role,
                train_max_run=72*60*60,
                train_instance_type=instance_family,
                train_instance_count=num_instances,
                framework_version='1.6.0',
                train_volume_size=100,
                base_job_name =s.split('_')[1] + 'rcnn-noopts-ee-hvd-' + str(num_instances)+ '-' + '-'.join(instance_family.split('.')[1:]),
                py_version='py3',
                hyperparameters=hyperparameters,
                distributions=distributions
            )

            estimator.fit(
                {'data':'s3://' + bucket_name + '/data'},
                wait=False
            )

Parameter distributions will be renamed to distribution in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Normally training is the one of the most expensive portions of building models. Not only in terms of cost, but also time. As oftentimes data-scientists will want to train, then modify then retrain models in order to maximize performance. In many cases on large datasets with state of the art networks this can be extremely costly, as training may take greater than 24 hours.

Lets talk through what is needed to get horovod to work with mxnet. While we wait for these instances to train.

In [None]:
#First we need to initialize horovod. This has to run before the rest of the code 
#The code below came directly from https://horovod.readthedocs.io/en/stable/mxnet.html
# but we will walk through it
import mxnet as mx
import horovod.mxnet as hvd
from mxnet import autograd

# Initialize Horovod, this has to be done first as it activates horoovd
hvd.init()

# Pin GPU to be used to process local rank
context =[mx.gpu(hvd.local_rank())] #local_rank is the specific gpu on that instance
num_gpus = hvd.size() #This is how many total gpus you will be using

#Typically in your dataloader you will want to shard your dataset in the train_mask_rcnn 
# example it looked like  this 
train_sampler = \
        gcv.nn.sampler.SplitSortedBucketSampler(...,#... is for whatever arguments you want to place
                                                num_parts=hvd.size() if args.horovod else 1,
                                                part_index=hvd.rank() if args.horovod else 0)

#Normally we would shard dataset first for horovod.
val_loader = mx.gluon.data.DataLoader(dataset, len(ctx), ...) #... is for your other arguments

    
#Next you build and initialize your model just like you normally would
model = ...

# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
    hvd.broadcast_parameters(params, root_rank=0)

# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)

# Create loss function like usual

# Train model  just like you normally would

To finish lets talk about training time. We went ahead and created a few benchmarks.

Results: 
    **Faster RCNN - with Horovod**
- 1 p3.16xlarge Approximately 6 hours 32 mins with mAP 37
- 3 p3.16xlarge Approximately 3 hours 30 mins with mAP 36.9

- 1 p3dn.24xlarge Approximately 5 hours 19 mins with mAP 37
- 3 p3dn.24xlarge Approximately 2 hours 13 mins with mAP 36.9
    
    **Faster RCNN - without Horovod**
- 1 p3.16xlarge Approximately 24 hours 18 mins with mAP 37
- 1 p3dn.24xlarge Approximately 24 hours 50 mins with mAP 37

Results: **Mask RCNN - with Horovod**
- 1 p3.16xlarge Approximately 9 hours 25 mins with mAP for bbox 34.2 mAP for segm 30.9
- 3 p3.16xlarge Approximately 4 hours 9 mins with mAP for bbox 36.6 mAP for segm 33.1

- 1 p3dn.24xlarge Approximately 7 hours 7 mins with mAP for bbox 34.2 mAP for segm 30.9
- 3 p3dn.24xlarge Approximately 2 hours 54 mins with mAP for bbox 36.4 mAP for segm 33  
  
    **Mask RCNN - without Horovod**
- 1 p3.16xlarge Approximately  hours  mins with mAP for bbox 38.3 mAP for segm 34.9 
- 1 p3dn.24xlarge Approximately 29 hours 29 mins with mAP 37.7 for bbox, 34.1  mAP for segm  
    


Now you can see how they vary. Please note we used approach for parameter scaling from ["Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
"](https://arxiv.org/abs/1706.02677 ) paper. Also please notice that these times are only the time spent training, and does not include downloading the dataset which can take another 20-30 minutes depending on instance type. Also, for simplicity in training we did not modify learning rates beyond the scaling of batch-size and learning rates based on the previous paper.

With this much savings in time, it allows Scientists to focus more on improving their algorithms instead of waiting for jobs to finish training. Using multiple instances scientists can complete training with a 90% time reduction, over 50% cost savings, with very little effect to mAP. 

## Optimizing Horovod Training

Horovod provides several utilities which allows to analyze and optimize training performance. 

### Horovod Autotune
Horovod has multiple configuration settings which may improve your training performance. Finding the optimal combinations of parameters for a given combination of model and cluster size may require several iterations of trial-and-error. 

**Autotune** feature allows to automate this trial-and-error activities within single training job and uses Bayesian optimization to search through the parameter space for most performant combination of parameters. Note, that Horovod will search for best combination in first cycles of training job, and once best combination is defined, Horovod will write this combination in Autotune log and use this combination for the rest of the training. See more details [here](https://horovod.readthedocs.io/en/stable/autotune.html).

To enable Autotune and capture search log, pass following parameters in your MPI configuration:

```
{
    'mpi':
    {
        'enabled': True,
        'custom_mpi_options': '-x HOROVOD_AUTOTUNE=1 -x HOROVOD_AUTOTUNE_LOG=/opt/ml/output/autotune_log.csv'
    }
}
```


### Horovod Timeline

Horovod Timeline is a report available after training completion which captures all activities in Horovod ring at training time. This is useful to understand what operations & activities are taking most of the time and to identify optimization opportunities. Refer to [this article](https://horovod.readthedocs.io/en/stable/timeline.html) for more details. 

To generate Timeline file, add following parameters in your MPI command:

```
{
    'mpi':
    {
        'enabled': True,
        'custom_mpi_options': '-x HOROVOD_TIMELINE=/opt/ml/output/timeline.json'
    }
}
```

Note, that `/opt/ml/output` is a directory with specific purpose. After training job completion, Amazon Sagemaker automatically archives all files in this directory and uploads it to S3 location defined by user. That's where your Timeline report will be available for your further analysis.

### Tensor Fusion

Tensor Fusion feature allows to batch **allreduce** operations at training time. This typically results in better overall performance. See more details [here](https://horovod.readthedocs.io/en/stable/tensor-fusion.html). By default, Tensor Fusion is enabled and has buffer size of 64MB. You can modify buffer size using custom MPI flag as follows (in this case we override default 64MB buffer value with 32MB):

```
{
    'mpi':
    {
        'enabled': True,
        'custom_mpi_options': '-x HOROVOD_FUSION_THRESHOLD=33554432'
    }
}
```

You can also tweak batch cycles using `HOROVOD_CYCLE_TIME` parameter. Note that cycle time is defined in miliseconds:


```
{
    'mpi':
    {
        'enabled': True,
        'custom_mpi_options': '-x HOROVOD_CYCLE_TIME=5'
    }
}
```



## Optimizing MXNet Model

Another set of optimization techniques is related to optimizing MXnet model itself. It is recommended you first run the code with os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '1' Then you can copy the best os environment variables for future training. In our testing we found the following to be the best results:

```
os.environ['MXNET_GPU_MEM_POOL_TYPE'] = 'Round'
os.environ['MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF'] = '26'
os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD'] = '999'
os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD'] = '25'
os.environ['MXNET_GPU_COPY_NTHREADS'] = '1'
os.environ['MXNET_OPTIMIZER_AGGREGATION_SIZE'] = '54'
```

## In Conclusion
Reducing Training time has many more benefits than just getting a model to production faster. It allows customers to innovate faster, improve model performance, increase utilization of compute therby reducing costs. In this blog we discussed what Horovod was, how to use it, and how you can optimize it in order to reduce your costs and training time.
