### Setup

We'll begin with some necessary imports, and get an Amazon SageMaker session to help perform certain tasks, as well as an IAM role with the necessary permissions.

In [20]:
import numpy as np
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
# role = get_execution_role()
role = "arn:aws:iam::367158743199:role/service-role/AmazonSageMaker-ExecutionRole-20200811T104146"
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-tf-distribution-options'
print('Bucket:\n{}'.format(bucket))

Bucket:
sagemaker-us-east-1-367158743199


TensorFlow Datasets is a collection of datasets ready to use. We are downloading The CIFAR-10 dataset from there.

The CIFAR-10 consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

In [21]:
sagemaker.__version__

'2.24.4'

In [None]:
from tensorflow.keras import datasets

(train_images, train_labels), (validation_images, validation_labels) = datasets.cifar10.load_data()
print('Data type:\n{}'.format(type(train_images)))
print('Shapes:\n{}\n{}\n{}\n{}'\
      .format(train_images.shape, train_labels.shape, validation_images.shape, validation_labels.shape))

The next step is to normalize pixel values to be between 0 and 1:

In [None]:
train_images, validation_images = train_images / 255.0, validation_images / 255.0

Now that we have normalized the data, we will save it locally./

In [None]:
import numpy as np
import os

data_dir = os.path.join(os.getcwd(), '../data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

validation_dir = os.path.join(os.getcwd(), 'data/validaiton')
os.makedirs(validation_dir, exist_ok=True)

np.save(os.path.join(train_dir, 'train_images.npy'), train_images)
np.save(os.path.join(train_dir, 'train_labels.npy'), train_labels)
np.save(os.path.join(validation_dir, 'validation_images.npy'), validation_images)
np.save(os.path.join(validation_dir, 'validation_labels.npy'), validation_labels)

Now we can save the dataset locally prior to uploading to Amazon S3. 

For Amazon SageMaker hosted training on a cluster separate from the hardware serving this notebook, training data must be stored in Amazon S3, Amazon EFS, or Amazon FSx for Lustre.  We'll upload the data to S3 now.

In [None]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/tf-2-distribution-options')
display(inputs)

# A Primer on Data Parallelism

If you’re training a model on a single GPU, its full internal state is available locally: model parameters, optimizer parameters, gradients (parameter updates computed by backpropagation), and so on. However, things are different when you distribute a training job to a cluster of GPUs.

Using a technique named “data parallelism,” the training set is split in mini-batches that are evenly distributed across GPUs. Thus, each GPU only trains the model on a fraction of the total data set. Obviously, this means that the model state will be slightly different on each GPU, as they will process different batches. In order to ensure training convergence, the model state needs to be regularly updated on all nodes. This can be done synchronously or asynchronously:

* Synchronous training: all GPUs report their gradient updates either to all other GPUs (many-to-many communication), or to a central parameter server that redistributes them (many-to-one, followed by one-to-many). As all updates are applied simultaneously, the model state is in sync on all GPUs, and the next mini-batch can be processed.

* Asynchronous training: gradient updates are sent to all other nodes, or to a central server. However, they are applied immediately, meaning that model state will differ from one GPU to the next.

Unfortunately, these techniques don’t scale very well. As the number of GPUs increases, a parameter server will inevitably become a bottleneck. Even without a parameter server, network congestion soon becomes a problem, as n GPUs need to exchange `n*(n-1)` messages after each iteration, for a total amount of `n*(n-1)*model size` bytes. For example, ResNet-50 is a popular model used in computer vision applications. With its 26 million parameters, each 32-bit gradient update takes about 100 megabytes. With 8 GPUs, each iteration requires sending and receiving 56 updates, for a total of 5.6 gigabytes. Even with a fast network, this will cause some overhead, and slow down training.

Still, as datasets keep growing, the network bottleneck issue often rises again. Enter SageMaker and its new AllReduce algorithm.

# Distributed Training with Sagemaker Data Parallelism 

[Amazon SageMaker](https://aws.amazon.com/sagemaker/) now supports a new data parallelism library that makes it easier to train models on datasets that may be as large as hundreds or thousands of gigabytes.


Amazon SageMaker now helps ML teams reduce distributed training time and cost, thanks to the SageMaker data parallelism library. Available for TensorFlow and PyTorch, the data parallelism library implements a more efficient distribution of computation, optimizes network communication, and fully utilizes our fastest [p3](https://aws.amazon.com/ec2/instance-types/p3/) and [p4](https://aws.amazon.com/ec2/instance-types/p4/) GPU instances.

   
![title](https://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/data-parallel/sdp-pytorch.png)


In [None]:
# !pip install --upgrade sagemaker

The SageMaker data parallelism API is designed for ease of use.

In [1]:
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import ProfilerConfig, FrameworkProfile, CollectionConfig

In [2]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

In [18]:
resource_config = {'volume_size_in_gb': 1024}
train_instance_type = 'ml.p3.16xlarge'
hyperparameters = {
    'epochs': 2,
    'batch_size': 2048,
    'learning_rate': 0.1,
}
distributions = {
    'smdistributed':{
        'dataparallel':{
            'enabled': True
        }
    }
}

estimator_dp = TensorFlow(
                        base_job_name='tf2-resnet-dist',
                        source_dir='../src',
                        entry_point='train_resnet_sdp.py',
                        role=role,
                        py_version='py37',
                        framework_version='2.3.1',
                        # For training with multinode distributed training, set this count. Example: 2
                        instance_count=2,
                        # For training with p3dn instance use - ml.p3dn.24xlarge
                        instance_type= 'ml.p3.16xlarge',
                        sagemaker_session=sagemaker_session,
                        resource_config=resource_config,
                        hyperparameters=hyperparameters,
                        # Training using SMDataParallel Distributed Training Framework
                        distribution=distributions,
                        debugger_hook_config=False
                        )

In [19]:
inputs = 's3://sagemaker-us-east-1-367158743199/data/tf-2-distribution-options'
remote_inputs = {'train': inputs + '/train',
                 'test': inputs + '/test'
                 }
estimator_dp.fit(remote_inputs)

2021-03-09 00:40:30 Starting - Starting the training job...
2021-03-09 00:40:54 Starting - Launching requested ML instancesProfilerReport-1615250430: InProgress
.........
2021-03-09 00:42:17 Starting - Preparing the instances for training.........
2021-03-09 00:43:56 Downloading - Downloading input data...
2021-03-09 00:44:27 Training - Downloading the training image..............
2021-03-09 00:46:59 Training - Training image download completed. Training in progress.[34m2021-03-09 00:46:46.075366: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-03-09 00:46:46.079164: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-03-09 00:46:46.271216: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2021-03

# Distributed Training with Horovod

In [15]:
from sagemaker.tensorflow import TensorFlow


hvd_instance_type = 'ml.p3.16xlarge'
hvd_processes_per_host = 1
hvd_instance_count = 2

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': hvd_processes_per_host,
                    'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'
                        }
                }

hyperparameters = {
    'epochs': 2,
    'batch_size': 2048,
    'learning_rate': 0.1,
}

estimator_hvd = TensorFlow(
                       source_dir='../src',
                       entry_point='train_tf2_hvd.py',
                       base_job_name='hvd-cifar10-tf', 
                       role=role,
                        py_version='py37',
                        framework_version='2.3.1',
                       hyperparameters=hyperparameters,
                       instance_count=hvd_instance_count, 
                       instance_type=hvd_instance_type,
                       distribution=distributions,
                        profiler_config=profiler_config,    
        metric_definitions=[
                   {'Name': 'train:error', 'Regex': 'loss: (.*?);'},
                   {'Name': 'train:accuracy', 'Regex': 'accuracy: (.*?);'}
                ]
)

NameError: name 'profiler_config' is not defined

In [None]:
inputs = 's3://sagemaker-us-east-1-367158743199/data/tf-2-distribution-options'
remote_inputs = {'train': inputs + '/train',
                 'test': inputs + '/test'
                 }

estimator_hvd.fit(remote_inputs, wait=True)

# simple training

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

sagemaker_session = sagemaker.Session()
role = get_execution_role()

In [None]:
model_dir = '/opt/ml/model'
train_instance_type = 'ml.p3.16xlarge'
hyperparameters = {
    'epochs': 2,
    'batch_size': 64,
    'learning_rate': 0.1,
}

estimator_single = TensorFlow(
    base_job_name='tf-2-resnet',
    source_dir='../src',
    entry_point='train_resnet_simple.py',
    role=role,
    py_version='py37',
    framework_version='2.3.1',
    model_dir=model_dir,
    instance_type=train_instance_type,
    instance_count=1,
    hyperparameters=hyperparameters,
    profiler_config=profiler_config,
    script_mode=True
                      )

Now we can call the fit method of the Estimator object to start training. 

In [None]:
inputs = 's3://sagemaker-us-east-1-367158743199/data/tf-2-distribution-options'
remote_inputs = {'train': inputs + '/train',
                 'test': inputs + '/test'
                 }
estimator_single.fit(remote_inputs)