# Distributed data parallel MNIST training with TensorFlow2 and SMDataParallel

SMDataParallel is a new capability in Amazon SageMaker to train deep learning models faster and cheaper. SMDataParallel is a distributed data parallel training framework for TensorFlow2, PyTorch and MXNet.  

This notebook example shows how to use SMDataParallel with TensorFlow2 in SageMaker using MNIST dataset.

For more information:
1. [TensorFlow in SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html)
2. [SMDataParallelTensorFlow API Specification] < LINK TO BE ADDED >
3. [Getting started with SMDataParallel on SageMaker] < LINK TO BE ADDED >

**NOTE:** This example requires SageMaker Python SDK v2.X.

### Dataset
This example uses the MNIST dataset. MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of handwritten digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits).

### SageMaker execution roles

The IAM role arn used to give training and hosting access to your data. See the [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with the appropriate full IAM role arn string(s).


In [None]:
!pip install sagemaker --upgrade
install_needed = True  # should only be True once
if install_needed:
    print("installing deps and restarting kernel")
    !python3 -m pip install -U sagemaker
    !python3 -m pip install -U smdebug
    IPython.Application.instance().kernel.do_shutdown(True)

In [1]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

In [2]:
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import (ProfilerConfig, 
                                FrameworkProfile, 
                                DetailedProfilingConfig, 
                                DataloaderProfilingConfig, 
                                PythonProfilingConfig)

## Model training with SMDataParallel

### Training script

The `train_tensorflow_smdataparallel_mnist.py` script provides the code you need for training a SageMaker model using SMDataParallel's `DistributedGradientTape`. The training script is very similar to a TensorFlow2 training script you might run outside of SageMaker, but modified to run with SMDataParallel. SMDataParallel's TensorFlow client provides an alternative to native `DistributedGradientTape`. For details about how to use SMDataParallel in your native TF2 script, see the Getting started with SMDataParallel tutorial.


In [3]:
!pygmentize code/train_tensorflow_smdataparallel_mnist.py

[37m# Licensed to the Apache Software Foundation (ASF) under one[39;49;00m
[37m# or more contributor license agreements.  See the NOTICE file[39;49;00m
[37m# distributed with this work for additional information[39;49;00m
[37m# regarding copyright ownership.  The ASF licenses this file[39;49;00m
[37m# to you under the Apache License, Version 2.0 (the[39;49;00m
[37m# "License"); you may not use this file except in compliance[39;49;00m
[37m# with the License.  You may obtain a copy of the License at[39;49;00m
[37m#[39;49;00m
[37m#   http://www.apache.org/licenses/LICENSE-2.0[39;49;00m
[37m#[39;49;00m
[37m# Unless required by applicable law or agreed to in writing,[39;49;00m
[37m# software distributed under the License is distributed on an[39;49;00m
[37m# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY[39;49;00m
[37m# KIND, either express or implied.  See the License for the[39;49;00m
[37m# specific language governing permissions and limit

### SageMaker TensorFlow Estimator function options


In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You're also passing in the training script you reviewed in the previous cell.

**Instance types**

SMDataParallel supports model training on SageMaker with the following instance types only:
1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]

**Instance count**

To get the best performance and the most out of SMDataParallel, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you update the the `distribution` strategy, and set it to use `smdistributed dataparallel`. 

In [4]:
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=100,
    framework_profile_params=FrameworkProfile(
        local_path="/opt/ml/output/profiler/",
        detailed_profiling_config=DetailedProfilingConfig(
            start_step=1, 
            num_steps=1
        ),
        dataloader_profiling_config=DataloaderProfilingConfig(
            start_step=1, 
            num_steps=1
        ),
        python_profiling_config=PythonProfilingConfig(
            start_step=1, 
            num_steps=1
        )
    )
)

In [5]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
                        base_job_name='tensorflow2-smdataparallel-mnist',
                        source_dir='code',
                        entry_point='train_tensorflow_smdataparallel_mnist.py',
                        role=role,
                        py_version='py37',
                        framework_version='2.3.1',
                        # For training with multinode distributed training, set this count. Example: 2
                        instance_count=2,
                        # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
                        instance_type= 'ml.p3.16xlarge',
                        sagemaker_session=sagemaker_session,
                        # Training using SMDataParallel Distributed Training Framework
                        distribution={'smdistributed':{
                                            'dataparallel':{
                                                    'enabled': True
                                             }
                                      }}
                        )


# estimator = TensorFlow(
#                         base_job_name=training_job_name,
#                         source_dir='code',
#                         entry_point='train_tensorflow_smdataparallel_mnist.py',
#                         role=role,
#                         py_version='py37',
#                         framework_version='2.3.1',
#                         # For training with multinode distributed training, set this count. Example: 2
# #                         instance_count=2,
#                         instance_count=1,
#                         # For training with p3dn instance use - ml.p3dn.24xlarge
#                         instance_type= 'ml.p3.16xlarge',
#                         sagemaker_session=sagemaker_session,
#                         profiler_config=profiler_config,  
#                         # Training using SMDataParallel Distributed Training Framework
#                         # missing code !!!
#                         )


In [6]:
estimator.fit()

2021-01-17 13:45:34 Starting - Starting the training job...
2021-01-17 13:45:57 Starting - Launching requested ML instancesProfilerReport-1610891134: InProgress
.........
2021-01-17 13:47:18 Starting - Preparing the instances for training.........
2021-01-17 13:49:00 Downloading - Downloading input data
2021-01-17 13:49:00 Training - Downloading the training image...............
2021-01-17 13:51:21 Training - Training image download completed. Training in progress.[34m2021-01-17 13:51:22.806904: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-01-17 13:51:22.812111: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-01-17 13:51:22.992710: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2021-01-1

[34m[1,0]<stdout>:algo-1:167:167 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.73.168<0>[0m
[34m[1,0]<stdout>:algo-1:167:167 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation[0m
[34m[1,0]<stdout>:[0m
[34m[1,0]<stdout>:algo-1:167:167 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1][0m
[34m[1,0]<stdout>:algo-1:167:167 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.73.168<0>[0m
[34m[1,0]<stdout>:algo-1:167:167 [0] NCCL INFO Using network Socket[0m
[34m[1,0]<stdout>:NCCL version 2.7.8+cuda11.0[0m
[34m[1,8]<stdout>:algo-2:176:176 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.81.42<0>[0m
[34m[1,8]<stdout>:algo-2:176:176 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation[0m
[34m[1,8]<stdout>:[0m
[34m[1,8]<stdout>:algo-2:176:176 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1][0m
[34m[1,8]<stdout>:algo-2:176:176 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.81.

[34m[1,0]<stdout>:algo-1:167:1400 [0] NCCL INFO Launch mode Parallel[0m
[34m[1,0]<stdout>:Step #0#011Loss: 2.306592[0m
[34m[1,0]<stdout>:algo-1:167:1392 [0] NCCL INFO Launch mode Parallel[0m
[34m[1,8]<stdout>:algo-2:176:1389 [0] NCCL INFO Launch mode Parallel[0m
[34m[1,0]<stdout>:Step #50#011Loss: 0.266403[0m
[34m[1,0]<stdout>:Step #100#011Loss: 0.241800[0m
[34m[1,0]<stdout>:Step #150#011Loss: 0.137374[0m
[34m[1,0]<stdout>:Step #200#011Loss: 0.213814[0m
[34m[1,0]<stdout>:Step #250#011Loss: 0.096107[0m
[34m[1,0]<stdout>:Step #300#011Loss: 0.091707[0m
[34m[1,0]<stdout>:Step #350#011Loss: 0.053055[0m
[34m[1,0]<stdout>:Step #400#011Loss: 0.186329[0m
[34m[1,0]<stdout>:Step #450#011Loss: 0.094664[0m
[34m[1,0]<stdout>:Step #500#011Loss: 0.189872[0m
[34m[1,0]<stdout>:Step #550#011Loss: 0.062819[0m
[34m[1,0]<stdout>:Step #600#011Loss: 0.084561[0m
[34m[1,8]<stderr>:2021-01-17 13:51:30.313433: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initial

[34m[1,1]<stderr>:[algo-1:00170] Read -1, expected 589824, errno = 1[0m
[34m[1,0]<stderr>:[algo-1:00167] Read -1, expected 589824, errno = 1[0m
[34m[1,3]<stderr>:[algo-1:00173] Read -1, expected 589824, errno = 1[0m
[34m[1,2]<stderr>:[algo-1:00172] Read -1, expected 589824, errno = 1[0m
[34m[1,11]<stderr>:[algo-2:00182] Read -1, expected 32768, errno = 1[0m
[34m[1,9]<stderr>:[algo-2:00179] Read -1, expected 32768, errno = 1[0m
[34m[1,8]<stderr>:[algo-2:00176] Read -1, expected 32768, errno = 1[0m
[34m[1,8]<stderr>:[algo-2:00176] Read -1, expected 589824, errno = 1[0m
[34m[1,10]<stderr>:[algo-2:00180] Read -1, expected 32768, errno = 1[0m
[34m[1,10]<stderr>:[algo-2:00180] Read -1, expected 589824, errno = 1[0m
[34m[1,11]<stderr>:[algo-2:00182] Read -1, expected 589824, errno = 1[0m
[34m[1,9]<stderr>:[algo-2:00179] Read -1, expected 589824, errno = 1[0m
[34m[1,6]<stderr>:[algo-1:00177] Read -1, expected 622592, errno = 1[0m
[34m[1,15]<stderr>:[algo-2:00188] Rea

[34m[1,15]<stderr>:[algo-2:00188] Read -1, expected 622592, errno = 1[0m
[34m[1,14]<stderr>:[algo-2:00187] Read -1, expected 622592, errno = 1[0m
[34m[1,5]<stderr>:[algo-1:00178] Read -1, expected 622592, errno = 1[0m
[34m[1,4]<stderr>:[algo-1:00176] Read -1, expected 622592, errno = 1[0m
[34m[1,1]<stderr>:[algo-1:00170] Read -1, expected 32768, errno = 1[0m
[34m[1,3]<stderr>:[algo-1:00173] Read -1, expected 32768, errno = 1[0m
[34m[1,0]<stderr>:[algo-1:00167] Read -1, expected 32768, errno = 1[0m
[34m[1,2]<stderr>:[algo-1:00172] Read -1, expected 32768, errno = 1[0m
[34m[1,1]<stderr>:[algo-1:00170] Read -1, expected 589824, errno = 1[0m
[34m[1,0]<stderr>:[algo-1:00167] Read -1, expected 589824, errno = 1[0m
[34m[1,2]<stderr>:[algo-1:00172] Read -1, expected 589824, errno = 1[0m
[34m[1,3]<stderr>:[algo-1:00173] Read -1, expected 589824, errno = 1[0m
[34m[1,8]<stderr>:[algo-2:00176] Read -1, expected 32768, errno = 1[0m
[34m[1,10]<stderr>:[algo-2:00180] Read -

[34m[1,9]<stderr>:[algo-2:00179] Read -1, expected 32768, errno = 1[0m
[34m[1,9]<stderr>:[algo-2:00179] Read -1, expected 589824, errno = 1[0m
[34m[1,6]<stderr>:[algo-1:00177] Read -1, expected 622592, errno = 1[0m
[34m[1,7]<stderr>:[algo-1:00179] Read -1, expected 622592, errno = 1[0m
[34m[1,4]<stderr>:[algo-1:00176] Read -1, expected 622592, errno = 1[0m
[34m[1,13]<stderr>:[algo-2:00186] Read -1, expected 622592, errno = 1[0m
[34m[1,5]<stderr>:[algo-1:00178] Read -1, expected 622592, errno = 1[0m
[34m[1,14]<stderr>:[algo-2:00187] Read -1, expected 622592, errno = 1[0m
[34m[1,12]<stderr>:[algo-2:00184] Read -1, expected 622592, errno = 1[0m
[34m[1,15]<stderr>:[algo-2:00188] Read -1, expected 622592, errno = 1[0m
[34m[1,3]<stderr>:[algo-1:00173] Read -1, expected 32768, errno = 1[0m
[34m[1,2]<stderr>:[algo-1:00172] Read -1, expected 32768, errno = 1[0m
[34m[1,2]<stderr>:[algo-1:00172] Read -1, expected 589824, errno = 1[0m
[34m[1,1]<stderr>:[algo-1:00170] Rea

[35m2021-01-17 13:52:02,814 sagemaker-training-toolkit INFO     Orted process exited[0m
[34m[1,9]<stderr>:[algo-2:00179] Read -1, expected 589824, errno = 1[0m
[34m[1,11]<stderr>:[algo-2:00182] Read -1, expected 589824, errno = 1[0m
[34m[1,6]<stderr>:[algo-1:00177] Read -1, expected 622592, errno = 1[0m
[34m[1,4]<stderr>:[algo-1:00176] Read -1, expected 622592, errno = 1[0m
[34m[1,7]<stderr>:[algo-1:00179] Read -1, expected 622592, errno = 1[0m
[34m[1,13]<stderr>:[algo-2:00186] Read -1, expected 622592, errno = 1[0m
[34m[1,14]<stderr>:[algo-2:00187] Read -1, expected 622592, errno = 1[0m
[34m[1,5]<stderr>:[algo-1:00178] Read -1, expected 622592, errno = 1[0m
[34m[1,15]<stderr>:[algo-2:00188] Read -1, expected 622592, errno = 1[0m
[34m[1,12]<stderr>:[algo-2:00184] Read -1, expected 622592, errno = 1[0m
[34m[1,11]<stderr>:[algo-2:00182] Read -1, expected 32768, errno = 1[0m
[34m[1,8]<stderr>:[algo-2:00176] Read -1, expected 32768, errno = 1[0m
[34m[1,10]<stderr

[34m[1,10]<stderr>:[algo-2:00180] Read -1, expected 589824, errno = 1[0m
[34m[1,8]<stderr>:[algo-2:00176] Read -1, expected 32768, errno = 1[0m
[34m[1,8]<stderr>:[algo-2:00176] Read -1, expected 589824, errno = 1[0m
[34m[1,11]<stderr>:[algo-2:00182] Read -1, expected 589824, errno = 1[0m
[34m[1,9]<stderr>:[algo-2:00179] Read -1, expected 589824, errno = 1[0m
[34m[1,6]<stderr>:[algo-1:00177] Read -1, expected 622592, errno = 1[0m
[34m[1,7]<stderr>:[algo-1:00179] Read -1, expected 622592, errno = 1[0m
[34m[1,4]<stderr>:[algo-1:00176] Read -1, expected 622592, errno = 1[0m
[34m[1,5]<stderr>:[algo-1:00178] Read -1, expected 622592, errno = 1[0m
[34m[1,13]<stderr>:[algo-2:00186] Read -1, expected 622592, errno = 1[0m
[34m[1,12]<stderr>:[algo-2:00184] Read -1, expected 622592, errno = 1[0m
[34m[1,15]<stderr>:[algo-2:00188] Read -1, expected 622592, errno = 1[0m
[34m[1,14]<stderr>:[algo-2:00187] Read -1, expected 622592, errno = 1[0m
[34m[1,3]<stderr>:[algo-1:00173]


2021-01-17 13:52:34 Uploading - Uploading generated training model[35m2021-01-17 13:52:32,844 sagemaker-training-toolkit INFO     MPI process finished.[0m
[35mFor details of how to construct your training script see:[0m
[35mhttps://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script[0m
[35m2021-01-17 13:52:32,844 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2021-01-17 13:53:03 Completed - Training job completed
ProfilerReport-1610891134: NoIssuesFound
Training seconds: 482
Billable seconds: 482


## Next steps

Now that you have a trained model, you can deploy an endpoint to host the model. After you deploy the endpoint, you can then test it with inference requests. The following cell will store the model_data variable to be used with the inference notebook.

In [7]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
import boto3 
training_job_name=estimator.latest_training_job.job_name
session=boto3.session.Session()
region=session.region_name
tj=TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()
from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame



[2021-01-17 13:54:52.909 ip-172-16-12-104:31515 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None


ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-west-2-230755935769/', 'ProfilingIntervalInMilliseconds': 500}
s3 path:s3://sagemaker-us-west-2-230755935769/tensorflow2-smdataparallel-mnist-2021-01-17-13-45-34-320/profiler-output


Profiler data from system is available


In [8]:
pf=PandasFrame(tj.profiler_s3_output_path)
system_metrics_df=pf.get_all_system_metrics()
framework_metrics_df=pf.get_all_framework_metrics()

[2021-01-17 13:54:59.988 ip-172-16-12-104:31515 INFO algorithm_metrics_reader.py:192] S3AlgorithmMetricsReader created with bucket:sagemaker-us-west-2-230755935769 and prefix:tensorflow2-smdataparallel-mnist-2021-01-17-13-45-34-320/profiler-output/framework/
[2021-01-17 13:55:00.038 ip-172-16-12-104:31515 INFO metrics_reader_base.py:134] Getting 10 event files


In [9]:
import os
from IPython.display import IFrame, display, Markdown
from smdebug.profiler.python_profile_utils import StepPhase, PythonProfileModes
from smdebug.profiler.analysis.utils.python_profile_analysis_utils import Metrics
from smdebug.profiler.analysis.python_profile_analysis import PyinstrumentAnalysis, cProfileAnalysis
from smdebug.profiler.analysis.utils.pandas_data_analysis import PandasFrameAnalysis, StatsBy, Resource

In [10]:
pf_analysis=PandasFrameAnalysis(system_metrics_df, framework_metrics_df)

In [12]:
from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram

system_metrics_reader=tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

metrics_histogram=MetricsHistogram(system_metrics_reader)
metrics_histogram.plot(starttime=0, 
                       endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), 
                       select_dimensions=["CPU", "GPU"],
                       select_events=["total"])

[2021-01-17 13:55:27.511 ip-172-16-12-104:31515 INFO metrics_reader_base.py:134] Getting 10 event files
Found 121310 system metrics events from timestamp_in_us:0 to timestamp_in_us:1610891520000000
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'GPUUtilization-nodeid:algo-2', 'CPUUtilization-nodeid:algo-2', 'GPUUtilization-nodeid:algo-1', 'GPUMemoryUtilization-nodeid:algo-2', 'CPUUtilization-nodeid:algo-1', 'GPUMemoryUtilization-nodeid:algo-1'}


filtered_dimensions:{'GPUUtilization-nodeid:algo-2', 'CPUUtilization-nodeid:algo-2', 'GPUUtilization-nodeid:algo-1', 'GPUMemoryUtilization-nodeid:algo-2', 'CPUUtilization-nodeid:algo-1', 'GPUMemoryUtilization-nodeid:algo-1'}


In [13]:
from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap

system_metrics_reader.refresh_event_file_list()
view_heatmap=Heatmap(system_metrics_reader, plot_height=900)

[2021-01-17 13:55:29.589 ip-172-16-12-104:31515 INFO metrics_reader_base.py:134] Getting 10 event files
[2021-01-17 13:55:31.204 ip-172-16-12-104:31515 INFO metrics_reader_base.py:134] Getting 10 event files
select events:['.*']
select dimensions:['.*CPU', '.*GPU', '.*Memory']
filtered_events:{'cpu32', 'gpu1', 'cpu60', 'TransmitBytesPerSecond', 'cpu27', 'cpu34', 'IOPS', 'cpu49', 'ReadThroughputInBytesPerSecond', 'cpu17', 'gpu0', 'WriteThroughputInBytesPerSecond', 'cpu54', 'cpu9', 'cpu26', 'cpu37', 'cpu50', 'cpu41', 'cpu25', 'cpu12', 'cpu42', 'cpu29', 'cpu59', 'cpu61', 'cpu14', 'gpu4', 'cpu45', 'gpu3', 'cpu19', 'cpu35', 'MemoryUsedPercent', 'cpu8', 'cpu24', 'cpu31', 'cpu57', 'cpu38', 'cpu11', 'cpu10', 'cpu44', 'cpu52', 'cpu1', 'cpu16', 'ReceiveBytesPerSecond', 'cpu40', 'cpu0', 'cpu55', 'cpu15', 'cpu36', 'gpu2', 'gpu6', 'cpu21', 'cpu47', 'cpu58', 'cpu53', 'cpu46', 'cpu33', 'cpu62', 'cpu43', 'cpu13', 'cpu28', 'cpu30', 'cpu23', 'cpu39', 'cpu4', 'cpu18', 'cpu5', 'cpu56', 'cpu2', 'cpu22', 'c

  tmp = np.array(tmp)


In [14]:
from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame

pf=PandasFrame(tj.profiler_s3_output_path)

[2021-01-17 13:55:53.370 ip-172-16-12-104:31515 INFO algorithm_metrics_reader.py:192] S3AlgorithmMetricsReader created with bucket:sagemaker-us-west-2-230755935769 and prefix:tensorflow2-smdataparallel-mnist-2021-01-17-13-45-34-320/profiler-output/framework/


In [15]:
model_data = estimator.model_data
print("Storing {} as model_data".format(model_data))
%store model_data

Storing s3://sagemaker-us-west-2-230755935769/tensorflow2-smdataparallel-mnist-2021-01-17-13-45-34-320/output/model.tar.gz as model_data
Stored 'model_data' (str)
