## Part 1: Distributed data parallel MNIST training with TensorFlow 2 and SageMaker distributed

[Amazon SageMaker's distributed library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) can be used to train deep learning models faster and cheaper. The [data parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) feature in this library (`smdistributed.dataparallel`) is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet. This notebook demonstrates how to use the SageMaker distributed data library to train a PyTorch model using the MNIST dataset.

This notebook example shows how to use `smdistributed.dataparallel` with TensorFlow2 in SageMaker using MNIST dataset.

For more information:

1. [TensorFlow in SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html)
2. [SageMaker distributed data parallel API Specification](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel_tensorflow.html)
3. [SageMaker's Distributed Data Parallel Library](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html)

**NOTE:** This example requires SageMaker Python SDK v2.X.

### Dataset
This example uses the MNIST dataset. MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of handwritten digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits).

In [1]:
pip install sagemaker --upgrade

Processing /home/ubuntu/.cache/pip/wheels/36/73/72/147e239a958fa69f277e3077dc08b53cbf466cf443463147bd/sagemaker-2.42.1-py2.py3-none-any.whl
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.42.0
    Uninstalling sagemaker-2.42.0:
      Successfully uninstalled sagemaker-2.42.0
Successfully installed sagemaker-2.42.1
Note: you may need to restart the kernel to use updated packages.


### SageMaker role

The following code cell defines `role` which is the IAM role ARN used to create and run SageMaker training and hosting jobs. This is the same IAM role used to create this SageMaker Notebook instance. 

`role` must have permission to create a SageMaker training job and launch an endpoint to host a model. For granular policies you can use to grant these permissions, see [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). If you do not require fine-tuned permissions for this demo, you can used the IAM managed policy AmazonSageMakerFullAccess to complete this demo. 

In [1]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

To verify that the role above has required permissions:

1. Go to the IAM console: https://console.aws.amazon.com/iam/home.
2. Select **Roles**.
3. Enter the role name in the search box to search for that role. 
4. Select the role.
5. Use the **Permissions** tab to verify this role has required permissions attached.

## Model training with SageMaker distributed data parallel

### Training script

The `train_tensorflow_smdataparallel_mnist.py` script provides the code you need for training a SageMaker model using `smdistributed.dataparallel`'s `DistributedGradientTape`. The training script is very similar to a TensorFlow2 training script you might run outside of SageMaker, and has been modified to run with this library. `smdistributed.dataparallel`'s TensorFlow client provides an alternative to native `DistributedGradientTape`. 

For details about how to use `smdistributed.dataparallel`'s DDP in your native TensorFlow 2 script, see the [Modify a TensorFlow 2.x Training Script Using SMD Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp.html#data-parallel-modify-sdp-tf2).

In [None]:
!pygmentize code/train_tensorflow_smdataparallel_mnist.py

### SageMaker TensorFlow Estimator function options


In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You're also passing in the training script you reviewed in the previous cell.

**Instance types**

`smdistributed.dataparallel` supports model training on SageMaker with the following instance types only. For best performance, it is recommended you use an instance type that supports Amazon Elastic Fabric Adapter (ml.p3dn.24xlarge and ml.p4d.24xlarge).

1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]


**Instance count**

To get the best performance and the most out of `smdistributed.dataparallel`, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you update the the `distribution` strategy, and set it to use `smdistributed dataparallel`. 

In [2]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    base_job_name="tensorflow2-smdataparallel-mnist",
    source_dir="code",
    entry_point="train_tensorflow_smdataparallel_mnist.py",
    role=role,
    py_version="py37",
    framework_version="2.4.1",
    # For training with multinode distributed training, set this count. Example: 2
    instance_count=2,
    # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge
    instance_type="ml.p3.16xlarge",
    sagemaker_session=sagemaker_session,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)

In [3]:
estimator.fit()

2021-05-28 23:54:33 Starting - Starting the training job...
2021-05-28 23:54:56 Starting - Launching requested ML instancesProfilerReport-1622246073: InProgress
.........
2021-05-28 23:56:33 Starting - Preparing the instances for training.........
2021-05-28 23:57:57 Downloading - Downloading input data...
2021-05-28 23:58:17 Training - Downloading the training image...............
2021-05-29 00:00:57 Training - Training image download completed. Training in progress.[35m2021-05-29 00:00:54.926771: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[35m2021-05-29 00:00:54.932755: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[35m2021-05-29 00:00:55.020912: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0[0m
[35m2021-0

Now that you have a trained model, you can deploy an endpoint to host the model. After you deploy the endpoint, you can then test it with inference requests. The following cell will store the model_data variable to be used with the inference notebook.

In [4]:
model_data = estimator.model_data
print("Storing {} as model_data".format(model_data))
%store model_data

Storing s3://sagemaker-us-west-2-688520471316/tensorflow2-smdataparallel-mnist-2021-05-28-23-54-33-230/output/model.tar.gz as model_data
Stored 'model_data' (str)


In [5]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


----!

In [7]:
print(predictor.endpoint_name)

tensorflow2-smdataparallel-mnist-2021-05-29-00-02-50-266


## Cleanup

If you don't intend to try out inference or to do anything else with the endpoint, you should delete the endpoint.

In [8]:
predictor.delete_endpoint()