# TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters

---
### Introduction

Heterogeneous clusters enable launching training jobs that use multiple instance types in a single job. This capability can improve your training cost and speed by running different parts of the model training on the most suitable instance type. This use case typically happens in computer vision (CV) deep learning (DL) training, where training is bottleneck on CPU resources needed for data augmentation, leaving the expensive GPU underutilized. Heterogeneous clusters enable you to add more CPU resources to fully utilize GPUs to increase training speed and cost-efficiency. For more details, you can find the documentation of this feature [here](https://docs.aws.amazon.com/sagemaker/latest/dg/train-heterogeneous-cluster.html).

This notebook demonstrates how to use Heterogeneous Clusters with TensorFlow's [tf.data.service](https://www.TensorFlow.org/api_docs/python/tf/data/experimental/service). It includes training a CPU intensive DL CV workload. Comparing cost and performance between homogeneous and heterogeneous training configurations.  

💡To get started quickly with heterogenous clusters, we suggest you'll reuse the provided code as a quick way to migrate your workload from a local tf.data pipeline to a distributed tf.data.service pipeline. You'll need to change [code/train_dnn.py](./code/train_dnn.py), while keeping [code/train_data.py](./code/train_data.py) and [code/launcher.py](code/launcher.py) intact. This is explained below in the [Workload Details] section.



<table>
  <tr>
    <td width=43%><b>Homogeneous Training Job</b><br/>
    In a Homogeneous training job the ml.p4d.24xlarge instance GPUs are under-utilized due to a CPU bottleneck.</td>
    <td width=57%><b>Heterogeneous Training Job</b><br/>
    In a Heterogeneous training job, we add two ml.c5.18xlarge instances with extra CPU cores, to reduce the CPU bottleneck and drive up GPU usage, to improve training speed cost-efficiency.
    </td>
   </tr> 
  </tr>
  <tr>
    <td> <img src=images/basic-homogeneous-job.png alt="homogeneous training job" /></td>
    <td><img src=images/basic-heterogeneous-job.png alt="Heterogeneous training job" /></td>
   </tr> 
  </tr>
</table>



### Workload Details
Training data is an arteficially generated dataset consisting of 32x32x3 images with random pixel values, and a corresponding random label representing 10 different classes. As this dataset is randomly generated, you should not expect the model to converge in a meaningful way. This shouldn't matter as our intent is only to measure data pipeline and neaural network optimization throughput expressd in epoch/step time.  
The model we used is [Resnet50](https://www.TensorFlow.org/api_docs/python/tf/keras/applications/ResNet50). The job runs on an 8 GPUs instance, ml.p4d.24xlarge, and uses Horovod for data parallelization. 

### Setting up heterogeneous clusters training
To switch to heterogenous clusters, we'll define two `instance_groups`:
- **data_group** - A group of CPU instances that will run data pre-processing code.
- **dnn_group** -  A group of GPU instances that will run Deep Neural Network training code.

In this example, the inter-node communication between CPU and GPU instance groups is implemented using [TensorFlow data service feature](https://www.TensorFlow.org/api_docs/python/tf/data/experimental/service). This feature allows to offload a configurable amount of preprocessing work to worker machines. Note that SageMaker's Heterogeneous cluster does not provide out-of-the-box support for inter-instance_group communication and it is up to the user to implement (we provide reference implementation here).

This notebook refers following files and folders:
- [code/train_dnn.py](./code/train_dnn.py) - this is standard TF training script, it has a single reference to tf.data.service when setting up the tf.data pipeline. This script will be executed on GPU instances belonging to the dnn_group.
- [code/train_data.py](./code/train_data.py) - this script starts tf.data.service services like a tf.data.service Dispatcher and tf.data.service Worker processes. You shouldn't edit this script when adjusting to your workload.
- [code/launcher.py](./code/launcher.py) - Entry point training script. This is the script that SageMaker Training will start on all instances (all instances groups must share the same entrypoint script in heterogeneous clusters). `launcher.py` is responsible for detecting the instance group the instance belong to, and start `train_dnn.py` and `train_data.py` accordingly. It is also responsible for shutting down tf.data.services the training script completes (`train_dnn.py`) so all instances exit allowing the SageMaker training job to complete. 
In every instance `luncher.py` will use `train_data.py` to start a tf.data.service worker server (As all instance types have CPUs that could be used for preprocessing). `luncher.py` will start a single tf.data.service dispatcher server (on the first instance of the `data_group`).  
`luncher.py` will start the `train_dnn.py` script in all GPU instances (`dnn_group` instances).

#### Learn more about tf.data.service processes
`tf.data.service Dispatcher` - The dispatcher server acts as the control plain for tf.data.service; Being responsible for registering worker servers and assinging preprocessing tasks to them. Each training job has a single Dispatcher running in the first instance of the `data_group` and listens on port 6000.
`tf.data.service Workers` - Worker servers carry out the data processing. Each instance could have one or more workers (listen on port 6001/6002/...).

#### Defining what part of your pipeline runs in which instance
 Applying `tf.data.experimental.service.distribute` to your dataset, you can program the dataset to run all preprocessing operations up to the point of application, on the workers. 
 As all instances will run a tf.data.service Worker, all instances will need access to a dataset you'll make available through a SageMaker training data channel. You do have the option of limiting which instance group will see which training data channel.

Thie below figutre shows sequence of events of setting up and running in a tf.data.service based heterogeneous cluster training job.

<img src=images/tf.data.service-diagram.png width=600px>

---
**NOTE**

As an alternative to this notebook, you can follow (readme.md)[./readme.md] which allows you to set up and launch the training job from an IDE or command line.  

---


At a high level, the notebook covers:
-  Set up Amazon SageMaker Studio Notebook 
-  Run homogeneous cluster training job 
   -  Setting up the Training environment
   -  Submitting the Training job
   -  Monitor and visualize the CloudWatch metrics
-  Run heterogeneous cluster training job 
   -  Setting up the Training environment
   -  Submitting the Training job
   -  Monitor and visualize the CloudWatch metrics
-  Compare time-to-train and cost-to-train
-  Conclusion

---

### A. Set up SageMaker Studio notebook
#### Before you start
Ensure you have selected Python 3 (_TensorFlow 2.6 Python 3.8 CPU Optimized_) image for your SageMaker Studio Notebook instance, and running on _ml.t3.medium_ instance type.

#### Step 1 - Upgrade SageMaker SDK and dependent packages 
Heterogeneous Clusters for Amazon SageMaker model training was [announced](https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training) on 07/08/2022. This feature release requires you to have updated SageMaker SDK, PyTorch, and Boto3 client.

In [3]:
%%bash
python3 -m pip install --upgrade boto3 botocore awscli sagemaker

Collecting boto3
  Downloading boto3-1.24.80-py3-none-any.whl (132 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.5/132.5 kB 925.9 kB/s eta 0:00:00
Collecting botocore
  Downloading botocore-1.27.80-py3-none-any.whl (9.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.1/9.1 MB 16.4 MB/s eta 0:00:00
Collecting awscli
  Downloading awscli-1.25.81-py3-none-any.whl (3.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 38.4 MB/s eta 0:00:00
Installing collected packages: botocore, boto3, awscli
  Attempting uninstall: botocore
    Found existing installation: botocore 1.27.72
    Uninstalling botocore-1.27.72:
      Successfully uninstalled botocore-1.27.72
  Attempting uninstall: boto3
    Found existing installation: boto3 1.24.72
    Uninstalling boto3-1.24.72:
      Successfully uninstalled boto3-1.24.72
  Attempting uninstall: awscli
    Found existing installation: awscli 1.25.73
    Uninstalling awscli-1.25.73:
      Successfully uninstalled awscli-1.25.73


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sagemaker-training 4.2.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.


Successfully installed awscli-1.25.81 boto3-1.24.80 botocore-1.27.80


#### Step 2 - Restart the notebook kernel 
From the Jupyter Lab menu bar **Kernel > Restart Kernel...**

#### Step 3 - Valdiate SageMaker Python SDK and TensorFlow versions
Ensure the output of the cell below reflects:

- SageMaker Python SDK version 2.98.0 or above, 
- boto3 1.24 or above 
- botocore 1.27 or above 
- TensorFlow 2.6 or above 

In [4]:
!pip show sagemaker boto3 botocore tensorflow protobuf |egrep 'Name|Version|---'

Name: sagemaker
Version: 2.109.0
---
Name: boto3
Version: 1.24.80
---
Name: botocore
Version: 1.27.80
---
Name: tensorflow
Version: 2.8.0
---
Name: protobuf
Version: 3.20.1


In [5]:
import os
import json
import datetime
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.instance_group import InstanceGroup

sess = sagemaker.Session()
role = get_execution_role()

### C. Run a homogeneous training job
#### Step 1: Set up the training environment
In this step, we define and submit a homogeneous training job. It uses a single instance type (p4d.24xlarge) with 8 GPUs, and analysis shows that is CPU bound causing its GPUs being underutilized.

In [6]:
import datetime
from sagemaker.tensorflow import TensorFlow
from sagemaker.instance_group import InstanceGroup
import os

hyperparameters = {
    "epochs": 10,
    "steps_per_epoch": 500,
    "batch_size": 1024,
    "tf_data_mode": "local",  # We won't be using tf.data.service ('service') for this homogeneous job
    "num_of_data_workers": 0,  # We won't be using tf.data.service ('service') for this homogeneous job
}

estimator = TensorFlow(
    entry_point="launcher.py",
    source_dir="code",
    framework_version="2.9.1",
    py_version="py39",
    role=role,
    volume_size=10,
    max_run=1800,  # 30 minutes
    disable_profiler=True,
    instance_type="ml.p4d.24xlarge",
    instance_count=1,
    hyperparameters=hyperparameters,
    distribution={
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,  # 8 GPUs per host
            "custom_mpi_options": "--NCCL_DEBUG WARN",
        },
    },
)

#### Step 2: Submit the training job

Note: For the logs, click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**. 


In [17]:
from start_job_utils import fit_with_retries
fit_with_retries(5, estimator, 
    job_name="homogeneous-" + datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ"),
)

2022-09-24 11:28:23 Starting - Starting the training job......
2022-09-24 11:29:08 Starting - Preparing the instances for training........................
2022-09-24 11:33:34 Downloading - Downloading input data
2022-09-24 11:33:34 Training - Downloading the training image........

#### Step 3: Analyzing the homogeneous training job throughput and resource usage
We'll examine: CPU and GPU usage. Epoch time and step time

**CPU and GPU usage analysis** 

In the screenshot below we observe that close to all the 96 vCPU of the instance is utilized. While GPU utilization is only ~45%. Clearly if we had more vCPUs we could increase GPU usage significantly to increase job throughput

Note: To view your own job Click on **View instance metrics** from the **Training jobs** in **Amazon SageMaker Console**. Then to rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch "Add Math" feature and average it out by no. of vCPUs/GPUs on those instance types. We captured metrics definitions used to produce this graph [here](./cloudwatch-metric-definitions/homogenous-workload%20copy.json).  
<img src="images/metrics homogeneous cpu and gpu usage.png" width=75%/>

**Epoch time and step time analysis**

For 2nd and 3rd epochs the below should print out: 105s/epoch - 209ms/step.

In [10]:
%%capture homogeneous_logs
estimator.sagemaker_session.logs_for_job(estimator.latest_training_job.name)

In [12]:
print(f"Printing step time for epochs and steps for {estimator.latest_training_job.name}")
for line in homogeneous_logs.stdout.split("\n"):
    if "mpirank:0" in line and "/epoch" in line:
        print(line)

Printing step time for epochs and steps for homogeneous-20220923T231801Z
[1,mpirank:0,algo-1]<stdout>:500/500 - 117s - loss: 2.4153 - lr: 0.0033 - 117s/epoch - 234ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 92s - loss: 2.3755 - lr: 0.0057 - 92s/epoch - 184ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 92s - loss: 2.3472 - lr: 0.0080 - 92s/epoch - 184ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 92s - loss: 2.3175 - lr: 0.0080 - 92s/epoch - 184ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 92s - loss: 2.3066 - lr: 0.0080 - 92s/epoch - 183ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 90s - loss: 2.3043 - lr: 0.0080 - 90s/epoch - 181ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 94s - loss: 2.3028 - lr: 0.0080 - 94s/epoch - 189ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 92s - loss: 2.3024 - lr: 0.0080 - 92s/epoch - 184ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 93s - loss: 2.3021 - lr: 0.0080 - 93s/epoch - 185ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 89s - loss: 2.3018 - l

### D. Run a heterogeneous cluster training job

#### Step 1: Set up training environment
We'll now run a training job in heterogeneous cluster mode.  
Note the changes from the homogeneous cluster job:  
- We define two new instance groups that are provided to the `estimator` as the `instance_groups` parameter that replaces the homogeneous parameters `instance_type` and `instance_count`.
- In the `distribution` parameter for Horovod we added a new parameter `instance_groups` that is used to limit the MPI cluster to run in the  `dnn_group`. The MPI cluster should include only the GPU nodes that run Horovod (which needs MPI). The `data_group` instances should not be part of the MPI cluster, as they set up their on `tf.data.service` cluster.

More on the two instance groups config we use:
- `data_group` - two ml.c5.18xlarge instances, each with 72 vCPUs to handle data preprocessing. Reading data from S3, preprocessing it, and forwarding it to the `dnn_group`.
- `dnn_group` - a single p4d.24xlarge instance, with 8 GPUs and 96 vCPUs to handle deep neural network optimization (forward backward passes). To fully utilize 96 vCPUs in the `dnn_group`, we'll be starting data workers on all the instances in both groups, therefore we have 240 vCPUs (96+72+72) in total available for preprocessing (minus vCPUs used for the neural network optimization process).

There are three Python scripts to know about:
The 1st is `train_dnn.py` - This is your training script for the neural network, you should edit it to match your own use case. Note that this script isn't aware of the Heterogeneous cluster set up, except when it initializes the tf.data dataset calling this line: `ds = ds.apply(tf.data.experimental.service.distribute(...)`.  
The 2nd and 3rd scripts, which you're not suppose to edit when adapting to your own use case, do the heavy lifting required for using tf.data.service over the Heterogeneous cluster feature.  
`train_data.py` include functions to start/stop tf.service.data process like a dispatcher and WorkerServer. 
`launcher.py` has several responsibilities: 
- A single entrypoint script for all instances in all instance groups (SageMaker will start the same script on all instances).
- Identifies which instance group the node belong to, and start the relevant script accordingly (`train_dnn.py` or  `train_data.py` or sometimes both).
- Takes measures to ensure that tf.data.sevice processes shutdown when training completes, as the training job completes only when all instances exit.
- Allows to start more than one process (for example, on the dnn_group instances we'll run both the `train_dnn.py` and a tf.data.service worker to utilize the instance CPUs).

In [14]:
import datetime
from sagemaker.tensorflow import TensorFlow
from sagemaker.instance_group import InstanceGroup
from sagemaker.inputs import TrainingInput
import os

hyperparameters = {
    "epochs": 10,
    "steps_per_epoch": 500,
    "batch_size": 1024,
    "tf_data_mode": "service",  # Using tf.data.service for this Heterogeneous cluster job
    "num_of_data_workers": 1,  # One tf.data.service worker per node
}

# Group for CPU instances to run tf.data.service dispatcher/workers processes.
data_group = InstanceGroup("data_group", "ml.c5.18xlarge", 2)
# Group for deep neural network (dnn) with accleartors (e.g., GPU, FPGA, etc.)
dnn_group = InstanceGroup("dnn_group", "ml.p4d.24xlarge", 1)

estimator2 = TensorFlow(
    entry_point="launcher.py",
    source_dir="code",
    framework_version="2.9.1",
    py_version="py39",
    role=role,
    volume_size=10,
    max_run=1800,  # 30 minutes
    disable_profiler=True,
    # instance_type='ml.p4d.24xlarge',
    # instance_count=1,
    instance_groups=[data_group, dnn_group],
    hyperparameters=hyperparameters,
    distribution={
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,  # p4d.24xlarge has 8 GPUs per host
            "custom_mpi_options": "--NCCL_DEBUG WARN",
        },
        "instance_groups": [dnn_group],  # Apply distribution strategy to the dnn_group only
    },
)

#### Step 2: Submit the training job

Note1: For the logs, click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**. 
Note2: Ignore the 0 billable seconds shown below. See actual billable seconds in the AWS web console > SageMaker > Training Jobs > this job.

In [16]:
from start_job_utils import fit_with_retries
fit_with_retries(5, estimator2, 
    job_name="heterogeneous-" + datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ"),
)

2022-09-23 23:52:33 Starting - Starting the training job......
2022-09-23 23:53:17 Starting - Preparing the instances for training........................
2022-09-23 23:57:33 Downloading - Downloading input data...
2022-09-23 23:57:48 Training - Downloading the training image...........................
2022-09-24 00:02:25 Training - Training image download completed. Training in progress....................................................
2022-09-24 00:11:23 Uploading - Uploading generated training model...
2022-09-24 00:11:59 Completed - Training job completed


#### Step 3: Analyze the heterogeneous cluster training job's throughput and resource usage
We'll examine: CPU and GPU usage. Epoch time and step time.

**CPU and GPU usage analysis** 

 In the screenshot below we observe that GPU usage has increase to 74% (compared to ~45% in the homogeneous training run) which is what we were aiming for. The CPU usage on all 3 instances are close to 80% CPU uage.  
 
Note: To view your own job Click on **View instance metrics** from the **Training jobs** node in **Amazon SageMaker Console**. Then to rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch "Add Math" feature and average it out by no. of vCPUs/GPUs on those instance types.  We captured metrics definitions used to produce this graph [here](./cloudwatch-metric-definitions/heterogenenous-workload.json).  
<img src="images/metrics Heterogeneous cpu and gpu usage.png" width=75%/>

**Epoch time and step time analysis** 

For 2nd epoch onwards you should see this print out in the logs of the dnn_group instance (p4d.24xlarge): 43s/epoch - 86ms/step.
Note that the instances are named: Algo1, Algo2, Algo3 randomly on each execution, so you'll need to open all instances logs to find the dnn_group instance.

## E. Comparing time-to-train and cost-to-train
The table below summarizes both jobs. We can see that:
- The Heterogeneous job is <b>2.2x faster to train</b> (86ms/step) than the homogeneous job (192ms/step).
- The Heterogeneous job is <b>45% cheaper to train</b> than the homogeneous job. This is despite the heterogeneous costs more per hour ($45/hour vs $37/hour), due to the two extra c5.18xlarge instances included in the heterogeneous job `($45 = $37.7 + 2 * $3.67` 
The cost-to-train formula we used: change in hourly price `($45/$37.7) ` times `reduction-in-time-to-train (86ms/192ms)`  =  45% = `($45/$37.7) * (86ms/192ms)`. 

<img src=images/homogeneous-vs-heterogeneous-results-table.png alt="results table" />

## F. Conclusion
In this notebook, we demonstrated how to leverage Heterogeneous cluster feature of SageMaker Training, with TensorFlow to achieve better price performance and increase training speed. To get started you can copy this example project and change  `train_dnn.py` to match your worklaod. To run the job, you could use this notebook, or the `start_job.py`.