# TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters

---
### Intro

Heterogeneous clusters enables launching training jobs that use multiple instance types in a single job. This  capability can improve your training cost and speed by running different parts of the model training on the most suitable instance type. This use case typically happens in computer vision DL training, where training is bottlnecked on CPU resources needed for data augmentation, leaving the expensive GPU underutilized. Heterogeneous clusters allows you to add more CPU resources to fully utilize GPUs to increase training speed and cost-efficiency. For more details, you can find the documentation of this feature [here](https://docs.aws.amazon.com/sagemaker/latest/dg/train-heterogeneous-cluster.html).

This notebook demonstrates how to use Heterogeneous Clusters feature of SageMaker Training with TensorFlow's [tf.data.service](https://www.tensorflow.org/api_docs/python/tf/data/experimental/service).

In this sample notebook, we'll be training a CPU intensive Deep Learning computer vision workload. We'll be comparing between a homogeneous and a heterogeneous training configurations. Both jobs we'll run train with the same data, pre-processing, and other relevant parameters:

<table>
  <tr>
    <td width=43%><b>Homogeneous Training Job</b><br/>
    In a Homogeneous training job the ml.p4d.24xlarge instance GPUs are under-utilized due to a CPU bottleneck.</td>
    <td width=57%><b>Heterogeneous Training Job</b><br/>
    In a Heterogeneous training job, we add two ml.c5.18xlarge instances with extra CPU cores, to reduce the CPU bottleneck and drive up GPU usage, to improve training speed cost-efficiency.
    </td>
   </tr> 
  </tr>
  <tr>
    <td> <img src=images/basic-homogeneous-job.png alt="homogeneous training job" /></td>
    <td><img src=images/basic-heterogeneous-job.png alt="Heterogeneous training job" /></td>
   </tr> 
  </tr>
</table>



#### Workload Details
Training data is stored in TFRecord files in `data` folder, and generated from [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset using `generate_cifar10_tfrecords.py` script. The data pre-processing pipeline includes: parsing images, dilation, blur filtering, and a number of [TensorFlow preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers). We'll use a [Resnet50](https://www.tensorflow.org/api_docs/python/tf/keras/applications/ResNet50) architecture. The job runs on an 8 GPUs instance, p4d.24xlarge, and uses Horovod for data parallelizaion. 

### Setting up Heterogeneous clusters Training
We then define `instance_groups` in the TensorFlow estimator to enable training jobs to leverage Heterogeneous Cluster features. 
The data pre-processing code runs on the CPU nodes (here referred as **data_group or data group**), whereas the Deep Neural Network training code runs on the GPU nodes (here referred as **dnn_group or dnn group**). In this example, the inter-node communication between CPU and GPU instance_groups is implemented using [TensorFlow data service feature](https://www.tensorflow.org/api_docs/python/tf/data/experimental/service). This feature was introduced in TensorFlow version 2.3 which provides APIs for defining dedicated worker machines for performing data preprocessing. Note that SageMaker's Heterogeneous cluster does not provide out-of-the-box support for inter-instance_group communication. 


The training script (`launcher.py`) runs on all the nodes regardless of which instance_group it belongs to. However, it has the logic to detect (using SageMaker's `instance_group` environment variables) whether the node it is running on belongs to data_group or dnn_group. 

If it is data_group, it spawns a separate process by executing `train_data.py`. This script runs a dispatcher server service on the first node of the instance group. And, on all the nodes in the same instance_group, it runs the worker server service. A dispatch server is responsible for distributing preprocessing tasks to one, or more, worker servers, each of which load the raw data directly from storage, and send the processed data to the GPU device. A dispatcher server listens on port 6000, whereas the worker server listens on port 6001. By applying `tf.data.experimental.service.distribute` to your dataset, you can program the dataset to run all preprocessing operations up to the point of application, on the workers. TFRecord files are copied over to instances in this group, as the workers load the raw data from those files. In this example, we are using 2 instances of ml.c5.18xlarge in the data_group. While dispatching the data to the dnn_group, the main entrypoint script `launcher.py` listens on port 16000 for a shutdown request coming from the data group. The train_data.py waits for shutdown action from the parent process.

If the node belongs to dnn_group, the main training script (`launcher.py`) spawns a separate set of processes by executing `train_dnn.py`. This script contains a deep neural network algorithm/code. And, in some cases, can host additional data-preprocessing components to maximize the use of CPUs on dnn nodes. The set of processes running DNN training consumes a stream of processed dataset from the Dispatcher server (the first node in the data_group at port 6000), and runs model training. The dnn_group can also run distributed training on multiple nodes defined by parameter instance_count (see details under **Setting up the training environment** section of this notebook). Once the model is trained on the dataset, the dnn_group establishes a connection back to the dispatcher server on port 16000 to signal shutdown request. 


A graphical view of how the data flows is shown below in Heterogeneous Cluster training with tf.data.service:

**NEED TO BE UPDATED**

<img src=images/tf.data.service-diagram.png width=600px>


This notebook refers following files and folders:

- Folders: 
  - `code`: this has the training scripts, grpc client-server code, 
  - `images`: contains images referred in notebook
- Files: 
  - `launcher.py`: entry point training script. This script is executed on all the nodes irrespective of which group it belongs to. Explained above. 
  - `train_data.py`: this script runs on the data_group nodes and responsible for setting up dispatcher and worker servers
  - `train_dnn.py`: this script runs on the dnn_group nodes, and responsible for DNN training code/algorithm. 
  - `requirements.txt`: defines package required for tensorflow-addon and protobuf 


---
**NOTE**

As an alternative to following this notebook, you follow (readme.md)[./readme.md] which allows you to setup and launch the training job from an IDE or command line.  

---


At a high level, the notebook covers:
-  A Setting up Amazon SageMaker Studio Notebook 
-  Preparing Training dataset and uploading to Amazon S3
-  Setting up the Training environment
-  Submitting the Training job
-  Monitor and visualize the CloudWatch metrics
-  Comparing time-to-train and cost-to-train
-  Conclusion

### A. Setting up SageMaker Studio notebook
#### Before you start
Ensure you have selected Python 3 (_TensorFlow 2.6 Python 3.8 CPU Optimized_) image for your SageMaker Studio Notebook instance, and running on _ml.t3.medium_ instance type.

#### Step 1 - Upgrade SageMaker SDK and dependent packages 
Heterogeneous Clusters for Amazon SageMaker model training was [announced](https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training) on 07/08/2022. As a first step, ensure you have updated SageMaker SDK, PyTorch, and Boto3 client that enables this feature.

In [None]:
%%bash
python3 -m pip install --upgrade boto3 botocore awscli sagemaker

#### Step 2 - Restart the notebook kernel 
From the Jupyter Lab menu bar **Kernel > Restart Kernel...**

#### Step 3 - Valdiate SageMaker Python SDK and Tensorflow versions
Ensure the output of the cell below reflects:

- SageMaker Python SDK version 2.98.0 or above, 
- boto3 1.24 or above 
- botocore 1.27 or above 
- TensorFlow 2.6 or above 

In [2]:
!pip show sagemaker boto3 botocore tensorflow protobuf |egrep 'Name|Version|---'

Name: sagemaker
Version: 2.109.0
---
Name: boto3
Version: 1.24.72
---
Name: botocore
Version: 1.27.72
---
Name: tensorflow
Version: 2.8.0
---
Name: protobuf
Version: 3.20.1


### B. Preparing Training Dataset

#### Download cifar10 dataset and convert them into tfrecord
The training data set is stored in TFRecord files in `data` folder, and generated from CIFAR-10 dataset.

In [6]:
%%bash
python3 ./generate_cifar10_tfrecords.py --data-dir ./data
rm -rf /tmp/data.old && mv data data.old && mkdir data && cp data.old/train/train.tfrecords ./data/train.1.tfrecords && cp data.old/train/train.tfrecords ./data/train.2.tfrecords && mv data.old /tmp

### C. Setting up training environment
#### Step 1: Import SageMaker components and setup the IAM role and S3 bucket


In [10]:
import os
import json
import datetime
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.instance_group import InstanceGroup

sess = sagemaker.Session()
role = get_execution_role()
output_path = "s3://" + sess.default_bucket() + "/cifar10-tfrecord"
print(f"role={role}")
print(f"output_path={output_path}")

####  Step 2: Upload the tfrecord training data to S3 bucket and define training input

In [15]:
prefix = "cifar10-tfrecord"
bucket = sess.default_bucket()
print(f"Uploading data from ./data to s3://{bucket}/{prefix}/")
s3path = sess.upload_data(path="./data", bucket=bucket, key_prefix=prefix)

from sagemaker import TrainingInput
data_uri = TrainingInput(
    s3path,
    # instance_groups=['data_group'], # we don't need to restrict training channel to a specific group as we have data workers in both groups
    input_mode="FastFile",
)

#### Step 3 - Running a homogeneous training job
In this step we define and submit a homogeneous training job. It use a single instance type (p4d.24xlarge) with 8 GPUs, and analysis will show its CPU bound causing its GPUs to be underutilized.

In [16]:
import datetime
from sagemaker.tensorflow import TensorFlow
from sagemaker.instance_group import InstanceGroup
from sagemaker.inputs import TrainingInput
import os

hyperparameters = {
    "epochs": 10,
    "steps_per_epoch": 500,
    "batch_size": 1024,
    "tf_data_mode": "local",  # We won't be using tf.data.service ('service') for this homogeneous job
    "num_of_data_workers": 0,  # We won't be using tf.data.service ('service') for this homogeneous job
}

estimator = TensorFlow(
    entry_point="launcher.py",
    source_dir="code",
    framework_version="2.9.1",
    py_version="py39",
    role=role,
    volume_size=10,
    max_run=1800,  # 30 minutes
    disable_profiler=True,
    instance_type="ml.p4d.24xlarge",
    instance_count=1,
    hyperparameters=hyperparameters,
    distribution={
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,  # 8 GPUs per host
            "custom_mpi_options": "--NCCL_DEBUG WARN",
        },
    },
)

### D. Submit the training job

Note: For the logs, click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**. 


In [18]:
estimator.fit(
    inputs=data_uri,
    job_name="homogeneous-" + datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ"),
)

### E. Analyzing the homogeneous training job throughput and resource usage
We'll examine: CPU and GPU usage. Epoch time and step time

#### CPU and GPU usage analysis
In the screenshot below we observe that close to all the 96 vCPU of the instance is utilized. While GPU utilization is only ~40%. Clearly if we had more vCPUs we could increase GPU usage signifiantly to increase job throughput

Note: To view your own job Click on **View instance metrics** from the **Training jobs** node in **Amazon SageMaker Console**. Then to rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch "Add Math" feature and average it out by no. of vCPUs/GPUs on those instance types.  
<img src="images/metrics homogeneous cpu and gpu usage.png" width=75%/>

#### Epoch time and step time analysis
For 2nd and 3rd epochs the below should printout: 105s/epoch - 209ms/step.

In [60]:
%%capture homogeneous_logs
estimator.sagemaker_session.logs_for_job(estimator.latest_training_job.name)

In [61]:
print(f"Printing step time for epochs and steps for {estimator.latest_training_job.name}")
for line in homogeneous_logs.stdout.split("\n"):
    if "mpirank:0" in line and "/epoch" in line:
        print(line)

Printing step time for epochs and steps for homogenous-20220909T140047Z
[1,mpirank:0,algo-1]<stdout>:500/500 - 132s - loss: 1.9972 - lr: 0.0033 - 132s/epoch - 263ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 105s - loss: 1.8961 - lr: 0.0057 - 105s/epoch - 209ms/step
[1,mpirank:0,algo-1]<stdout>:500/500 - 105s - loss: 1.8536 - lr: 0.0080 - 105s/epoch - 211ms/step


#### Step 3 - Running a Heterogeneous Training Job
We'll now run a training job in heterogeneous clusters mode.  
Note the changes from the homogeneous cluster job:  
- We define two new instance groups that are provided to the `estimator` as the `instance_groups` parameter that replaces the homogeneous paramters `instance_type` and `instance_count`.
- In the `distribution` parameter for Horovod we added a new parameter `instance_groups` that is used to limit the MPI cluster to run in the  `dnn_group`. The MPI cluster should include only the GPU nodes that run Horovod (which needs MPI). The `data_group` instances should not be part of the MPI cluster, as they set up their on `tf.data.service` cluster.

More on the two instance groups config we use:
- `data_group` - two ml.c5.18xlarge instances, each with 72 vCPUs to handle data preprocessing. Reading data from S3, preprocessing it, and forwarding it to the `dnn_group`.
- `dnn_group` - a single p4d.24xlarge instance, with 8 GPUs and 96 vCPUs. to handle deep neural network optimization (forward backword passes) releing on 8 GPUs and some of the 96 vPCUs. To fully utilize 96 vCPUs in the `dnn_group`, we'll be starting data workers on all instances in both groups, therefore we have 240 vCPUs (96+72+72) in total available for preprocessing (minus vCPUs used for the neural network optimization process).

There are three Python scripts to know about:
The 1st is `train_dnn.py` - This is your training script for the neural network, you should edit it to match your own use case. Note how this script isn't aware of the Heterogeneous clusters setup, except when it initializes the tf.data dataset calling this line: `ds = ds.apply(tf.data.experimental.service.distribute(...)`.  
The 2nd and 3rd scripts, which you're not suppose to edit when adapting to your own use case, do the heavy lifting required for using tf.data.service over the Heterogeneous clusters feature.  
`train_data.py` include functions to start/stop tf.service.data process like a dispatcher and WorkerServer. 
`launcher.py` has several responsibilities: 
- a single entrypoint script for all instances in all instance groups (SageMaker will start the same script on all instances).
- Identify which instance group the node belong to, and start the relevant script accordingly (`train_dnn.py` or  `train_data.py` or sometimes both).
- Takes measures to ensure that tf.data.sevice processes shutdown when training completes, as the training job completes only when all instances exit. Remember that training job.
- Allow to start more than one process (for example, on the dnn_gruop instances we'll run both the `train_dnn.py` and a tf.data.service worker to utilize the instance CPUs).

In [19]:
import datetime
from sagemaker.tensorflow import TensorFlow
from sagemaker.instance_group import InstanceGroup
from sagemaker.inputs import TrainingInput
import os

hyperparameters = {
    "epochs": 10,
    "steps_per_epoch": 500,
    "batch_size": 1024,
    "tf_data_mode": "service",  # We'll be using tf.data.service for this Heterogeneous clusters job
    "num_of_data_workers": 2,  # We won't be using tf.data.service for this Heterogeneous clusters job
}

# Group for CPU instances to run tf.data.service dispatcher/workers processes.
data_group = InstanceGroup("data_group", "ml.c5.18xlarge", 3)
# Group for deep neural network (dnn) with accleartors (e.g., GPU, FPGA, etc.)
dnn_group = InstanceGroup("dnn_group", "ml.p4d.24xlarge", 1)

estimator2 = TensorFlow(
    entry_point="launcher.py",
    source_dir="code",
    framework_version="2.9.1",
    py_version="py39",
    role=role,
    volume_size=10,
    max_run=1800,  # 30 minutes
    disable_profiler=True,
    # instance_type='ml.p4d.24xlarge',
    # instance_count=1,
    instance_groups=[data_group, dnn_group],
    hyperparameters=hyperparameters,
    distribution={
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,  # 8 GPUs per host
            "custom_mpi_options": "--NCCL_DEBUG WARN",
        },
        "instance_groups": [dnn_group],
    },
)

### D. Submit the training job

Note1: For the logs, click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**. 
Note2: Ignore the 0 billable seconds shown below. See actual billable seconds in the AWS web console > SageMaker > Training Jobs > this job.

In [22]:
estimator2.fit(
    inputs=data_uri,
    job_name="heterogenous-" + datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ"),
)

2022-09-13 21:29:25 Starting - Starting the training job.

### E. Analyzing the Heterogeneous training job throughput and resource usage
We'll examine: CPU and GPU usage. Epoch time and step time.

#### CPU and GPU usage analysis
 In the screenshot below we observe that GPU usage has increase to 73% (compared to ~40% in the homogeneous training run) which is what we were aiming for. The CPU usage on all 3 instances are close to 90% CPU uage.  
 
Note: To view your own job Click on **View instance metrics** from the **Training jobs** node in **Amazon SageMaker Console**. Then to rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch "Add Math" feature and average it out by no. of vCPUs/GPUs on those instance types.  
<img src="images/metrics Heterogeneous cpu and gpu usage.png" width=75%/>

#### Epoch time and step time analysis
For 2nd epoch onwards you should see this printout in the logs of the dnn_gruop instance (p4d.24xlarge): 45s/epoch - 89ms/step.
Note that the instances are named: Algo1, Algo2, Algo3 randomly on each execution, so you'll need to open all instances logs to find the dnn_group instance.

## E. Comparing time-to-train and cost-to-train
The table below summarizes both jobs. We can see that:
- The Heterogeneous job is <b>2.4x faster to train</b> (86ms/step) than the homogeneous job (208ms/step).
- The Heterogeneous job is <b>50% cheaper to train</b> than the homogeneous job. This is despite the heterogenous costs more per hour ($45/hour vs $37/hour), due to the two extra c5.18xlarge instances included in the heterogenous job `($45 = $37.7 + 2 * $3.67` 
The cost-to-train formula we used: change in houly price `($45/$37.7) ` times `reduction-in-time-to-train (86ms/208ms)`  =  50% = `($45/$37.7) * (86ms/208ms)`. 

<img src=images/homogeneous-vs-heterogeneous-results-table.png alt="results table" />

## F. Conclusion
In this notebook, we demonstrated how to leverage Heterogeneous cluster feature of SageMaker Training, with TensorFlow to achieve better price performance and increase training speed. To get started you can copy this example project and only change the `train_dnn.py` script. To run the job, you could use this notebook, or the `start_job.py`.