# Identify a Network Bottleneck with Amazon SageMaker Debugger 

In this notebook we demonstrate how to identify a bottleneck in `tf.data` pipeline of a ResNet50 training session. To simulate the bottleneck, we have added a heavy data preprocessing task to the pipeline to modify the CIFAR-10 dataset during the training.

### Tensorflow Datasets package

First of all, set the notebook kernel to Tensorflow 2.x.

We will use CIFAR-10 dataset for this experiment. To download CIFAR-10 datasets and convert it into TFRecord format, install `tensorflow-datasets` package, run `demo/generate_cifar10_tfrecords`, and upload tfrecord files to your S3 bucket.

In [None]:
!python demo/generate_cifar10_tfrecords.py --data-dir=./data

In [None]:
import sagemaker

s3_bucket = sagemaker.Session().default_bucket()

dataset_prefix='data/cifar10-tfrecords'
desired_s3_uri = f's3://{s3_bucket}/{dataset_prefix}'

dataset_location = sagemaker.s3.S3Uploader.upload(local_path='data', desired_s3_uri=desired_s3_uri)
print(f'Dataset uploaded to {dataset_location}')

## Step 1: Create a Training Job with Profiling Enabled<a class="anchor" id="option-1"></a>

We will use the standard [SageMaker Estimator API for Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) to create a training job. To enable profiling, we create a `ProfilerConfig` object and pass it to the `profiler_config` parameter of the `TensorFlow` estimator. For this demo, we set the the profiler to probe the system once every 60 seconds.

### Set a profiler configuration

In [None]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
  system_monitor_interval_millis=500,
  framework_profile_params=FrameworkProfile(start_step=5, num_steps=2)  
)

### Define hyperparameters

The start-up script is set to [train_tf_bottleneck.py](./demo/train_tf_bottleneck). Define hyperparameters such as number of epochs, batch size, and data augmentation. `dataset_bottleneck` hyperparameter is to turn the data augmentation on or off. To add data preprocessing bottleneck, set `dataset_bottleneck` as `True`.

In [None]:
hyperparameters = {'epoch': 1, 
                   'batch_size': 1024,
                   'dataset_bottleneck': True
                  }

### Define SageMaker Tensorflow Estimator

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

job_name = 'dataset-bottleneck'
instance_count = 1
instance_type = 'ml.p2.xlarge'
entry_script = 'train_tf_bottleneck.py'

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name=job_name,
    instance_type=instance_count,
    instance_count=instance_type,
    entry_point=entry_script,
    source_dir='demo',
    framework_version='2.3.1',
    py_version='py37',
    profiler_config=profiler_config,
    script_mode=True,
    hyperparameters=hyperparameters,
    input_mode='Pipe'
)

> If you see an error, `TypeError: __init__() got an unexpected keyword argument 'instance_type'`, that means SageMaker Python SDK is out-dated. Please update your SageMaker Python SDK to 2.x by executing the below command and restart this notebook.

```bash
pip install --upgrade sagemaker
```

### Start training job

The following `estimator.fit()` with `wait=False` argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks.

In [None]:
remote_inputs = {'train' : dataset_location+'/train'}

estimator.fit(remote_inputs, wait=False)

## Step 3: Monitor the system resource utilization using SageMaker Studio

During the training is in progress or after the training job is completed, go to `Debugger` in SageMaker Studio. You will see GPU utilization stays low while CPU utilization is hitting 100% all the time. We know this is due to the change we made. But if you see this pattern in your own training job, what do you want to know? Definitely, we want to know what is being executed on CPU. Python profiling functionality of SageMaker Debugger will tell you what is happening or what happened. 

![Debugger-in-Studio](./images/datapipeline-bottleneck.png)

## Step 4: Investigate the bottleneck interactively using Debugger analysis APIs

In order to analyze the Python profiling data gathered by SageMaker Debugger, open a notebook, [interactive_analysis.ipynb](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/profiling_analysis_tools/interactive_analysis.ipynb), which is in SageMaker Debugger example git repo. Then, set your training job and the AWS region where your training job exists in the notebook. 

```python
training_job_name = '<PUT YOUR TRANING JOB NAME>'
region = '<AWS REGION WHERE THE TRAINING JOB WAS EXECUTED>' 
```

Execute the notebook code one by one, and you will meet a plot similar to this. This is a zoom-in version of the above plots. Follow the guide in the notebook to choose the time interval for dive deep investigation. 

![Debugger-in-Studio](./images/datapipeline-bottleneck-cpugpuutil.png)

The list of function names executed during the selected period is given as below, and the execution time of each function is also printed. In our case, `GetNext` functions consumed CPU cycle mostly, and `GetNext` function is related to get the next example. 

```python
view_timeline_charts.find_time_annotations([13527,13548])  
```

```
Selected timerange: 1606920919.15031 to 1606920929.67799 
Spent 0.079906123 ms (cumulative time) in Step:ModeKeys.TRAIN_79 
Spent 3.202e-06 ms (cumulative time) in PipeModeDatasetOp::Dataset::Iterator::GetNext 
Spent 3.5277999999999976e-05 ms (cumulative time) in tensorflow::data::(anonymous namespace)::ParallelMapIterator::GetNext 
Spent 9.340299999999999e-05 ms (cumulative time) in tensorflow::data::RepeatDatasetOp::Dataset::ForeverIterator::GetNext 
Spent 9.553100000000004e-05 ms (cumulative time) in tensorflow::data::ShuffleDatasetOp::ReshufflingDatasetV2::Iterator::GetNext 
Spent 0.079633313 ms (cumulative time) in tensorflow::data::experimental::MapAndBatchDatasetOp::Dataset::Iterator::GetNext 
Spent 0.078741146 ms (cumulative time) in tensorflow::data::PrefetchDatasetOp::Dataset::Iterator::GetNext 
Spent 0.078741162 ms (cumulative time) in tensorflow::data::(anonymous namespace)::ModelDatasetOp::Dataset::Iterator::GetNext 
Spent 0.078741179 ms (cumulative time) in IteratorResource::GetNext 
Spent 0.078741203 ms (cumulative time) in IteratorGetNextOp::DoCompute 
```

You can also download the generated timeline file, and open it using Chrome Tracing tool, which visualizes the timeline profiling data. Agsin, `GetNext` is the most heavy function call. 

![Debugger-in-Studio](./images/datapipeline-bottleneck-timeline.png)

If you take a look at the data pipeline in the training code, there is a code to add Gaussian Blur filters in `data_augmentation` function which slow down the data pipelines to GPU. 

```python
def data_augmentation(image, label): 
    import tensorflow_addons as tfa 
    for i in range(1): 
        image = tfa.image.gaussian_filter2d(image=image, filter_shape=(11, 11), sigma=0.8) 
    return image, label 

if dataset_bottleneck: 
    dataset = dataset.map(data_augmentation, num_parallel_calls=tf.data.experimental.AUTOTUNE) 
```

This is the bottleneck you have to resolve either by removing it or applying this modification to the dataset in advance.