# Visualize Training Jobs and Performance of Your Model Using TensorBoard on SageMaker


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

---


## Background
[TensorBoard on Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/tensorboard-on-sagemaker.html) can be used to monitor and visualize metrics collected on a SageMaker training job. 
This notebook example shows how to use the `sagemaker.interactive_apps.tensorboard.TensorBoardApp` API, which launches the hosted TensorBoard application within SageMaker. The sample training script is prepared with PyTorch, the SageMaker data parallelism library `smdistributed.dataparallel` for distributed training, and the MNIST dataset.

### Dataset
This example uses the MNIST dataset. MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits).

**NOTE:** This example requires SageMaker Python SDK v2.**.

In [None]:
!pip install sagemaker --upgrade

### Initialize SageMaker

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-sagemaker-tensorboard-pytorch"
region = sagemaker_session.boto_region_name

role = sagemaker.get_execution_role()
role_name = role.split(["/"][-1])
print(f"The Amazon Resource Name (ARN) of the role used for this demo is: {role}")
print(f"The name of the role used for this demo is: {role_name[-1]}")

To verify that the role above has required permissions:

1. Go to the IAM console: https://console.aws.amazon.com/iam/home.
2. Select **Roles**.
3. Enter the role name in the search box to find the role retrieved from the output of the previous code cell. 
4. Select the role.
5. Use the **Permissions** tab to verify that the role has all the required permissions attached.

## Model training with SageMaker distributed data parallel

### Training script background

The MNIST dataset is downloaded using the `torchvision.datasets` PyTorch module; you can see how this is implemented in the `train_pytorch_smdataparallel_mnist.py` training script that is printed out in the next cell.

The training script provides the code you need for distributed data parallel (DDP) training using SageMaker's distributed data parallel library (`smdistributed.dataparallel`). For details about how to use `smdistributed.dataparallel`'s DDP in your native PyTorch script, see the [Modify a PyTorch Training Script Using SMD Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp.html#data-parallel-modify-sdp-pt).

### Customizing training script for TensorBoard data collection

In order to support TensorBoard data collection, you need to explicitly modify the training script to store the data using the `torch.utils.tensorboard.SummaryWriter` object. This object requires an installation of TensorBoard in the training container, so you will need to specify a `tensorboard` dependency in `requirements.txt`.

In distributed training jobs, using the `SummaryWriter` across multiple ranks can cause duplicate data collection, so the `SummaryWriter` should only be created and written to on a specific rank:


```
from torch.utils.tensorboard import SummaryWriter

...
args.rank = rank = dist.get_rank()
args.local_rank = local_rank = int(os.getenv("LOCAL_RANK", -1))
...
writer = SummaryWriter('/opt/ml/output/tensorboard') if rank == 0 and local_rank == 0 else None
...

def train(args, model, device, train_loader, optimizer, epoch):
    ...
    # Only write on rank 0 to avoid duplicate entries
    if writer:
        writer.add_scalar("example_metric", example_metric)
        
# Close writer at the end of training 
writer.close()
```
To see the full training script, run the cell below:

In [None]:
!pygmentize code/train_pytorch_smdataparallel_mnist.py

### Estimator function options

`TensorBoardOutputConfig` is used to inform the training environment where the local 


In [None]:
from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_s3_output_path = "s3://{}/{}".format(bucket, prefix)

# Create a TensorBoardOutputConfig object to configure automatic data upload from
# the training job's local data storage to S3
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_output_path,
    container_local_output_path="/opt/ml/output/tensorboard",
)

In [None]:
from sagemaker.pytorch import PyTorch

env = {
    "SAGEMAKER_REQUIREMENTS": "requirements.txt",  # tensorboard dependency required for PyTorch
}

estimator = PyTorch(
    base_job_name="pytorch-tensorboard-dataparallel-mnist",
    source_dir="code",
    entry_point="train_pytorch_smdataparallel_mnist.py",
    role=role,
    framework_version="1.11.0",
    py_version="py38",
    instance_count=1,
    env=env,
    instance_type="ml.p3.16xlarge",
    sagemaker_session=sagemaker_session,
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
    # Configure TensorBoard output config in estimator to enable automatic sync in training job
    tensorboard_output_config=tensorboard_output_config,
)

In [None]:
estimator.fit(wait=False)

## Track training job in TensorBoard

Once the job is started, you can generate the URL to the TensorBoard application and add your training job directly into the TensorBoard application and retrieve a link by using the `TensorBoardApp().get_app_url()` API from `agemaker.interactive_apps.tensorboard`.

If you are running the command from SageMaker Studio, the Python SDK will generate a direct URL. Otherwise, you will receive an AWS console URL, which you can access in order to select the application Domain and User Profile that you want the TensorBoard session to run under. 

For more information on how to configure TensorBoard on SageMaker, see [Use TensorBoard to Debug and Analyze Training Jobs in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/tensorboard-on-sagemaker.html).

In [None]:
from sagemaker.interactive_apps.tensorboard import TensorBoardApp

app = TensorBoardApp(region)

print("Navigate to the following URL:")
print(app.get_app_url(training_job_name=estimator._current_job_name))
print("Data may not appear until job is started and emitting data")
estimator.logs()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)
