# MNIST Training using PyTorch and Tensforflow

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by creating a SageMaker session and specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).


In [1]:
install_needed = True  # should only be True once
# install_needed = False

In [2]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install --upgrade pip 
    !{sys.executable} -m pip install -U sagemaker smdebug ipywidgets --user
    !{sys.executable} -m pip install -U torchvision
    !/bin/bash ./local/local_mode_setup.sh
    IPython.Application.instance().kernel.do_shutdown(True)

installing deps and restarting kernel
sudo: systemctl: command not found
Stopping docker: [60G[[0;32m  OK  [0;39m]
Starting docker:	.[60G[[0;32m  OK  [0;39m]
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


In [1]:
import sagemaker
from sagemaker.debugger import (
    Rule, ProfilerRule, rule_configs, ProfilerConfig, 
    FrameworkProfile, DetailedProfilingConfig, 
    DataloaderProfilingConfig, PythonProfilingConfig)
import time

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-mnist'

role = sagemaker.get_execution_role()

In [2]:
sagemaker.__version__

'2.72.3'

## Data
### Getting the data



In [3]:
from torchvision import datasets, transforms

datasets.MNIST('data', download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
]))

Dataset MNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )

### Uploading the data to S3
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.


In [4]:
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-322537213286/sagemaker/DEMO-pytorch-mnist


## Train
### Training script
The script provides all the code we need for training and hosting a SageMaker model (`model_fn` function to load a model).
The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.
  These artifacts are uploaded to S3 for model hosting.

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (``if __name__=='__main__':``) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

For example, the script run by this notebook:

In [5]:
# !pygmentize ./train_code/pytorch_mnist_smdp.py

### Configure rules
We specify the following rules:
- loss_not_decreasing: checks if loss is decreasing and triggers if the loss has not decreased by a certain persentage in the last few iterations
- LowGPUUtilization: checks if GPU is under-utilizated 
- ProfilerReport: runs the entire set of performance rules and create a final output report with further insights and recommendations.

In [6]:
rules=[ 
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

### Create a Training Job with Profiling Enabled<a class="anchor" id="option-1"></a>

You will use the standard [SageMaker Estimator API for PyTorch ](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.pytorch.html) to create training jobs. To enable profiling, create a `ProfilerConfig` object and pass it to the `profiler_config` parameter of the `PyTorch` estimator.

In [7]:
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        start_step=5,num_steps=10,
        detailed_profiling_config=DetailedProfilingConfig(start_step=5, num_steps=10),
        dataloader_profiling_config=DataloaderProfilingConfig(start_step=5, num_steps=10),
        python_profiling_config=PythonProfilingConfig(start_step=5, num_steps=10), # cprofile / Pyinstrument
    )
)

In [8]:
metric_definitions=[
     {'Name': 'train:Loss', 'Regex': 'Loss: (.*?),'},
     {'Name': 'test:Accuracy', 'Regex': 'Accuracy: (.*?)%'},
]

### Model training with SMDataParallel


The training script provides the code you need for distributed data parallel (DDP) training using SMDataParallel. The training script is very similar to a PyTorch training script you might run outside of SageMaker, but modified to run with SMDataParallel. SMDataParallel's PyTorch client provides an alternative to PyTorch's native DDP. For details about how to use SMDataParallel's DDP in your native PyTorch script, see the Getting Started with SMDataParallel tutorials.

In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You're also passing in the training script you reviewed in the previous cell.

**Instance types**

SMDataParallel supports model training on SageMaker with the following instance types only:
1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]

**Instance count**

To get the best performance and the most out of SMDataParallel, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you update the the `distribution` strategy, and set it to use `smdistributed dataparallel`. 

In [82]:
distribution = {"smdistributed": {
                    "dataparallel": {
                            "enabled": True
                    }
               }
             }

In [83]:
instance_type = 'ml.p3.16xlarge'
instance_count = 2
entry_point = 'pytorch_mnist_smdp.py'

In [88]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point=entry_point,
                    source_dir='train_code',
                    role=role,
                    framework_version='1.10',
                    py_version='py38',
                    instance_count=instance_count,
                    instance_type=instance_type,
                    distribution=distribution,
                    metric_definitions=metric_definitions,
                    profiler_config=profiler_config,
                    rules=rules,
#                     disable_profiler=True,
                    use_spot_instances=False,
#                     max_wait=3*60*60,
                    max_run=1*30*60,
                    hyperparameters={
                        'epochs': 5
                    }
                   )

After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.


In [116]:
instance_type = 'ml.p3.16xlarge'
instance_count = 2
entry_point = 'tf_mnist_smdp.py'

In [130]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow( entry_point=entry_point,
                        source_dir='train_code',
                        role=role,
                        framework_version='2.6.2',
                        py_version='py38',
                        instance_count=instance_count,
                        instance_type=instance_type,
                        distribution=distribution,
                        metric_definitions=metric_definitions,
                        profiler_config=profiler_config,
                        rules=rules,
#                         disable_profiler=True,
                        use_spot_instances=False,
#                         max_wait=3*60*60,
                        max_run=3*60*60,
                        hyperparameters={
                            'epochs': 5
                        }
                   )


In [131]:
job_name = "training-job-{}".format(int(time.time()))

estimator.fit({'training': inputs},
              job_name=job_name,
              wait=False
            )

In [132]:
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

2022-01-11 13:08:00 Starting - Starting the training job...
2022-01-11 13:08:11 Starting - Launching requested ML instancesLossNotDecreasing: InProgress
ProfilerReport: InProgress
.........
2022-01-11 13:09:57 Starting - Preparing the instances for training.........
2022-01-11 13:11:31 Downloading - Downloading input data...
2022-01-11 13:11:53 Training - Downloading the training image....................[35m2022-01-11 13:15:06.211997: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[35m2022-01-11 13:15:06.218076: I tensorflow/core/profiler/internal/smprofiler_config_reader.cc:123] PID of the process that is writing to the timeline : 1[0m
[35m2022-01-11 13:15:06.218689: I tensorflow/core/profiler/internal/smprofiler_timeline.cc:121] SageMaker Profiler Timeline Writer read the following config parameters :[0m
[35m2022-01-11 13:15:06.218703: I tensorflow/core/profiler/internal/smprofiler_timeline.cc:122] Base Folder : /opt/ml

### Download Debugger Profiling Report
The profiling report rule will create an html report `profiler-report.html` with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket. For more information about how to download and open the Debugger profiling report, see [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) in the SageMaker developer guide.

In [133]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

You will find the profiler report in s3://sagemaker-us-east-1-322537213286/training-job-1641906479/rule-output


In [134]:
!aws s3 ls {rule_output_path}/ProfilerReport/profiler-output/

                           PRE profiler-reports/
2022-01-11 13:18:49     373924 profiler-report.html
2022-01-11 13:18:48     222364 profiler-report.ipynb


In [135]:
import os

output_dir = './output'
!rm -rf $output_dir

profile_output = output_dir+'/ProfilerReport'

if not os.path.exists(profile_output):
    os.makedirs(profile_output)

In [136]:
!aws s3 cp {rule_output_path}/ProfilerReport/profiler-output/ {output_dir}/ProfilerReport/ --recursive

download: s3://sagemaker-us-east-1-322537213286/training-job-1641906479/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to output/ProfilerReport/profiler-reports/BatchSize.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1641906479/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json to output/ProfilerReport/profiler-reports/Dataloader.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1641906479/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json to output/ProfilerReport/profiler-reports/IOBottleneck.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1641906479/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to output/ProfilerReport/profiler-reports/CPUBottleneck.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1641906479/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to output/ProfilerReport/profil

In [137]:
from IPython.core.display import display, HTML

display(HTML('<b>ProfilerReport : <a href="{}profiler-report.html">Profiler Report</a></b>'.format(output_dir+"/ProfilerReport/")))
