# SageMaker Distrubuted Training: Hosting a Model Parallelism with PyTorch CNN on CPU Instances

*(This notebook was tested with the "Python 3 (PyTorch CPU Optimized)" kernel.)*

This notebook demonstrates how to use SageMaker distributed model parallelism to train a CNN based model for image classification based on Cifar10 dataset. It is inteneded to run a live workshop for SageMaker distributed training with limited time and constrained resource. The training job runs on a single node with multiple vCPUs such as ml.c5.2xlarge. The code was based on SageMaker 2.x Python3.8 and PyTorch 1.11. 


Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality, large models. The SageMaker Python SDK makes it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks, including PyTorch.

## Setup

Let's start by specifying:

- An Amazon S3 bucket and prefix for training and model data. This should be in the same region used for SageMaker Studio, training, and hosting.
- An IAM role for SageMaker to access to your training and model data. If you wish to use a different role than the one set up for SageMaker Studio, replace `sagemaker.get_execution_role()` with the appropriate IAM role or ARN. For more about using IAM roles with SageMaker, see [the AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "pytorch-cnn-cifar10-example"

#role = sagemaker.get_execution_role()
role = 'arn:aws:iam::976939723775:role/service-role/AmazonSageMaker-ExecutionRole-20210317T133000'

## Prepare the training data

The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is a subset of the [80 million tiny images dataset](https://people.csail.mit.edu/torralba/tinyimages). It consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.

### Download the data

First we download the dataset:

In [None]:
%%bash

wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar xfvz cifar-10-python.tar.gz

mkdir data
mv cifar-10-batches-py data/.

rm cifar-10-python.tar.gz

After downloading the dataset, we use the [`torchvision.datasets` module](https://pytorch.org/docs/stable/torchvision/datasets.html) to load the CIFAR-10 dataset, utilizing the [`torchvision.transforms` module](https://pytorch.org/docs/stable/torchvision/transforms.html) to convert the data into normalized tensor images:

In [None]:
from cifar_utils import classes, show_img, train_data_loader, test_data_loader

train_loader = train_data_loader()
test_loader = test_data_loader()

### Preview the data

Now we can view some of data we have prepared:

In [None]:
import numpy as np
import torchvision, torch

# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()

# show images
show_img(torchvision.utils.make_grid(images))

# print labels
print(" ".join("%9s" % classes[labels[j]] for j in range(4)))

### Upload the dataset to s3 
We use the `sagemaker.s3.S3Uploader` to upload our dataset to Amazon S3. The return value `inputs` identifies the location -- we use this later for the training job.

In [None]:
from sagemaker.s3 import S3Uploader

inputs = S3Uploader.upload("data", "s3://{}/{}/data".format(bucket, prefix))

## Prepare the entry-point script

When SageMaker trains and hosts our model, it runs a Python script that we provide. (This is run as the entry point of a Docker container.) For training, this script contains the PyTorch code needed for the model to learn from our dataset. For inference, the code is for loading the model and processing the prediction input. For convenience, we put both the training and inference code in the same file.

### Training

The training code is very similar to a training script we might run outside of Amazon SageMaker, but we can access useful properties about the training environment through various environment variables. For this notebook, our script retrieves the following environment variable values:

* `SM_HOSTS`: a list of hosts on the container network.
* `SM_CURRENT_HOST`: the name of the current container on the container network.
* `SM_MODEL_DIR`: the location for model artifacts. This directory is uploaded to Amazon S3 at the end of the training job.
* `SM_CHANNEL_TRAINING`: the location of our training data.
* `SM_NUM_GPUS`: the number of GPUs available to the current container.

We also use a main guard (`if __name__=='__main__':`) to ensure that our training code is executed only for training, as SageMaker imports the entry-point script.

For more about writing a PyTorch training script with SageMaker, please see the [SageMaker documentation](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html#prepare-a-pytorch-training-script).

### Inference

For inference, we need to implement a few specific functions to tell SageMaker how to load our model and handle prediction input.

* `model_fn(model_dir)`: loads the model from disk. This function must be implemented.
* `input_fn(serialized_input_data, content_type)`: deserializes the prediction input.
* `predict_fn(input_data, model)`: calls the model on the deserialized data.
* `output_fn(prediction_output, accept)`: serializes the prediction output.

The last three functions - `input_fn`, `predict_fn`, and `output_fn` - are optional because SageMaker has default implementations to handle common content types. However, there is no default implementation of `model_fn` for PyTorch models on SageMaker, so our script has to implement `model_fn`.

For more about PyTorch inference with SageMaker, please see the [SageMaker documentation](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html#id3).

### Put it all together

Here is the full script for both training and hosting our convolutional neural network:

In [None]:
!pygmentize scripts/cifar10_torch.py

## Run a SageMaker training job

The SageMaker Python SDK makes it easy for us to interact with SageMaker. Here, we use the `PyTorch` estimator class to start a training job. We configure it with the following parameters:

* `entry_point`: our training script.
* `role`: an IAM role that SageMaker uses to access training and model data.
* `framework_version`: the PyTorch version we wish to use. For a list of supported versions, see [here](https://github.com/aws/sagemaker-python-sdk#pytorch-sagemaker-estimators).
* `instance_count`: the number of training instances.
* `instance_type`: the training instance type. For a list of supported instance types, see [the AWS Documentation](https://aws.amazon.com/sagemaker/pricing/instance-types/).

Once we our `PyTorch` estimator, we start a training job by calling `fit()` and passing the training data we uploaded to S3 earlier.

In [None]:
from sagemaker.pytorch import PyTorch

hyperparameters = {"epoch": 10,
                  "lr":0.0005,
                  "batch_size":8}
env={'SAGEMAKER_REQUIREMENTS': 'requirements.txt'}

kwargs = dict(
    source_dir="./scripts",
    entry_point="cifar10_torch.py",
    model_dir=False,
    env=env,
    instance_type="ml.c5.2xlarge",
    instance_count=1,
    framework_version="1.11.0",
    py_version='py38',
    debugger_hook_config=None,
    disable_profiler=True,
    max_run=60 * 60,  # 60 minutes
    role=role,
    metric_definitions=[
        {"Name": "training_loss", "Regex": "loss: ([0-9.]*?) "},
        {"Name": "training_accuracy", "Regex": "accuracy: ([0-9.]*?) "},
        {"Name": "training_latency_per_epoch", "Regex": "- ([0-9.]*?)s/epoch"},
        {"Name": "training_avg_latency_per_step", "Regex": "- ([0-9.]*?)ms/step"},
    ],
)

## Configure smdistributed parameters for distributed model parallelism

Model parallelism allows partitioning a large deep learning model across multiple devices, within or across instances during training. Increasing the size of deep learning models (layers and parameters) yields better accuracy for complex tasks such as computer vision and natural language processing. However, there is a limit to the maximum model size you can fit in the memory of a single GPU. When training DL models, GPU memory limitations can be bottlenecks in the following ways:

* They limit the size of the model you can train, since the memory footprint of a model scales proportionally to the number of parameters.

* They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.

To overcome the limitations associated with training a model on a single GPU, SageMaker provides the model parallel library to help distribute and train DL models efficiently on multiple compute nodes. Furthermore, with the library, you can achieve most optimized distributed training using EFA-supported devices, which enhance the performance of inter-node communication with low latency, high throughput, and OS bypass.


For a training job that uses AMP (FP16) and Adam optimizers, the required GPU memory per parameter is about 20 bytes, which we can break down as follows:

* An FP16 parameter ~ 2 bytes
* An FP16 gradient ~ 2 bytes
* An FP32 optimizer state ~ 8 bytes based on the Adam optimizers
* An FP32 copy of parameter ~ 4 bytes (needed for the optimizer apply (OA) operation)
* An FP32 copy of gradient ~ 4 bytes (needed for the OA operation)

Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory, which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory and V100 with 16/32 GB) available on a single GPU. Note that on top of the memory requirements for model and optimizer states, there are other memory consumers such as activations generated in the forward pass. The memory required can be a lot greater than 200GB.

SageMaker distributed training libraries support Pipeline parallelism, Tensor parallelism (available for PyTorch) and Optimizer state sharding (available for PyTorch).

* Pipeline parallelism partitions the set of layers or operations across the set of devices, leaving each operation intact.
* Tensor parallelism splits individual layers, or nn.Modules, across devices, to be run in parallel. 
* Optimizing state sharding is to avoid replicating optimizer state in all of your GPUs by using a single replica of the optimizer state which is sharded across data-parallel ranks, with no redundancy across devices.


SageMaker's distributed model parallel library to train large deep learning (DL) models that are difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by efficiently training larger DL models with billions or trillions of parameters.

You can use the library to automatically partition your own TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through the SageMaker Python SDK.To track the latest updates of the library, see the [SageMaker Distributed Model Parallel Release Notes](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html) in the SageMaker Python SDK documentation.

Amazon SageMaker’s TensorFlow and PyTorch estimator objects contain a distribution parameter, which you can use to enable and specify parameters for SageMaker distributed training. The SageMaker model parallel library internally uses MPI. To use model parallelism, both smdistributed and MPI must be enabled through the distribution parameter.

* microbatches: The number of microbatches to perform pipelining over. 1 means no pipelining. Batch size must be divisible by the number of microbatches.
* pipeline: The pipeline schedule "interleaved" or "simple"
* placement_strategy: Determines the mapping of model partitions onto physical devices. "cluster", "spread"
* optimize: Determines the distribution mechanism of transformer layers. "memory" or "speed".
* auto_partition: Enable auto-partitioning. If disabled, default_partition parameter must be provided.
* default_partition: Required if auto_partition is false. The partition ID to place operations/modules that are not placed in any smp.partition contexts. 0 or 1

In addition, there are a few PyTorch and Tensorflow specific parameters. For details please refer to this the [SageMaker document](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html). 

In [None]:
# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8, # (pipeline_parallel_degree) x (data_parallel_degree) = processes_per_host
    "custom_mpi_options": "-verbose", #To avoid Docker warnings from contaminating your training logs,
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,  # Mini-batchs are split in micro-batch to increase parallelism
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "partitions": 4, # we'll partition the model among the 4 CPUs
        "ddp": True,
    },
    "parameter_server": {
        "enabled": True
    }
}


distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

# Set the distribution
data_distribution = { "smdistributed":
                { "dataparallel":
                 { "enabled": True
                 }
                }
               } 

In [None]:
kwargs['instance_count']  =  1

def_dmp_estimator = sagemaker.pytorch.estimator.PyTorch(
    hyperparameters=hyperparameters,
    **kwargs,
    distribution=distribution,
)

In [None]:
%%time
def_dmp_estimator.fit(inputs)

## Deploy the model for inference

After we train our model, we can deploy it to a SageMaker Endpoint, which serves prediction requests in real-time. To do so, we simply call `deploy()` on our estimator, passing in the desired number of instances and instance type for the endpoint:

In [None]:
%%time
def_dmp_predictor = def_dmp_estimator.deploy(initial_instance_count=1, instance_type="ml.c5.xlarge")

## Send a few samples for inference

In [None]:
# get some test images
dataiter = iter(test_loader)
images, labels = dataiter.next()


# print images, labels, and predictions
show_img(torchvision.utils.make_grid(images))
print("GroundTruth: ", " ".join("%4s" % classes[labels[j]] for j in range(4)))

outputs = def_dmp_predictor.predict(images.numpy())

_, predicted = torch.max(torch.from_numpy(np.array(outputs)), 1)

print("Predicted:   ", " ".join("%4s" % classes[predicted[j]] for j in range(4)))

## Cleanup

Once finished, we delete our endpoint to release the instances (and avoid incurring extra costs).

In [None]:
def_dmp_predictor.delete_endpoint()