# PyTorch MNIST Lift and Shift Exercise

For this exercise notebook, you should be able to use the `Python 3 (Data Science)` kernel on SageMaker Studio, or `conda_python3` on classic SageMaker Notebook Instances.

---

## Introduction

Your new colleague in the data science team (who isn't very familiar with SageMaker) has written a nice notebook to tackle an image classification problem with Keras: [Local Notebook.ipynb](Local%20Notebook.ipynb).

It works OK with the simple MNIST data set they were working on before, but now they'd like to take advantage of some of the features of SageMaker to tackle bigger and harder challenges.

**Can you help refactor the Local Notebook code, to show them how to use SageMaker effectively?**


## Getting Started

First, check you can **run the [Local Notebook.ipynb](Local%20Notebook.ipynb) notebook through** - reviewing what steps it takes.

**This notebook** sets out a structure you can use to migrate code into, and lists out some of the changes you'll need to make at a high level. You can either work directly in here, or duplicate this notebook so you still have an unchanged copy of the original.

Try to work through the sections first with an MVP goal in mind (fitting the model to data in S3 via a SageMaker Training Job, and deploying/using the model through a SageMaker Endpoint). At the end, there are extension exercises to bring in more advanced functionality.


## Dependencies

Listing all our imports at the start helps to keep the requirements to run any script/file transparent up-front, and is specified by nearly every style guide including Python's official [PEP 8](https://www.python.org/dev/peps/pep-0008/#imports)


In [None]:
# External Dependencies:
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import numpy as np

# Local Dependencies:
from util.nb import upload_in_background

# TODO: What else will you need?
# Have a look at the documentation: https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html
# to see which libraries need to be imported to use sagemaker and the tensorflow estimator


In [None]:
# TODO: Here might be a good place to init any SDKs you need...
# 1. Setup the SageMaker role
role = ?
# 2. Setup the SageMaker session
sess = ?
# 3. Setup the SageMaker default bucket
bucket_name = ?

# Have a look at the previous examples to find out how to do it

## Data Preparation

The primary data source for a SageMaker training job is (nearly) always S3 - so we should upload our training and test data there.

We'd like our training job to be reusable for other image classification projects, so we'll upload in the **folders-of-images format** rather than the straight pre-processed numpy arrays.

However, for this particular dataset (tens of thousands of tiny files) it's easy to accidentally write a poor-performing upload that **could take a long time**... So we prepared the below to help you run the upload **in the background** using the [aws s3 sync](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) CLI command.

**Check you understand** what data it's going to upload from this notebook, and where it's going to store it in S3, then start the upload running while you work on the rest.


In [None]:
upload_in_background(local_path="data", s3_uri=f"s3://{bucket_name}/mnist")

You can carry on working on the other sections while your data uploads!


## Data Input ("Channels") Configuration

The draft code has **2 data sets**: One for training, and one for test/validation. (For classification, the folder location of each image is sufficient as a label).

In SageMaker terminology, each input data set is a "channel" and we can name them however we like... Just make sure you're consistent about what you call each one!

For a simple input configuration, a channel spec might just be the S3 URI of the folder. For configuring more advanced options, there's the [s3_input](https://sagemaker.readthedocs.io/en/stable/inputs.html) class in the SageMaker SDK.


In [None]:
# TODO: Define your 2 data channels
# The data can be found in: "s3://{bucket_name}/mnist/train" and "s3://{bucket_name}/mnist/test"

inputs = # Look at the previous example to see how the inputs were defined

## Algorithm ("Estimator") Configuration and Run

Instead of loading and fitting this data here in the notebook, we'll be creating a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) through the SageMaker SDK, to run the code on a separate container that can be scaled as required.

The ["Using PyTorch with the SageMaker Python SDK"](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) docs give a good overview of this process. You should run your estimator in **Python 3**.

**Use the [src/main.py](src/main.py) file** as your entry point to port code into - which has already been created for you with some basic hints.


In [None]:
# TODO: Create your PyTorch estimator

# Note the PyTorch class inherits from some cross-framework base classes with additional
# constructor options:
# https://sagemaker.readthedocs.io/en/stable/estimators.html
# https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator

# We are using pytorch 1.4 and python 3
# You can reuse the metrics definition from the previous example
# (Optional) Look at the Pytorch script and try to pass new hyperparameters

estimator = ?

In [None]:
# TODO: Call estimator.fit


## Deploy and Use Your Model (Real-Time Inference)

If your training job has completed; and saved the model in the correct PyTorch model format; it should now be pretty simple to deploy the model to a real-time endpoint.

You can achieve this with the [Estimator API](https://sagemaker.readthedocs.io/en/stable/estimators.html).

In [None]:
# TODO: Deploy a real-time endpoint

Reviewing the architecture from the example notebook, we set up the model to accept **batches** of **28x28** image tensors with **normalized 0-1 pixel values** and a **color channel dimension**

Assuming you haven't added any custom pre-processing to our model source code (to accept e.g. encoded JPEGs/PNGs, or arbitrary shapes), we'll need to replicate that same format when we use our endpoint.

We've provided a nice **interactive widget** below (which doesn't work in JupyterLab, unfortunately - only plain Jupyter!) and some skeleton code to help you use your model... But you'll need to fill in some details!


### WARNING: The next next cells for visualization only works with the classic Jupyter notebooks, skip to the next section if you are using JupyterLab and SageMaker Studio

In [None]:
# Display interactive widget:
# This widget updates variable "data" here in the Jupyter kernel when drawn on
HTML(open("util/input.html").read())

In [None]:
# Run a prediction:

# Squeeze out any unneeded dimensions from "data", then put back the batch and channel dimensions
# we want (assuming batch dim is first and channel dim is last):
print(f"Raw data shape {np.array(data).shape}")
img = np.squeeze(np.array(data)).astype(np.float32)
img = np.expand_dims(np.expand_dims(img, 0), 0)
print(f"Request data shape {img.shape}")

# TODO: Call the predictor with reqdata

# TODO: What structure is the response? How do we interpret it?

### If you are on JupyterLab or SageMaker Studio (or just struggle to get the interactive widget working)

...don't worry: Try adapting the "Exploring Results" section from the Local Notebook to send in one of the test set images instead!


In [None]:
# TODO: import libraries

# TODO: Choose an image

# TODO: Load the image with the tensorflow keras api
img = 

# Send to the model:


# Plot the result:
plt.figure(figsize=(3, 3))
fig = plt.subplot(1, 1, 1)
ax = plt.imshow(img, cmap="gray")
fig.set_title(f"Predicted Number {np.argmax(result)}")
plt.show()

## Further Improvements

If you've got the basic train/deploy/call cycle working, congratulations! This core pattern of experimenting in the notebook but executing jobs on scalable hardware is at the heart of the SageMaker data science workflow.

There are still plenty of ways we can use the tools better though: Read on for the next challenges!


### 1. Cut training costs easily with SageMaker Managed Spot Mode

AWS Spot Instances let you take advantage of unused capacity in the AWS cloud, at up to a 90% discount versus standard on-demand pricing! For small jobs like this, taking advantage of this discount is as easy as adding a couple of parameters to the Estimator constructor:

https://sagemaker.readthedocs.io/en/stable/estimators.html

Note that in general, spot capacity is offered at a discounted rate because it's interruptible based on instantaneous demand... Longer-running training jobs should implement checkpoint saving and loading, so that they can efficiently resume if interrupted part way through. More information can be found on the [Managed Spot Training in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) page of the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/).


### 2. Parameterize your algorithm

Being able to change the parameters of your algorithm at run-time (without modifying the `main.py` script each time) is helpful for making your code more re-usable... But even more so because it's a pre-requisite for automatic hyperparameter tuning!

Job parameter parsing should ideally be factored into a separate function, and as a best practice should accept setting values through **both** command line flags (as demonstrated in the [official MXNet MNIST example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/mxnet_mnist/mnist.py)) **and** the [SageMaker Hyperparameter environment variable(s)](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-container-environmental-variables-user-scripts.html). Perhaps the official MXNet example could be improved by setting environment-variable-driven defaults to the algorithm hyperparameters, the same as it already does for channels?

Refactor your job to accept **epochs** and **batch size** as optional parameters, and show how you can set these before each training run through the [Estimator API](https://sagemaker.readthedocs.io/en/stable/estimators.html).


### 3. Tune your network hyperparameters

Re-use the same approach as before to parameterize some features in the structure of your network: Perhaps the sizes of the `Conv2D` kernels? The number, type, node count, or activation function of layers in the network? No need to stray too far away from the sample architecture!

Instead of manually (or programmatically) calling `estimator.fit()` with different hyperparameters each time, we can use SageMaker's Bayesian Hyperparameter Tuning functionality to explore the space more efficiently!

The SageMaker SDK Docs give a great [overview](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-automatic-model-tuning) of using the HyperparameterTuner, which you can refer to if you get stuck.

First, we'll need to define a specific **metric** to optimize for, which is really a specification of how to scrape metric values from the algorithm's console logs. 

Next, use the [\*Parameter](https://sagemaker.readthedocs.io/en/stable/tuner.html) classes (`ContinuousParameter`, `IntegerParameter` and `CategoricalParameter`) to define appropriate ranges for the hyperparameters whose combination you want to optimize.

With the original estimator, target metric and parameter ranges defined, you'll be able to create a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/tuner.html) and use that to start a hyperparameter tuning job instead of a single model training job.

Pay attention to likely run time and resource consumption when selecting the maximum total number of training jobs and maximum parallel jobs of your hyperparameter tuning run... You can always view and cancel ongoing hyperparameter tuning jobs through the SageMaker Console.


### Additional Challenges

If you have time, the following challenges are trickier, and might stretch your SageMaker knowledge even further!

**Batch Transform / Additional Inference Formats**: As discussed in this notebook, the deployed endpoint expects a particular tensor data format for requests... This complicates the usually-simple task of re-purposing the same model for batch inference (since our data in S3 is in JPEG format). The SageMaker TensorFlow SDK docs provide guidance on accepting custom formats in the ["Create Python Scripts for Custom Input and Output Formats"](https://sagemaker.readthedocs.io/en/stable/using_tf.html#create-python-scripts-for-custom-input-and-output-formats) section. If you can refactor your algorithm to accept JPEG requests when deployed as a real-time endpoint, you'll be able to run it as a batch [Transformer](https://sagemaker.readthedocs.io/en/stable/transformer.html) against images in S3 with a simple `estimator.transformer()` call.

**Optimized Training Formats**: A dataset like this (containing many tiny objects) may take much less time to load in to the algorithm if we either converted it to the standard Numpy format that Keras distributes it in (just 4 files X_train, Y_train, X_test, Y_test); or *streaming* the data with [SageMaker Pipe Mode](https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/), instead of downloading it up-front.

**Experiment Tracking**: The new (December 2019) [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) feature gives a more structured way to track trials across multiple related experiments (for example, different HPO runs, or between HPO and regular model training jobs). You can use the [official SageMaker Experiments Example](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-experiments) for guidance on how to track the experiments in this notebook... and should note that the [SageMaker Experiments SDK Docs](https://sagemaker-experiments.readthedocs.io/en/latest/) are maintained separately, since it's a different Python module.


## Clean-Up

Remember to clean up any persistent resources that aren't needed anymore to save costs: The most significant of these are real-time prediction endpoints, and this SageMaker Notebook Instance.

The SageMaker SDK [Predictor](https://sagemaker.readthedocs.io/en/stable/predictors.html) class provides an interface to clean up real-time prediction endpoints; and SageMaker Notebook Instances can be stopped through the SageMaker Console when you're finished.

You might also like to clean up any S3 buckets / content we created, to prevent ongoing storage costs.


In [None]:
# TODO: Clean up any endpoints/etc to release resources