In [1]:
!yes | pip uninstall torchvison
!pip install -qU torchvision

yes: standard output: Broken pipe
yes: write error


# MNIST Training using PyTorch

This notebook is built from [SageMaker's PyTorch MNIST example](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/pytorch_mnist). The objective here is to show how one might use SageMaker's Python SDK to build models and iterate.

## Setup

These next cells are copied from the example notebook. If you want a more thorough explanation of what each one is doing, read through the original example.

In [2]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-mnist'

role = sagemaker.get_execution_role()

In [3]:
from torchvision.datasets import MNIST
from torchvision import transforms

MNIST.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/"]

MNIST(
    'data',
    download=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    )
)

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz
Downloading https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw

Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


Dataset MNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )

In [4]:
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-ca-central-1-366756336356/sagemaker/DEMO-pytorch-mnist


In [6]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='mnist.py',
                    role=role,
                    py_version='py3',
                    framework_version='1.8.0',
                    instance_count=2,
                    instance_type='ml.c5.2xlarge',
                    hyperparameters={
                        'epochs': 1,
                        'backend': 'gloo'
                    })

In [7]:
estimator.fit({'training': inputs})

2021-06-10 13:44:40 Starting - Starting the training job...
2021-06-10 13:45:04 Starting - Launching requested ML instancesProfilerReport-1623332680: InProgress
......
2021-06-10 13:46:04 Starting - Preparing the instances for training............
2021-06-10 13:48:08 Downloading - Downloading input data
2021-06-10 13:48:08 Training - Downloading the training image..[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2021-06-10 13:48:22,996 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2021-06-10 13:48:22,998 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2021-06-10 13:48:23,007 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-

In [8]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------!

Let's save our endpoint name in a variable so we can use it later

In [20]:
ENDPOINT_NAME = predictor.endpoint_name

## Improve Your Model

Let's say you deployed your model and it's doing very well. You've set up some monitoring and several internal services are calling your endpoint. Let's try to improve the model by training it for more epochs.

In [16]:
second_estimator = PyTorch(entry_point='mnist.py',
                    role=role,
                    py_version='py3',
                    framework_version='1.8.0',
                    instance_count=2,
                    instance_type='ml.c5.2xlarge',
                    hyperparameters={
                        'epochs': 10,  # 10 epochs instead of 1
                        'backend': 'gloo'
                    })

second_estimator.fit({'training': inputs})

2021-06-10 14:46:04 Starting - Starting the training job...
2021-06-10 14:46:28 Starting - Launching requested ML instancesProfilerReport-1623336364: InProgress
......
2021-06-10 14:47:28 Starting - Preparing the instances for training......
2021-06-10 14:48:28 Downloading - Downloading input data...
2021-06-10 14:48:58 Training - Downloading the training image..[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2021-06-10 14:49:13,043 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2021-06-10 14:49:13,045 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2021-06-10 14:49:13,053 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-06-

We improved our model. Great!

### Deploying the Improved Model

We want a way of deploying that model to the same endpoint we have set up before. It's normally a good idea to that instead of creating a new endpoint every time you train a new model, so we don't have to update our calling code.

The [Using Estimators](https://sagemaker.readthedocs.io/en/stable/overview.html#using-estimators) section of the Python SDK documentation states that:

> Additionally, it is possible to deploy a different endpoint configuration, which links to your model, to an already existing SageMaker endpoint. This can be done by specifying the existing endpoint name for the `endpoint_name` parameter along with the `update_endpoint` parameter as True within your `deploy()` call.

Then it goes ahead and shows us a code example doing just that:

```python
mxnet_predictor = mxnet_estimator.deploy(initial_instance_count=1,
                                         instance_type='ml.p2.xlarge',
                                         update_endpoint=True,
                                         endpoint_name='existing-endpoint')
```

So... Let's try it!

In [18]:
second_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    update_endpoint=True,
    endpoint_name=ENDPOINT_NAME
)

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


TypeError: __init__() got an unexpected keyword argument 'update_endpoint'

Weird... This error message says that we're using deprecated stuff. Surpring, since we read in the `stable` version of the documentation that this should work. Let's read the link the error message gave us to see how we can fix this.

> The `update_endpoint` argument in `deploy()` methods for estimators and models is now a no-op. Please use `sagemaker.predictor.Predictor.update_endpoint()` instead.

Ok, so now we gotta have a `Predictor` before we deploy our model to the existing endpoint. [This piece of documentation](https://sagemaker.readthedocs.io/en/stable/overview.html#how-do-i-make-predictions-against-an-existing-endpoint) (in the same page that proved to be out of date) tells us that we can instantiate a new predictor by passing the existing endpoint name. Let's try that then.

In [21]:
from sagemaker.predictor import Predictor

existing_predictor = Predictor(ENDPOINT_NAME)

And now we can look at the [`Predictor.update_endpoint` method documentation](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.update_endpoint) to figure out how to update our existing endpoint.

Give yourself a minute to try to figure this out...

So what did you get? To me it seems that the `model_name` parameter holds promise, right? We're updating the endpoint to point to a new model. But what's our trained model name? Maybe our `second_estimator` instance holds the answer.

In [24]:
[m for m in dir(second_estimator) if 'model' in m]

['_compiled_models',
 '_model_entry_point',
 '_model_source_dir',
 'compile_model',
 'create_model',
 'model_channel_name',
 'model_data',
 'model_uri']

Ok... Maybe `model_uri`?

In [26]:
print(second_estimator.model_uri)

None


That's not it... Our not-so-much trustworthy [Python SDK](https://sagemaker.readthedocs.io/en/stable/) documentation doesn't seem to have an answer for us. 
We could spend some time on this, but let's just skip to the fix.

If you're new to SageMaker you might be surprised to discover that **you haven't created a SageMaker model yet**. We have trained a model and we have the model saved on S3, but no "Model object" has been created. Normally the `Estimator.deploy()` method creates a model for you (although the [method documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.deploy) doesn't say anything about that). Since we're not using the `deploy()` method anymore, we gotta create the model manually using the `create_model` method:

In [29]:
second_model = second_estimator.create_model()

In [35]:
print(f'ðŸŽ‰ Model name: {second_model.name} ðŸŽ‰')

ðŸŽ‰ Model name: pytorch-training-2021-06-10-17-08-22-677 ðŸŽ‰


Finally! Let's finally try the `Predictor.update_endpoint` method.

In [38]:
existing_predictor.update_endpoint(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    model_name=second_model.name
)

---------------!

#### Hurrah! ðŸŽ‰

## Cleaning up

In [40]:
existing_predictor.delete_endpoint()
existing_predictor.delete_model()

ClientError: An error occurred (ValidationException) when calling the DeleteEndpointConfig operation: Could not find endpoint configuration "arn:aws:sagemaker:ca-central-1:366756336356:endpoint-config/pytorch-training-2021-06-10-13-49-23-10-2021-06-10-17-13-11-700".


## So what did we learn?

Amazon should get their documentation straight. Every time you search for ways to train or deploy models the Python SDK seems to be the recommended way of doing that, but as we dig further and try to use it for our production pipelines we discover that the SDK is actually hiding a lot of stuff behind scenes that shouldn't be hidden! The documentation helps if you're following along the happy path of "training a dummy model -> deploying to an endpoint", but as soon as you try to customize this a bit you're bound to bump into troubles. 

Overall our recommendation is to **steer away from the Python SDK**. Instead, learn how to use Amazon's API/CLI to do all of your work. You'll get a much better grip of what is going on and how to customize things.