# Building a TensorFlow Container for Your Training and Hosting Algorithms

To use your own algorithms to train or deploy a model in Amazon SageMaker, package them in Docker containers. By packaging an algorithm in a container, you can use almost any code with Amazon SageMaker, regardless of the programming language, environment, framework, or dependencies. This example shows how to build a TensorFlow Docker container and use it for training and inference in Amazon SageMaker.

1. [Building a TensorFlow Container for Your Training and Hosting Algorithms](#Building-a-Tensorflow-Container-for-Your_Training-and-Hosting-Algorithms)
  1. [When Should You Create a Container for Your Algorithm?](#When-Should-You-Create-a-Container-for-Your-Algorithm?)
  1. [Required Permissions](#Required-Permissions)
  1. [The Example](#The-Example)
  1. [The Presentation](#The-Presentation)
1. [Part 1: Packaging and Uploading Your Algorithm for Use with Amazon SageMaker](#Part-1:-Packaging-and-Uploading-Your-Algorithm-for-Use-with-Amazon-SageMaker)
    1. [Docker Overview](#Docker-Overview)
    1. [How Amazon SageMaker Runs Your Docker Container](#How-Amazon-SageMaker-Runs-Your-Docker-Container)
      1. [How Amazon SageMaker Runs Your Container During Training](#How-Amazon-SageMaker-Runs-Your-Container-During-Training)
        1. [The Input](#The-Input)
        1. [The Output](#The-Output)
      1. [How Amazon SageMaker Runs Your Container During Hosting](#How-Amazon-SageMaker-Runs-Your-Container-During-Hosting)
    1. [The Example Container](#The-Example-Container)
    1. [The Dockerfile](#The-Dockerfile)
    1. [Build and Register the Container](#Build-and-Register-the-Container)
  1. [Test Your Algorithm Locally](#Test-Your-Algorithm-Locally)
  1. [Download the CIFAR-10 Dataset](#Download-the-CIFAR-10-Dataset)
  1. [Train Locally with the Amazon SageMaker Python SDK](#Train-Locally-with-the-Amazon-SageMaker-Python-SDK)
  1. [Fit, Deploy, and Predict](#Fit,-Deploy,-and-Predict)
  1. [Make Predictions Using the Amazon SageMaker Python SDK](#Make-Predictions-Using-the-Amazon-SageMaker-Python-SDK)
1. [Part 2: Training and Hosting Your Algorithm in Amazon SageMaker](#Part-2:-Training-and-Hosting-Your-Algorithm-in-Amazon-SageMaker)
  1. [Set Up the Environment](#Set-Up-the-Environment)
  1. [Create the Session](#Create-the-Session)
  1. [Upload the Data for Training](#Upload-the-Data-for-Training)
  1. [Training on Amazon SageMaker](#Training-on--Amazon-SageMaker)
  1. [Optional: Clean Up](#Optional:-Clean Up)  
1. [Reference](#Reference)

_or_ I'm impatient, just [let me see the code](#The-Dockerfile)!

### When Should You Create a Container for Your Algorithm?

It's not always necessary to create a container to use your own code in Amazon SageMaker. If you use a framework that's supported by Amazon SageMaker, such as Apache MXNet or TensorFlow, you can use the SDK entry points for that framework to supply the Python code that implements your algorithm. We regularly expand the set of supported frameworks, so always check to see if the framework that your algorithm was written in is supported.

However, even if there is SDK support for your framework, sometimes it's more effective to build your own container. You might want to build your own container if the code that implements your algorithm is complex or if you need to make additions to the framework.

If your framework is supported, you might consider building your own container for the following reasons:

* A specific version isn't supported.
* You need to configure and install your dependencies and environment.
* You use a different training or hosting solution than the one provided.
 

### Required Permissions

This notebook creates new repositories in Amazon Elastic Container Registry (Amazon ECR), so you need permissions beyond those granted by the `SageMakerFullAccess` permissions to run it. To add these permissions, add the `AmazonEC2ContainerRegistryFullAccess` managed policy to the AWS Identity and Access Management (IAM) role that you used to start your notebook instance. The new permissions are available immediately (you don't need to restart your notebook instance).

### The Example

In this example, we show how to package a custom TensorFlow container with an example written in Python. The example uses the CIFAR-10 dataset for training and TensorFlow Serving for inference. You can use another inference solution by modifying the Docker container.

We use a single image for both training and hosting because it's easier to manage one image. If training and hosting have different requirements, you might want to create a separate Dockerfile for each, then build two images. Choose the approach that is easier to develop and manage.

If you're using Amazon SageMaker only for training or only for hosting, build only the required functionality into your container.

[CIFAR-10]: http://www.cs.toronto.edu/~kriz/cifar.html

### The Presentation

This example is divided into two parts. The first explains how to _build_ the container and the second explains how to _use_ the container.

## Part 1: Packaging and Uploading Your Algorithm for Use with Amazon SageMaker

### Docker Overview

If you're familiar with Docker, you can skip to [How Amazon SageMaker Runs Your Docker Container] (#How-Amazon-SageMaker-runs-your-Docker-container).

Docker has become very popular in programming and devops communities because of its flexibility and its well-defined specification for how code can be run in its containers. It is the underpinning of many services built in the past few years, such as [Amazon Elastic Container Service (Amazon ECS)]. Although Docker containers are unfamiliar to many data scientists, they aren't difficult to build and use, and they can significantly simplify software package deployment. 

You use Docker to package arbitrary code into an _image_ that is totally self-contained. After creating the image, you use Docker to run a _container_ based on that image. Running a container is just like running a program, except that a container creates a fully self-contained environment for the program to run in. Containers are isolated from each other and from the host environment, so they run your program the way it is set up, no matter where you run it.

Docker is more powerful than environment managers like Conda or virtualenv because it is completely language independent, and because it comprises your whole operating environment, including startup commands, and environment variable.

A Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second, and many containers can run simultaneously on the same physical or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. You'll see an example in this walkthrough. You can build your Docker images based on Docker images that you've already built or on images built by others.

Amazon SageMaker uses Docker to enable users to train and deploy arbitrary algorithms. In Amazon SageMaker, Docker containers are invoked one way for training and another, slightly different, way for hosting. 

For more information about Docker, see the following:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker Runs Your Docker Container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container.

* In this example, you don't define an `ENTRYPOINT` in the Dockerfile, so Docker runs the command [`train` at training time](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html) and [`serve` at serving time](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html). In this example, you define these as executable Python scripts, but you could use any program that you want to start in that environment.
* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program runs at startup, and its first argument is `train` or `serve`. The program looks at that argument and decides what to do.
* If you are building separate containers for training and hosting (or building for only one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument that is passed in. 

#### How Amazon SageMaker Runs Your Container During Training

When Amazon SageMaker runs training, your `train` script runs as in a regular Python program. The `/opt/ml` directory includes the following files:

    /opt/ml
    ├── input
    │   ├── config
    │   │   ├── hyperparameters.json
    │   │   └── resourceConfig.json
    │   └── data
    │       └── <channel_name>
    │           └── <input data>
    ├── model
    │   └── <model files>
    └── output
        └── failure

##### The Input
Input files provide the following:
* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary that maps hyperparameter names to values. These values are always strings, so you might need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to the CreateTrainingJob operation, but it's important that channels match algorithm expectations. The files for each channel are copied from Amazon Simple Storage Service (Amazon S3) to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and increment by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The Output 
There are two output directories: 
* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format. It can be a single file or a whole directory tree. Amazon SageMaker packages files in this directory into a compressed tar archive file. This file is made available at the Amazon S3 location returned in the `DescribeTrainingJob` response.
* `/opt/ml/output` is the directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file are returned in the `FailureReason` field of the `DescribeTrainingJob` response. For jobs that succeed, there is no reason to write this file because Amazon SageMaker ignores it.

#### How Amazon SageMaker Runs Your Container During Hosting

Hosting requires a very different model than training because hosting reponds to inference requests that come in through HTTP. In this example, we use [TensorFlow Serving](https://www.tensorflow.org/serving/), but you can customize the hosting solution. For an example, see [Python serving stack within the scikit learn example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb).

Amazon SageMaker hosting uses two URLs that are included in the container:

* `/ping` receives `GET` requests from the infrastructure. If the container is accepting requests, your program returns 200.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these are passed in too. 

In the container, the model files are in the same place that they were written to during training:

    /opt/ml
    └── model
        └── <model files>



### The Example Container

The `container` directory contains all of the components that you need to package the example algorithm for Amazon SageMager:

    .
    ├── Dockerfile
    ├── build_and_push.sh
    └── cifar10
        ├── cifar10.py
        ├── resnet_model.py
        ├── nginx.conf
        ├── serve
        ├── train

The components perform the following tasks:

* __`Dockerfile`__ describes how to build your Docker container image. 
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to Amazon ECR. You invoke the commands directly later in this notebook, but you can copy and run the script for your own algorithms.
* __`cifar10`__ contains the files that are installed in the container.

For this simple application, you install only five files in the container: 

* __`cifar10.py`__ is the program that implements your training algorithm.
* __`resnet_model.py`__ is the program that contains your ResNet model. 
* __`nginx.conf`__ is the configuration file for the nginx front end. Generally, you should be able to take this file as is.
* __`serve`__ is the program that is started when the container is started for hosting. It launches nginx and loads your exported model with TensorFlow Serving.
* __`train`__ is the program that is invoked when the container is run for training. Our implementation of this script invokes cifar10.py with hyperparameter values retrieved from /opt/ml/input/config/hyperparameters.json. We do this to avoid having to modify the training algorithm program.

You might need only five files, but if you have many supporting routines, you might want to install more. 

This is the standard structure of our Python containers, although you are free to choose a different toolset and, therefore, could have a different layout. If you're writing in a different programming language, your layout will depend on the framework and tools that you choose.

You probably will want to change two files for your application: `train` and `serve`.

### The Dockerfile

The Dockerfile describes the image that you want to build. It describes the complete operating system installation of the system that you want to run. A running Docker container is much lighter than a full operating system, because it uses Linux on the host machine for basic operations. 

For the Python science stack, start with an official TensorFlow Docker image and run the standard tools to install TensorFlow Serving. Then add the code that implements your specific algorithm to the container, and set up the right environment for it to run under.

Here's the Dockerfile for this example:

In [1]:
!cat container/Dockerfile

# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

# For more information on creating a Dockerfile
# https://docs.docker.com/compose/gettingstarted/#step-2-create-a-dockerfile
FROM tensorflow/tensorflow:1.8.0-py3

RUN apt-get update && apt-get install -y --no-install-recommends nginx curl

# Download TensorFlow Serving
# https://www.tensorflow.org/serving/setup#installing_the_modelserver
RUN echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-

### Build and Register the Container

The following shell code uses `docker build` to build the container image and `docker push` to push the container image to Amazon ECR. This code is also available as the shell script `container/build-and-push.sh`, which you can run as `build-and-push.sh tensorflow-cifar10-example` to build the image `tensorflow-cifar10-example`. 

The code looks for an Amazon ECR repository in the account that you're using and the current default AWS Region (if you're using an Amazon SageMaker notebook instance, this is the Region where the notebook instance was created). If there is no Amazon ECR repository, the script creates it.

In [None]:
%%sh

# The name of our algorithm
algorithm_name=tensorflow-cifar10-example

cd container

chmod +x cifar10/train
chmod +x cifar10/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

### Test the Algorithm Locally 

When you're packaging your first algorithm for use with Amazon SageMaker, it's a good idea to test it to make sure it's working correctly. You use the [Amazon SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to test both locally and on Amazon SageMaker. To test our algorithm, you need to download our dataset.

For more examples of using the Amazon SageMaker Python SDK, see [Amazon SageMaker Examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk). 

### Download the CIFAR-10 Dataset
The training algorithm expects training data to be in [TFRecords](https://www.tensorflow.org/guide/datasets) file format. TFRecords format is a simple record-oriented binary format that many TensorFlow applications use for training data.
The following Python script downloads the CIFAR-10 dataset and converts them into TFRecords. It is adapted from the [official TensorFlow CIFAR-10 example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator).

In [None]:
! python utils/generate_cifar10_tfrecords.py --data-dir=/tmp/cifar-10-data

In [2]:
# There should be three tfrecords. (eval, train, validation)
! ls /tmp/cifar-10-data

eval.tfrecords	train.tfrecords  validation.tfrecords


### Train Locally with the Amazon SageMaker Python SDK 
To represent training, you use the Estimator class. You need to configure the following: 
1. IAM role - The AWS execution role.
2. train_instance_count - The number of instances to use for training.
3. train_instance_type - The type of instance to use for training. For training locally, specify `local`.
4. image_name - The name of the custom TensorFlow Docker image that we created.
5. hyperparameters - The hyperparameters that we want to pass.

To set up the IAM role, you use a helper function in the Amazon SageMaker Python SDK. The function gets metadata from the notebook instance, so if you run it outside of a notebook instance, it throws an exception. To run the function outside of a notebook instance, you must provide an IAM role that has the required permissions. For more information, see [Permissions](#Permissions).

In [3]:
from sagemaker import get_execution_role

role = get_execution_role()

### Fit, Deploy, and Predict

Now you can call `fit()` with the path to your local copy of the CIFAR-10 dataset prefixed with `file://`. This invokes the TensorFlow container with `train` and passes in your hyperparameters and other metadata as .json files in /opt/ml/input/config within the container.

After training succeeds, the training algorithm outputs a trained model to the /opt/ml/model directory, which is used to handle predictions.

You can then call `deploy()` with the instance_count and instance_type, `1` and `local`, respectively. This invokes the TensorFlow container with `serve`, which sets up your container to handle prediction requests through TensorFlow Serving. A predictor is returned, which you use to make inferences against your trained model.

After you get your prediction, you can delete the endpoint.

We recommend testing and training your training algorithm locally first, because it iterates faster and is easier to debug.

In [4]:
# Lets set up our SageMaker notebook instance for local mode.
!/bin/bash ./utils/setup.sh

SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


In [None]:
from sagemaker.estimator import Estimator

hyperparameters = {'train-steps': 100}

instance_type = 'local'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name='tensorflow_cifar10_example:latest',
                      hyperparameters=hyperparameters)

estimator.fit('file:///tmp/cifar-10-data')

predictor = estimator.deploy(1, instance_type)

### Make Predictions Using the Amazon SageMaker Python SDK

To make predictions, you use an image that is converted into JSON format with OpenCV. You send this as an inference request. You also install OpenCV to deserialize the image that is used to make predictions.

The JSON response is the probability that the image belongs to each of the 10 classes and the most likely class that it belongs to. For a list of classes, see the [CIFAR-10 website](https://www.cs.toronto.edu/~kriz/cifar.html). We didn't train the model for long, so we aren't expecting very accurate results.

In [None]:
! pip install opencv-python

In [6]:
import cv2
import numpy

from sagemaker.predictor import json_serializer, json_deserializer

image = cv2.imread("data/cat.png", 1)

# resize, as our model is expecting images in 32x32.
image = cv2.resize(image, (32, 32))

data = {'instances': numpy.asarray(image).astype(float).tolist()}

# The request and response format is JSON for TensorFlow Serving.
# For more information: https://www.tensorflow.org/serving/api_rest#predict_api
predictor.accept = 'application/json'
predictor.content_type = 'application/json'

predictor.serializer = json_serializer
predictor.deserializer = json_deserializer

# For more information on the predictor class.
# https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/predictor.py
predictor.predict(data)

[36malgo-1-L58J2_1  |[0m 172.18.0.1 - - [03/Aug/2018:22:32:52 +0000] "POST /invocations HTTP/1.1" 200 229 "-" "-"


{'predictions': [{'probabilities': [2.29861e-05,
    0.0104983,
    0.147974,
    0.01538,
    0.0478089,
    0.00164997,
    0.758483,
    0.0164191,
    0.00125304,
    0.000510801],
   'classes': 6}]}

In [None]:
predictor.delete_endpoint()

## Part 2: Training and Hosting Your Algorithm in Amazon SageMaker
After packaging your container, you can use it to train and serve models. 

### Set Up the Environment
Specify the S3 bucket to use and the IAM role needed to work with Amazon SageMaker.

In [None]:
# S3 prefix
prefix = 'DEMO-tensorflow-cifar10'

### Create the Session

The session remembers connection parameters to Amazon SageMaker. you perform all of our Amazon SageMaker operations in the session.

In [None]:
import sagemaker as sage

sess = sage.Session()

### Upload the Data for Training

To upload the data to a default S3 bucket, you use the tools provided by the Amazon SageMaker Python SDK.

In [None]:
WORK_DIRECTORY = '/tmp/cifar-10-data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

### Train on Amazon SageMaker
To train a model on Amazon SageMaker, you use the Amazon SageMaker Python SDK similar to the way that you used it to train a model locally. You do the following:

1. Change the train_instance_type from `local` to one of the [supported EC2 instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/).
2. Specify the URL of the Amazon ECR image that you pushed.
3. Make sure that your local training dataset is in Amazon S3 and that the S3 URL to the dataset is passed into the `fit()` call.

Begin by fetching the URL of the Amazon ECR image.

In [None]:
import boto3

client = boto3.client('sts')
account = client.get_caller_identity()['Account']

my_session = boto3.session.Session()
region = my_session.region_name

algorithm_name = 'tensorflow-cifar10-example'

ecr_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)

print(ecr_image)

In [None]:
from sagemaker.estimator import Estimator

hyperparameters = {'train-steps': 100}

instance_type = 'ml.m4.xlarge'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name=ecr_image,
                      hyperparameters=hyperparameters)

estimator.fit(data_location)

predictor = estimator.deploy(1, instance_type)

In [None]:
image = cv2.imread("data/cat.png", 1)

# resize, as our model is expecting images in 32x32.
image = cv2.resize(image, (32, 32))

data = {'instances': numpy.asarray(image).astype(float).tolist()}

predictor.accept = 'application/json'
predictor.content_type = 'application/json'

predictor.serializer = json_serializer
predictor.deserializer = json_deserializer

predictor.predict(data)

### Optional: Clean Up
You can see all of the training jobs, models, and endpoints that you created in the Amazon SageMaker console in your AWS account. When you're done with the endpoint, delete it to avoid accruing unnecessary charges.

In [None]:
predictor.delete_endpoint()

## Reference
- [How Amazon SageMaker Runs Your Training Image](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker Runs Your Inference Image](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)
- [Amazon SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [scikit-bring-your-own](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)