# Packing your TensorFlow model

With Amazon SageMaker, you can package your own algorithms that can then be trained and deployed in the SageMaker environment. This notebook guides you through an example using TensorFlow that shows you how to build a container for SageMaker and use it for training.

## Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it creates new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

## The example

In this example we show how to package a custom TensorFlow algorithm with a Python example which works with the CIFAR-10 dataset: [CIFAR-10]: http://www.cs.toronto.edu/~kriz/cifar.html

### The parts of the algo

The `src` directory has all the components you need to package the sample algorithm for Amazon SageMager:

    └── src
        ├── cifar10.py
        └── resnet_model.py

Let's discuss each of these in turn:

* __`src`__ is the directory which contains the files that are installed in the container.

The files that we put in the container are:

* __`cifar10.py`__ is the program that implements our training algorithm.
* __`resnet_model.py`__ is the program that contains our Resnet model.

### Packing the Training Code

In [1]:
from build_sagemaker_container import build

tag = 'tensorflow-cifar10-example:latest'

build(base_image='tensorflow/tensorflow:1.11.0-py3',
      entrypoint='cifar10.py',
      source_dir='src',
      tag=tag)


FROM tensorflow/tensorflow:1.11.0-py3

RUN apt-get update && apt-get install -y --no-install-recommends git

RUN git clone https://github.com/mvsusp/sagemaker-containers.git -b mvs-sagemaker-containers-train-improvements && cd sagemaker-containers && pip install . --quiet --disable-pip-version-check

COPY src /opt/ml/code

ENV PYTHONPATH /opt/ml/code:$PYTHONPATH
ENV SAGEMAKER_TRAINING_MODULE cifar10


Sending build context to Docker daemon  23.55kB
Step 1/6 : FROM tensorflow/tensorflow:1.11.0-py3
 ---> 7f147470ab6f
Step 2/6 : RUN apt-get update && apt-get install -y --no-install-recommends git
 ---> Using cache
 ---> c93c5af50332
Step 3/6 : RUN git clone https://github.com/mvsusp/sagemaker-containers.git -b mvs-sagemaker-containers-train-improvements && cd sagemaker-containers && pip install . --quiet --disable-pip-version-check
 ---> Using cache
 ---> 3b868bb84114
Step 4/6 : COPY src /opt/ml/code
 ---> Using cache
 ---> 230f3686567a
Step 5/6 : ENV PYTHONPATH /opt/ml/code:$PYTHONPATH


### The build command

In [2]:
??build

[0;31mSignature:[0m [0mbuild[0m[0;34m([0m[0mbase_image[0m[0;34m,[0m [0mentrypoint[0m[0;34m,[0m [0msource_dir[0m[0;34m,[0m [0mtag[0m[0;34m,[0m [0mbuild_commands[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mbuild[0m[0;34m([0m[0mbase_image[0m[0;34m,[0m [0mentrypoint[0m[0;34m,[0m [0msource_dir[0m[0;34m,[0m [0mtag[0m[0;34m,[0m [0mbuild_commands[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    Build your algo using from a Docker `base_image`.[0m
[0;34m[0m
[0;34m    Args:[0m
[0;34m        base_image (string): Docker image which your algo is based from.[0m
[0;34m        entrypoint (string): Path (relative) to the Python source file which should be executed[0m
[0;34m                as the entry point to training. This should be compatible with either Python 2.7 or Python 3.5.[0m
[0;34m        source_dir (str): Path (absolute or re

## Testing your algorithm on your local machine

When you're packaging you first algorithm to use with Amazon SageMaker, you probably want to test it yourself to make sure it's working correctly. We use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to test both locally and on SageMaker. For more examples with the SageMaker Python SDK, see [Amazon SageMaker Examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk). In order to test our algorithm, we need our dataset.

## Download the CIFAR-10 dataset
Our training algorithm is expecting our training data to be in the file format of [TFRecords](https://www.tensorflow.org/guide/datasets), which is a simple record-oriented binary format that many TensorFlow applications use for training data.
Below is a Python script adapted from the [official TensorFlow CIFAR-10 example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator), which downloads the CIFAR-10 dataset and converts them into TFRecords.

In [3]:
import utils

data_dir = '/tmp/cifar-10-data'

utils.download_cifar10_tf_records(data_dir)

Download from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and extract.

Downloading cifar-10-python.tar.gz
Successfully downloaded cifar-10-python.tar.gz 170498071 bytes.
Generating /tmp/cifar-10-data/eval.tfrecords
Generating /tmp/cifar-10-data/train.tfrecords
Generating /tmp/cifar-10-data/validation.tfrecords
Removing original files.
Done!


In [4]:
ls {data_dir}

eval.tfrecords        train.tfrecords       validation.tfrecords


# Testing with Docker

In [5]:
training_channel = '/opt/ml/input/data/training'

!docker run -v {data_dir}:{training_channel} {tag} train --train-steps 100

2018-10-27 17:20:32,416 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2018-10-27 17:20:32,428 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "job_name": null,
    "current_host": "8954c2624492",
    "input_data_config": {
        "training": {}
    },
    "hyperparameters": {
        "train-steps": 100
    },
    "hosts": [
        "8954c2624492"
    ],
    "output_data_dir": "/opt/ml/output/data",
    "framework_module": "cifar10",
    "log_level": 20,
    "module_dir": "/opt/ml/code",
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "module_name": "None",
    "input_dir": "/opt/ml/input",
    "network_interface_name": "ethwe",
    "resource_config": {
        "current_host": "8954c2624492",
        "hosts": [
            "8954c2624492"
        ]
    },
    "num_cpus": 2,
    "input_config_dir": "/opt/ml/input/config",
    "output_dir": "/opt/ml/output",
    "model_dir": "/opt/ml/model",
 

## SageMaker Python SDK Local Training
To represent our training, we use the Estimator class, which needs to be configured in five steps. 
1. IAM role - our AWS execution role
2. train_instance_count - number of instances to use for training.
3. train_instance_type - type of instance to use for training. For training locally, we specify `local`.
4. image_name - our custom TensorFlow Docker image we created.
5. hyperparameters - hyperparameters we want to pass.

Let's start with setting up our IAM role. We make use of a helper function within the Python SDK. This function throw an exception if run outside of a SageMaker notebook instance, as it gets metadata from the notebook instance. If running outside, you must provide an IAM role with proper access stated above in [Permissions](#Permissions).

In [6]:
from sagemaker import get_execution_role

role = 'SageMakerRole'

## Fit

Now that the rest of our estimator is configured, we can call `fit()` with the path to our local CIFAR10 dataset prefixed with `file://`. This invokes our TensorFlow container with 'train' and passes in our hyperparameters and other metadata as json files in /opt/ml/input/config within the container.

After our training has succeeded, our training algorithm outputs our trained model within the /opt/ml/model directory, which is used to handle predictions.

We recommend testing and training your training algorithm locally first, as it provides quicker iterations and better debuggability.

In [7]:
from sagemaker.estimator import Estimator

hyperparameters = {'train-steps': 100}

instance_type = 'local'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name='tensorflow-cifar10-example:latest',
                      hyperparameters=hyperparameters)

estimator.fit('file:///tmp/cifar-10-data')

INFO:sagemaker:Creating training-job with name: tensorflow-cifar10-example-2018-10-27-17-21-00-224


Creating tmpnnll4yej_algo-1-LQDWW_1 ... 
[1BAttaching to tmpnnll4yej_algo-1-LQDWW_1
[36malgo-1-LQDWW_1  |[0m 2018-10-27 17:21:03,462 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-LQDWW_1  |[0m 2018-10-27 17:21:03,474 sagemaker-containers INFO     Invoking user script
[36malgo-1-LQDWW_1  |[0m 
[36malgo-1-LQDWW_1  |[0m Training Env:
[36malgo-1-LQDWW_1  |[0m 
[36malgo-1-LQDWW_1  |[0m {
[36malgo-1-LQDWW_1  |[0m     "model_dir": "/opt/ml/model",
[36malgo-1-LQDWW_1  |[0m     "log_level": 20,
[36malgo-1-LQDWW_1  |[0m     "hosts": [
[36malgo-1-LQDWW_1  |[0m         "algo-1-LQDWW"
[36malgo-1-LQDWW_1  |[0m     ],
[36malgo-1-LQDWW_1  |[0m     "hyperparameters": {
[36malgo-1-LQDWW_1  |[0m         "train-steps": 100
[36malgo-1-LQDWW_1  |[0m     },
[36malgo-1-LQDWW_1  |[0m     "module_dir": "/opt/ml/code",
[36malgo-1-LQDWW_1  |[0m     "output_dir": "/opt/ml/output",
[36malgo-1-LQDWW_1  |[0m     "output_data_dir": "/opt/ml/o

# Part 2: Training in Amazon SageMaker
Once you have your container pushed to ECR, you can use it to train models. Let's do that with the algorithm we made above.

## Upload the data for training

We will use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

In [8]:
import sagemaker

# S3 prefix
prefix = 'DEMO-tensorflow-cifar10'

data_location = sagemaker.Session().upload_data(data_dir, key_prefix=prefix)

## Push the image

In [9]:
from build_sagemaker_container import push

ecr_image_name = push(tag)

Login Succeeded
Pushing docker image to ECR repository 369233609183.dkr.ecr.us-west-2.amazonaws.com/tensorflow-cifar10-example:latest

The push refers to repository [369233609183.dkr.ecr.us-west-2.amazonaws.com/tensorflow-cifar10-example]
796eec1122b4: Preparing
9fd96661976b: Preparing
741f5fa65bfc: Preparing
788a01b9cb70: Preparing
4331257a069e: Preparing
a6a13fd7a75f: Preparing
9ff6cd787adb: Preparing
32e1e1d8a456: Preparing
9a0f96301e7d: Preparing
fde791900dd4: Preparing
fa8678ba5abc: Preparing
f157c6afd0c0: Preparing
75b79e19929c: Preparing
4775b2f378bb: Preparing
883eafdbe580: Preparing
19d043c86cbc: Preparing
8823818c4748: Preparing
fa8678ba5abc: Waiting
f157c6afd0c0: Waiting
75b79e19929c: Waiting
4775b2f378bb: Waiting
883eafdbe580: Waiting
19d043c86cbc: Waiting
8823818c4748: Waiting
32e1e1d8a456: Waiting
9a0f96301e7d: Waiting
fde791900dd4: Waiting
9ff6cd787adb: Waiting
a6a13fd7a75f: Waiting
4331257a069e: Layer already exists
796eec1122b4: Layer already exists
788a01b9cb70: Layer

### The push command

In [10]:
??push

[0;31mSignature:[0m [0mpush[0m[0;34m([0m[0mtag[0m[0;34m,[0m [0maws_account[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0maws_region[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mpush[0m[0;34m([0m[0mtag[0m[0;34m,[0m [0maws_account[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0maws_region[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    Push the builded tag to ECR.[0m
[0;34m[0m
[0;34m    Args:[0m
[0;34m        tag (string): tag which you named your algo[0m
[0;34m        aws_account (string): aws account of the ECR repo[0m
[0;34m        aws_region (string): aws region where the repo is located[0m
[0;34m[0m
[0;34m    Returns:[0m
[0;34m        (string): ECR repo image that was pushed[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0msession[0m [0;34m=[0m [0mboto3[0m[0;34m.[0m[0mSession[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m

## Training on SageMaker

In [11]:
hyperparameters = {'train-steps': 1000}

instance_type = 'ml.c5.xlarge'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name=ecr_image_name,
                      hyperparameters=hyperparameters)

estimator.fit({'training':data_location})

INFO:sagemaker:Creating training-job with name: tensorflow-cifar10-example-2018-10-27-17-22-00-357


2018-10-27 17:21:30 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-27 17:23:09 Downloading - Downloading input data
2018-10-27 17:23:16 Training - Downloading the training image..
[31m2018-10-27 17:23:44,321 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2018-10-27 17:23:44,331 sagemaker-containers INFO     Invoking user script
[0m
[31mTraining Env:
[0m
[31m{
    "input_data_config": {
        "training": {
            "RecordWrapperType": "None",
            "S3DistributionType": "FullyReplicated",
            "TrainingInputMode": "File"
        }
    },
    "hyperparameters": {
        "train-steps": 1000
    },
    "log_level": 20,
    "output_dir": "/opt/ml/output",
    "hosts": [
        "algo-1"
    ],
    "input_config_dir": "/opt/ml/input/config",
    "framework_module": "cifar10",
    "model_dir": "/opt/ml/model",
    "module_dir": "/opt/ml/code",
    "nu

# Reference
- [How Amazon SageMaker interacts with your Docker container for training](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker interacts with your Docker container for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [scikit-bring-your-own](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)