# SageMaker BYO Keras Container

This notebook shows how to build and train a Keras container locally before submiting to SageMaker.
The model used for this notebook is a ResNet model, trainer with the CIFAR-10 dataset. The
example is based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py

## Set up the environment

In [15]:
import os
import numpy as np
import tempfile

import tensorflow as tf

import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator


sagemaker_session = sagemaker.Session()

role = 'SageMakerRole'#get_execution_role() # TODO test on mead
ecr_repository = '369233609183.dkr.ecr.us-west-2.amazonaws.com/test' #TODO explain how to create a ecr repo

NUM_CLASSES = 10

## Complete source code
- [trainer/start.py](trainer/start.py): Keras model
- [trainer/environment.py](trainer/environment.py): Contain information about the SageMaker environment

## Building the image



In [3]:
import shlex
import subprocess



def get_tensorflow_version_tag(framework_version, instance_type):
    is_gpu = instance_type[3] == 'p'
    return '%s-gpu' % framework_version if is_gpu else framework_version


def get_image_name(ecr_repository, tensorflow_version_tag):
    return '%s:tensorflow-%s' % (ecr_repository, tensorflow_version_tag)

def build_image(name, version):
    cmd = 'docker build -t %s --build-arg VERSION=%s -f Dockerfile .' % (name, version)
    subprocess.check_call(shlex.split(cmd))


# you can choose other tf versions:
tf_version = 'latest'

instance_type = 'ml.c5.xlarge'

tensorflow_version_tag = utils.get_tensorflow_version_tag(tf_version, instance_type)

image_name = utils.get_image_name(ecr_repository, tensorflow_version_tag)

#TODO the logs are in the console not in the notebook
utils.build_image(image_name, tensorflow_version_tag)

## Upload the data to a S3 bucket

In [12]:
def upload_channel(channel_name, x, y):
    y = tf.keras.utils.to_categorical(y, NUM_CLASSES)

    file_path = tempfile.mkdtemp()
    np.savez_compressed(os.path.join(file_path, 'cifar-10-npz-compressed.npz'), x=x, y=y)

    return sagemaker_session.upload_data(path=file_path, key_prefix='data/DEMO-keras-cifar10/%s' % channel_name)


def upload_training_data():
    # The data, split between train and test sets:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

    train_data_location = upload_channel('train', x_train, y_train)
    test_data_location = upload_channel('test', x_test, y_test)

    return {'train': train_data_location, 'test': test_data_location}

channels = upload_training_data()

## Setting the hyperparameters

In [13]:
hyperparameters = dict(batch_size=32, data_augmentation=True, learning_rate=.0001, 
                       width_shift_range=.1, height_shift_range=.1)


## Testing the container locally



In [16]:
estimator = Estimator(image_name, role='SageMakerRole', train_instance_count=1,
                      train_instance_type='local', hyperparameters=hyperparameters)

estimator.fit(channels)

INFO:sagemaker:Creating training-job with name: test-2018-04-10-09-21-22-333


sagemaker-us-west-2-369233609183
data/DEMO-keras-cifar10/test
/private/var/folders/r4/6vbcymq564x9g4_bsq1ystss0hvddl/T/tmpu6nXKj/test
/private/var/folders/r4/6vbcymq564x9g4_bsq1ystss0hvddl/T/tmpu6nXKj/test/cifar-10-npz-compressed.npz
aaadfsdfjdijdsghrghoerhuerigergruhgrhgurhgrughr
sagemaker-us-west-2-369233609183
data/DEMO-keras-cifar10/train
/private/var/folders/r4/6vbcymq564x9g4_bsq1ystss0hvddl/T/tmpu6nXKj/train
/private/var/folders/r4/6vbcymq564x9g4_bsq1ystss0hvddl/T/tmpu6nXKj/train/cifar-10-npz-compressed.npz
aaadfsdfjdijdsghrghoerhuerigergruhgrhgurhgrughr
Attaching to tmpitbxmk_algo-1-YLNGC_1
[36malgo-1-YLNGC_1  |[0m   from ._conv import register_converters as _register_converters
[36malgo-1-YLNGC_1  |[0m Using TensorFlow backend.
[36malgo-1-YLNGC_1  |[0m creating SageMaker trainer environment:
[36malgo-1-YLNGC_1  |[0m TrainerEnvironment(input_dir='/opt/ml/input', input_config_dir='/opt/ml/input/config', model_dir='/opt/ml/model', output_dir='/opt/ml/output', hyperparamete

KeyboardInterrupt: 

## building a GPU image

In [21]:
instance_type = 'ml.p2.xlarge'

tensorflow_version_tag = utils.get_tensorflow_version_tag(tf_version, instance_type)

image_name = utils.get_image_name(ecr_repository, tensorflow_version_tag)

#TODO the logs are in the console not in the notebook

utils.build_image(image_name, tensorflow_version_tag)

## Pushing the container to ECR


In [22]:

def push_image(name):
    cmd = 'aws ecr get-login --no-include-email --region us-west-2'
    login = subprocess.check_output(shlex.split(cmd)).strip()

    subprocess.check_call(shlex.split(login))

    cmd = 'docker push %s' % name
    subprocess.check_call(shlex.split(cmd))

#TODO the logs are in the console not in the notebook
push_image(image_name)

## Running the GPU container in SageMaker

In [None]:
estimator = Estimator(image_name, role='SageMakerRole', train_instance_count=1,
                      train_instance_type=instance_type, hyperparameters=hyperparameters)

estimator.fit(channels)

INFO:sagemaker:Creating training-job with name: test-2018-04-10-09-26-17-300


..........