# Iris Categorisation - TensorFlow - Sagemaker

This time we use sagemaker to:

- Accelerate training
- Deploy the model to an production endpoint

In [2]:
import sagemaker
from sagemaker.session import Session

execution_role = sagemaker.get_execution_role()
sm_session = Session()
bucket = sm_session.default_bucket()

## Preparation

Only do this once, to make the data available for any container running the model

No need to get the csv files if they already exist

In [1]:
!ls  ../data

iris_test.csv  iris_training.csv  validation_data.hdf


In [15]:
# Upload the locally stored data to the sagemaker default bucket
s3_data = sm_session.upload_data(path='../data', key_prefix='iris/data')

In [17]:
!aws s3 ls {s3_data}/

2018-10-10 08:04:24        573 iris_test.csv
2018-10-10 08:04:24       2194 iris_training.csv


## Run locally

Test the training script to ensure it compiles and runs before registering a full training job.

In [10]:
import sys
import os
import tensorflow as tf
sys.path.append('../package')

from trainer.model import estimator_fn, _input_fn, serving_input_fn

In [11]:
outdir = '/tmp/sagemaker/iris/out'

config = {
    'train_file': os.path.join(os.getcwd(), '../data/iris_training.csv'),
    'test_file': os.path.join(os.getcwd(), '../data/iris_test.csv'),
    'train_steps': 10,  # Train with trivially small number of steps at this stage
}

In [16]:
!mkdir -p {outdir}

It is not clear exactly what process Sagemaker uses to run the training job but hopefully its something similar.

In [23]:
est = estimator_fn(output_dir=outdir)

print('Defining training spec')
train_spec = tf.estimator.TrainSpec(
    input_fn=_input_fn('train', data_path='/tmp', data_filename=config['train_file']),
    max_steps=config['train_steps'],
)

print('Defining eval spec')
eval_spec = tf.estimator.EvalSpec(
    input_fn=_input_fn('eval', data_path='/tmp', data_filename=config['test_file']),
    steps=None,
    start_delay_secs=0,
    throttle_secs=1,
    exporters=tf.estimator.LatestExporter('exporter', serving_input_fn),
)

print('Starting training ...')

tf.estimator.train_and_evaluate(est, train_spec, eval_spec)
print('Finished training')

Defining training spec
Expecting data from:  /home/ec2-user/SageMaker/Cloud-Data-Science/Platform-Comparison/Sagemaker/../data/iris_training.csv
Defining eval spec
Expecting data from:  /home/ec2-user/SageMaker/Cloud-Data-Science/Platform-Comparison/Sagemaker/../data/iris_test.csv
Starting training ...




Finished training


In [18]:
ls {outdir} | head

checkpoint
[0m[01;34meval[0m/
events.out.tfevents.1539158870.ip-172-16-27-120
[01;34mexport[0m/
graph.pbtxt
model.ckpt-10.data-00000-of-00001
model.ckpt-10.index
model.ckpt-10.meta
model.ckpt-1.data-00000-of-00001
model.ckpt-1.index


It complies and produces output.  We'll train it more comprehensively in the cloud.

## Train

In [38]:
from sagemaker.tensorflow import TensorFlow

#Bucket location to save your custom code in tar.gz format.
custom_code_upload_location = 's3://{}/iris/code'.format(bucket)
#Bucket location where results of model training are saved.
model_artifacts_location = 's3://{}/iris/artifacts'.format(bucket)

In [40]:
estimator = TensorFlow(entry_point='../package/trainer/model.py',
                       role=execution_role,
                       framework_version='1.10',
                       output_path=model_artifacts_location,
                       code_location=custom_code_upload_location,
                       train_instance_count=1,
                       training_steps=100,
                       evaluation_steps=10,
                       train_instance_type='ml.c4.xlarge')

In [41]:
%%time
estimator.fit(s3_data)

INFO:sagemaker:Creating training-job with name: sagemaker-tensorflow-2018-10-10-08-17-27-885


2018-10-10 08:17:28 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training......
2018-10-10 08:19:38 Downloading - Downloading input data
2018-10-10 08:19:53 Training - Downloading the training image..
[31m2018-10-10 08:20:09,859 INFO - root - running container entrypoint[0m
[31m2018-10-10 08:20:09,859 INFO - root - starting train task[0m
[31m2018-10-10 08:20:09,864 INFO - container_support.training - Training starting[0m
[31m2018-10-10 08:20:12,520 INFO - tf_container - ----------------------TF_CONFIG--------------------------[0m
[31m2018-10-10 08:20:12,520 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}[0m
[31m2018-10-10 08:20:12,520 INFO - tf_container - ---------------------------------------------------------[0m
[31m2018-10-10 08:20:12,520 INFO - tf_container - creating RunConfig:[0m
[31m2018-10-10 08:20:12,520 INFO - tf_contain


2018-10-10 08:20:21 Uploading - Uploading generated training model
2018-10-10 08:20:26 Completed - Training job completed
[31m2018-10-10 08:20:16.867261: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-10-10 08:20:16.876778: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404[0m
[31m2018-10-10 08:20:16.876805: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.[0m
[31m2018-10-10 08:20:16.876943: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-10-10 08:20:16.890705: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-10-10 08:20:16.902718: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20181010T0820161539159616890[0m
[31m2018-10-10 08:20:16.93

Billable seconds: 48
CPU times: user 452 ms, sys: 40 ms, total: 492 ms
Wall time: 3min 13s


## Deploy

In [42]:
%%time
iris_predictor = estimator.deploy(initial_instance_count=1,
                                  instance_type='ml.t2.medium')

INFO:sagemaker:Creating model with name: sagemaker-tensorflow-2018-10-10-08-17-27-885
INFO:sagemaker:Creating endpoint with name sagemaker-tensorflow-2018-10-10-08-17-27-885


--------------------------------------------------------------!CPU times: user 276 ms, sys: 16 ms, total: 292 ms
Wall time: 5min 16s


Great! So apparently we have a REST endpoint serving prediction of our model.  Lets give it a test!

## Evaluate

Lets send data unseen by the model to the deployed endpoint.

In [43]:
import pandas as pd

In [52]:
valid = pd.read_hdf('../data/validation_data.hdf', 'test1')

features = valid.drop('Species', axis=1)
sample0 = features.loc[0]
dict(sample0)

{'PetalLength': 1.7, 'PetalWidth': 0.5, 'SepalLength': 5.1, 'SepalWidth': 3.3}

It is unclear what format to provide the data in to have the endpoint serve predictions correctly

In [92]:
# iris_predictor.predict(dict(sample0)
# iris_predictor.predict(list(sample0.values))
# iris_predictor.predict({ u'': [dict(sample0)]}) 
iris_predictor.predict({u'' : list(sample0.values)})

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2018-10-01-02-17-54-088 in account 167464700695 for more information.