# SageMaker Model Training and Prediction
## Introduction
__"Leverage SageMaker from Notebook --> Production Testing"__

---
## 1 - Model Training
### Permissions and Environmental Variables

The packages that will be needed to prepare the data and train the model are as follows:
- [os](https://docs.python.org/3/library/os.html) provides a portable way of using operating system dependent functionality to manage the data set locally before uploading to the S3.
- [datetime](https://docs.python.org/2/library/datetime.html) provides classes for manipulating dates and times in both simple and complex ways.
- [numpy](https://www.numpy.org) is the fundamental package for scientific computing with Python.
- [matplotlib](https://matplotlib.org) is a famous library to plot graphs in Python.
- [PIL](http://www.pythonware.com/products/pil/) is used here to test the model with your own picture at the end.
boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.
json is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript.
os is a module the provides a portable way of using operating system dependent functionality. Particularly the  environ object is a mapping object representing the environment.
uuid creates a unique, random ID.
The io module provides the Python interfaces to stream handling.
The Python interface to the Redis key-value store.
math to determine the last mini-batch size when partitioning the Training Data Set into mini-batches.

In [None]:
# Import libraries
import os
import boto3
import sagemaker
import h5py
import json
import tarfile
import datetime
import matplotlib.pyplot as plt
import numpy as np
import mxnet as mx
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

# Configure SageMaker
sagemaker_session = sagemaker.Session()
role = get_execution_role()

## Input Data Preparation

In [None]:
# Create local repository for Numpy Arrays
if not os.path.exists('/tmp/data'):
    os.mkdir('/tmp/data')

# Load the Training and Testing dataset
dataset = h5py.File('datasets/datasets.h5', 'r')

# Save the Dataset as Numpy Arrays
np.save('/tmp/data/train_X.npy', np.array(dataset['train_set_x'][:]))
np.save('/tmp/data/train_Y.npy', np.array(dataset['train_set_y'][:]))
np.save('/tmp/data/test_X.npy', np.array(dataset['test_set_x'][:]))
np.save('/tmp/data/test_Y.npy', np.array(dataset['test_set_y'][:]))

# Upload the Training and Testing Data to S3
inputs = sagemaker_session.upload_data(path='/tmp/data', key_prefix='training_input')
bucket = inputs.split('/')[2]

## Create the Estimator

In [None]:
# Create a MXNet Estimator
mxnet_estimator = MXNet(
    'model.py',
    role=role,
    train_instance_count=1,
    train_instance_type='ml.c4.xlarge',
    output_path='s3://'+bucket,
    hyperparameters={
        'epochs': 2500,
        'optmizer': 'sgd',
        'learning_rate': 0.0075,
        'batch_size': 64
    }
)

## Train the Model

In [None]:
# Create a new Job name for current training run
#job_name = '<<Specific Training Job Name'
#mxnet_estimator.fit(inputs, job_name=job_name)
mxnet_estimator.fit(inputs)

---
## 2 - Training Analysis

In [None]:
# Download and uncompress output results from model training
%matplotlib inline
job_name = '<<Enter Training Job Name>>'
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(job_name+'/output/output.tar.gz', '/tmp/output.tar.gz')
tarfile.open('/tmp/output.tar.gz').extractall()
with open('results.json') as j:
    data = json.load(j)#, object_pairs_hook=OrderedDict)

# Format data for plotting
costs = []
val_acc = []
train_acc = []
for key, value in sorted(data.items()):#, key=lambda (k,v): (v, k)):
    if 'epoch' in key:
        for k, v in value.items():
            if k == 'cost':
                costs.append(v)
            elif k == 'val_acc':
                val_acc.append(v)
            elif k == 'train_acc':
                train_acc.append(v)
    elif 'Start' in key:
        start = datetime.datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f")
    elif 'End' in key:
        end = datetime.datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f")
val_acc = np.array(val_acc)
train_acc = np.array(train_acc)
costs = np.array(costs)
delta = end - start
print("Model Training Time: {} Minute(s)".format(int(delta.total_seconds() / 60)))

# Plot the Learning Curve
plt.rcParams['figure.figsize'] = (11.0, 10.0)
plt.plot(costs)
plt.plot(train_acc)
plt.plot(val_acc)
plt.ylabel('Cost / Accuracy')
plt.xlabel('Epochs (in Hundreds)')
plt.title("Learning Curve")
plt.legend(['Cost', 'Training Accuracy', 'Validation Accuracy'])
plt.show;

---
## 3 - Prediciton Analysis

In [None]:
predictor = mxnet_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

__Test prediction on unseen image data.__

In [None]:
import glob
import matplotlib.image as mpimg
from skimage import transform

# Get Classes
classes = ["non-cat", "cat"]

# Get Image files
images = []
for img_path in glob.glob('./images/*'):
    images.append(mpimg.imread(img_path))

# Plot predictions
plt.figure(figsize=(20.0,20.0))
columns = 2
for i, image in enumerate(images):
    img = transform.resize(image, (64, 64), mode='constant').reshape((1, 64 * 64 * 3))
    prediction = int(predictor.predict(img.tolist()))
    plt.subplot(len(images) / columns + 1, columns, i + 1)
    plt.title('Prediction = "{}" picture.'.format(classes[prediction]))
    plt.imshow(image);