# SageMaker Model Training and Prediction
## Introduction
[Amazon SageMaker](https://aws.amazon.com/sagemaker/?sc_channel=PS&sc_campaign=pac_ps_q4&sc_publisher=google&sc_medium=sagemaker_b_pac_search&sc_content=sagemaker_e&sc_detail=aws%20sagemaker&sc_category=sagemaker&sc_segment=webp&sc_matchtype=e&sc_country=US&sc_geo=namer&sc_outcome=pac&s_kwcid=AL!4422!3!245225393502!e!!g!!aws%20sagemaker&ef_id=WL2I0wAAAIRC8xLB:20180418161912:s) is a fully mamnaged platform that enables Data Scientists to build, train and deploy machine learning models at any scale. It provides key services necessary to create and manage a Machine Learning (ML) Pipeline from "Notebook" to "Production", as highlighted below:

<img src="images/SageMaker_Workflow.png" style="width:800px;height:200px;">

The following Notebook demonstrates this process by using SageMaker's built-in [Image Classification Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html). To accomplish this, SageMaker's Image Classification Algorithm leverages a commonly used and pre-built model for image classification called __Resnet__ (You can read more about Resnet [here](https://arxiv.org/abs/1512.03385)). It also provides the added feature of leveraging pre-trained weights, thus allowing for [Transfer Learning](https://en.wikipedia.org/wiki/Transfer_learning). A technique used for reducing the time required for training a new model, where instead of training a model from scratch, on e can use a modified pre-trained model and continue training it with a unique dataset. In essense, tranferring the knowledge learned from one model to another.

By leveraging this methadology, the Data Scientist doesn't need to expend time to build, train and optmize a custom Image Classification model, as was done in [Demo 2](https://github.com/darkreapyre/itsacat/blob/Demo-2/Notebooks/ItsaCat-Gluon_Codebook.ipynb), but rather simply provide the training data and left SageMaker perform all the heavy lifting.

---
## 1 - Using the Notebook instance to understand and Manage the Input Data.
The SageMaker Notebook instace is a fully mananged compute instance that runs the Jupyter Notebook application and allows the *Data Scientist* to explore and preprocess the dataset that is used to train the ML model. The Notebook instance can also be thought of as an Integrated Development Environment (IDE) for writting the code for the ML model, training the model as wella as testing/validating the model's performance. For more information on using the SageMaker Notebook Instances, see the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html).

### Permissions and Environmental Variables

The packages that will be needed to prepare the data and train the model are as follows:
- [datetime](https://docs.python.org/2/library/datetime.html) provides classes for manipulating dates and times in both simple and complex ways.
- [numpy](https://www.numpy.org) is the fundamental package for scientific computing with Python.
- [matplotlib](https://matplotlib.org) is a famous library to plot graphs in Python.
- [PIL](http://www.pythonware.com/products/pil/) is used here to test the model on unseen image data at the end.
- [boto3](https://pypi.python.org/pypi/boto3) is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.
- [json](https://docs.python.org/3/library/json.html) is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript.
- [os](https://docs.python.org/3/library/os.html) is a module the provides a portable way of using operating system dependent functionality. Particularly the  environ object is a mapping object representing the environment.
- [tarfile](https://docs.python.org/3/library/tarfile.html) is used to read and write tar archives, when extracting the model training results from S3.
- [urllib](https://docs.python.org/3/library/urllib.html) is a package with several modeules that are used to work with URL's. The `request` module is used for openning and reading URL's.
- [imageio](https://imageio.github.io) for reading and writing image data.
- [mxnet](http://mxnet.incubator.apache.org) is a flexable and effecient library for deep learning.
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) is an open source library for training and deploying machine learning models on Amazon SageMaker.

In [1]:
# Import libraries
import warnings; warnings.simplefilter('ignore')
import os
import boto3
import sagemaker
import h5py
import json
import tarfile
import datetime
import urllib.request
import imageio
import matplotlib.pyplot as plt
import numpy as np
import mxnet as mx
from sagemaker.mxnet import MXNet
from sagemaker.amazon.amazon_estimator import get_image_uri
from mxnet import gluon
%matplotlib inline

# Configure SageMaker
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
training_image = get_image_uri(boto3.Session().region_name, 'image-classification') #Image Classification Estimator

# Helper functions
def download(url):
    """
    Downloads the target file from the given URL.
    
    Arguments:
    url -- Full URL to download
    """
    filename = url.split("/")[-1]
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)

# Build the make_list function
def make_lst(data, label, name):
    """
    Make a custom tab separated `lst` file for `im2rec.py`
    
    Arguments:
    data -- Numpy array of image data
    label -- "Truth" label for image classification
    name -- "train" or "test" data
    """
    # Create local repository for the images based on name
    if not os.path.exists('./'+name):
        os.mkdir('./'+name)
        
    # Create the lst file
    lst_file = './'+name+'.lst'
    
    # Iterate through the numpy arrays and save as `.jpg`
    # and update the index file
    for i in range(len(data)):
        img = data[i]
        img_name = name+'/'+str(i)+'.jpg'
        imageio.imwrite(img_name, img)
        with open(lst_file, 'a') as f:
            f.write("{}\t{}\t{}\n".format(str(i), str(label[i]), img_name))
            f.flush()
            f.close()

try:
    import multiprocessing
except ImportError:
    multiprocessing = None

# Download the tool for creating RecordIO formatted data
download('https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py')

### Input Data Preparation
To train the Neural Network, we are provided with a dataset (`datasets.h5`) containing:
- a training set of $m$ images containing cats and non-cats as well as the appropriate class labels ($y=1$) and non-cat images ($y=0$).
- a test set of $m$ images containing cats and non-cat as well as the appropriate class labels ($y=1$) and non-cat images ($y=0$).

>**Note:** The original dataset was comprised of two separate files, `test_catvnoncat.h5` and `train_catvnoncat.h5`. For the sake of this implementation a single file is used, `datasets.h5`.

In [2]:
# Load the Training and Testing dataset
dataset = h5py.File('datasets/datasets.h5', 'r')

# Createw the Training and Testing data sets
X_train = np.array(dataset['train_set_x'][:])
y_train = np.array(dataset['train_set_y'][:])
X_test = np.array(dataset['test_set_x'][:])
y_test = np.array(dataset['test_set_y'][:])

From the cell above, the image training and testing (validation) input data (`train_set_x` and `test_set_x`) are 4-dimensional arrays consiting of $209$ training examoples ($m$) and $50$ testing images. Each image is in turn of height, width and depth (**R**ed, **G**reen **B**lue values) of $64 \times 64 \times 3$. Additionally, the dimension for the "true" labels (`train_set_y` and `test_set_y`) only show a $209$ and $50$ column structure.  

In [3]:
# Training data set dimensions
print("Training Data Dimension: {}".format(X_train.shape))
print("No. Training Examples: {}".format(X_train.shape[0]))
print("No. Training Features: {}".format(X_train.reshape((-1, 12288)).shape[1]))

Training Data Dimension: (209, 64, 64, 3)
No. Training Examples: 209
No. Training Features: 12288


In [4]:
# Testing data set dimensions
print("Training Data Dimension: {}".format(X_test.shape))
print("No. Training Examples: {}".format(X_test.shape[0]))
print("No. Training Features: {}".format(X_test.reshape((-1, 12288)).shape[1]))

Training Data Dimension: (50, 64, 64, 3)
No. Training Examples: 50
No. Training Features: 12288


Using SageMaker's built-in Image Classification algorithm requires that the dataset be formatted in [RecordIO](https://mxnet.incubator.apache.org/architecture/note_data_loading.html). RecordIO is an efficient file format that feeds images for model training as a stream, thus allowing for the entire dataset to be loaded either into CPU or GPU memeory and thus vastly iomproving the model training time. Some fo the benefits include:
- Storing images in a compact format, which greatly reduces the size of the dataset on the disk.
- Packing data together allows continuous reading on the disk.
- RecordIO has a simple way to partition, simplifying the distriburtion of training data when leveraging distributed training.

Since the Training and Testing data sets are currently stored as Numpy arrays, they will need to be converted to native `.jpg` images and then converted to RecordIO format before uploading them to S3.

However, the associated metadata (i.e. the classification label of the image) needs to be captured along with the image file when converting to the RecordIO format. In order to accomplish this, a `.lst` file needs to be created that captures the metadata of the image and it's associated label.

A `.lst` file is a tab-separated file with three columns that contains a list of image files. The first column specifies the image index, the second column specifies the class label index for the image, and the third column specifies the relative path of the image file. The image index in the first column should be unique across all of the images. The following code cell leverages the `make_lst()` helper function to accomplish this.

In [5]:
# Create `train.lst`
make_lst(X_train, y_train, name='train')

# Create `test.lst`
make_lst(X_test, y_test, name='test')

# View the output of the training `.lst` file
print("Sample output of the `train.lst` file:\n")
!head -n 3 ./train.lst > example.lst
f = open('example.lst','r')
lst_content = f.read()
print(lst_content)

Sample output of the `train.lst` file:

0	0	train/0.jpg
1	0	train/1.jpg
2	1	train/2.jpg



Now the the associated image metadata has been captures, the RecordIO files can be built. This is a done by leveraging the [im2rec](https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py) tool that has already been downloaded.

In [6]:
%%bash
# Convert training and validation images to `.rec`
python im2rec.py ./train.lst ./ --quality 100 --pass-through
python im2rec.py ./test.lst ./ --quality 100 --pass-through

Creating .rec file from /home/ec2-user/SageMaker/itsacat/Notebooks/train.lst in /home/ec2-user/SageMaker/itsacat/Notebooks
multiprocessing not available, fall back to single threaded encoding
time: 0.00029468536376953125  count: 0
Creating .rec file from /home/ec2-user/SageMaker/itsacat/Notebooks/test.lst in /home/ec2-user/SageMaker/itsacat/Notebooks
multiprocessing not available, fall back to single threaded encoding
time: 0.0002923011779785156  count: 0


### Input Data Upload
In order for *SageMaker* to execute the training and validation process on the Input Data, the data needs to be uploaded to S3. *SageMaker* provides the handy function, `upload_data()`, to upload the Numpy data to a default (or specific) location. If not already created, the function will create an S3 bucket. The resulting S3 bucket will also store the various training and testing output that will be used for creating production *Endpoints* and *Analysis*.

In [8]:
# Upload the Training and Testing Data to S3
training_data = sagemaker_session.upload_data(path='./train.rec', key_prefix='train')
testing__data = sagemaker_session.upload_data(path='./test.rec', key_prefix='test')
bucket = training_data.split('/')[2]
print("S3 Bucket: {}".format(bucket))

S3 Bucket: sagemaker-us-west-2-500842391574


## 2 - Training the Classifier as a SageMaker Training Job.
### Training Function
The *Training Function*, `model.py`, contains the instructions that *SageMaker* needs to:
1. Load the Input training and validation data sets from S3; `get_data()`.
2. Pre-process, "vectorize" and scale the image data to be processed by the Neural Network; `transform()`.
3. Train the model and validate the prediction accuracy of the proposed Neural Network model on the Input data; `train()`.
4. Save the model and training results to S3; `save()`.

In [9]:
# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
# For this training, we will use 18 layers
num_layers = 50
# we need to specify the input image shape for the training data
image_shape = "3,64,64"
# we also need to specify the number of training samples in the training set
# for caltech it is 15420
num_training_samples = 209
# specify the number of output classes
num_classes = 2
# batch size for training
mini_batch_size =  32
# number of epochs
epochs = 13
# learning rate
learning_rate = 0.01
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights
use_pretrained_model = 1

In [10]:
%%time
import time
import boto3
from time import gmtime, strftime


#s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'DEMO-imageclassification'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "use_pretrained_model": str(use_pretrained_model)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "test"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/train/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/test/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

Training job name: DEMO-imageclassification-2018-07-31-23-47-38

Input Data Location: {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-500842391574/train/', 'S3DataDistributionType': 'FullyReplicated'}
CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 29.3 ms


In [11]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress
Training job ended with status: Completed


In [12]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)

Training job ended with status: Completed


---

---

### MXNet Estimator
*SageMaker* provides built-in functionality to train and host [MXNet](http://mxnet.incubator.apache.org) and [Gluon](http://gluon.mxnet.io) models, using the `MXNet` class of the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk). Leveraging the MXNet Estimator drastically simplifies the handling of end-to-end training as well as deployment of custom MXNet models.

Using the code (below), the model itself, the location of the training data and the Hyperparameters are applied to the MXNet Estimator.

In [None]:
# Create a MXNet Estimator
mxnet_estimator = MXNet(
    'model.py',
    role=role,
    train_instance_count=1,
    train_instance_type='ml.c4.xlarge',
    output_path='s3://'+bucket,
    hyperparameters={
        'epochs': 2500,
        'optmizer': 'sgd',
        'learning_rate': 0.0075,
        'batch_size': 64,
        'threshold': 0.0019
    }
)

### Training Job
By calling the estimator's `fit()` method, with the location of the training data, *SageMaker* can start the model training using the configuration provided. After the training is successfully completed, the training results can be analyzed. Should the results prove that the model is optimal, It can then be deployed to *SageMaker's* hosting services.
>**Note:** Make sure to note the Training Job Name as it will be used in the next step.

In [None]:
##############################################################################################
#                   Create a custom job name for current training run                        #
#job_name = '<<Specific Training Job Name>>'                                                 #
#mxnet_estimator.fit(input_data, job_name=job_name) # Fit the estimator to custom job name   #
##############################################################################################

# Automatically generate training job name
mxnet_estimator.fit(input_data)

---
## 3 - Performance Analysis of the Trained Model.
After a model has been trained and before it can be leveraged in production, it must be tested. This testing process typically takes the form of:
1. **Analyzing the results from the training process:**
    A good indication that the model performs well on the training data is to verify that the overall Training Error (Cost Function) decreases after every iteration of the forward propagation process.
2. **Classification Accuracy (Training data set):**
    While the Training Error provides a good indication of how well the Neural Network out probabilities agree with the observed labels, a common evaluation metric used for classification models sn the **Accuracy Score**. This metric generally summarizes the number of correct predictions the classifier has made as a ratio of all the predictions.
3. **Classification Accuracy (Test/Validation data set):**
    A good practice in machine learning is to create a subset of the training data keep it separate for testing. This is typically referred to as a hold-out, validation or test set. By testing how well the model performance against this data, further insight can be derived.
    
As can be seen from the output from training process above, the model learn the features of the training set to accurately classify the observed label. Additionally, when the model applies the optimized parameters to classify the test data, it achieves as overall accuracy of $80%$.

Since the training function also captures these results to S3, the Cost/Error, training set Accuracy and test set Accuracy can be visualized as follows:
>**Note**: Be sure to enter the name of the above SageMaker training job in `job_name` variable.

In [None]:
# Download and uncompress output results from model training
job_name = '<<Enter Training Job Name>>'
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(job_name+'/output/output.tar.gz', '/tmp/output.tar.gz')
tarfile.open('/tmp/output.tar.gz').extractall()
with open('results.json') as j:
    data = json.load(j)#, object_pairs_hook=OrderedDict)

# Format data for plotting
costs = []
val_acc = []
train_acc = []
for key, value in sorted(data.items()):#, key=lambda (k,v): (v, k)):
    if 'epoch' in key:
        for k, v in value.items():
            if k == 'cost':
                costs.append(v)
            elif k == 'val_acc':
                val_acc.append(v)
            elif k == 'train_acc':
                train_acc.append(v)
    elif 'Start' in key:
        start = datetime.datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f")
    elif 'End' in key:
        end = datetime.datetime.strptime(value, "%Y-%m-%d %H:%M:%S.%f")
val_acc = np.array(val_acc)
train_acc = np.array(train_acc)
costs = np.array(costs)
delta = end - start
print("Model Training Time: {} Minute(s)".format(int(delta.total_seconds() / 60)))

# Plot the Learning Curve
plt.rcParams['figure.figsize'] = (11.0, 10.0)
plt.grid(True, which='both')
plt.plot(costs)
plt.plot(train_acc)
plt.plot(val_acc)
plt.ylabel('Cost / Accuracy')
plt.xlabel('Epochs (in Hundreds)')
plt.title("Learning Curve")
plt.legend(['Cost', 'Training Accuracy', 'Validation Accuracy'])
plt.show;

---
## 4 - Performance Analysis of the Inference Endpoint.
Testing the model against an image that is neither part of the training data or the testing data will provide realistic proof of it's performance in production. The following code cells demonstrate how the trained model performs against a selection of images that have pictures of cats as well as "other" pictures.

To further simulate the predictive capabilities of the trained mode in a production environment, the `deploy()` method of the estimator is called to host the model on the *SageMaker* [hosting services](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html) which provides an HTTPS endpoint to provide classification inferences on images.

In [None]:
predictor = mxnet_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

The code cells below show the pseudo production classification inferences on unseen image data by leveraging the hosted predictor.

In [None]:
import glob
import matplotlib.image as mpimg
from skimage import transform

# Get Classes
classes = ["non-cat", "cat"]

# Get Image files
images = []
for img_path in glob.glob('./images/*.jpeg'):
    images.append(mpimg.imread(img_path))

# Plot predictions
plt.figure(figsize=(20.0,20.0))
columns = 2
for i, image in enumerate(images):
    img = transform.resize(image, (64, 64), mode='constant').reshape((1, 64 * 64 * 3))
    prediction = int(predictor.predict(img.tolist()))
    plt.subplot(len(images) / columns + 1, columns, i + 1)
    plt.title('Prediction = "{}" picture.'.format(classes[prediction]))
    plt.imshow(image);

---
# Next: Test the Production API
Now that the model has been trained and validated for production, the **Data Science** part of the ML Pipeline can be integrated into the **DevOps** process. Refer back to the [README](../README.md) on the next steps.
>**Note:** Make sure to remember the name of the training job, as it is necessary to complete the next steps.

---
# Appendix A: Image Classification Model

In [None]:
!cat model.py