# SageMaker Model Training and Prediction - Built-in Image Classification Model
## Introduction
[Amazon SageMaker](https://aws.amazon.com/sagemaker/?sc_channel=PS&sc_campaign=pac_ps_q4&sc_publisher=google&sc_medium=sagemaker_b_pac_search&sc_content=sagemaker_e&sc_detail=aws%20sagemaker&sc_category=sagemaker&sc_segment=webp&sc_matchtype=e&sc_country=US&sc_geo=namer&sc_outcome=pac&s_kwcid=AL!4422!3!245225393502!e!!g!!aws%20sagemaker&ef_id=WL2I0wAAAIRC8xLB:20180418161912:s) is a fully mamnaged platform that enables Data Scientists to build, train and deploy machine learning models at any scale. It provides key services necessary to create and manage a Machine Learning (ML) Pipeline from "Notebook" to "Production", as highlighted below:

<img src="images/SageMaker_Workflow.png" style="width:800px;height:200px;">

The following Notebook demonstrates this process by using SageMaker's built-in [Image Classification Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html). To accomplish this, SageMaker's Image Classification Algorithm leverages a commonly used and pre-built model for image classification called __Resnet__. You can read more about Resnet [here](https://arxiv.org/abs/1512.03385).

<img src="images/CNN.jpg" style="width:950px;height:400px;">
<caption><left>[*image source](https://www.MathWorks.com)</left></caption><br>

By leveraging this methodology, the Data Scientist or Developer doesn't need to expend time to build, train and optimize a custom Image Classification models, but rather simply provide the training data and left SageMaker perform all the heavy lifting.

In [None]:
# Import libraries
import warnings; warnings.simplefilter('ignore')
import os
import boto3
import sagemaker
import time
import h5py
import json
import tarfile
import datetime
import urllib.request
import imageio
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
from time import gmtime, strftime
from IPython.display import Image
from sagemaker.amazon.amazon_estimator import get_image_uri
%matplotlib inline

# Configure SageMaker
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
training_image = get_image_uri(boto3.Session().region_name, 'image-classification') #Image Classification Estimator

# Helper Functions
def make_lst(data, label, name):
    """
    Make a custom tab separated `lst` file for `im2rec.py`
    
    Arguments:
    data -- Numpy array of image data
    label -- "Truth" label for image classification
    name -- "train" or "test" data
    """
    # Create local repository for the images based on name
    if not os.path.exists('./'+name):
        os.mkdir('./'+name)
        
    # Create the `.lst` file
    lst_file = './'+name+'.lst'
    
    # Iterate through the numpy arrays and save as `.jpg`
    # and update the `.lst` file
    for i in range(len(data)):
        img = data[i]
        img_name = name+'/'+str(i)+'.jpg'
        imageio.imwrite(img_name, img)
        with open(lst_file, 'a') as f:
            f.write("{}\t{}\t{}\n".format(str(i), str(label[i]), img_name))
            f.flush()
            f.close()

---
# Data Overview
## Input Data Preparation
To train the Neural Network, we are provided with a dataset (`datasets.h5`) containing:
- a training set of $m$ images containing cats and non-cats as well as the appropriate class labels ($y=1$) and non-cat images ($y=0$).
- a test set of $m$ images containing cats and non-cat as well as the appropriate class labels ($y=1$) and non-cat images ($y=0$).

In [None]:
# Load the Training and Testing dataset
dataset = h5py.File('datasets/datasets.h5', 'r')

# Createw the Training and Testing data sets
X_train = np.array(dataset['train_set_x'][:])
y_train = np.array(dataset['train_set_y'][:])
X_test = np.array(dataset['test_set_x'][:])
y_test = np.array(dataset['test_set_y'][:])

From the cell above, the image training and testing (validation) input data (`X_train` and `X_test`) are 4-dimensional arrays consisting of $209$ training examples ($m$) and $50$ testing images. Each image is in turn of height, width and depth (__R__ed, __G__reen __B__lue values) of $64 \times 64 \times 3$. Additionally, the dimension for the "true" labels (`y_train` and `y_test`) only show a $209$ and $50$ column structure.

In [None]:
# Training data set dimensions
print("Training Data Dimension: {}".format(X_train.shape))
print("No. Training Examples: {}".format(X_train.shape[0]))
print("No. Training Features: {}".format(X_train.reshape((-1, 12288)).shape[1]))

In [None]:
# Testing data set dimensions
print("Training Data Dimension: {}".format(X_test.shape))
print("No. Training Examples: {}".format(X_test.shape[0]))
print("No. Training Features: {}".format(X_test.reshape((-1, 12288)).shape[1]))

Using SageMaker's built-in Image Classification algorithm works best when the training dataset is optimizied for protobuf [RecordIO](https://mxnet.incubator.apache.org/architecture/note_data_loading.html) format. RecordIO (__content type:__ application/x-recordio) is an efficient file format that feeds images for model training as a stream, thus allowing for the entire dataset to be loaded either into CPU or GPU memory and thus vastly iomproving the model training time. . Some fo the benefits include:
- Storing images in a compact format, which greatly reduces the size of the dataset on the disk.
- Packing data together allows continuous reading on the disk.
- RecordIO has a simple way to partition, simplifying the distriburtion of training data when leveraging distributed training.

Optimizing the image data for protobuf RecordIO format, requires the numpy arrays for training and testing to be converted to this format, uploaded to S3 and then streamed to the training instance memory. This process is referred to as *Pipe mode* in SageMaker and the MXNet community provides a tool, [im2rec](https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py), that helps to convert images to RecordIO format. More information can be found in the [tutorial](https://mxnet.incubator.apache.org/faq/recordio.html).

For the sake of this notebook, where the native images (__content type:__ application/x-image) will be used, instead of RecordIO. This process is referred to as *File mode* in SageMaker. Since the Training and Testing data sets are currently stored as Numpy arrays, they will need to be converted to native `.jpg` images before uploading them to S3.

SageMaker's built-in image classifier has the capability to convert the native images to RecordIO format as part of the trianing function. However, the associated metadata (i.e. the classification label of the image) needs to be captured along with the image when uploading to S3. In order to accomplish this, a `.lst` file needs to be created that captures the metadata of the image and it's associated label.

An `.lst` file is a tab-separated file with three columns that contains a list of image files. The first column specifies the image index, the second column specifies the class label index for the image, and the third column specifies the relative path of the image file. The image index in the first column should be unique across all of the images. The following code cell leverages the `make_lst()` helper function to accomplish this and sows an example of what the `.lst` file looks like.

In [None]:
# Extract numpy arrays to images and create `lst` file
make_lst(X_train, y_train, name='train') # Training data
make_lst(X_test, y_test, name='test') # Testing data
# View the output of the training `.lst` file
print("Sample output of the `train.lst` file:\n")
!head -n 3 ./train.lst > example.lst
f = open('example.lst','r')
lst_content = f.read()
print(lst_content)

In [None]:
# Show sample image from file
from PIL import Image
image = mpimg.imread('./train/2.jpg')
plt.imshow(image);

## Input Data Upload

In order for SageMaker to execute the training and validation process on the Input Data, the data needs to be uploaded to S3. SageMaker provides the handy function, upload_data(), to upload the Numpy data to a default (or specific) location. If not already created, the function will create an S3 bucket. The resulting S3 bucket will also store the various training and testing output that will be used for creating production Endpoints and Analysis.

In [None]:
# Upload the Training and Testing Data to S3
training_data = sagemaker_session.upload_data(path='./train', key_prefix='train')
training_lst = sagemaker_session.upload_data(path='./train.lst', key_prefix='train_lst')
test_data = sagemaker_session.upload_data(path='./test', key_prefix='test')
test_lst = sagemaker_session.upload_data(path='./test.lst', key_prefix='test_lst')
bucket = training_data.split('/')[2]
print("S3 Bucket: {}".format(bucket))

---

# Training the SageMaker Classifier
## Hyperparameters
There are two kinds of parameters that need to be set for training. The first one are the hyperparameters that are specific to the algorithm. These are:

- __num_layers:__ The number of layers (depth) for the network. We use `18` in this example due to the fact that the training images are smalle ($64\times64$). More layers are typically used for much larger training images.
- __image_shape:__ The input image dimensions,'num_channels, height, width', for the network. It should be no larger than the actual image size. The number of channels should be same as the actual image.
    <div class="alert alert-info">
      <strong>Info!</strong> The original image data is shaped as <b>64 x 64 x 3</b>. RecordIO prefers the data formatted with channels first, i.e. <b>3 x 64 x 64</b>. Since the built-in image classifier will automatically convert the native images to RecordIO format before executing the training, the `image_shape` hyperparameter dimensions are the dimensions of the data after becing converted to RecordIO, i.e. <b>3 x 63 x 63</b>.
    </div>
- __num_training_samples:__ This is the total number of training examples. It is set to $209$.
- __num_classes:__ This is the number of output classes for the new dataset. Imagenet was trained with 1000 output classes but the number of output classes can be changed for fine-tuning. For this training set, $2$ is used because it has $2$ object categories, __cat__ or __non-cat__.
- __mini_batch_size:__ The number of training samples used for each mini batch. In distributed training, the number of training samples used per batch will be $N \times mini_batch_size$ where $N$ is the number of hosts on which training is run.
- __epochs:__ Number of training epochs.
- __learning_rate:__ Learning rate for training.
- __use_pretrained_model:__ Set to $0$ since the example will not be using transfer learning.

In [None]:
# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
num_layers = 18

# Shape of the training images
image_shape = "3,64,64"

# No. Samples in Training set
num_training_samples = 209

# No. output classes
num_classes = 2

# Batch Size
mini_batch_size =  42

# No. Epochs
epochs = 6

# Learning rate
learning_rate = 0.01

# Since transfer learning is not used, set use_pretrained_model to `0`
# so that weights can be initialized WITHOUT pre-trained weights
use_pretrained_model = 0

## Training Configuration

The second set of parameters, are those that are specific to the SageMaker training job. These include:

- __Input specification:__ These are the training and validation channels that specify the path where training data is present. These are specified in the `InputDataConfig` section. The main parameters that need to be set is the `ContentType` which can be set to *application/x-image* since the input data format and the `S3Uri` specifies the bucket and the folder where the training images are stored.
- __Output specification:__ This is specified in the `OutputDataConfig` section, the path to where the output can be stored after training. 
- __Resource config:__ This section specifies the type of instance on which to run the training and the number of hosts used for training. If `InstanceCount` is more than $1$, then training can be run in a distributed manner.

In [None]:
# Create unique job name 
job_name_prefix = 'sagemaker-imageclassification'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # Specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge", # GPU Instance
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "use_pretrained_model": str(use_pretrained_model)    
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    "InputDataConfig": [
        {
            "ChannelName": "train", # Training Images Location
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation", # Testing Images Location
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "train_lst", # Image Metadata
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/train_lst/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation_lst", # Image metadata
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/test_lst/'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))

## Training Job
Use the SageMaker `create_training_job()` method to start the training with the above parameters.

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

---
# Hosting the SageMaker Model
Now that the image classification model has been trained, it can be used to perform inferences, i.e. is the picture a "cat" or "non-cat" picture. This section involves the following steps:
- __Create model:__ Create model for the training output.
- __Create endpoint configuration:__ Create a configuration defining an endpoint.
- __Create endpoint:__ Use the configuration to create an inference endpoint.
- __Prediction:__ Perform inference on some input data using the endpoint.

## Create Model
Use the model created as output from training, to create the Endpoint Configuration.

In [None]:
# Create hosting model
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name = "image-classification" + timestamp
info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print("Model mame: {}".format(model_name))

# SageMaker Hossting Image
hosting_image = get_image_uri(boto3.Session().region_name, 'image-classification')
primary_container = {
    'Image': hosting_image,
    'ModelDataUrl': model_data,
}

# Create the Model
create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container
)

## Create the Endpoint Configuration
Next, configure __REST__ endpoints for hosting multiple models, e.g. for __A/B__ testing purposes. In order to support this, create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment, and at launch will describe the autoscaling configuration.

In [None]:
# Create the Endpoint configuration
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-config' + timestamp
endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants = [
        {
            'InstanceType':'ml.m5.xlarge', # Non-GPU Instance for hosting
            'InitialInstanceCount':1,
            'ModelName':model_name,
            'VariantName':'AllTraffic'
        }
    ]
)
print('Endpoint configuration name: {}'.format(endpoint_config_name))

## Create Hosting Endpoint
Through specifying the name and configuration defined above an endpoint is created that will be used to predictions on un-seen data as well as incorporated into production applications. This takes 9-11 minutes to complete.

In [None]:
# Create Endpoint
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-endpoint' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)

# Get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))
    
try:
    sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
    resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Create endpoint ended with status: " + status)
    print("Ednpoint Arn: " + resp['EndpointArn'])
    
    if status != 'InService':
        message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
        print('Training failed with the following error: {}'.format(message))
        raise Exception('Endpoint creation did not succeed')

---
# Prediction

Finally, the endpoint can be obtained from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint to predict whether a new (previously unseen) image is classified as a **cat** or **non-cat** image.

In [None]:
import glob
import matplotlib.image as mpimg

# Plot predictions
plt.figure(figsize=(20.0,20.0))
columns = 2
threshold = 0.5

# Create label Classes
classes = ["non-cat", "cat"]

# Inference client
runtime = boto3.Session().client(service_name='runtime.sagemaker')

# Get Image files
images = []
for img_path in glob.glob('./images/*.jpeg'):
    images.append(img_path)

# Run each image against the Inference Endpoint and plot results
for i, image in enumerate(images):
    with open(image, 'rb') as f:
        payload = f.read()
        payload = bytearray(payload)
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/x-image',
        Body=payload
    )
    result = json.loads(response['Body'].read())
    if result[0] > threshold:
        prediction = classes[1]
    else:
        prediction = classes[0]
    plt.subplot(len(images) / columns + 1, columns, i + 1)
    plt.title('Prediction = "{}" picture.'.format(prediction))
    plt.imshow(mpimg.imread(image))

# Next: Test the Production API
Now that the model has been trained and validated for production, the **Data Science** part of the ML Pipeline can be integrated into the **DevOps** process. Refer back to the [README](../README.md) on the next steps.

<div class="alert alert-danger">
  <strong>Note:</strong> Make sure to remember the name of the training job, as it is necessary to complete the next steps.
</div>