# Train and Host a Keras Sequential Model
## Utilizing Pipe Mode datasets and distributed training with Horovod
This notebook shows how to train and host a Keras Sequential model on SageMaker. The model used for this notebook is a simple deep CNN that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py).

## The dataset
The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)

In this tutorial, we will train a deep CNN to recognize these images.

We'll compare file mode, pipemode datasets and distributed training with Horovod

## Set up the environment

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

## Download the CIFAR-10 dataset
Downloading the test and training data will take around 5 minutes.

In [None]:
!python generate_cifar10_tfrecords.py --data-dir ./data

## Create a training job using the sagemaker.TensorFlow estimator, running locally
To test that the code will work in SageMaker, we'll first use SageMaker local

In [None]:
!/home/ec2-user/sample-notebooks/sagemaker-python-sdk/tensorflow_distributed_mnist/setup.sh

In [None]:
from sagemaker.tensorflow import TensorFlow

import subprocess
instance_type = 'local'

if subprocess.call('nvidia-smi') == 0:
    ## Set type to GPU if one is present
    instance_type = 'local_gpu'
    
local_hyperparameters = {'epochs': 5, 'batch-size' : 256}

source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='cifar10_keras_main.py',
                       source_dir=source_dir,
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters=local_hyperparameters,
                       train_instance_count=1, train_instance_type=instance_type)

In [None]:
local_inputs = {'train' : 'file://'+os.getcwd()+'/data/train', 'validation' : 'file://'+os.getcwd()+'/data/validation', 'eval' : 'file://'+os.getcwd()+'/data/eval'}
estimator.fit(local_inputs)

## Runing on SageMaker cloud

### Uploading the data to s3

In [None]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10-tf')
display(inputs)

## Configuring metrics from the job logs
SageMaker will be able to get the metrics directly from the logs, and send them to CloudWatch metrics

In [None]:
keras_metric_definition = [
    {'Name': 'train:loss', 'Regex': '.*loss: ([0-9\\.]+) - acc: [0-9\\.]+.*'},
    {'Name': 'train:accuracy', 'Regex': '.*loss: [0-9\\.]+ - acc: ([0-9\\.]+).*'},
    {'Name': 'validation:accuracy', 'Regex': '.*step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_acc: ([0-9\\.]+).*'},
    {'Name': 'validation:loss', 'Regex': '.*step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_acc: [0-9\\.]+.*'},
    {'Name': 'sec/steps', 'Regex': '.* - \d+s (\d+)[mu]s/step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_acc: [0-9\\.]+'}
]

In [None]:
hyperparameters = {'epochs': 60, 'batch-size' : 256}

In [None]:
from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(base_job_name='cifar10-tf',
                       entry_point='cifar10_keras_main.py',
                       source_dir=source_dir,
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=1, train_instance_type='ml.p3.2xlarge',
                       tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'file'}],
                       metric_definitions=keras_metric_definition)

In [None]:
remote_inputs = {'train' : inputs+'/train', 'validation' : inputs+'/validation', 'eval' : inputs+'/eval'}
estimator.fit(remote_inputs, wait=True)

## Runing on SageMaker with Pipe Mode input
SageMaker Pipe Mode is a mechanism for providing S3 data to a training job via Linux fifos. Training programs can read from the fifo and get high-throughput data transfer from S3, without managing the S3 access in the program itself.
Pipe Mode is covered in more detail in the SageMaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-inputdataconfig)

in out script, we enabled Pipe Mode using the following code:
```python
from sagemaker_tensorflow import PipeModeDataset
dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')
```

In [None]:
from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator_pipe = TensorFlow(base_job_name='pipe-cifar10-tf',
                       entry_point='cifar10_keras_main.py',
                       source_dir=source_dir,
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=1, train_instance_type='ml.p3.2xlarge',
                       tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'pipe'}],
                       metric_definitions=keras_metric_definition,
                       input_mode='Pipe')

In this example, we'll configure ```wait=False``` if you want to see the output logs change to ```wait=True```

In [None]:
remote_inputs = {'train' : inputs+'/train', 'validation' : inputs+'/validation', 'eval' : inputs+'/eval'}
estimator_pipe.fit(remote_inputs, wait=False)

## Distributed training with horovod
Horovod is a distributed training framework based on MPI. Horovod is only available with TensorFlow version 1.12 or newer. You can find more details at Horovod README.

To enable Horovod, we needed to add the following code to our script:
```python
import horovod.keras as hvd
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
```

We added to following callbacks:
```python
hvd.callbacks.BroadcastGlobalVariablesCallback(0)
hvd.callbacks.MetricAverageCallback()
```

Configured the optimizer:
```python
opt = Adam(lr=learning_rate * size, decay=weight_decay)
opt = hvd.DistributedOptimizer(opt)
```
and choose to save checkpoints and send TensorBoard logs only from the ```python hvd.rank() == 0``` instance

To start an distributed training job with Horovod, you'll need to configure the job distribution:
```python
distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': # Number of Horovod processes per host
                        }
                }
```

In [None]:
from sagemaker.tensorflow import TensorFlow

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': 4
                        }
                }
hyperparameters = {'epochs': 60, 'batch-size' : 256}

source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator_dist = TensorFlow(base_job_name='dist-cifar10-tf',
                       entry_point='cifar10_keras_main.py',
                       source_dir=source_dir,
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=2, train_instance_type='ml.p3.8xlarge',
                       tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'dist'}],
                       metric_definitions=keras_metric_definition,
                       distributions=distributions)

In this example, we'll configure ```wait=False``` if you want to see the output logs change to ```wait=True```

In [None]:
remote_inputs = {'train' : inputs+'/train', 'validation' : inputs+'/validation', 'eval' : inputs+'/eval'}
estimator_dist.fit(remote_inputs, wait=False)

### Local TensorBoard command
Using TenosrBoard we can compare between the different jobs we ran, the following command will print the tensorboard command.
Run it in any environment where you have TensorBoard installed.

In [None]:
!python generate_tensorboard_command.py

Install tensorboard on your computer, and run the command.
You can access TensorBoard locally at http://localhost:6006

## Deploy the trained model

The deploy() method creates an endpoint which serves prediction requests in real-time.
The model saves keras artifacts, Thus, can't use TensorFlow serving.
2 relevant options are available:
1. Export the artifacts to the SavedModel format, [blog post](https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/)
2. Use the SKLearn container to load a keras model

I decided to choose the SKLearn container for this example

Using to inference.py for inference.

In [None]:
!cat inference_source_dir/inference.py

In [None]:
from sagemaker.sklearn import SKLearnModel
model = SKLearnModel(model_data=estimator.model_data,
                     role=role,
                     source_dir='inference_source_dir',
                     entry_point='inference.py'
                    )

In [None]:
predictor = model.deploy(endpoint_name='cifar10',initial_instance_count=1, instance_type='ml.m4.xlarge')

## Make some predictions
Prediction is not the focus of this notebook, so to verify the endpoint's functionality, we'll simply generate random data in the correct shape and make a prediction.

In [None]:
# Creating fake prediction data
import numpy as np
data = np.random.randn(1, 32, 32, 3)
print("Predicted class is {}".format(np.argmax(predictor.predict(data))))

### Calculating accuracy and create a confusing matrix based on the test dataset

In [None]:
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix
datagen = ImageDataGenerator()

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

def predict(data):
    predictions = predictor.predict(data)
    return predictions

In [None]:
batch_size = 256
predicted = []
actual = []
batches = 0
for data in datagen.flow(x_test,y_test,batch_size=batch_size):
    for i,prediction in enumerate(predict(data[0])):
        predicted.append(np.argmax(prediction))
        actual.append(data[1][i][0])
    batches += 1
    if batches >= len(x_test) / batch_size:
        break

In [None]:
cm = confusion_matrix(y_pred=predicted,y_true=actual)
TP = cm[0][0]
FP = cm[0][1]
TN = cm[1][1]
FN = cm[1][0]
accuracy = (TP + TN)/(TP+FP+TN+FN)
display('TP : {}'.format(TP))
display('FP : {}'.format(FP))
display('TN : {}'.format(TN))
display('FN : {}'.format(FN))
display('Accuracy: {}%'.format(round(accuracy*100,2)))

In [None]:
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

n_cm = cm / cm.astype(np.float).sum(axis=1)
sn.set(rc={'figure.figsize':(11.7,8.27)})
sn.set(font_scale=1.4)#for label size
sn.heatmap(n_cm, annot=True,annot_kws={"size": 10})# font size

# Cleaning up
To avoid incurring charges to your AWS account for the resources used in this tutorial you need to delete the SageMaker Endpoint:

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint)