# SageMaker python SDK

In this last chapter we'll have a look at the [SageMaker python SDK](https://github.com/aws/sagemaker-python-sdk), which is an open-sourced python library designed to simplify training and serving (hosting) of ML models on SageMaker.

The SDK provides several high-level abstractions to achive this, see the [overview](https://github.com/aws/sagemaker-python-sdk#sagemaker-python-sdk-overview).

However, to use the functionality the SDK provides, you have to design your code to use the abstractions provided by the SDK. This might force you to rewrite parts of your already working network/algorithm!

## Refactor training script to use the SDK

Since the refactoring is quite big, and there are many SDK specific details, it is not possible to have the same level of explenation as in previous chapters. It will be up to you to read up on the SDK if there are parts you don't understand.

### Estimators

The SageMaker Estimator the abstraction used to train and deploy a model.

We'll be using these links as inspiration:
- https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras
- https://cloud.google.com/blog/big-data/2017/12/new-in-tensorflow-14-converting-a-keras-model-to-a-tensorflow-estimator
- https://stackoverflow.com/questions/48295788/using-a-keras-model-inside-a-tf-estimator

In [None]:
import os
import glob
import pandas as pd
from PIL import Image

def read_tub(path):
    '''
    Read a Tub directory into memory
    
    A Tub contains records in json format, one file for each sample. With a default sample frequency of 20 Hz,
    a 5 minute drive session will contain roughly 6000 files.
    
    A record JSON object has the following properties (per default):
    - 'user/angle'      - wheel angle
    - 'user/throttle'   - speed
    - 'user/mode'       - drive mode (.e.g user or pilot)
    - 'cam/image_array' - relative path to image
    
    Returns a list of dicts, [ { 'record_id', 'angle', 'throttle', 'image', } ]
    '''

    def as_record(file):
        '''Parse a json file into a Pandas Series (vector) object'''
        return pd.read_json(file, typ='series')
    
    def is_valid(record):
        '''Only records with angle, throttle and image are valid'''
        return hasattr(record, 'user/angle') and hasattr(record, 'user/throttle') and hasattr(record, 'cam/image_array')
        
    def map_record(file, record):
        '''Map a Tub record to a dict'''
        # Force library to eager load the image and close the file pointer to prevent 'too many open files' error
        img = Image.open(os.path.join(path, record['cam/image_array']))
        img.load()
        # Strip directory and 'record_' from file name, and parse it to integer to get a good id
        record_id = int(os.path.splitext(os.path.basename(file))[0][len('record_'):])
        return {
            'record_id': record_id,
            'angle': record['user/angle'],
            'throttle': record['user/throttle'],
            'image': img
        }
    
    json_files = glob.glob(os.path.join(path, '*.json'))
    records = ((file, as_record(file)) for file in json_files)
    return list(map_record(file, record) for (file, record) in records if is_valid(record))

In [None]:
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import Convolution2D
from tensorflow.python.keras.layers import Input, Dropout, Flatten, Dense
from tensorflow.python.keras.estimator import model_to_estimator

INPUT_TENSOR_NAME = "images"
SIGNATURE_NAME = "serving_default"
LEARNING_RATE = 0.001


def model_fn(features, labels, mode, params):
    '''
    Model function for Estimator.
    '''
    
    # 1. Create the model as before, but use features array as input
    x = features[INPUT_TENSOR_NAME]
    x = Convolution2D(24, (5,5), strides=(2,2), activation='relu')(x)
    x = Convolution2D(32, (5,5), strides=(2,2), activation='relu')(x)
    x = Convolution2D(64, (5,5), strides=(2,2), activation='relu')(x)
    x = Convolution2D(64, (3,3), strides=(2,2), activation='relu')(x)
    x = Convolution2D(64, (3,3), strides=(1,1), activation='relu')(x)

    x = Flatten(name='flattened')(x)
    x = Dense(100, activation='relu')(x)
    x = Dropout(.1)(x)
    x = Dense(50, activation='relu')(x)
    x = Dropout(.1)(x)

    angle_out = Dense(15, activation='softmax', name='angle_out')(x)
    throttle_out = Dense(1, activation='relu', name='throttle_out')(x)

    model = Model(inputs=[img_in], outputs=[angle_out, throttle_out])
    model.compile(optimizer='adam',
                  loss={'angle_out': 'categorical_crossentropy', 'throttle_out': 'mean_absolute_error'},
                  loss_weights={'angle_out': 0.9, 'throttle_out': .001})
    
    # 2. Create the TensorFlow Estimator from the Keras model (Dunno if this gonna work...)
    return model_to_estimator(keras_model=model)


def train_input_fn(training_dir, params):
    # TODO: Need to read a Tub into correct format. Could use read_tub from intro, or Donkey library TubGroup???
    records = read_tub(training_dir)
    df = pd.DataFrame.from_records(records).set_index('record_id')
    
    # TODO: Where do we define the input format (nbr of channels):
    # img_in = Input(shape=(120, 160, 3), name='img_in')
    
    # Return numpy_input_fn
    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: np.array(training_set.data)},
        y=np.array(training_set.target),
        num_epochs=None,
        shuffle=True)()

def eval_input_fn(training_dir, params):
    # TODO: Copy/paste train_input_fn
    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: np.array(training_set.data)},
        y=np.array(training_set.target),
        shuffle=True)()

def serving_input_fn(params):
    # TODO: Only finish if actually needed. We do not host the endpoint on SageMaker, only train the model
    return None