# re:Invent Chalk Talk - Building, Training, and Deploying Fast.ai Models Using Amazon SageMaker

## Background

This example application trains a fastai based image classification model using a Convolutional Neural Network (CNN) to distinguish between **Heavy Metal** and **Sports** shirts.

## Setup

*This notebook was created and tested on an ml.p3.2xlarge notebook instance.*

Let's start by creating a SageMaker session and specifying:

* The **S3 bucket** and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
* The **IAM role** arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with appropriate full IAM role arn string(s). 

**IMPORTANT** please make sure the IAM role associated to your SageMaker notebook instance has the following managed IAM policies attached:

* **arn:aws:iam::aws:policy/AmazonSageMakerFullAccess**
* **arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess**

We also need to ensure there are no AWS credentials setup on the instance. We will set these up later in order to be able to train and deploy locally.

In [None]:
%matplotlib inline
import os
import random
from io import BytesIO
import subprocess
from glob import glob

from IPython.display import display, HTML, Image

import matplotlib.pyplot as plt

from PIL import Image
import requests
import boto3

import sagemaker
from sagemaker.analytics import TrainingJobAnalytics
from sagemaker.local import local_session
from sagemaker.pytorch import PyTorch
from sagemaker.predictor import RealTimePredictor, json_deserializer
from sagemaker.pytorch import PyTorchModel
from sagemaker.utils import name_from_image

In [None]:
def display_images(images, header=None, width="100%"):
    if type(width)==type(1): width = "{}px".format(width)
    html = ["<table style='width:{}'><tr>".format(width)]
    if header is not None:
        html += ["<th>{}</th>".format(h) for h in header] + ["</tr><tr>"]

    for image in images:
        html.append("<td><img src='{}' /></td>".format(image))
    html.append("</tr></table>")
    display(HTML(''.join(html)))

In [None]:
! if [ -e ~/.aws/credentials ]; then rm ~/.aws/credentials; fi

In [None]:
sagemaker_session = sagemaker.Session()

bucket_name = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-shirts-classification'

role = sagemaker.get_execution_role()

We also need to pull the needed Docker images from DockerHub that are specific for running fastai based models on SageMaker. 

There is a project with the source code and Dockerfile that can be found at the GitHub project: https://github.com/aws-samples/amazon-sagemaker-container-with-fastai.

Once we download the Docker images we then need to upload them to ECR so that they can be used by SageMaker.

In [None]:
%%bash

# get the Dockerfile from GitHub
wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-container-with-fastai/master/Dockerfile

IMAGE="sagemaker-fastai"

# parameters
FASTAI_VERSION="1.0"
PY_VERSION="py36"

# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
    exit 255
fi

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${IMAGE}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${IMAGE}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Get the login command from ECR in order to pull down the SageMaker PyTorch image
$(aws ecr get-login --registry-ids 520713654638 --region ${region} --no-include-email)

# loop for each architecture (cpu & gpu)
for arch in gpu cpu
do  
    echo "Building image with arch=${arch}, region=${region}"
    TAG="${FASTAI_VERSION}-${arch}-${PY_VERSION}"
    FULLNAME="${account}.dkr.ecr.${region}.amazonaws.com/${IMAGE}:${TAG}"
    docker build -t ${IMAGE}:${TAG} --build-arg ARCH="$arch"  --build-arg REGION="${region}"  .
    docker tag ${IMAGE}:${TAG} ${FULLNAME}
    docker push ${FULLNAME}
done

## Prepare Data

### Download training data to notebook instance

We will be utilizing a custom data set with a mixture of pictures of heavy metal t-shirts and sport shirts.

In [None]:
%%bash
if [ ! -d data/shirts ]; then
    mkdir -p data/shirts
    wget -q https://s3-eu-west-1.amazonaws.com/sagemaker-934676248949-eu-west-1/data/shirts_imgs.tar.gz
    tar zxf shirts_imgs.tar.gz -C data/shirts
    rm shirts_imgs.tar.gz
fi

In [None]:
DATA_PATH=f'{os.getcwd()}/data/shirts'

In [None]:
%ls {DATA_PATH}

### View sample images

Let's look at some of the images in the folders.

In [None]:
metal_img = random.choice(glob('data/shirts/metal/*.jpg'))
sport_img = random.choice(glob('data/shirts/sport/*.jpg'))

display_images([metal_img, sport_img],
       header=['Metal', 'Sport'], width="60%")

### Upload training data to S3

We are going to use the sagemaker.Session.upload_data function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.

In [None]:
s3 = boto3.client('s3')
key = f'{prefix}/metal/'
response = s3.list_objects_v2(
        Bucket=bucket_name,
        Prefix=key,
)

if response['KeyCount'] > 0:
    print("Images exist in S3!")
    data_location=f's3://{bucket_name}/{prefix}'
else:
    print("Training images not uploaded to S3. Uploading now")
    data_location = sagemaker_session.upload_data(path=DATA_PATH, bucket=bucket_name, key_prefix=prefix)

print(f'training images location: {data_location}')

##  Train locally

### Training script

We need to provide a training script that can run on the SageMaker platform. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
* `SM_OUTPUT_DATA_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script run by this notebook:

In [None]:
!pygmentize 'src/shirts/train.py'

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

In the current example we also need to provide source directory since training script imports data and model classes from other modules.

In [None]:
!ls src/shirts

### Create ~/.aws/credentials file (to be removed when local mode is supported)

*TODO: this is a temp fix to a problem with running training in local mode. When pull request [here](https://github.com/aws/sagemaker-python-sdk/pull/499) is merged then it can be removed.*

A new IAM user needs to be created otherwise the PyTorch local estimator will not work. Create a new IAM User including secret keys and attach the managed policy named `arn:aws:iam::aws:policy/AmazonSageMakerFullAccess`. 

There is a helper script in `utils/create_save_iam_credentials.sh` that you should run outside this notebook (e.g. on your laptop) that will create the IAM user & access keys and save them into the AWS Secrets Store.

In [None]:
!pygmentize utils/create_save_iam_credentials.sh

Make sure you modify the IAM role associated with this SageMaker notebook instance adding the permission to read the secrets store entries. An example AWS CLI command to run on your local laptop that attaches the correct policy to your SageMaker notebook instance IAM role is the following:

```
aws iam put-role-policy --role-name "<role-name>" --policy-name "secrets" --policy-document "{ \"Version\": \"2012-10-17\", \"Statement\": [ { \"Effect\": \"Allow\", \"Action\": \"secretsmanager:GetSecretValue\", \"Resource\": \"arn:aws:secretsmanager:*:*:secret:SageMakerNb*\" } ] }" 
```

Remember to replace the text `<role-name>` with the name of your role. You can obtain it by running the command in the cell below.

In [None]:
# gets the name of the IAM role attached to this notebook instance
print(role.rsplit('/', 1)[-1])

In [None]:
%%bash

if [ ! -e ~/.aws/credentials ]; then
    echo "Writing new credentials file"
    accesskey=$(aws secretsmanager get-secret-value --secret-id "SageMakerNbAccessKey" --query 'SecretString' --output text)
    secretkey=$(aws secretsmanager get-secret-value --secret-id "SageMakerNbSecretKey" --query 'SecretString' --output text)

    cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id=${accesskey}
aws_secret_access_key=${secretkey}
EOF

else
    echo "Credentials file already exists"
fi

### Configure docker

Make sure docker is configured for local training 

In [None]:
%%bash
pushd utils
bash setup.sh
popd

### Fit estimator locally

First we need to set the name of the Docker image we want to use for training. If we are training on a notebook instance with a GPU (e.g. `ml.p2.xlarge`) we can use the GPU based image so that training locally runs faster.

In [None]:
region = boto3.session.Session().region_name

client = boto3.client('sts')
account = client.get_caller_identity()['Account']

instance_type = 'local'

base_image_name = 'sagemaker-fastai'

image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/{base_image_name}:1.0-cpu-py36'

if subprocess.call('nvidia-smi') == 0:
    ## Set type to GPU if one is present
    instance_type = 'local_gpu'
    image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/{base_image_name}:1.0-gpu-py36'

print("Instance type = " + instance_type)
print(f'Using ECR image for local training: {image_name}')

As we want to train locally we need to setup a specific local sagemaker session.

This code will depend on this PR being approved: https://github.com/aws/sagemaker-python-sdk/pull/499

In [None]:
#from sagemaker.local import local_session
# Create local session
#sagemaker_session = local_session.LocalSession()
#sagemaker_session.config = {'local': {'local_code': True }}

To represent our training, we use the Estimator class, which needs to be configured in five steps. 
1. IAM role - our AWS execution role (this is ignored for local training and uses the local AWS credentials)
2. train_instance_count - number of instances to use for training.
3. train_instance_type - type of instance to use for training. For training locally, we specify `local` or `local_gpu`.
4. image_name - our custom Fast.ai Docker image we created.
5. hyperparameters - hyperparameters we want to pass.

In [None]:
estimator = PyTorch(entry_point='train.py',
                    source_dir='src/shirts',
                    role=role,
                    framework_version='1',                    
                    train_instance_count=1,
                    train_instance_type=instance_type,
                    #sagemaker_session=sagemaker_session,
                    image_name=image_name,
                    hyperparameters={
                        'epochs': 4, 
                        'batch-size': 64
                    })

Now that the rest of our estimator is configured, we can call `fit()` with the path to our local Shirts dataset prefixed with `file://`. This invokes our fast.ai container with 'train' and passes in our hyperparameters and other metadata as json files in `/opt/ml/input/config` within the container to our program entry point defined in the Dockerfile.

After our training has succeeded, our training algorithm outputs our trained model within the `/opt/ml/model` directory, which is used to handle predictions.

It will run the training job locally using `docker compose`.

In [None]:
estimator.fit(f'file://{DATA_PATH}', job_name=name_from_image('fastai-shirts'))

Once the model is trained locally we are ready to deploy it locally to test it is making the correct predictions.

## Host locally

### Hosting script

We are going to provide custom implementation of `model_fn`, `input_fn`, `output_fn` and `predict_fn` hosting functions in a separate file:

In [None]:
!pygmentize 'src/shirts/serve.py'

You can also put your training and hosting code in the same file but you would need to add a main guard (`if __name__=='__main__':`) for the training code, so that the container does not inadvertently run it at the wrong point in execution during hosting.

### Create local model

The `PyTorch` model uses a npy serializer and deserializer by default. For this example, since we have a custom implementation of all the hosting functions and plan on using JSON instead, we need a predictor that can serialize and deserialize JSON.

In [None]:
class ImagePredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(ImagePredictor, self).__init__(endpoint_name, sagemaker_session=sagemaker_session, serializer=None, 
                                            deserializer=json_deserializer, content_type='image/jpeg')

Since hosting functions implemented outside of train script we can't just use estimator object to deploy the model. Instead we need to create a `PyTorchModel` object using the latest training job to get the S3 location of the trained model data. Besides model data location in S3, we also need to configure `PyTorchModel` with the script and source directory (because our `serve.py` script requires model and data classes from source directory), an IAM role.

In [None]:
# always run inference with the cpu based Docker image
image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/{base_image_name}:1.0-cpu-py36'
print(f'Using ECR image for local hosting: {image_name}')

In [None]:
model = PyTorchModel(name=name_from_image('fastai-shirts'),
                     model_data=estimator.model_data,
                     role=role,
                     framework_version='1',
                     entry_point='serve.py',
                     source_dir='src/shirts',
                     image=image_name,
                     predictor_cls=ImagePredictor)

### Deploy model locally

We can now call `deploy()` with an instance_count and instance_type, which is `1` and `local`. This invokes our fast.ai container with `'serve'`, which setups our container to handle prediction requests as defined [here](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py#L103). What is returned is a predictor, which is used to make inferences against our trained model.

After our prediction, we can delete our endpoint.

We recommend testing and training your training algorithm locally first, as it provides quicker iterations and better debuggability.

In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type='local')

### Infer locally

Now we are ready to call our locally deployed endpoint to test it is working fine.

In [None]:
# Motorhead T-Shirt
# url = 'https://images.backstreetmerch.com/images/products/bands/clothing/mthd/bsi_mthd281.jpg'
# Judas Priest T-Shirt
#url = 'https://thumbs2.ebaystatic.com/d/l225/m/m7Lc1qRuFN3oFIlQla5V0IA.jpg'
# Iron Maiden T-Shirt
url = 'https://www.ironmaidencollector.com/assets/pages/ab9ed-20180815_121042.jpg'
# All Blacks rugby jersey
#url = 'https://images.sportsdirect.com/images/products/38153703_l.jpg'
# Australia Rugby T-Shirt
#url = 'https://www.lovell-rugby.co.uk/products/products_580x387/40378.jpg'
# Chicago Bulls top
#url = 'https://i.ebayimg.com/images/g/qc0AAOSwBahVN~qm/s-l300.jpg'
# Masters Golf Shirt
#url = 'https://s-media-cache-ak0.pinimg.com/originals/29/6a/15/296a15200e7dd3ed08e12d9052ea4f97.jpg'
img_bytes = requests.get(url).content
img = Image.open(BytesIO(img_bytes))
img

In [None]:
response = predictor.predict(img_bytes)
response

### Cleanup Endpoint

In [None]:
predictor.delete_endpoint()

## Train on SageMaker

Now that we have tested the training and hosting locally, we are ready to train our model using the Amazon SageMaker training service.

Training a model on SageMaker with the Python SDK is done in a way that is similar to the way we trained it locally. This is done by changing our train_instance_type from `local` to one of our [supported EC2 instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/).

In addition, we must now specify the ECR image URL, which we just pushed above.

Finally, our local training dataset has to be in Amazon S3 and the S3 URL to our dataset is passed into the `fit()` call.

Let's first fetch our ECR image url that corresponds to the image we just built and pushed.

In [None]:
image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/{base_image_name}:1.0-gpu-py36'
print(f'Using ECR image for SageMaker training: {image_name}')

Now we will create a `PyTorch` estimator object using the SageMaker SDK. The input parameters are almost exactly the same as when we trained locally except we will provide an instance type that will be a specfic instance type used to train the model using the SageMaker training service. In this specific example we will use the `ml.p3.2xlarge` instance type.

In [None]:
estimator = PyTorch(entry_point='train.py',
                    source_dir='src/shirts',
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.p3.2xlarge',
                    image_name=image_name,
                    framework_version='1',
                    hyperparameters={
                        'epochs': 6, 
                        'batch-size': 64
                    },
                    metric_definitions=[
                        {'Name': 'valid:loss',     'Regex': '#quality_metric: host=\S+, epoch=\S+, valid_loss=(\S+)'},
                        {'Name': 'train:loss',     'Regex': '#quality_metric: host=\S+, epoch=\S+, train_loss=(\S+)'},
                        {'Name': 'valid:accuracy', 'Regex': '#quality_metric: host=\S+, epoch=\S+, accuracy=(\S+)'}
                    ])

In [None]:
data_location=f's3://{bucket_name}/{prefix}'
print(f'Training data location: {data_location}')

In [None]:
training_job_name=name_from_image('fastai-shirts')
estimator.fit(data_location, job_name=training_job_name)

Now we can plot the accuracy metric on a graph pulling the data from CloudWatch

### Graph training metrics from SageMaker

In [None]:
# get the dataframe of training metrics
df = TrainingJobAnalytics(training_job_name=training_job_name,metric_names=['train:loss', 'valid:loss', 'valid:accuracy']).dataframe()

# plot the dataframe with matplotlib
fig, (ax1, ax2) = plt.subplots(1,2, sharey=False)
ax1.set_title('Loss')
ax1.set_ylabel('loss')
for key, grp in df.loc[df['metric_name'] != 'valid:accuracy'].groupby(['metric_name']):
    ax = grp.plot(ax=ax1, kind='line', x='timestamp', y='value', label=key)

ax2.set_title('Accuracy')
ax2.set_ylabel('accuracy')
df.loc[df['metric_name'] == 'valid:accuracy'].plot(ax=ax2, kind='line', x='timestamp', y='value', label='accuracy')
plt.tight_layout()

In [None]:
df

## Host model with SageMaker

### Import model into SageMaker

Since hosting functions implemented outside of train script we can't just use estimator object to deploy the model. Instead we need to create a `PyTorchModel` object using the latest training job to get the S3 location of the trained model data. Besides model data location in S3, we also need to configure `PyTorchModel` with the script and source directory (because our `serve.py` script requires model and data classes from source directory), an IAM role.

In [None]:
# always run inference with the cpu based Docker image
image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/{base_image_name}:1.0-cpu-py36'
print(f'Using ECR image for SageMaker hosting: {image_name}')

In [None]:
model = PyTorchModel(name=name_from_image('fastai-shirts'),
                     model_data=estimator.model_data,
                     role=role,
                     framework_version='1',
                     entry_point='serve.py',
                     source_dir='src/shirts',
                     image=image_name,
                     predictor_cls=ImagePredictor)

### Deploy model to SageMaker

Now we will take the PyTorch specific model created earlier and call the `deploy()` method giving the different instance type so that it will be deployed to the SageMaker hosting service. The instance type does not need to be a GPU instance, a CPU is perfectly fine for model inference.

In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

### Call SageMaker endpoint

Now we are ready to call the SageMaker endpoint to see if it is making correct inferences against some test data.

In [None]:
# use on existing endpoint
#endpoint_name = 'fastai-shirts-2018-11-27-20-21-19-215'
#predictor = ImagePredictor(endpoint_name=endpoint_name, sagemaker_session=sagemaker_session)

In [None]:
# Motorhead T-Shirt
#url = 'https://images.backstreetmerch.com/images/products/bands/clothing/mthd/bsi_mthd281.jpg'
# Judas Priest T-Shirt
#url = 'https://thumbs2.ebaystatic.com/d/l225/m/m7Lc1qRuFN3oFIlQla5V0IA.jpg'
# Iron Maiden T-Shirt
#url = 'https://www.ironmaidencollector.com/assets/pages/ab9ed-20180815_121042.jpg'
# All Blacks rugby jersey
#url = 'https://images.sportsdirect.com/images/products/38153703_l.jpg'
# Australia Rugby T-Shirt
#url = 'https://www.lovell-rugby.co.uk/products/products_580x387/40378.jpg'
# Chicago Bulls top
url = 'https://i.ebayimg.com/images/g/qc0AAOSwBahVN~qm/s-l300.jpg'
# Masters Golf Shirt
#url = 'https://s-media-cache-ak0.pinimg.com/originals/29/6a/15/296a15200e7dd3ed08e12d9052ea4f97.jpg'
img_bytes = requests.get(url).content
img = Image.open(BytesIO(img_bytes))
img

In [None]:
response = predictor.predict(img_bytes)
response

### Cleanup endpoint

When you're done with the endpoint, you should clean it up.

All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS account.

In [None]:
predictor.delete_endpoint()

## Reference

- [How Amazon SageMaker interacts with your Docker container for training](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker interacts with your Docker container for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [PyTorch extending container example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb)
- [scikit-bring-your-own example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
- [SageMaker fast.ai container](https://github.com/aws-samples/amazon-sagemaker-container-with-fastai)
- [SageMaker fast.ai example](https://github.com/mattmcclean/sagemaker-fastai-example)
- [SageMaker PyTorch container](https://github.com/aws/sagemaker-pytorch-container)