# Building a custom inference container
1. [Part 1: Packaging your code for inference with Amazon SageMaker](#Part-1:-Packaging-your-code-for-inference-with-Amazon-SageMaker)
    1. [How Amazon SageMaker runs your Docker container during hosting](#How-Amazon-SageMaker-runs-your-Docker-container-during-hosting)
    1. [The parts of the sample container](#The-parts-of-the-sample-inference-container)
1. [Part 2: Building and registering the container](#Part-2:-Building-and-registering-the-container)
1. [Part 3: Use the container for inference in Amazon SageMaker](#Part-3:-Use-the-container-for-inference-in-Amazon-SageMaker)
  1. [Import model into hosting](#Import-model-into-hosting)
  1. [Create endpoint configuration](#Create-endpoint-configuration) 
  1. [Create endpoint](#Create-endpoint)   
  1. [Invoke model](#Invoke-model)     
1. [(Optional) cleanup](#(Optional)-cleanup)  

## Part 1: Packaging your code for inference with Amazon SageMaker

### How Amazon SageMaker runs your Docker container during hosting

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container. All SageMaker framework containers already cover this requirement and will trigger your defined training algorithm and inference code.

* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.

#### Running your container during hosting

Hosting has a very different model than training because hosting is reponding to inference requests that come in via HTTP. 

Amazon SageMaker uses two URLs in the container:

* `/ping` receives `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these are passed in as well. 

If you are using the same container image for both training and serving the model, it will have the model files in the same place that they were written to during training:

    /opt/ml
    `-- model
        `-- <model files>
        
Alternatively, if you are using separate containers for training and inference, when the inference container is spun up, the model files will be copied from the S3 location that the training container outputted them to. 

### The parts of the sample inference container

In order to build a production grade inference server into the container, we use the following stack to make the implementer's job simple:

![The Inference Stack](stack.png)

1. __[nginx][nginx]__ is a light-weight layer that handles the incoming HTTP requests and manages the I/O in and out of the container efficiently.
2. __[gunicorn][gunicorn]__ is a WSGI pre-forking worker server that runs multiple copies of your application and load balances between them.
3. __[flask][flask]__ is a simple web framework used in the inference app that you write. It lets you respond to call on the `/ping` and `/invocations` endpoints without having to write much code.

The `inference_container` directory has all the components you need to extend the inference logic of the SageMaker scikit-learn container:

    .
    |-- Dockerfile
        |-- light_fm   
            |-- nginx.conf
            |-- predictor.py
            |-- wsgi.py    
            |-- serve

Let's discuss each of these in turn:

* __`Dockerfile`__ The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here. we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here.
* __`serve`__: The wrapper that starts the inference server. In most cases, you can use this file as-is.
* __`wsgi.py`__: The start up shell for the individual server workers. This only needs to be changed if you changed where predictor.py is located or is named.
* __`predictor.py`__: The algorithm-specific inference server. This is the file that you modify with your own algorithm's code.
* __`nginx.conf`__: The configuration for the nginx master server that manages the multiple workers.

### Environment variables

When you create an inference server, you can control some of Gunicorn's options via environment variables. These
can be supplied as part of the CreateModel API call.

    Parameter                Environment Variable              Default Value
    ---------                --------------------              -------------
    number of workers        MODEL_SERVER_WORKERS              the number of CPU cores
    timeout                  MODEL_SERVER_TIMEOUT              60 seconds


[skl]: http://scikit-learn.org "scikit-learn Home Page"
[dockerfile]: https://docs.docker.com/engine/reference/builder/ "The official Dockerfile reference guide"
[ecr]: https://aws.amazon.com/ecr/ "ECR Home Page"
[nginx]: http://nginx.org/
[gunicorn]: http://gunicorn.org/
[flask]: http://flask.pocoo.org/

# Part 2: Building and registering the container

Just like with the training container, we are going to use the [Amazon SageMaker Studio Image Build new CLI](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/).

Open a terminal window and run the following command:
```
cd ~/inference_container
sm-docker build . --repository lightfm-inference:1.0
```

# Part 3: Use the container for inference in Amazon SageMaker

In [None]:
import boto3 
def get_container_uri(ecr_repository, tag):
    account_id = boto3.client('sts').get_caller_identity().get('Account')

    region = boto3.session.Session().region_name

    uri_suffix = 'amazonaws.com'
    if region in ['cn-north-1', 'cn-northwest-1']:
        uri_suffix = 'amazonaws.com.cn'

    return '{}.dkr.ecr.{}.{}/{}:{}'.format(account_id, region, uri_suffix, ecr_repository, tag)

# Import model into hosting

When creating the Model entity for endpoints, the container's ModelDataUrl is the S3 prefix where the model artifacts that are invokable by the endpoint are located. The rest of the S3 path will be specified when invoking the model.

The Mode of container is specified as MultiModel to signify that the container will host multiple models.

In [None]:
import boto3
from sagemaker import get_execution_role
from time import gmtime, strftime

role = get_execution_role()
client = boto3.client(service_name='sagemaker')

byoc_image_uri = get_container_uri('lightfm-inference','1.0')
model_url = 's3://sagemaker-us-east-1-718026778991/light-fm-custom-container-train-job-2021-04-26-10-26-53-215/output/model.tar.gz'
model_name = 'Demo-LightFM-Inference-Model'+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())

container = {
    'Image': byoc_image_uri,
    'ModelDataUrl': model_url,
    'Mode': 'SingleModel'
}

create_model_response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    Containers = [container])

print("Model Arn: " + create_model_response['ModelArn'])

# Create endpoint configuration

In [None]:
endpoint_config_name = 'DEMO-LightFM-EndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': 'ml.m5.xlarge',
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic'}])

print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

# Create endpoint

In [None]:
import time

endpoint_name = 'DEMO-LightFMEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

resp = client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)

print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
waiter = client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

# Invoke model

Now we invoke the model that we uploaded to S3 previously in the training step. 

The first invocation of a model may be slow, since behind the scenes, SageMaker is downloading the model artifacts from S3 to the instance and loading it into the container.

In [None]:
# TODO: change input to CSV

%%time

import json
import numpy as np

runtime_client = boto3.client(service_name='sagemaker-runtime')

data = np.array([3, 42, 500])
payload = json.dumps(data.tolist())

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
#    TargetModel='resnet_18.tar.gz', # this is the rest of the S3 path where the model artifacts are located
    Body=payload)

print(*json.loads(response['Body'].read()), sep = '\n')

## (Optional) cleanup
When you're done with the endpoint, you should clean it up.

All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS account, but you can also run the code below to easily clean up the resources.

In [None]:
client.delete_endpoint(EndpointName=endpoint_name)
client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
client.delete_model(ModelName=model_name)