# Host a Pretrained Model on SageMaker
    
Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you're welcome to only use the features needed for your use case. One of the most popular features of SageMaker is model hosting. Using SageMaker hosting, you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. Read more at [Deploy a Model in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). In this notebook, we demonstrate how to host a pretrained BERT model in Amazon SageMaker to extract embeddings from text.

SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don't need to install and configure one. We use the SageMaker PyTorch container, but you may use the TensorFlow container, or bring your own container if needed. See all containers at [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers).

This notebook walks you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production-ready API.

## Runtime

This notebook takes approximately 5 minutes to run.

## Contents

1. [Retrieve Model Artifacts](#Retrieve-Model-Artifacts)
1. [Write the Inference Script](#Write-the-Inference-Script)
1. [Package Model](#Package-Model)
1. [Deploy Model](#Deploy-Model)
1. [Get Predictions](#Get-Predictions)
1. [Conclusion](#Conclusion)
1. [Cleanup](#Cleanup)

## Retrieve Model Artifacts

First we download the model artifacts for the pretrained BERT model. BERT is a popular natural language processing (NLP) model that extracts meaning and context from text. You can read the original paper, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).

In [2]:
!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet



In [3]:
import os
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

model_path = "model/"
code_path = "code/"

if not os.path.exists(model_path):
    os.mkdir(model_path)

model.save_pretrained(save_directory=model_path)
tokenizer.save_pretrained(save_directory=model_path)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

('model/vocab.txt', 'model/special_tokens_map.json', 'model/added_tokens.json')

## Write the Inference Script

Since we are bringing a model to SageMaker, we must create an inference script. The script runs inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predictions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional details at [Serve a PyTorch Model](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).

The next cell shows our inference script, whcich uses the [Transformers library from HuggingFace](https://huggingface.co/transformers/). This library is not installed in the container by default, so we add it in the next section.

In [4]:
!pygmentize code/inference_code.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m BertTokenizer, BertModel

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""[39;49;00m
[33m    Load the model for inference[39;49;00m
[33m    """[39;49;00m

    model_path = os.path.join(model_dir, [33m'[39;49;00m[33mmodel/[39;49;00m[33m'[39;49;00m)
    
    [37m# Load BERT tokenizer from disk.[39;49;00m
    tokenizer = BertTokenizer.from_pretrained(model_path)

    [37m# Load BERT model from disk.[39;49;00m
    model = BertModel.from_pretrained(model_path)

    model_dict = {[33m'[39;49;00m[33mmodel[39;49;00m[33m'[39;49;00m: model, [33m'[39;49;00m[33mtokenizer[39;49;00m[33m'[39;49;00m:tokenizer}
    
    [34mreturn[39;49;00m model_dict

[34mdef[39;49;00m [32mpredict_fn[39;49;00m(input_data, model):
    [33m"""[39;49;00m
[33m    Apply model to the incoming requ

## Package Model

For hosting, SageMaker requires that the deployment package be structured in a compatible format. It expects all files to be packaged in a tar archive named "model.tar.gz" with gzip compression. To install additional libraries at container startup, we can add a requirements.txt file that specifies the libraries to be installed using [pip](https://pypi.org/project/pip/). Read more at [Using Third-Party Libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries). Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the [Model Directory Structure](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#model-directory-structure) guide for a thorough explanation of the required directory structure.  

In [5]:
import tarfile

zipped_model_path = os.path.join(model_path, "model.tar.gz")

with tarfile.open(zipped_model_path, "w:gz") as tar:
    tar.add(model_path)
    tar.add(code_path)

## Deploy Model

Now that we have our deployment package, we can use the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it needs read access to the default SageMaker bucket (usually named `s3://sagemaker-{region}-{your account ID}`) so it can read the deployment package. When we call `deploy()`, the SDK saves our deployment archive to S3 for the SageMaker endpoint to use. We use the helper function [get_execution_role()](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=get_execution_role#sagemaker.session.get_execution_role) to retrieve our current IAM role so we can pass it to the SageMaker endpoint. Minimally it requires read access to the model artifacts in S3 and the [ECR repository](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) where the container image is stored by AWS.


You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use. 

We use an m5.xlarge instance for our endpoint to ensure we have sufficient memory to serve our model. 

In [6]:
from sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role
import time

endpoint_name = "bert-base-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

model = PyTorchModel(
    entry_point="inference_code.py",
    model_data=zipped_model_path,
    role=get_execution_role(),
    framework_version="1.5",
    py_version="py3",
)

predictor = model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=endpoint_name
)

-----!

## Get Predictions

Now that our API endpoint is deployed, we send it text to get predictions from our BERT model. You can use the SageMaker SDK or the [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) method of the SageMaker Runtime API to invoke the endpoint. 

In [7]:
import sagemaker

sm = sagemaker.Session().sagemaker_runtime_client

prompt = "The best part of Amazon SageMaker is that it makes machine learning easy."

response = sm.invoke_endpoint(
    EndpointName=endpoint_name, Body=prompt.encode(encoding="UTF-8"), ContentType="text/csv"
)

response["Body"].read()

b'(tensor([[[-0.2462, -0.0988,  0.1747,  ..., -0.4059,  0.0966,  0.6564],\n         [-0.1352, -0.5824, -0.0728,  ..., -0.1726,  0.5765,  0.1273],\n         [-0.1491, -0.4218,  0.2821,  ...,  0.1332,  0.5053, -0.2813],\n         ...,\n         [-0.8054, -0.3126,  0.6776,  ..., -0.0572,  0.0806, -0.0318],\n         [ 0.7608,  0.1367, -0.2650,  ...,  0.1246, -0.5977, -0.2397],\n         [ 0.4660,  0.2762,  0.0636,  ...,  0.1112, -0.5502, -0.2997]]],\n       grad_fn=<NativeLayerNormBackward>), tensor([[-7.0429e-01, -4.2229e-01, -9.7203e-01,  6.3414e-01,  8.6010e-01,\n         -3.5008e-01,  3.8001e-02,  2.2652e-01, -8.5239e-01, -9.9980e-01,\n         -6.4649e-01,  8.4232e-01,  8.9319e-01,  6.2476e-01,  4.8914e-01,\n         -3.7195e-01,  1.0597e-01, -5.0569e-01,  3.3702e-01,  7.3767e-01,\n          6.0322e-01,  1.0000e+00, -3.2281e-01,  4.7648e-01,  4.4296e-01,\n          9.4813e-01, -6.6813e-01,  7.4915e-01,  8.2229e-01,  6.3062e-01,\n         -1.4025e-01,  2.2783e-01, -9.6329e-01, -2.0670

## Cleanup

Delete the model and endpoint to release resources and stop incurring costs.

In [8]:
predictor.delete_model()
predictor.delete_endpoint()

## Conclusion

We have successfully created a scalable, highly available, RESTful API that is backed by a BERT model! It can be used for downstream NLP tasks like text classification. If you are still interested in learning more, check out some of the more advanced features of SageMaker hosting, like [Monitor models for data and model quality, bias, and explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) to detect concept drift, [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) to dynamically adjust the number of instances, or [Give SageMaker Hosted Endpoints Access to Resources in Your Amazon VPC](https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html) to control network access to/from your endpoint.

You can also read the blog [Deploy machine learning models to Amazon SageMaker using the ezsmdeploy Python package and a few lines of code](https://aws.amazon.com/blogs/opensource/deploy-machine-learning-models-to-amazon-sagemaker-using-the-ezsmdeploy-python-package-and-a-few-lines-of-code/). The ezsmdeploy package automates most of this process.