<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049; text-align: center; border-radius: 5px 5px; padding: 5px"> Pay as you use SageMaker Serverless inference with GPT-2 </h3>

<img src = "img/Serverless.jpg">

Huge credits 👏 to AWS team for making SageMaker Serverless inference option generally available.

Lately, I've been looking for hosting Machine Learning models on serverless infrastructure and found that there are multiple ways in which we can achieve that.
1. Using [Serverless framework](https://www.serverless.com/framework/docs/getting-started)

    Two options:
    * Create a Lambda layer (which contains dependency libraries) and attach it to Lambda function.
    * Using Docker container (for example; host Hugging Face BERT models, Image Classification models on S3 and serve it through serverless framework and Lambda functions)
2. Using [AWS CDK](https://aws.amazon.com/blogs/compute/hosting-hugging-face-models-on-aws-lambda/) (Cloud Development Kit)
3. Using [AWS SAM](https://aws.amazon.com/serverless/sam/) (Serverless Application Model)

    Host Deep Learning models on S3, load it on to EFS (like storing models on cache) and serve the inference requests.

    Two options:
    * [Using SAM Helloworld template](https://towardsdatascience.com/deploying-sklearn-machine-learning-on-aws-lambda-with-sam-8cc69ee04f47) - Create a Lambda function with code and API gateway trigger.
    * Using SAM Machine Learning template - Create a docker container with all code then attach it to Lambda function and create an API gateway trigger.
4. Using [SageMaker Serverless inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html)

    The problem with the first three options is that we have to build, manage, and maintain all your containers.
    SageMaker (SM) Serverless inference option allows you to focus on the model building process without having to manage the underlying infrastructure. You can choose either a SM in-built container or bring your own.

**SageMaker Serverless inference Use cases**

Use this option when you don't often receive inference requests the entire day, such as customer feedback service or chatbot applications or analyze data from documents and tolerate cold start problems.
Serverless endpoints automatically launch compute resources and scale them in and out based on the workload. You can pay only for invocations and save a lot of cost.

**Warming up the Cold Starts**

You can create a health-check service to load the model but do not use the model and you can invoke that service periodically or when users are still exploring the application.
Use the AWS CloudWatch to keep our lambda service warm.

This article will demonstrate how to host pretrained transformers models: GPT-2 model on SageMaker Serverless endpoint using SageMaker boto3 API.

NOTE: At the time of writing this only CPU Instances are supported for Serverless Endpoint.

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Import necessary libraries and Setup permissions </h2>

NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances

If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

In [None]:
from sagemaker import get_execution_role
import boto3
import sagemaker

role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'gpt-serverless-model'
sm_client = boto3.client("sagemaker")


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Retrieve Model Artifacts </h2>

#### `GPT-2 model`

We will download the model artifacts for the pretrained [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. GPT-2 is a popular text generation model that was developed by OpenAI. Given a text prompt it can generate synthetic text that may follow.

In [2]:
!pip install transformers==4.17.0 --quiet

In [3]:
import os
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

model_path = 'model/'

if not os.path.exists(model_path):
    os.mkdir(model_path)
    
model.save_pretrained(save_directory=model_path)
tokenizer.save_vocabulary(save_directory=model_path)

('model/vocab.json', 'model/merges.txt')

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Write the Inference Script </h2> 

#### `GPT-2 model`

In the next cell we'll see our inference script for GPT-2 model.

In [4]:
!mkdir model/code

! cp code/inference.py model/code/inference.py

In [5]:
!pygmentize model/code/inference.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m GPT2Tokenizer, TextGenerationPipeline, GPT2LMHeadModel

[37m# Load the model for inference[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):

    [37m# Load GPT2 tokenizer from disk.[39;49;00m
    vocab_path = os.path.join(model_dir, [33m'[39;49;00m[33mvocab.json[39;49;00m[33m'[39;49;00m)
    merges_path = os.path.join(model_dir, [33m'[39;49;00m[33mmerges.txt[39;49;00m[33m'[39;49;00m)
    
    tokenizer = GPT2Tokenizer(vocab_file=vocab_path, merges_file=merges_path)

    [37m# Load GPT2 model from disk.[39;49;00m
    model = GPT2LMHeadModel.from_pretrained(model_dir)
    [34mreturn[39;49;00m TextGenerationPipeline(model=model, tokenizer=tokenizer)

[37m# Apply model to the incoming request[39;49;00m
[34mdef[39;49;00m [32mpredict_fn[39;49;00m(input_data, model):
    [34mre

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Package Model </h2> 

For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named "model.tar.gz" with gzip compression. Within the archive, the Hugging Face container expects all inference code files to be inside the `code/` directory.

In [6]:
!tar -czvf model/model.tar.gz -C model/ .

./
./merges.txt
./vocab.json
./config.json
./.ipynb_checkpoints/
./code/
./code/inference.py
./pytorch_model.bin
tar: .: file changed as we read it


<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Upload GPT-2 model to S3 </h2> 

In [None]:
from sagemaker.s3 import S3Uploader

model_data = S3Uploader.upload('model/model.tar.gz', 's3://{0}/{1}'.format(bucket,prefix))
model_data

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Create and Deploy a Serverless GPT-2 model </h2> 

We are using a CPU based Hugging Face container image to host the inference script, GPUs are not supported in Serverless endpoints and hopefully the AWS team will add GPUs to Serverless endpoints soon 😄.

In [8]:
image_uri = "763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04"

model_name    = 'gpt-2-serverless-model'
epc_name     = 'gpt-2-serverless-model-epc'
endpoint_name = 'gpt-2-serverless-model-ep'

primary_container = {
    'Image': image_uri,
    'ModelDataUrl': model_data,
    'Environment': {
        'SAGEMAKER_PROGRAM': 'inference.py',
        'SAGEMAKER_REGION': region,
        'SAGEMAKER_SUBMIT_DIRECTORY': model_data
    }
}

Next we will create a SageMaker model, endpoint config and endpoint. We have to specify "ServerlessConfig" which contains two parameters MemorySizeInMB and MaxConcurrency while creating endpoint config. This is the only difference we have in Serverless endpoint otherwise everything remains same as we do in Real-time inference.

MemorySizeInMB: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. The memory size should be at least as large as your model size.

MaxConcurrency: The maximum number of concurrent invocations your serverless endpoint can process.

In [None]:
# Create/Register a GPT-2 model in SM
from sagemaker import get_execution_role

create_model_response = sm_client.create_model(ModelName = model_name,
                                              ExecutionRoleArn = get_execution_role(),
                                              PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

# Create a SM Serverless endpoint config
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = epc_name,
    ProductionVariants=[
        {
        'ServerlessConfig':{
            'MemorySizeInMB' : 6144,
            'MaxConcurrency' : 5
        },
        'ModelName':model_name,
        'VariantName':'AllTraffic',
        'InitialVariantWeight':1
        }
    ])

print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

# Create a SM Serverless endpoint config
endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': epc_name,
}
endpoint_response = sm_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=epc_name)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Get Predictions </h2>

Now that our Serverless endpoint is deployed, we can send it text to get predictions from our GPT-2 model. You can use the SageMaker Python SDK or the SageMaker Runtime API to invoke the endpoint.

In [11]:
import boto3
import json

invoke_client = boto3.client('sagemaker-runtime')
prompt = "Working with SageMaker makes machine learning "
    
response = invoke_client.invoke_endpoint(EndpointName=endpoint_name, 
                            Body=json.dumps(prompt),
                            ContentType='text/csv')

response['Body'].read().decode('utf-8')

'[{\'generated_text\': \'"Working with SageMaker makes machine learning "a lot easier" than it used to be.\\n\'}]'

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Monitor Serverless GPT-2 model endpoint </h2>

The `ModelSetupTime` metric helps you to track the time (cold start time) it takes to launch new compute resources to setup Serverless endpoint. It depends on size of the model and container start up time.

Serverless endpoint takes around 12 secs to host the GPT-2 model with available compute resources and takes around 3.9 secs to serve the first inference request.

<img src = "img/se_first_invocation.jpg">

Serverless GPT-2 model endpoint is serving subsequent inference requests within 1 sec which is great news 🙌.

<img src = "img/se_second_invocation.jpg">

Serverless endpoint utilizes 16.14% of the memory.

<img src = "img/se_memory_utilization.jpg">

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Clean-up </h2>

In [None]:
sm_client.delete_model(ModelName=model_name)
sm_client.delete_endpoint_config(EndpointConfigName=epc_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Conclusion </h2>

We successfully deployed GPT-2 (text generation model) to Amazon SageMaker Serverless endpoint using the SageMaker boto3 API.

The big advantage of Serverless endpoint is that your Data Science team is focusing on the model building process and not spending thousands of dollars while implementing a POC or at the start of a new Product. After the POC is successful, you can easily deploy your model to real-time endpoints with GPUs to handle production workload.