# Optimized & deploy BERT on AWS inferentia2 with Amazon SageMaker

In this end-to-end tutorial, you will learn how to speed up BERT inference down to `1ms` latency for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia2.

You will learn how to: 

1. Convert BERT to AWS Neuron (Inferentia2) with `optimum-neuron`
2. Create a custom `inference.py` script for `text-classification`
3. Upload the neuron model and inference script to Amazon S3
4. Deploy a Real-time Inference Endpoint on Amazon SageMaker
5. Run and evaluate Inference performance of BERT on Inferentia2

Let's get started! 🚀

---

*If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.*

## 1. Convert BERT to AWS Neuron (Inferentia2) with `optimum-neuron`

We are going to use the [optimum-neuron](https://huggingface.co/docs/optimum-neuron/index). 🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/?nc1=h_ls). It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks. 

As a first step, we need to install the `optimum-neuron` and other required packages.

*Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the `conda_python3` conda kernel.*


In [6]:
# Install the required packages
# !pip install "optimum-neuron[neuronx]==0.0.4"  --upgrade
# !python -m pip install "optimum-neuron[neuronx]==0.0.4" "sagemaker==2.162.0"  --upgrade
# pip install sagemaker from github
!pip uninstall sagemaker -y
!pip install 'sagemaker @ git+https://github.com/philschmid/sagemaker-python-sdk@patch-1' --upgrade --no-cache-dir

Found existing installation: sagemaker 2.161.1.dev0
Uninstalling sagemaker-2.161.1.dev0:
  Successfully uninstalled sagemaker-2.161.1.dev0
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sagemaker@ git+https://github.com/philschmid/sagemaker-python-sdk@patch-1
  Cloning https://github.com/philschmid/sagemaker-python-sdk (to revision patch-1) to /tmp/pip-install-fgbng3_i/sagemaker_8883e196a64941e09ec0374625cad0c7
  Running command git clone --filter=blob:none --quiet https://github.com/philschmid/sagemaker-python-sdk /tmp/pip-install-fgbng3_i/sagemaker_8883e196a64941e09ec0374625cad0c7
  Running command git checkout -b patch-1 --track origin/patch-1
  Switched to a new branch 'patch-1'
  Branch 'patch-1' set up to track remote branch 'patch-1' from 'origin'.
  Resolved https://github.com/philschmid/sagemaker-python-sdk to commit d77485cea67860641e4a3f143029413ed7f31f01
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collec

After we have installed the `optimum-neuron` we can convert load and convert our model.

We are going to use the [yiyanghkust/finbert-tone](https://huggingface.co/yiyanghkust/finbert-tone) model. FinBERT is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice. It is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens. This released finbert-tone model is the FinBERT model fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from analyst reports.

In [3]:
model_id = "yiyanghkust/finbert-tone"

At the time of writing, the [AWS Inferentia2 does not support dynamic shapes for inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/dynamic-shapes.html?highlight=dynamic%20shapes#), which means that the input size needs to be static for compiling and inference. 

In simpler terms, this means when the model is converted with a sequence length of 16. The model can only run inference on inputs with the same shape.

_When using a `t2.medium` instance the compiling takes around 2-3 minutes_ 

In [4]:
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.exporters.neuron import export
from optimum.exporters.neuron.model_configs import DistilBertNeuronConfig

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# define dummy input, sequence length and configuratio
dummy_input = "dummy input which will be padded later"
sequence_length = 128
output_path = Path("tmp")
neuron_config = DistilBertNeuronConfig(
    config=model.config, 
    task="text-classification",
    batch_size=1, 
    sequence_length=sequence_length,
)

# Export BERT to Neuron model (inferentia2)
export(
    model=model,
    config=neuron_config,
    output=output_path / "model.neuron",
    auto_cast="all",
    auto_cast_type="bf16",
)
# include sequence length in model config
model.config.__setattr__("neuron_sequence_length", sequence_length)
# save tokenizer and model config
tokenizer.save_pretrained(output_path)
model.config.save_pretrained(output_path)


Using Neuron: --auto-cast all
Using Neuron: --auto-cast-type bf16


## 2. Create a custom `inference.py` script for `text-classification`

The [Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) supports zero-code deployments on top of the [pipeline feature](https://huggingface.co/transformers/main_classes/pipelines.html) from 🤗 Transformers. This allows users to deploy Hugging Face transformers without an inference script [[Example](https://github.com/huggingface/notebooks/blob/master/sagemaker/11_deploy_model_from_hf_hub/deploy_transformer_model_from_hf_hub.ipynb)]. 

Currently is this feature not supported with AWS Inferentia2, which means we need to provide an `inference.py` for running inference. 

To use the inference script, we need to create an `inference.py` script. In our example, we are going to overwrite the `model_fn` to load our neuron model and the `predict_fn` to create a sentence-embeddings pipeline. 

If you want to know more about the `inference.py` script check out this [example](https://github.com/huggingface/notebooks/blob/master/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb). It explains amongst other things what the `model_fn` and `predict_fn` are.

In [9]:
!mkdir code

mkdir: cannot create directory ‘code’: File exists


We are using the `NEURON_RT_NUM_CORES=1` to make sure that each HTTP worker uses 1 Neuron core to maximize throughput.

In [5]:
%%writefile code/inference.py
import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch_neuronx

# To use one neuron core per worker
os.environ["NEURON_RT_NUM_CORES"] = "1"
AWS_NEURON_TRACED_WEIGHTS_NAME = "model.neuron"


def model_fn(model_dir):
    # load tokenizer and neuron model from model_dir
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
    model_config = AutoConfig.from_pretrained(model_dir)

    return model, tokenizer, model_config


def predict_fn(data, model_tokenizer_model_config):
    # destruct model, tokenizer and model config
    model, tokenizer, model_config = model_tokenizer_model_config

    # create embeddings for inputs
    inputs = data.pop("inputs", data)
    embeddings = tokenizer(
        inputs,
        return_tensors="pt",
        max_length=model_config.neuron_sequence_length,
        padding="max_length",
        truncation=True,
    )
    # convert to tuple for neuron model
    neuron_inputs = tuple(embeddings.values())

    # run prediciton
    with torch.no_grad():
        predictions = model(*neuron_inputs)[0]
        scores = torch.nn.Softmax(dim=1)(predictions)

    # return dictonary, which will be json serializable
    return [{"label": model_config.id2label[item.argmax().item()], "score": item.max().item()} for item in scores]

Overwriting code/inference.py


## 3. Upload the neuron model and inference script to Amazon S3

Before we can deploy our neuron model to Amazon SageMaker we need to create a `model.tar.gz` archive with all our model artifacts saved into, e.g. `model.neuron` and upload this to Amazon S3.

To do this we need to set up our permissions.

In [1]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.


sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker bucket: sagemaker-us-east-1-558105141721
sagemaker session region: us-east-1


Next, we create our `model.tar.gz`.The `inference.py` script will be placed into a `code/` folder.

In [7]:
# copy inference.py into the code/ directory of the model directory.
!cp -r code/ tmp/code/
# create a model.tar.gz archive with all the model artifacts and the inference.py script.
%cd tmp
!tar zcvf model.tar.gz *
%cd ..

/home/ubuntu/huggingface-inferentia2-samples/bert-transformers/tmp
code/
code/inference.py
config.json
model.neuron
neuron.model
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt
/home/ubuntu/huggingface-inferentia2-samples/bert-transformers


Now we can upload our `model.tar.gz` to our session S3 bucket with `sagemaker`.

In [8]:
from sagemaker.s3 import S3Uploader

# create s3 uri
s3_model_path = f"s3://{sess.default_bucket()}/neuronx/{model_id}"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")

model artifcats uploaded to s3://sagemaker-us-east-1-558105141721/neuronx/yiyanghkust/finbert-tone/model.tar.gz


## 4. Deploy a Real-time Inference Endpoint on Amazon SageMaker

After we have uploaded our `model.tar.gz` to Amazon S3 can we create a custom `HuggingfaceModel`. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.

In [2]:
s3_model_uri="s3://sagemaker-us-east-1-558105141721/neuronx/yiyanghkust/finbert-tone/model.tar.gz"

In [5]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,      # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.28.0",  # transformers version used
   pytorch_version="1.13.0",       # pytorch version used
   py_version='py38',            # python version used
   model_server_workers=2,       # number of workers for the model server
)

# Let SageMaker know that we've already compiled the model
huggingface_model._is_compiled_model = True

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge" # AWS Inferentia Instance
)

Defaulting to the only supported framework/algorithm version: 4.28.1. Ignoring framework/algorithm version: 4.28.0.


ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Requested image 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.0-transformers4.28.0-neuronx-py38-ubuntu20.04 not found.

In [6]:
huggingface_model = HuggingFaceModel(
    model_data="s3://mybucket/train",
    transformers_version="4.28",
    role=role,
    pytorch_version="1.13",
    py_version="py38",
)
container = huggingface_model.prepare_container_def("ml.inf2.xlarge", inference_tool="neuronx")

In [7]:
container

{'Image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13-transformers4.28-neuronx-py38-ubuntu20.04',
 'Environment': {'SAGEMAKER_PROGRAM': '',
  'SAGEMAKER_SUBMIT_DIRECTORY': '',
  'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
  'SAGEMAKER_REGION': 'us-east-1'},
 'ModelDataUrl': 's3://mybucket/train'}

In [None]:
# 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.0-transformers4.28.0-neuronx-py38-ubuntu20.04
# 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.0-transformers4.28.1-neuronx-py38-sdk2.9.1-ubuntu20.04


In [4]:
huggingface_model.prepare_container_def("ml.inf2.xlarge", inference_tool="neuronx")

Defaulting to the only supported framework/algorithm version: 4.28.1. Ignoring framework/algorithm version: 4.28.0.


ValueError: Unsupported base framework: pytorch1.13.1. You may need to upgrade your SDK version (pip install -U sagemaker) for newer base frameworks. Supported base framework(s): version_aliases, pytorch1.13.0.

# 5. Run and evaluate Inference performance of BERT on Inferentia

The `.deploy()` returns an `HuggingFacePredictor` object which can be used to request inference.

In [None]:
data = {
  "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)
res

We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let's test its performance of it. As a dummy load test will we loop and send 10000 synchronous requests to our endpoint.

In [None]:
# send 10000 requests
for i in range(10000):
    resp = predictor.predict(
        data={"inputs": "it 's a charming and often affecting journey ."}
    )

Let's inspect the performance in cloudwatch.

In [None]:
print(f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}")

The average latency for our MiniLM model is `3-4.5ms` for a sequence length of 128.  

![performance](./imgs/performance.png)


### Delete model and endpoint

To clean up, we can delete the model and endpoint.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()