# Cost Efficiently deploy a Sentence Transformer model with optimum and AWS SageMaker Serverless Inference

In this end-to-end tutorial, you will:

1. [Convert a sentence transformer model to ONNX with Optimum.](#1-convert-a-sentence-transformer-model-to-onnx-with-optimum)
2. [Create a custom inference script for the SageMaker endpoint.](#2-create-a-custom-inference-script-for-the-sagemaker-endpoint)
3. [Create an AWS Role with the necessary permissions.](#3-create-an-aws-role-with-the-necessary-permissions)
4. [Upload all necessary files to S3.](#4-upload-all-necessary-files-to-s3)
5. [Create a SageMaker model and serverless endpoint.](#5-create-a-sagemaker-model-and-serverless-endpoint)

---

*This code is meant to be run locally, if you want to run it within an AWS managed environment, e.g as a SageMaker Studio Notebook, the step 3 would look a bit different, but the rest should remain very similar or the same.*

<a id='1-convert-a-sentence-transformer-model-to-onnx-with-optimum'></a>
## 1. Convert a sentence transformer model to ONNX with [Optimum](https://github.com/huggingface/optimum).

The first step is to load the desired model using the `ORTModelForFeatureExtraction` class, the `export` parameter tells it that is a transformers model so it can load it properly and convert it to ONNX. We will also need the tokenizer so we load it and save to the same directory `onnx_path` as with the converted model.

In this case, we are not trying to optimize the final ONNX model, but several things can be done to squeeze more inferences per second to the final model, such as graph optimization and/or dynamic quantization. If you want to know more about it, you can [check this great article about it.](https://www.philschmid.de/optimize-sentence-transformers)

This tutorial uses a [pretrained model from huggingface hub](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2), but if you have a fined tuned model, just replace the value in `model_id` with the path to your local model

In [None]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

model_id = "sentence-transformers/all-MiniLM-L12-v2"
onnx_path = "tmp"

# load vanilla transformers and convert to onnx
model = ORTModelForFeatureExtraction.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

<a id='2-create-a-custom-inference-script-for-the-sagemaker-endpoint'></a>
## 2. Create a custom inference script for the SageMaker endpoint.

For creating this inference endpoint, we are going to use the [SageMaker Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit), and because we are deploying a sentence transformer model, we need a way to specify how the inferences need to be performed, we do this by creating an `inference.py` script. In this case we need to create three functions inside that script.

* `model_fn` - this received the path to the model directory and outputs the model and tokenizer.
* `predict_fn` - this receives the inputs and the output from model_fn and outputs the predictions.
* `mean_pooling` - this one is just a helper function, as we need a way to calculate the mean pooling over the outputs of the model.

The inference toolkit expects this script to be located inside a directory named `code`, so we create it and save the custom script inside it.

In [None]:
!mkdir code

In [None]:
%%writefile code/inference.py

from transformers import AutoTokenizer
import torch
import torch.nn.functional as F
from optimum.onnxruntime import ORTModelForFeatureExtraction

model_name = 'model' # This has to be the same as the one inside onnx_path

# Helper: Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(model_output.size()).float()
    return torch.sum(model_output * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def model_fn(model_dir):
    # load tokenizer and neuron model from model_dir
    model = ORTModelForFeatureExtraction.from_pretrained(model_dir, file_name=f"{model_name}.onnx")
    tokenizer = AutoTokenizer.from_pretrained(model_dir)

    return model, tokenizer


def predict_fn(data, model_tokenizer_model_config):
    # destruct model and tokenizer
    model, tokenizer = model_tokenizer_model_config

    # Tokenize sentences
    inputs = data.pop("inputs", data)
    encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
    
    # Compute token embeddings
    with torch.no_grad():
        model_outputs = model(**encoded_inputs)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_outputs["last_hidden_state"], encoded_inputs['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    # return dictonary, which will be json serializable
    return {"vectors": sentence_embeddings.tolist()}



Because we are using additional libraries for our custom inference script, we need also to create a `requirements.txt` indicating the libraries that need to be installed for the inferences to run properly, in our case optimum with ONNX support.

In [None]:
%%writefile code/requirements.txt

optimum[onnxruntime]

<a id='3-create-an-aws-role-with-the-necessary-permissions'></a>
## 3. Create an AWS Role with the necessary permissions.

To create a Role that has all the required permissions to perform all the necessary actions to deploy this endpoint, you first need to go to **IAM -> Roles -> Create Role**. Once you're there, select the **AWS Account** option and go next, then on the Add permissions step, search for `AmazonSageMakerFullAccess`, select it and go next, finally give it a name and create the role. In this tutorial we will be using an AWS user to assume this role and to do so we need to edit this role's trust policy. To do this, go to trust relationships and edit it to give your desired user the permission to assume this role. The entire JSON should be something like the one below, just replace **arn:aws:iam::XXXXXXXXX:user/username** with your user ARN.

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com",
                "AWS": "arn:aws:iam::XXXXXXXXX:user/username"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}
```

Finally to make sure your code will use the correct role and user, create a `.env` file and add the following environment variables with the correct information:
* AWS_ACCESS_KEY_ID
* AWS_SECRET_ACCESS_KEY
* AWS_DEFAULT_REGION
* AWS_ROLE_ARN

In [None]:
from dotenv import load_dotenv

load_dotenv()

Now we can create a sagemaker session using the previously setup role and use it to deploy our endpoint.

In [None]:
import os
import boto3
import sagemaker

role_arn = os.environ["AWS_ROLE_ARN"]

session = boto3.Session()
sts = session.client("sts")
response = sts.assume_role(
    RoleArn=role_arn,
    RoleSessionName="sagemaker-test"
)

boto_session = boto3.Session(
    aws_access_key_id=response['Credentials']['AccessKeyId'],
    aws_secret_access_key=response['Credentials']['SecretAccessKey'],
    aws_session_token=response['Credentials']['SessionToken']
)

sess = sagemaker.Session(boto_session=boto_session)
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()


sess = sagemaker.Session(boto_session=boto_session, default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role_arn}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

<a id='4-upload-all-necessary-files-to-s3'></a>
## 4. Upload all necessary files to S3.

To create our SageMaker model, first we will need to create a `model.tar.gz` file with all the necessary model files and the custom inference code, and upload that file to a S3 bucket.

In [None]:
!cp -r code/ tmp/code/

%cd tmp
!tar zcvf model.tar.gz *
%cd ..

In [None]:
from sagemaker.s3 import S3Uploader

# create s3 uri
s3_model_path = f"s3://{sess.default_bucket()}/onnx"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path, sagemaker_session=sess)
print(f"model artifcats uploaded to {s3_model_uri}")


<a id='5-create-a-sagemaker-model-and-serverless-endpoint'></a>
## 5. Create a SageMaker model and serverless endpoint.

Now we can leverage the [HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-model) class to create the SageMaker model. As it is our intention to deploy this model as a serverless endpoint, we need to use a [ServerlessInferenceConfig](https://sagemaker.readthedocs.io/en/v2.203.0/api/inference/serverless.html) to configure the endpoint to our needs. Then we only need to run the model's `deploy` method and pass the configuration object. This step may take a few minutes.

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,       # path to your model and script
   role=role_arn,                 # iam role with permissions to create an Endpoint
   transformers_version="4.12",   # transformers version used
   pytorch_version="1.9",         # pytorch version used
   py_version='py38',             # python version used
   sagemaker_session=sess
)


serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=3072,
    max_concurrency=1,
)

predictor = huggingface_model.deploy(serverless_inference_config=serverless_config)

Now that everything is up and running we can start running inferences.

In [None]:
result = predictor.predict({"inputs": ["this is a test text"]})

## Delete Resources

To clean up, we can delete the model and endpoint.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

## Conclusion

Well Done! In this tutorial you learned how to convert a Sentence Transformers model to ONNX and how to deploy that converted model as a SageMaker Serverless Inference Endpoint.
Further steps may include testing different ONNX optimizations and/or different endpoint configurations to find the most suitable setting for your scenario.

### References:

1. [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers)
2. [Accelerated document embeddings with Hugging Face Transformers and AWS Inferentia](https://www.philschmid.de/huggingface-sentence-transformers-aws-inferentia#2-create-a-custom-inferencepy-script-for-sentence-embeddings)
3. [SageMaker Serverless Inference](https://github.com/aws/amazon-sagemaker-examples/blob/main/serverless-inference/huggingface-serverless-inference/huggingface-text-classification-serverless-inference.ipynb)