# Deploying a Hugging Face Model via TorchServe for Bulk Embedding Inference

In this notebook, we'll deploy a `sentence_transformers` model with [TorchServe](https://pytorch.org/serve/). To accelerate inference we will also use features from the `optimum` [library](https://github.com/huggingface/optimum) to apply graph optimization and/or quantization to the model. Finally, we'll run some benchmarks to understand performance tradeoffs.

### Install packages

In [2]:
!pip -q install --upgrade transformers, datasets, optimum[onnxruntime-gpu]

### Imports

In [1]:
import torch
import base64
import json
import os
import random
import sys
import transformers

In [2]:
print(f"Notebook runtime: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"PyTorch version : {torch.__version__}")
print(f"Transformers version : {transformers.__version__}")

Notebook runtime: GPU
PyTorch version : 2.0.1+cu118
Transformers version : 4.33.2


## Deployment

### Overview   

We'll deploy a container running [PyTorch's TorchServe](https://pytorch.org/serve/) tool in order to serve predictions from a fine-tuned sentence transformer model `bge-base-en` available in [Hugging Face Transformers](https://huggingface.co/BAAI/bge-base-en). 

### Save model locally

In [3]:
APP_NAME = "test_bge_embedder"
MODEL_DIR = "./"+APP_NAME+"_predictor"
os.makedirs(MODEL_DIR, exist_ok=True)

In [4]:
from transformers import AutoTokenizer, AutoModel

model_id = "BAAI/bge-base-en"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

# save base model
pt_save_directory = os.path.join(MODEL_DIR, "model")
tokenizer.save_pretrained(pt_save_directory)
model.save_pretrained(pt_save_directory)

### Save ORT model locally

The ONNX model can be directly optimized during the ONNX export using [Optimum CLI](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization), by passing the argument `--optimize {O1,O2,O3,O4}` in the CLI. Here, we'll apply O4 graph optimizations and save the ORT model locally.

In [5]:
ort_save_directory = os.path.join(MODEL_DIR, "ort_model")

In [None]:
!optimum-cli export onnx \
  --model $model_id \
  --task feature-extraction \
  --optimize O4 \
  --device cuda \
  $ort_save_directory # output folder

### Create a custom model handler 

Please refer to the [TorchServe documentation](https://pytorch.org/serve/custom_service.html) for defining a custom handler.

In [7]:
USE_ORT = True # True if using ORT model

In [8]:
%%bash -s $APP_NAME $USE_ORT

APP_NAME=$1
USE_ORT=$2

cat << EOF > ./${APP_NAME}_predictor/custom_handler.py

import os
import json
import logging

import torch
from transformers import AutoModel, AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForFeatureExtraction

from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)

class SentenceTransformersHandler(BaseHandler):
    """
    The handler takes an input string and returns the embedding 
    based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(SentenceTransformersHandler, self).__init__()
        self.initialized = False
        self.use_ort = $USE_ORT

    def initialize(self, ctx):
        """ Loads the model.onnx file and initialized the model object.
        Instantiates Tokenizer for preprocessor to use and a feature extraction pipeline
        """
        self.manifest = ctx.manifest

        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        # Read model serialize/pt file
        serialized_file = self.manifest["model"]["serializedFile"]
        model_pt_path = os.path.join(model_dir, serialized_file)
        
        logger.info(f"model_pt_path: {model_pt_path}")
        logger.info(f"model_dir: {model_dir}")
        
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.onnx or pytorch_model.bin file")
        
        # Load model
        if self.use_ort:
            logger.info(f"----Loading ORT model----")
            self.model = ORTModelForFeatureExtraction.from_pretrained(model_dir, provider="CUDAExecutionProvider")
            self.tokenizer = AutoTokenizer.from_pretrained(model_dir, model_max_length=512)
            
        else:
            logger.info(f"----Loading PyTorch model----")
            self.model = AutoModel.from_pretrained(model_dir)
            self.tokenizer = AutoTokenizer.from_pretrained(model_dir, model_max_length=512)
        
        self.model.to(self.device)
        self.initialized = True

    def preprocess(self, requests):
        """ Preprocessing input request by tokenizing
            Extend with your own preprocessing steps as needed
            
            [{'data': b'I am creating an endpoint using TorchServe and HF transformers\n'}]
        """
        input_texts = []
        for idx, request in enumerate(requests):
            text = request.get("data")
            if text is None:
                text = request.get("body")
                
            text = text.decode('utf-8')
            input_texts.append(text)
        logger.info("Received text: '%s'", input_texts)
        
        return input_texts

    def inference(self, input_texts):
        """ Predict the class of a text using a trained transformer model.
        """
        encoded_input = self.tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt').to(self.device)
        
        with torch.no_grad():
            model_output = self.model(**encoded_input)
            sentence_embeddings = model_output[0][:, 0] # Perform pooling. In this case, cls pooling.
        
        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1).cpu().tolist()
        
        logger.info(f"Model embedded: {len(sentence_embeddings)}")
        logger.info(f"Output type: {type(sentence_embeddings)}")
        return sentence_embeddings

    def postprocess(self, inference_output):
        return inference_output
    
EOF

### Create custom container image

#### 1. Build TorchServe image compatible with ONNX

To make use of ONNX, we need to use an NVIDIA CUDA runtime as the base image for TorchServe as specified [here](https://github.com/pytorch/serve/blob/master/docker/README.md). This requires building a custom image, which the following commands will do for you. The output will be a new Docker image called `pytorch/torchserve:latest-gpu` with the CUDA libraries necessary for running ONNX.

In [15]:
!git clone https://github.com/pytorch/serve.git && \
    cd serve/docker && \
    DOCKER_BUILDKIT=1 docker build --file Dockerfile \
                                --build-arg BASE_IMAGE="nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04" \
                                --build-arg CUDA_VERSION="cu118" \
                                --build-arg PYTHON_VERSION="3.10" \
                                --build-arg BUILD_NIGHTLY="false" \
                                -t "pytorch/torchserve:latest-gpu" \
                                --target production-image .

Cloning into 'serve'...
remote: Enumerating objects: 46147, done.[K
remote: Counting objects: 100% (7773/7773), done.[K
remote: Compressing objects: 100% (1167/1167), done.[K
remote: Total 46147 (delta 6711), reused 7137 (delta 6434), pack-reused 38374[K
Receiving objects: 100% (46147/46147), 72.29 MiB | 46.56 MiB/s, done.
Resolving deltas: 100% (28602/28602), done.
[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                                         
[?25h[1A[0G[?25l[+] Building 0.1s (1/2)                                                         
 => [internal] load build definition from Dockerfile                       0.1s
[34m => => transferring dockerfile: 6.63kB                                     0.0s
[0m[34m => [internal] load .dockerignore                                          0.1s
[0m[34m => => transferring context: 2B                                            0.0s
[0m[?25h[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (2/3)                     

#### 2. Create a Dockerfile with the new TorchServe image as base image

In [16]:
MODEL_PATH = "./ort_model/" if USE_ORT else "./model/"
MODEL_FILE = "model.onnx" if USE_ORT else "pytorch_model.bin"

In [17]:
%%bash -s $APP_NAME $MODEL_FILE $MODEL_PATH

APP_NAME=$1
MODEL_FILE=$2
MODEL_PATH=$3

cat << EOF > ./${APP_NAME}_predictor/Dockerfile

FROM pytorch/torchserve:latest-gpu

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install transformers optimum[onnxruntime-gpu]

USER model-server

# copy model artifacts, custom handler and other dependencies
COPY custom_handler.py /home/model-server/
COPY $MODEL_PATH /home/model-server/

# create torchserve configuration file
USER root
RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
RUN printf "\njob_queue_size=1000" >> /home/model-server/config.properties

    
# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081

# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f \
  --model-name=$APP_NAME \
  --version=1.0 \
  --serialized-file=/home/model-server/$MODEL_FILE \
  --handler=/home/model-server/custom_handler.py \
  --extra-files "/home/model-server/config.json,/home/model-server/tokenizer.json,/home/model-server/tokenizer_config.json,/home/model-server/special_tokens_map.json,/home/model-server/vocab.txt" \
  --export-path=/home/model-server/model-store

# run Torchserve HTTP serve to respond to prediction requests
# note that we don't register a model upon startup...
# we'll use the Management API for this later on
CMD ["torchserve", \
     "--start", \
     "--ts-config=/home/model-server/config.properties", \
     "--model-store", \
     "/home/model-server/model-store"]

EOF

echo "Writing ./${APP_NAME}_predictor/Dockerfile"

Writing ./test_bge_embedder_predictor/Dockerfile


In [18]:
APP_DIR = APP_NAME+"_predictor"

In [19]:
APP_NAME, APP_DIR

('test_bge_embedder', 'test_bge_embedder_predictor')

#### 3. Build the TorchServe container with our custom handler

In [20]:
CUSTOM_PREDICTOR_IMAGE_URI = f"pytorch_predict_{APP_NAME}_{str(USE_ORT).lower()}"
print(f"CUSTOM_PREDICTOR_IMAGE_URI = {CUSTOM_PREDICTOR_IMAGE_URI}")

CUSTOM_PREDICTOR_IMAGE_URI = pytorch_predict_test_bge_embedder_true


In [21]:
!docker build \
  --tag=$CUSTOM_PREDICTOR_IMAGE_URI \
  ./$APP_DIR

Sending build context to Docker daemon  668.4MB
Step 1/16 : FROM pytorch/torchserve:latest-gpu
 ---> c2fd7df76dc9
Step 2/16 : RUN python3 -m pip install --upgrade pip
 ---> Using cache
 ---> a7407e5de4f1
Step 3/16 : RUN pip3 install transformers optimum[onnxruntime-gpu]
 ---> Using cache
 ---> a3cb1a6a27ff
Step 4/16 : USER model-server
 ---> Using cache
 ---> 136b1aca1136
Step 5/16 : COPY custom_handler.py /home/model-server/
 ---> 867b117a724c
Step 6/16 : COPY ./ort_model /home/model-server/
 ---> 1faf117c6315
Step 7/16 : USER root
 ---> Running in a5b06607c29c
Removing intermediate container a5b06607c29c
 ---> b50496f7ea0a
Step 8/16 : RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
 ---> Running in 25f9508a5974
Removing intermediate container 25f9508a5974
 ---> ac23e18e2c9b
Step 9/16 : RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
 ---> Running in c35c8b7809f4
Removing intermediate container c35c8b7809f4
 ---

### Test the TorchServe API locally

#### 1. Run our Docker container and map the necessary ports

In [22]:
!docker run -td --rm -p 7080:7080 -p 7081:7081 -p 8082:8082 --gpus all --name=$APP_NAME $CUSTOM_PREDICTOR_IMAGE_URI
!sleep 5

40ab1b7a9dd71369ea24088b43ce35b700f58bb88eb0d6c67cfe5f492dfcaa61


#### 2. Use TorchServe's Management API to register our model(s)

Here we can optionally choose to deploy the model a.) with multiple replicas and b.) for [batched inference](https://pytorch.org/serve/batch_inference_with_ts.html) 

In [23]:
BATCH_SIZE = 8
N_WORKERS = 2

In [24]:
# Without batched inference
# !curl -X POST "localhost:7081/models?url=test_bge_embedder.mar&initial_workers=$N_WORKERS&model_name=test_bge_embedder"

# With batched inference
!curl -X POST "localhost:7081/models?url=test_bge_embedder.mar&batch_size=$BATCH_SIZE&max_batch_delay=1000&initial_workers=$N_WORKERS&model_name=test_bge_embedder"

{
  "status": "Model \"test_bge_embedder\" Version: 1.0 registered with 2 initial workers"
}


##### Run a health check

In [25]:
!curl http://localhost:7080/ping

{
  "status": "Healthy"
}


##### Confirm our deployment's model configuration

In [27]:
!curl http://localhost:7081/models/test_bge_embedder

[
  {
    "modelName": "test_bge_embedder",
    "modelVersion": "1.0",
    "modelUrl": "test_bge_embedder.mar",
    "runtime": "python",
    "minWorkers": 2,
    "maxWorkers": 2,
    "batchSize": 8,
    "maxBatchDelay": 1000,
    "loadedAtStartup": false,
    "workers": [
      {
        "id": "9000",
        "startTime": "2023-09-29T19:36:48.967Z",
        "status": "READY",
        "memoryUsage": 3799924736,
        "pid": 63,
        "gpu": true,
        "gpuUsage": "gpuId::1 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::1148 MiB"
      },
      {
        "id": "9001",
        "startTime": "2023-09-29T19:36:48.968Z",
        "status": "READY",
        "memoryUsage": 3206426624,
        "pid": 64,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::2010 MiB"
      }
    ],
    "jobQueueStatus": {
      "remainingCapacity": 1000,
      "pendingRequests": 0
    }
  }
]


##### Send test request

Here we'll use a long input to simulate long documents being embedded.

In [46]:
TEST_INPUT = "hello " * 510

In [48]:
%%bash -s $APP_NAME

APP_NAME=$1

cat > ./${APP_NAME}_predictor/instances.json << END
{ 
   "instances": [
     { 
       "data": {
         "b64": "$(echo 'hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello  ' | base64 --wrap=0)"
       }
     }
   ]
}
END

curl -s -X POST \
  -H "Content-Type: application/json; charset=utf-8" \
  -d @./${APP_NAME}_predictor/instances.json \
  http://localhost:7080/predictions/$APP_NAME/

{"predictions": [[0.0027409277390688658, -0.0005287091480568051, 0.014051479287445545, 0.009391368366777897, 0.04596078768372536, 0.027604931965470314, 0.049034327268600464, 0.019978001713752747, -0.014058594591915607, -0.049546584486961365, -0.015737656503915787, 0.009661726653575897, -0.05654742196202278, 0.015054648742079735, -0.002073927316814661, 0.03420734778046608, 0.05885257571935654, 0.014727373607456684, 0.010892564430832863, 0.004596078768372536, 0.006040357518941164, 0.03010929748415947, 0.019124241545796394, 0.025541676208376884, 0.02521440200507641, -0.010102835483849049, 0.023122688755393028, 0.05202249065041542, -0.10159753262996674, 0.023023081943392754, 0.007890172302722931, 0.005545887630432844, 0.0013073212467133999, -0.01532500609755516, -0.003941528964787722, -0.058909494429826736, -0.024104513227939606, -0.004322163760662079, 0.0014531719498336315, -0.03417889028787613, -0.02850138023495674, -0.019294993951916695, -0.025086337700486183, 0.020746387541294098, -0.0

##### Run a load test to benchmark performance with Apache Bench

First, we need to install [Apache Bench](https://httpd.apache.org/docs/2.4/programs/ab.html). Then we can run a test and save out results. 

In [95]:
!apt-get update
!apt-get install apache2-utils

In [30]:
REQUESTS = 2000
CONCURRENCY = 1000

!ab -l -c $CONCURRENCY -n $REQUESTS -k -p ./$APP_DIR/instances.json -T "application/json" http://localhost:7080/predictions/$APP_NAME/ > ./$APP_DIR/result.txt

Completed 200 requests
Completed 400 requests
Completed 600 requests
Completed 800 requests
Completed 1000 requests
Completed 1200 requests
Completed 1400 requests
Completed 1600 requests
Completed 1800 requests
Completed 2000 requests
Finished 2000 requests


**Benchmarking Results**
- 1 GPU, without ORT: 25.75 RPS 
- 1 GPU, with ORT: 86.47 RPS
- 2 GPU, without ORT: 52.31 RPS
- 2 GPU, with ORT: 153.33 RPS
- 2 GPU, with ORT, batch size 4: 171.52 RPS
- 2 GPU, with ORT, batch size 8: 183.93 RPS
- 2 GPU, with ORT, batch size 16: 187.43 RPS
- 2 GPU, with ORT, batch size 32: 185.46 RPS

In [None]:
!docker stop $CUSTOM_PREDICTOR_IMAGE_URI

**Resources**:
   - On batching:
        - https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching
   - On TorchServe:
        - https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers
   - On Locust:
        - https://medium.com/@tferreiraw/performing-load-tests-with-python-locust-io-62de7d91eebd
        - https://medium.com/@ashmi_banerjee/3-step-tutorial-to-performance-test-ml-serving-apis-using-locust-and-fastapi-40e6cc580adc