# Deploying a Hugging Face model to Google Vertex AI

Inspired by the [GCP tutorial]( https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/pytorch_text_classification_using_vertex_sdk_and_gcloud/pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb) we will deploy a `sentence-transformers` model on a [Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api) endpoint. We will use [TorchServe](https://pytorch.org/serve/) to serve a Hugging Face model available on the [Hub](hf.co). To accelerate inference we will also use features from the `optimum` [library](https://github.com/huggingface/optimum) to apply graph optimization and/or quantization to the model.

### Set up your local development environment

1. Follow the Google Cloud guide to [setting up a Python development environment](https://cloud.google.com/python/docs/setup) 
2. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/) 
3. Create a virtual environment (virtualenv, pyenv) with Python 3 (<3.9) and activate the environment
4. Launch jupyter notebook from this environment


In [None]:
https://cloud.google.com/products/calculator/

### Install packages

In [None]:
!pip -q install --upgrade google-cloud-aiplatform #Vertex AI sdk
!pip -q install --upgrade transformers
!pip -q install --upgrade datasets
!pip -q install --upgrade 'optimum[onnxruntime]'

### Set up your Google Cloud project

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager)
1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project)
1. Enable following APIs in your project required for running the tutorial
    - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)
    - [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)
    - [Container Registry API](https://console.cloud.google.com/flows/enableapi?apiid=containerregistry.googleapis.com)
    - [Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com)
   
1. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

### Authenticate to gcloud

 1. In the Cloud Console, go to the [**Create service account key** page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).,
 2. Click **Create service account**.,
 3. In the **Service account name** field, enter a name, and click **Create**,
 4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type \"Vertex AI\" into the filter box, and select **Vertex AI Administrator**. Type \"Storage Object Admin\" into the filter box, and select **Storage Object Admin**.
 5. Click *Create*. A JSON file that contains your key downloads to your local environment.
 6. Enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
%env GOOGLE_APPLICATION_CREDENTIALS ./keys/huggingface-ml-e974975230cc.json #change to your service account key

In [None]:
# Get your Google Cloud project ID using google.auth
import google.auth

_, PROJECT_ID = google.auth.default()
print("Project ID: ", PROJECT_ID)

#Or set it yourself manually
PROJECT_ID = "huggingface-ml" 

### Create a cloud storage bucket

In [None]:
BUCKET_NAME = "gs://florent-bucket"  # <---CHANGE THIS TO YOUR BUCKET
REGION = "us-central1"

**If the bucket doesn't exist, run the following:**

In [None]:
#! gsutil mb -l $REGION $BUCKET_NAME

Access the content of the bucket

In [None]:
! gsutil ls -al $BUCKET_NAME

### Imports

In [None]:
import torch
import base64
import json
import os
import random
import sys
import transformers

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.protobuf.json_format import MessageToDict

In [None]:
print(f"Notebook runtime: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"PyTorch version : {torch.__version__}")
print(f"Transformers version : {transformers.__version__}")

In [None]:
APP_NAME = "test_sbert_embedder_optimum"

## Deployment

#### *Overview*

Deploying a PyTorch model on [Vertex AI Predictions](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions) requires to use a custom container that serves online predictions. You will deploy a container running [PyTorch's TorchServe](https://pytorch.org/serve/) tool in order to serve predictions from a fine-tuned sentence transformer model `msmarco-distilbert-base-tas-b` available in [Hugging Face Transformers](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b). 

Essentially, to deploy a PyTorch model on Vertex AI Predictions following are the steps:
1. Package the trained model artifacts including [default](https://pytorch.org/serve/#default-handlers) or [custom](https://pytorch.org/serve/custom_service.html) handlers by creating an archive file using [Torch model archiver](https://github.com/pytorch/serve/tree/master/model-archiver),
2. Build a [custom container](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements) compatible with Vertex AI Predictions to serve the model using Torchserve
3. Upload the model with custom container image to serve predictions as a Vertex AI Model resource,
4. Create a Vertex AI Endpoint and [deploy the model](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api) resource

#### *How to improve latency*

Deployment of the model will be made here on CPU. To improve latency of the model we will use the [Hugging Face Optimum](https://github.com/huggingface/optimum) library to convert the model to the [ONNX (Open Neural Network eXchange)](http://onnx.ai/) format and apply graph optimization and/or quantization to improve inference time. To learn more about these techniques consult:
- [Hugging Face Optimum documentation](https://huggingface.co/docs/optimum/quickstart)
- [Convert Transformers to ONNX with Hugging Face Optimum](https://huggingface.co/blog/convert-transformers-to-onnx#2-what-is-hugging-face-optimum)
- [Graph Optimizations in ONNX Runtime](https://onnxruntime.ai/docs/performance/graph-optimizations.html)
- [Quantize ONNX Models](https://onnxruntime.ai/docs/performance/quantization.html)

Those operations need to be performed before using the Torch model archiver. The ONNX exported model will then be loaded in the custom handler

### Save model locally

In [None]:
!mkdir ./predictor

In [None]:
from transformers import AutoTokenizer, AutoModel

model_name = "sentence-transformers/msmarco-distilbert-base-tas-b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


pt_save_directory = "./predictor/model/"

tokenizer.save_pretrained(pt_save_directory)
model.save_pretrained(pt_save_directory)

### Apply optimum optimizations

Optimization is enough here for the latency we need but you can also apply quantization with `ORTQuantizer` if you need faster predictions. However this may affect the performance of the model. See the [documentation](https://huggingface.co/docs/optimum/main/en/pipelines#quantizing-with-ortquantizer).

In [None]:
from pathlib import Path
from optimum.onnxruntime import ORTModelForFeatureExtraction, ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.pipelines import pipeline


pt_save_directory_optimum = "./predictor/optimum/"

save_path = Path("optimum_model")
save_path.mkdir(exist_ok=True)

#use ORTOptimizer to export the model and define quantization configuration
optimizer = ORTOptimizer(model=model, tokenizer=tokenizer)
optimization_config = OptimizationConfig(optimization_level=2)


# apply the optimization configuration to the model
optimizer.export(
    onnx_model_path=save_path / "model.onnx",
    onnx_optimized_model_output_path=save_path / "model-optimized.onnx",
    optimization_config=optimization_config,
)

optimizer.model.config.save_pretrained(save_path) # saves config.json 

model = ORTModelForFeatureExtraction.from_pretrained(save_path, file_name="model-optimized.onnx")

tokenizer.save_pretrained(pt_save_directory_optimum)
model.save_pretrained(pt_save_directory_optimum)

#You can also push the model to the HF hub
#model.push_to_hub(pt_save_directory_optimum,
#                  repository_id="onnx-msmarco-distilbert-base-tas-b",
#                  use_auth_token=True
#                  )

### Create a custom model handler 

Please refer to the [TorchServe documentation](https://pytorch.org/serve/custom_service.html) for defining a custom handler.

In [None]:
%%writefile predictor/custom_handler.py

import os
import json
import logging

import torch
from transformers import AutoModel, AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction, ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.pipelines import pipeline

from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)
torch.set_num_threads(1)

class SentenceTransformersHandler(BaseHandler):
    """
    The handler takes an input string and returns the embedding 
    based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(SentenceTransformersHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        """ Loads the model.onnx file and initialized the model object.
        Instantiates Tokenizer for preprocessor to use and a feature extraction pipeline
        """
        self.manifest = ctx.manifest

        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        #self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        # Read model serialize/pt file
        serialized_file = self.manifest["model"]["serializedFile"]
        model_pt_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.onnx or pytorch_model.bin file")
        
        # Load model
        self.model = ORTModelForFeatureExtraction.from_pretrained(model_dir)
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        
        # Ensure to use the same tokenizer used during training
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, model_max_length=128)
        
        # Create an optimum pipeline
        self.pipeline = pipeline("feature-extraction", model=self.model, tokenizer=self.tokenizer)

        self.initialized = True

    def preprocess(self, data):
        """ Preprocessing input request by tokenizing
            Extend with your own preprocessing steps as needed
        """
        text = data[0].get("data")
        if text is None:
            text = data[0].get("body")
        sentences = text.decode('utf-8')
        logger.info("Received text: '%s'", sentences)
        return sentences

    def inference(self, sentences):
        """ Predict the class of a text using a trained transformer model.
        """
        
        def cls_pooling(pipeline_output):
            """
            Return the [CLS] token embedding
            """
            return [_h[0] for _h in pipeline_output]
        
        embeddings = cls_pooling(self.pipeline(sentences))

        logger.info(f"Model embedded: {len(embeddings)}" )
        return embeddings

    def postprocess(self, inference_output):
        return inference_output

### Create custom container image

**Create a Dockerfile with TorchServe as base image**

**NB**: to define the right Torchserve parameters such as `workers` please consult (https://github.com/pytorch/serve/blob/master/docs/performance_guide.md) 

In [None]:
%%bash -s $APP_NAME

APP_NAME=$1

cat << EOF > ./predictor/Dockerfile

FROM pytorch/torchserve:latest-cpu

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install transformers
RUN pip3 install 'optimum[onnxruntime]'


USER model-server

# copy model artifacts, custom handler and other dependencies
COPY custom_handler.py /home/model-server/
COPY ./optimum/ / /home/model-server/

# create torchserve configuration file
USER root
RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties

# Consult https://github.com/pytorch/serve/blob/master/docs/performance_guide.md to define the right parameters
RUN printf "\nworkers=4" >> /home/model-server/config.properties

# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081

# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f \
  --model-name=$APP_NAME \
  --version=1.0 \
  --serialized-file=/home/model-server/model.onnx \
  --handler=/home/model-server/custom_handler.py \
  --extra-files "/home/model-server/config.json,/home/model-server/tokenizer.json,/home/model-server/tokenizer_config.json,/home/model-server/special_tokens_map.json,/home/model-server/vocab.txt" \
  --export-path=/home/model-server/model-store

# run Torchserve HTTP serve to respond to prediction requests
CMD ["torchserve", \
     "--start", \
     "--ts-config=/home/model-server/config.properties", \
     "--models", \
     "$APP_NAME=$APP_NAME.mar", \
     "--model-store", \
     "/home/model-server/model-store"]
EOF

echo "Writing ./predictor/Dockerfile"

**Build container**

In [None]:
CUSTOM_PREDICTOR_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_predict_{APP_NAME}"
print(f"CUSTOM_PREDICTOR_IMAGE_URI = {CUSTOM_PREDICTOR_IMAGE_URI}")

In [None]:
!docker build \
  --tag=$CUSTOM_PREDICTOR_IMAGE_URI \
  ./predictor

**Run container locally**

In [None]:
!docker stop local_sbert_embedder_optimum
!docker run -t -d --rm -p 7080:7080 --name=local_sbert_embedder_optimum $CUSTOM_PREDICTOR_IMAGE_URI
!sleep 20

**Test API locally**

1. Health check

In [None]:
!curl http://localhost:7080/ping

2. Send request

In [None]:
%%bash -s $APP_NAME

APP_NAME=$1

cat > ./predictor/instances.json <<END
{ 
   "instances": [
     { 
       "data": {
         "b64": "$(echo 'I am creating an endpoint using TorchServe and HF transformers' | base64 --wrap=0)"
       }
     }
   ]
}
END

curl -s -X POST \
  -H "Content-Type: application/json; charset=utf-8" \
  -d @./predictor/instances.json \
  http://localhost:7080/predictions/$APP_NAME/

3. Stop the container

In [None]:
!docker stop local_sbert_embedder_optimum

### Push image Container Registry

In [None]:
!docker push $CUSTOM_PREDICTOR_IMAGE_URI

### Create model and endpoint to VertexAI

We create a model resource on Vertex AI and deploy the model to a Vertex AI Endpoints. You must deploy a model to an endpoint before using the model. The deployed model runs the custom container image to serve predictions.

**Initialize the Vertex AI SDK for Python**

In [None]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

**Create a Model resource with custom serving container**

In [None]:
VERSION = 1
model_display_name = f"{APP_NAME}-v{VERSION}"
model_description = "PyTorch based sentence transformers embedder with custom container"

MODEL_NAME = APP_NAME
health_route = "/ping"
predict_route = f"/predictions/{MODEL_NAME}"
serving_container_ports = [7080]

In [None]:
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=CUSTOM_PREDICTOR_IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

**Create an Endpoint for Model with Custom Container**

In [None]:
endpoint_display_name = f"{APP_NAME}-endpoint"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

**Deploy the Model to Endpoint**

See more on the [documentation](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api).

To select the right machine type according to your budget select go to [Google Cloud Pricing Calculator](https://cloud.google.com/products/calculator) and [Finding the ideal machine type](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute#finding_the_ideal_machine_type).

In [None]:
traffic_percentage = 100
machine_type = "n1-standard-8"
deployed_model_display_name = model_display_name
min_replica_count = 1
max_replica_count = 3
sync = True

model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    traffic_percentage=traffic_percentage,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    sync=sync,
)

### Invoking the Endpoint with deployed Model using Vertex AI SDK to make predictions

**Get the endpoint id**

In [None]:
endpoint_display_name = f"{APP_NAME}-endpoint"
filter = f'display_name="{endpoint_display_name}"'

for endpoint_info in aiplatform.Endpoint.list(filter=filter):
    print(
        f"Endpoint display name = {endpoint_info.display_name} resource id ={endpoint_info.resource_name} "
    )

endpoint = aiplatform.Endpoint(endpoint_info.resource_name)

In [None]:
endpoint.list_models()

**Formatting input for online prediction**

In [None]:
test_instances = [
    b"This is an example of model deployment using a sentence transformers model and optimum",
]*100

In [None]:
len(tokenizer(test_instances[0])["input_ids"])

In [None]:
#test_instances

In [None]:
%%time
print("=" * 100)
for instance in test_instances:
    print(f"Input text: \n\t{instance.decode('utf-8')}\n")
    b64_encoded = base64.b64encode(instance)
    test_instance = [{"data": {"b64": f"{str(b64_encoded.decode('utf-8'))}"}}]
    print(f"Formatted input: \n{json.dumps(test_instance, indent=4)}\n")
    prediction = endpoint.predict(instances=test_instance)
    #print(f"Prediction response: \n\t{prediction}")
    print("=" * 100)

In [None]:
%%time
prediction = endpoint.predict(instances=test_instance)