# Deploying a Hugging Face model to Google Vertex AI

Inspired by the [GCP tutorial]( https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/pytorch_text_classification_using_vertex_sdk_and_gcloud/pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb) we will deploy a `sentence-transformers` model on a [Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api) endpoint. We will use [TorchServe](https://pytorch.org/serve/) to serve a Hugging Face model available on the [Hub](hf.co). To accelerate inference we will also use features from the `optimum` [library](https://github.com/huggingface/optimum) to apply graph optimization and/or quantization to the model.

### Set up your local development environment

1. Follow the Google Cloud guide to [setting up a Python development environment](https://cloud.google.com/python/docs/setup) 
2. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/) 
3. Create a virtual environment (virtualenv, pyenv) with Python 3 (<3.9) and activate the environment
4. Launch jupyter notebook from this environment


In [None]:
https://cloud.google.com/products/calculator/

### Install packages

In [13]:
# !pip -q install --upgrade google-cloud-aiplatform #Vertex AI sdk
!pip -q install --upgrade transformers
!pip -q install --upgrade datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

### Set up your Google Cloud project

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager)
1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project)
1. Enable following APIs in your project required for running the tutorial
    - [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)
    - [Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)
    - [Container Registry API](https://console.cloud.google.com/flows/enableapi?apiid=containerregistry.googleapis.com)
    - [Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com)
   
1. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

### Authenticate to gcloud

 1. In the Cloud Console, go to the [**Create service account key** page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).,
 2. Click **Create service account**.,
 3. In the **Service account name** field, enter a name, and click **Create**,
 4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type \"Vertex AI\" into the filter box, and select **Vertex AI Administrator**. Type \"Storage Object Admin\" into the filter box, and select **Storage Object Admin**.
 5. Click *Create*. A JSON file that contains your key downloads to your local environment.
 6. Enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [2]:
# %env GOOGLE_APPLICATION_CREDENTIALS ./keys/huggingface-ml-e974975230cc.json #change to your service account key

In [1]:
# Get your Google Cloud project ID using google.auth
import google.auth

_, PROJECT_ID = google.auth.default()
print("Project ID: ", PROJECT_ID)

#Or set it yourself manually
# PROJECT_ID = "huggingface-ml"

Project ID:  huggingface-ml


### Create a cloud storage bucket

In [2]:
BUCKET_NAME = "gs://andrew-reed-bucket"  # <---CHANGE THIS TO YOUR BUCKET
REGION = "us-east4"

**If the bucket doesn't exist, run the following:**

In [3]:
# ! gsutil mb -l $REGION $BUCKET_NAME

Access the content of the bucket

In [4]:
! gsutil ls -al $BUCKET_NAME

### Imports

In [5]:
import torch
import base64
import json
import os
import random
import sys
import transformers

import google.auth
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from google.protobuf.json_format import MessageToDict

In [6]:
print(f"Notebook runtime: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print(f"PyTorch version : {torch.__version__}")
print(f"Transformers version : {transformers.__version__}")

Notebook runtime: GPU
PyTorch version : 2.0.1+cu118
Transformers version : 4.32.1


In [7]:
APP_NAME = "test_bge_embedder"

## Deployment

#### *Overview*

Deploying a PyTorch model on [Vertex AI Predictions](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions) requires to use a custom container that serves online predictions. You will deploy a container running [PyTorch's TorchServe](https://pytorch.org/serve/) tool in order to serve predictions from a fine-tuned sentence transformer model `msmarco-distilbert-base-tas-b` available in [Hugging Face Transformers](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b). 

Essentially, to deploy a PyTorch model on Vertex AI Predictions following are the steps:
1. Package the trained model artifacts including [default](https://pytorch.org/serve/#default-handlers) or [custom](https://pytorch.org/serve/custom_service.html) handlers by creating an archive file using [Torch model archiver](https://github.com/pytorch/serve/tree/master/model-archiver),
2. Build a [custom container](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements) compatible with Vertex AI Predictions to serve the model using Torchserve
3. Upload the model with custom container image to serve predictions as a Vertex AI Model resource,
4. Create a Vertex AI Endpoint and [deploy the model](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api) resource

### Save model locally

In [8]:
MODEL_DIR = "./"+APP_NAME+"_predictor"
os.makedirs(MODEL_DIR, exist_ok=True)

In [9]:
from transformers import AutoTokenizer, AutoModel

model_id = "BAAI/bge-small-en"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

pt_save_directory = os.path.join(MODEL_DIR, "model")

tokenizer.save_pretrained(pt_save_directory)
model.save_pretrained(pt_save_directory)

In [10]:
pt_save_directory

'./test_bge_embedder_predictor/model'

In [99]:
from transformers import pipeline

def cls_pooling(pipeline_output):
    """
    Return the [CLS] token embedding
    """
    return [_h[0] for _h in pipeline_output]
        
        
        
tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
model = AutoModel.from_pretrained(pt_save_directory)

BATCH_SIZE = 4
pipe = pipeline("feature-extraction", model=model, tokenizer=tokenizer, batch_size=BATCH_SIZE)

input_texts = ["I like the new ORT pipeline", "blah blah blah", "Hello, I'm andrew"]
embeddings = pipe(input_texts)

In [100]:
len(embeddings)

3

In [96]:
len(embeddings[0])

1

In [90]:
len(embeddings[1][0])

9

In [None]:
pipe.tokenizer.

In [63]:
len(embeddings[0][0])

9

In [58]:
embeddings[0]

[[-0.7370260953903198,
  -0.43500322103500366,
  0.2417694479227066,
  -0.15344509482383728,
  0.08760987967252731,
  0.055698275566101074,
  -0.11618993431329727,
  0.44207069277763367,
  0.3685150146484375,
  0.2258044332265854,
  -0.1241288110613823,
  -0.3396099805831909,
  0.22041895985603333,
  -0.03137054666876793,
  0.022830264642834663,
  0.23484335839748383,
  0.3145812153816223,
  -0.10938600450754166,
  -0.5342490077018738,
  0.09915713965892792,
  0.5146430730819702,
  -0.2800300121307373,
  0.013828051276504993,
  -0.647057056427002,
  -0.06701657176017761,
  0.4996411204338074,
  -0.2386353611946106,
  -0.28559839725494385,
  -0.16580581665039062,
  -1.6529196500778198,
  0.18382528424263,
  -0.38130268454551697,
  0.2521995007991791,
  -0.06463456153869629,
  -0.21382613480091095,
  -0.08582992106676102,
  -0.06464950740337372,
  0.36071184277534485,
  -0.6784002184867859,
  0.28479912877082825,
  0.42383238673210144,
  0.013030575588345528,
  -0.29924923181533813,
  -0

### Create a custom model handler 

Please refer to the [TorchServe documentation](https://pytorch.org/serve/custom_service.html) for defining a custom handler.

In [12]:
%%writefile test_bge_embedder_predictor/custom_handler.py

import os
import json
import logging

import torch
from transformers import AutoModel, AutoTokenizer, pipeline

from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)
# torch.set_num_threads(1)

class SentenceTransformersHandler(BaseHandler):
    """
    The handler takes an input string and returns the embedding 
    based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(SentenceTransformersHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        """ Loads the model.onnx file and initialized the model object.
        Instantiates Tokenizer for preprocessor to use and a feature extraction pipeline
        """
        self.manifest = ctx.manifest

        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        # Read model serialize/pt file
        serialized_file = self.manifest["model"]["serializedFile"]
        model_pt_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.onnx or pytorch_model.bin file")
        
        # Load model
        self.model = AutoModel.from_pretrained(model_dir)
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        logger.info(f"model_pt_path: {model_pt_path}")
        logger.info(f"model_dir: {model_dir}")
        
        # Ensure to use the same tokenizer used during training
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, model_max_length=512)
        
        # Create an optimum pipeline
        # Use BetterTransformer for fused kernel + sparsity optimizations
        # https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2
        self.pipeline = pipeline("feature-extraction",
                                 model=self.model, 
                                 tokenizer=self.tokenizer, 
                                 device=self.device, 
                                 accelerator="bettertransformer")

        self.initialized = True

    def preprocess(self, requests):
        """ Preprocessing input request by tokenizing
            Extend with your own preprocessing steps as needed
            
            [{'data': b'I am creating an endpoint using TorchServe and HF transformers\n'}]
        """
        print(f'{"---"*20}')
        print(requests)
        print(type(requests))
        print(len(requests))
        print(f'{"---"*20}')
        
        input_texts = []
        for idx, request in enumerate(requests):
            text = request.get("data")
            if text is None:
                text = request.get("body")
                
            text = text.decode('utf-8')
            input_texts.append(text)
        logger.info("Received text: '%s'", input_texts)
        
        return input_texts
        
        # text = data[0].get("data")
        # if text is None:
        #     text = data[0].get("body")
        # sentences = text.decode('utf-8')
        # logger.info("Received text: '%s'", sentences)
        # return sentences

    def inference(self, input_texts):
        """ Predict the class of a text using a trained transformer model.
        """
        
        def cls_pooling(pipeline_output):
            """
            Return the [CLS] token embedding
            """
            return [_h[0] for _h in pipeline_output]
        
        embeddings = cls_pooling(self.pipeline(input_texts))

        logger.info(f"Model embedded: {len(embeddings)}")
        return embeddings

    def postprocess(self, inference_output):
        return inference_output

Writing test_bge_embedder_predictor/custom_handler.py


### Create custom container image

**Create a Dockerfile with TorchServe as base image**

**NB**: to define the right Torchserve parameters such as `workers` please consult (https://github.com/pytorch/serve/blob/master/docs/performance_guide.md) 

In [13]:
%%bash -s $APP_NAME

APP_NAME=$1

cat << EOF > ./${APP_NAME}_predictor/Dockerfile

FROM pytorch/torchserve:latest-gpu

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install transformers


USER model-server

# copy model artifacts, custom handler and other dependencies
COPY custom_handler.py /home/model-server/
COPY ./model/ / /home/model-server/

# create torchserve configuration file
USER root
RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties


# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081

# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f \
  --model-name=$APP_NAME \
  --version=1.0 \
  --serialized-file=/home/model-server/pytorch_model.bin \
  --handler=/home/model-server/custom_handler.py \
  --extra-files "/home/model-server/config.json,/home/model-server/tokenizer.json,/home/model-server/tokenizer_config.json,/home/model-server/special_tokens_map.json,/home/model-server/vocab.txt" \
  --export-path=/home/model-server/model-store

# run Torchserve HTTP serve to respond to prediction requests
CMD ["torchserve", \
     "--start", \
     "--ts-config=/home/model-server/config.properties", \
     "--models", \
     "$APP_NAME=$APP_NAME.mar", \
     "--model-store", \
     "/home/model-server/model-store"]

EOF

echo "Writing ./predictor/Dockerfile"

Writing ./predictor/Dockerfile


In [28]:
APP_NAME

'test_bge_embedder'

In [35]:
APP_DIR = APP_NAME+"_predictor"

**Build container**

In [38]:
CUSTOM_PREDICTOR_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_predict_{APP_NAME}"
print(f"CUSTOM_PREDICTOR_IMAGE_URI = {CUSTOM_PREDICTOR_IMAGE_URI}")

CUSTOM_PREDICTOR_IMAGE_URI = gcr.io/huggingface-ml/pytorch_predict_test_bge_embedder


In [39]:
!docker build \
  --tag=$CUSTOM_PREDICTOR_IMAGE_URI \
  ./$APP_DIR

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Sending build context to Docker daemon  134.5MB
Step 1/15 : FROM pytorch/torchserve:latest-gpu
 ---> 04eef250c14e
Step 2/15 : RUN python3 -m pip install --upgrade pip
 ---> Using cache
 ---> 2f06486ed62f
Step 3/15 : RUN pip3 install transformers
 ---> Using cache
 ---> 4ed472215157
Step 4/15 : USER model-server
 ---> Using cache
 ---> 4fa0cd41a25e
Step 5/15 : COPY custom_handler.py /home/model-server/
 ---> Using cache
 ---> 6ffef9cd46b5
Step 6/15 : COPY ./model/ / /home/model-server/
 ---> Using cache
 ---> 1eaec110f389
Step 7/15 : USER root
 ---> Running in 32101cfc0c4d
Removing intermediate container 32101cfc0c4d
 ---> 38839665de46
Step 8/15 : RUN printf "\nservice_envelope=json" >> /home/model-server/con

**Test API locally**

In [40]:
!docker run -td --rm -p 7080:7080 --gpus all --name=$APP_NAME $CUSTOM_PREDICTOR_IMAGE_URI

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
9b2a72e275404c43f2cc2f3702fa85d26a11ac19c1ee43bcc8b8bca6410c7444


1. Health check

In [41]:
!curl http://localhost:7080/ping

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{
  "status": "Healthy"
}


2. Send request

In [53]:
%%bash -s $APP_NAME

APP_NAME=$1

cat > ./predictor/instances.json <<END
{ 
   "instances": [
     { 
       "data": {
         "b64": "$(echo 'I am creating an endpoint using TorchServe and HF transformers' | base64 --wrap=0)"
       }
     }
   ]
}
END

curl -s -X POST \
  -H "Content-Type: application/json; charset=utf-8" \
  -d @./predictor/instances.json \
  http://localhost:7080/predictions/$APP_NAME/

{"predictions": [[-0.6905803084373474, 0.009051576256752014, 0.30402737855911255, -0.3663185238838196, -0.07281982898712158, 0.5289543271064758, -0.35053467750549316, 0.05200103297829628, -0.20036430656909943, 0.19550664722919464, 0.07047154009342194, -0.9721583724021912, 0.09480097144842148, 0.32743242383003235, -0.07829789817333221, 0.25278550386428833, -0.24017898738384247, 0.42118600010871887, -0.7957032322883606, 0.09498923271894455, 0.6807821393013, -0.4031156897544861, -0.10545552521944046, -0.22044122219085693, -0.2215351164340973, 0.35518744587898254, -0.21126778423786163, -0.18924182653427124, -0.31485188007354736, -1.508244276046753, 0.051057118922472, -0.4146045744419098, -0.031520672142505646, -0.1402014195919037, -0.15924163162708282, -0.3663807213306427, -0.2124873250722885, 0.09935609251260757, -0.26911237835884094, 0.23719154298305511, 0.3093794584274292, -0.13422630727291107, -0.16769321262836456, -0.38768434524536133, -0.3415384888648987, -0.983018696308136, -0.13000

3. Stop the container

In [None]:
!docker stop local_sbert_embedder_optimum

### Push image Container Registry

In [None]:
!docker push $CUSTOM_PREDICTOR_IMAGE_URI

### Create model and endpoint to VertexAI

We create a model resource on Vertex AI and deploy the model to a Vertex AI Endpoints. You must deploy a model to an endpoint before using the model. The deployed model runs the custom container image to serve predictions.

**Initialize the Vertex AI SDK for Python**

In [None]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

**Create a Model resource with custom serving container**

In [None]:
VERSION = 1
model_display_name = f"{APP_NAME}-v{VERSION}"
model_description = "PyTorch based sentence transformers embedder with custom container"

MODEL_NAME = APP_NAME
health_route = "/ping"
predict_route = f"/predictions/{MODEL_NAME}"
serving_container_ports = [7080]

In [None]:
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=CUSTOM_PREDICTOR_IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

**Create an Endpoint for Model with Custom Container**

In [None]:
endpoint_display_name = f"{APP_NAME}-endpoint"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)

**Deploy the Model to Endpoint**

See more on the [documentation](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api).

To select the right machine type according to your budget select go to [Google Cloud Pricing Calculator](https://cloud.google.com/products/calculator) and [Finding the ideal machine type](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute#finding_the_ideal_machine_type).

In [None]:
traffic_percentage = 100
machine_type = "n1-standard-8"
deployed_model_display_name = model_display_name
min_replica_count = 1
max_replica_count = 3
sync = True

model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    traffic_percentage=traffic_percentage,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    sync=sync,
)

### Invoking the Endpoint with deployed Model using Vertex AI SDK to make predictions

**Get the endpoint id**

In [None]:
endpoint_display_name = f"{APP_NAME}-endpoint"
filter = f'display_name="{endpoint_display_name}"'

for endpoint_info in aiplatform.Endpoint.list(filter=filter):
    print(
        f"Endpoint display name = {endpoint_info.display_name} resource id ={endpoint_info.resource_name} "
    )

endpoint = aiplatform.Endpoint(endpoint_info.resource_name)

In [None]:
endpoint.list_models()

**Formatting input for online prediction**

In [None]:
test_instances = [
    b"This is an example of model deployment using a sentence transformers model and optimum",
]*100

In [None]:
len(tokenizer(test_instances[0])["input_ids"])

In [None]:
#test_instances

In [None]:
%%time
print("=" * 100)
for instance in test_instances:
    print(f"Input text: \n\t{instance.decode('utf-8')}\n")
    b64_encoded = base64.b64encode(instance)
    test_instance = [{"data": {"b64": f"{str(b64_encoded.decode('utf-8'))}"}}]
    print(f"Formatted input: \n{json.dumps(test_instance, indent=4)}\n")
    prediction = endpoint.predict(instances=test_instance)
    #print(f"Prediction response: \n\t{prediction}")
    print("=" * 100)

In [None]:
%%time
prediction = endpoint.predict(instances=test_instance)