# How to Deploy a Custom Hugging Face Model to Azure ML

The following is a step by step guide on how to deploy a custom Hugging Face model to Azure ML. In order to be able to deploy the model from your local environment to Azure ML we will need to install the libraries from ```requirements.txt```.

Then, we can run the cells in the notebook to prepare the model and deploy it to Azure ML.

## 0. Download the model from Hugging Face Hub and save it locally

In this example, we are going to deploy an embedding model. To keep the example as simple as possible, we will be using sentence-transformers for inference. Here, we download the model from Hugging Face Hub and save it locally to upload and register the model in Azure ML.

If we already have the model locally, we can skip this step.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")
model.save("path/to/model/bge-m3")

## 0.1 Import required libraries

In [None]:
# import required libraries
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Environment,
    CodeConfiguration,
)
from azure.identity import DefaultAzureCredential

## 0.2 Authenticate via Azure CLI
In order to ensure a smooth connection to the workspace down below, authenticate first by installing the Azure CLI. Please find the installation steps here: [Install Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) and authenticaing via `az login`.

## 1. Connect to Azure ML Workspace

SDK v2 now uses the `MLClient` class to connect to a workspace.

In [None]:
# enter details of your Azure Machine Learning workspace
subscription_id = 'xxx'
resource_group = 'xxx'
workspace = 'xxx'

# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

## 2. Register the model

Registering the model means that the model will uploaded to the Azure ML Model Registry. Depending on the size of the model, this can take a while.

In [None]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

file_model = Model(
    path="path/to/model/bge-m3",
    type=AssetTypes.CUSTOM_MODEL,
    name="bge-m3",
    description="Hugging Face BGE-M3 model uploaded to Azure ML Model Registry.",
)
ml_client.models.create_or_update(file_model)

## 2.1 Get the model from the registry

In [None]:
# Create a Model object using the name and version of the already registered model
model = ml_client.models.get(name="bge-m3", version="1")

## 3. Define the environment

Define the environment by giving it a distinct name and pass all packages you will need to run inferences with the model. Define all packages within the conda.yaml file.

```yaml
name: hugging-face-embeddings-env
# channels:
#   - conda-forge
dependencies:
  - python=3.11
  - pip=22.1.2
  - pip:
    - azureml-inference-server-http==1.3.0
    - sentence-transformers
    - scipy==1.10.1
```

In this case, we will be basically using sentence-transformers only.

In [None]:
env = Environment(
    conda_file="azureml-models/b3_bilang_extraction_model/1/conda.yaml",
    image="mcr.microsoft.com/azureml/inference-base-2204:20240530.v1",
)

ml_client.environments.create_or_update(env)

# 4. Create an inference script

In order to get inferences from the model, we need to create an inference script. For Azure ML, the script needs to have an ```init()``` and a ```run()``` function. The ```init()``` function will run only once at the start of the contrainer and is for loading the model into memory. The ```run()``` function will be called every time a request is made to the endpoint and needs to unpack the request data and pass it to the model for inference.

Refer to the ```score.py``` file in this repository for an example.

# 5. Create a deployment configuration and deploy the model

In general, deploying a model to Azure ML needs an endpoint and a deployment within the endpoint. The endpoint is a webservice which can be called via REST API. It has an URI and pre-defined authentication methods, such as key-based authentication. A deployment is kind of the backbone of the webservice. The deployment is the actual container that will run the model. It has a pre-defined environment and hardware to run the container.

Before deploying the model, we need to define the endpoint and the deployment configuration. Here we choose `Standard_E4s_v3` since it has enough CPU power and memory, so the model can be loaded into memory and the container will not run out of memory while performing inference.

The deploy method will the start the deployment process. Here, a container image will be created and a VM instance will be created to run the container. The whole process can take a while.

If we already registered a model but want to redeploy it, we can grab the model to pass it to the deploy method via `model = ml_client.models.get(name="bge-m3", version="1")`.

Important: Double-Check that you are using the Azure ML SDK v1 or v2 for both the model registration and the deployment. If we registered the model via Studio UI or SDK v2, we can not deploy the model via SDK v1.

In [None]:
# Define an endpoint name
endpoint_name = "hugging-face-embeddings-endpoint"

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name = endpoint_name, 
    description="this is the hugging face embeddings endpoint",
    auth_mode="key"
)

In [None]:
deployment = ManagedOnlineDeployment(
    name="hugging-face-embeddings-deployment-1", 
    endpoint_name=endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="path/to/code/folder", scoring_script="score.py"
    ),
    instance_type="Standard_E4s_v3",
    instance_count=1,
)

In [None]:
ml_client.online_endpoints.begin_create_or_update(endpoint)

In [None]:
ml_client.online_deployments.begin_create_or_update(deployment)

In [None]:
ml_client.online_deployments.get_logs(
    name="hugging-face-embeddings-deployment-1", endpoint_name="hugging-face-embeddings-endpoint", lines=100
)

# 6. Call the endpoint and get inferences

To call the endpoint, we can either send a request via https or use `ml_client.online_endpoints.invoke()`.

In [None]:
import json
# test the embeddings deployment with some sample data
response = json.loads(ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name="hugging-face-embeddings-deployment-1",
    request_file="sample-request.json",
))

embeddings = response["embeddings"]

# 7. Now do the fun part

After we got the embeddings, we can use them for all kinds of downstream tasks. For instance, we can embed multiple texts and calculate the cosine similarity between them.

In [None]:
from utils import cosine_similarity

print(cosine_similarity(embeddings, embeddings))

In [None]:
docs = [
    "The first human landing on the Moon was achieved in 1969.",
    "Neil Armstrong was the first person to walk on the lunar surface.",
    "Apollo 11 was the spaceflight that landed the first two people on the Moon.",
]
query = "Who was the first to walk on the Moon?"

for i in docs:
    print(cosine_similarity(embed_text(query), embed_text(i)))