# Deploy hugging face to Triton Inference Server on AKS

description: Deploy a bert model to AKS GPU cluster

Please note that this Public Preview release is subject to the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

In [None]:
from azureml.core import Workspace

subscription_id = os.getenv("SUBSCRIPTION_ID", default="<subscription_id>")
resource_group = os.getenv("RESOURCE_GROUP", default="<resource_group>")
workspace_name = os.getenv("WORKSPACE_NAME", default="<workspace_name>")

ws = Workspace.get(
    subscription_id = subscription_id, 
    resource_group = resource_group, 
    name = workspace_name)

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Get model

In [None]:
from azureml.core.model import Model

model = Model(ws, id="bert-base:1")

print(model)

## Deploy webservice

Deploy to a pre-created [AksCompute](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.aks.akscompute?view=azure-ml-py#provisioning-configuration-agent-count-none--vm-size-none--ssl-cname-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--location-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--service-cidr-none--dns-service-ip-none--docker-bridge-cidr-none--cluster-purpose-none--load-balancer-type-none-) named `aks-gpu`. For other options, see [our documentation](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-and-where?tabs=azcli).


In [None]:
from azureml.core.webservice import AksWebservice
from azureml.core.model import InferenceConfig
from random import randint

service_name = "bert-ncd-aks-gpu"

config = AksWebservice.deploy_configuration(
    compute_target_name="aks-gpu",
    gpu_cores=1,
    cpu_cores=1,
    memory_gb=8,
    auth_enabled=True,
)

service = Model.deploy(
    workspace=ws,
    name=service_name,
    models=[model],
    deployment_config=config,
    overwrite=True,
)

service.wait_for_deployment(show_output=True)

In [None]:
print(service.get_logs())

## Test the webservice

In [None]:
service_key = service.get_keys()[0]
scoring_uri = service.scoring_uri
uri = scoring_uri[7:]
print(service_key)
print(scoring_uri)
print(uri)

In [None]:
!curl -v $scoring_uri/v2/health/ready -H 'Authorization: Bearer '"$service_key"''
!curl -k -X POST -v $scoring_uri/v2/service/bert-ncd-aks-gpu/v2/repository/index -H 'Authorization: Bearer '"$service_key"''

In [None]:
import tritonclient.http as tritonhttpclient

headers = {}
headers["Authorization"] = f"Bearer {service_key}"

triton_client = tritonhttpclient.InferenceServerClient(uri)

model_name = "bert-base-cased"

# Check the state of server.
health_ctx = triton_client.is_server_ready(headers=headers)
print("Is server ready - {}".format(health_ctx))

# Check the status of model.
status_ctx = triton_client.is_model_ready(model_name, "1", headers)
print("Is model ready - {}".format(status_ctx))

In [None]:
from bert.tokenization  import BertTokenizer
from bert.preprocess import preprocess_tokenized_text
from tritonclient.utils import triton_to_np_dtype

context = "Within the genitourinary and gastrointestinal tracts, commensal flora serve as biological barriers by competing with pathogenic bacteria for food and space and, in some cases, by changing the conditions in their environment, such as pH or available iron. This reduces the probability that pathogens will reach sufficient numbers to cause illness. However, since most antibiotics non-specifically target bacteria and do not affect fungi, oral antibiotics can lead to an overgrowth of fungi and cause conditions such as a vaginal candidiasis (a yeast infection). There is good evidence that re-introduction of probiotic flora, such as pure cultures of the lactobacilli normally found in unpasteurized yogurt, helps restore a healthy balance of microbial populations in intestinal infections in children and encouraging preliminary data in studies on bacterial gastroenteritis, inflammatory bowel diseases, urinary tract infection and post-surgical infections."
query = "Most antibiotics target bacteria and don't affect what class of organisms?"

tokenizer = BertTokenizer('bert/vocab.txt', max_len=512)

query_tokens = tokenizer.tokenize(query)

print(query_tokens)

feature = preprocess_tokenized_text(context, query_tokens, tokenizer)

tensors_for_inference, tokens_for_postprocessing = feature

In [None]:
from tritonclient.utils import triton_to_np_dtype
import numpy as np

model_metadata = triton_client.get_model_metadata(model_name=model_name, headers=headers)

input_meta = model_metadata["inputs"]
output_meta = model_metadata["outputs"]

np_dtype = triton_to_np_dtype(input_meta[0]["datatype"])

input_ids = np.array(tensors_for_inference.input_ids, dtype=np_dtype)[None,...] # make bs=1
segment_ids = np.array(tensors_for_inference.segment_ids, dtype=np_dtype)[None,...] # make bs=1
input_mask = np.array(tensors_for_inference.input_mask, dtype=np_dtype)[None,...] # make bs=1

input_mapping = {
    "input_ids": input_ids,
    "token_type_ids": segment_ids,
    "attention_mask": input_mask,
}

inputs = []
outputs = []
        
# Populate the inputs array
for in_meta in input_meta:
    input_name = in_meta["name"]
    data = input_mapping[input_name]

    input = tritonhttpclient.InferInput(input_name, data.shape, in_meta["datatype"])
    input.set_data_from_numpy(data, binary_data=False)
    inputs.append(input)

# Populate the outputs array
for out_meta in output_meta:
    output_name = out_meta["name"]
    output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
    outputs.append(output)

            
# Run inference
res = triton_client.infer(
    model_name,
    inputs,
    request_id="0",
    outputs=outputs,
    model_version="1",
    headers=headers,
)

for output in res._result['outputs']:
    print(output['name'])

In [None]:
from bert.postprocess import get_answer
import json

start_logits = res.as_numpy("output_0")[0]
end_logits = res.as_numpy("output_1")[0]

# post-processing
doc_tokens = context.split()
answer, answers = get_answer(doc_tokens, tokens_for_postprocessing, start_logits[1], end_logits)
    
# print result
print(answer)
print(answers)

## Delete the webservice

In [None]:
service.delete()

# Next steps

Try reading [our documentation](https://aka.ms/triton-aml-docs) to use Triton with your own models or check out the other notebooks in this folder for ways to do pre- and post-processing on the server. 