### Inference with the updated model.

The model is now updated and saved as  triton backend model we will apply tokenization offline and query the model with the tokenized words and the attention mask. 
The model will return the indices of the translated test, we will use the tokenizer again to decode the indices and produce the output.

We can later have the tokenizer as a separate service people can interact with using http.

In [1]:
MODEL_NAME = "masakhane/m2m100_418M_en_swa_rel_news"

In [2]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [4]:
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import triton_to_np_dtype

In [5]:
client = httpclient.InferenceServerClient(url="localhost:8000")

In [6]:
inputs_ids = httpclient.InferInput("input_ids", shape=(-1,1) , datatype="TYPE_INT64",)
attention_mask = httpclient.InferInput("attention_mask", shape=(-1,1) , datatype="TYPE_INT64",)

In [8]:
text_to_translate = ["I am learning how to use Triton Server for Machine Learning"]

In [12]:
def get_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer


def tokenize_text(model_name, text):
    tokenizer = get_tokenizer(model_name)
    tokenized_text = tokenizer(text, padding=True, return_tensors="np")
    return tokenized_text.input_ids, tokenized_text.attention_mask

In [30]:
def generate_inference_input(model_name, text):
    inputs = []
    input_ids, attention_mask = tokenize_text(model_name, text)
    inputs.append(httpclient.InferInput("input_ids", input_ids.shape, "INT64"))
    inputs.append(httpclient.InferInput("attention_mask", attention_mask.shape, "INT64"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int64), binary_data=False)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int64), binary_data=False)
    return inputs

In [31]:
inputs = generate_inference_input(MODEL_NAME, text_to_translate)

In [32]:
inputs

[<tritonclient.http._infer_input.InferInput at 0x16afcffd0>,
 <tritonclient.http._infer_input.InferInput at 0x16afcff90>]

In [8]:
query.set_data_from_numpy(np.asarray([text_to_translate], dtype=object))

In [21]:
outputs = httpclient.InferRequestedOutput("generated_indices", binary_data=False)

In [33]:
results = client.infer(model_name="m2m100_translation_model", inputs=inputs, outputs=[outputs])
inference_output = results.as_numpy('generated_indices')

In [34]:
inference_output

array([[     2, 128088,  71714,    720,  12089,    438,  85959,    102,
         55728,  37578,  53140,    311,    103,   2447,     82,   2786,
          3194,    720,  12089,    438,  28668,  21552,  55125,    360,
             2]])

In [35]:
decoded_output = tokenizer.batch_decode(inference_output, skip_special_tokens=True)

In [36]:
decoded_output

['Ninajifunza jinsi ya kutumia Mtandao wa Triton ili Kujifunza Kutumia Mashine']

We are able to infer using triton.

The next step is to build the production pipeline that will scale on GCP.

###  Later we will update this code to use grpc because it have been proven to be faster and more efficient than http.

One of the main advantages of gRPC over HTTP is that it is faster and more efficient. This is due to several factors:



In [None]:
#### STart here tommorow night

https://github.com/kserve/kserve/tree/master/docs/samples/multimodelserving/triton

 docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/triton_model_repository:/models nvcr.io/nvidia/tritonserver:23.06-py3 bash -c "pip install transformers==4.30.2  sentencepiece==0.1.99 && tritonserver --model-repository=/models"


 - https://medium.com/@fractal.ai/bloom-3b-optimization-deployment-using-triton-server-part-1-f809037fea40