### Inference with the updated model.

The model is now updated and saved as  triton backend model we will apply tokenization offline and query the model with the tokenized words and the attention mask. 
The model will return the indices of the translated test, we will use the tokenizer again to decode the indices and produce the output.

We can later have the tokenizer as a separate service people can interact with using http.

In [1]:
MODEL_NAME = "masakhane/m2m100_418M_en_swa_rel_news"


In [2]:
from transformers import AutoTokenizer


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


In [4]:
import numpy as np
import tritonclient.http as httpclient


In [5]:
client = httpclient.InferenceServerClient(url="localhost:8080")


In [6]:
inputs_ids = httpclient.InferInput("input_ids", shape=(-1,1) , datatype="TYPE_INT64",)
attention_mask = httpclient.InferInput("attention_mask", shape=(-1,1) , datatype="TYPE_INT64",)


In [7]:
text_to_translate = ["I am learning how to use Triton Server for Machine Learning"]


In [8]:
def get_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer


def tokenize_text(model_name, text):
    tokenizer = get_tokenizer(model_name)
    tokenized_text = tokenizer(text, padding=True, return_tensors="np")
    return tokenized_text.input_ids, tokenized_text.attention_mask


In [9]:
def generate_inference_input(model_name, text):
    inputs = []
    input_ids, attention_mask = tokenize_text(model_name, text)
    inputs.append(httpclient.InferInput("input_ids", input_ids.shape, "INT64"))
    inputs.append(httpclient.InferInput("attention_mask", attention_mask.shape, "INT64"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int64), binary_data=False)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int64), binary_data=False)
    return inputs


In [10]:
inputs = generate_inference_input(MODEL_NAME, text_to_translate)


In [11]:
inputs


[<tritonclient.http._infer_input.InferInput at 0x106c57a90>,
 <tritonclient.http._infer_input.InferInput at 0x15a6003d0>]

In [12]:
outputs = httpclient.InferRequestedOutput("generated_indices", binary_data=False)


In [13]:
# this is the code for the authentication part to bypass the login page
import requests

HOST = "http://localhost:8080/"
USERNAME = "user@example.com"
PASSWORD = "12341234"

session = requests.Session()
response = session.get(HOST)

headers = {
    "Content-Type": "application/x-www-form-urlencoded",
}

data = {"login": USERNAME, "password": PASSWORD}
session.post(response.url, headers=headers, data=data)
session_cookie = session.cookies.get_dict()["authservice_session"]


In [14]:
SESSION = session_cookie
SERVICE_HOSTNAME = "m2m100-translation-inference-service.default.example.com"
INGRESS_HOST = "localhost"
INGRESS_PORT = "8080"


In [15]:
url = f"http://{INGRESS_HOST}:{INGRESS_PORT}/v2/models/m2m100_translation_model"

# Define headers
headers = {
    "Cookie": f"authservice_session={SESSION}",
    "Host": SERVICE_HOSTNAME,
}

# Make the HTTP request
response = requests.get(url, headers=headers)

# Check the response
if response.status_code == 200:
    print("Request was successful.")
    print(response.text)
else:
    print(f"Request failed with status code {response.status_code}.")


Request was successful.
{"name":"m2m100_translation_model","versions":["1"],"platform":"python","inputs":[{"name":"input_ids","datatype":"INT64","shape":[-1,-1]},{"name":"attention_mask","datatype":"INT64","shape":[-1,-1]}],"outputs":[{"name":"generated_indices","datatype":"FP32","shape":[-1,-1]}]}


In [16]:
SESSION


'MTY5NTg0MzYyOHxOd3dBTkUwelMwUk5UMXBIVmpkRlNWbFVWbEl5UkROU1FsZE5SVVJYV2xGVVVETlVTMU15UkVaWldUTTBOMEZGTmtSQ1RVNUtTVUU9fJeO7m9NIqQhd7wSduSzaju2JS3xSK6kji2a_JHCZ6xf'

In [17]:
outputs


<tritonclient.http._requested_output.InferRequestedOutput at 0x15b82aa90>

In [18]:
results = client.infer(model_name="m2m100_translation_model", inputs=inputs, outputs=[outputs], headers=headers)
inference_output = results.as_numpy('generated_indices')


<HTTPSocketPoolResponse status=200 headers={'content-length': '266', 'content-type': 'application/json', 'date': 'Wed, 27 Sep 2023 19:40:44 GMT', 'x-envoy-upstream-service-time': '15149', 'server': 'istio-envoy'}>
**********


In [None]:
inference_output


array([[     2, 128088,  71714,    720,  12089,    438,  85959,    102,
         55728,  37578,  53140,    311,    103,   2447,     82,   2786,
          3194,    720,  12089,    438,  28668,  21552,  55125,    360,
             2]])

In [21]:
decoded_output = tokenizer.batch_decode(inference_output, skip_special_tokens=True)


In [22]:
decoded_output


['Ninajifunza jinsi ya kutumia Mtandao wa Triton ili Kujifunza Kutumia Mashine']

We are able to infer using triton.

The next step is to build the production pipeline that will scale on GCP.

###  Later we will update this code to use grpc because it have been proven to be faster and more efficient than http.

One of the main advantages of gRPC over HTTP is that it is faster and more efficient. This is due to several factors:



#### STart here tommorow night

https://github.com/kserve/kserve/tree/master/docs/samples/multimodelserving/triton

https://towardsdatascience.com/kserve-highly-scalable-machine-learning-deployment-with-kubernetes-aa7af0b71202


$(kubectl get inferenceservices m2m100-translation-inference-service -o jsonpath='{.status.url}' | cut -d "/" -f 3)

 docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/triton_model_repository:/models nvcr.io/nvidia/tritonserver:23.06-py3 bash -c "pip install transformers==4.30.2  sentencepiece==0.1.99 && tritonserver --model-repository=/models"


 - https://medium.com/@fractal.ai/bloom-3b-optimization-deployment-using-triton-server-part-1-f809037fea40

Stop here because I failed  to load the model in the kubernetes cluster due to memory limitation issues

[400] Failed to process the request(s) for model instance 'm2m100_translation_model_0', message: Stub process 'm2m100_translation_model_0' is not healthy.

Next step is to document the learning and learning about memory usage