### Deployed an optimized M2M100 model with ONNX

#### Optimize M2M100 model with ONNX


In this notebook we will describes steps to load models with m2m100 translation models from HuggingFace and optimize them with ONNX Runtime. We will also show how to use the optimized model to perform translation.

Once the model are optimize we will deploy them as an Api so that they can be used in a web application.

At the first step we will load the vanilla model from Hugginface and use it for inference, then we will convert it to ONNX and Finally we will optimize it with ONNX Runtime.

### First Step

We will start by loading our model from the huggingface repository! 

Our model is an encoder decoder model from the m2m100 family. It was trained to translate english to swahili. Why did I pick Swahili? Because I am a native Swahili speaker. Let make sure we have the transformer library installed as well as Pytorch.

In [3]:
from transformers import AutoTokenizer, M2M100ForConditionalGeneration, pipeline


In [4]:
MODEL_NAME = "masakhane/m2m100_418M_en_swa_rel_news"


In [5]:
model: M2M100ForConditionalGeneration = M2M100ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


In [6]:
text_to_translate = "Hello, my name is Espoir Murhabazi,  I am a Software Engineer from Congo DRC but living in UK"


In [7]:
model_input = tokenizer(text_to_translate, return_tensors="pt")


In [8]:
model_input.keys()


dict_keys(['input_ids', 'attention_mask'])

In [9]:
generated_tokens = model.generate(**model_input, forced_bos_token_id=tokenizer.lang_code_to_id["sw"])




At this point our model have generate the translation token, the next step is to use our tokenizer to convert back the token to the text. This is called decoding.

In [12]:
translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)


In [13]:
translated_text


['Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza']

The translated test show us that the model is working, the next step is to prepare the model for production. 
To productionarize our model we will deploy it to ONNX format.

#### What is ONNX format?

ONNX stands for Open Neural Network Exchange. It is an open format built to represent machine learning models.

As you may know, neural networks are computation graphs with the input,  the weights  and operations.

ONNX format is a way of saving neural network as computation graphs. That  computational graph represent the flow of data through the neural network.


The keys benefits of saving neural networks in onnx format is interoperability and hardware access. A neural network saved in onnx format can be read by any deep learning platform.  A model trained in pytorch can be exported to ONNX format and then imported in Tensorflow and vice versa.

You don't need to use python to read a model saved as ONNX, you can use any programming language of your choice such as javascript , c or c++. 

ONNX makes model easier to accesss hardware opitimizations, and you can apply other optimization such quantization to your ONNX model.

Let us see how we can convert our model to ONNX format to use the full benefits of it.

Trying to export the model manually and see if we can load the model.

To export the model to onnx format we will be using the optimum cli from Huggingface.

In [14]:
! optimum-cli export onnx --model masakhane/m2m100_418M_en_swa_rel_news --task seq2seq-lm-with-past --for-ort onnx/m2m100_418M_en_swa_rel_news


The option --for-ort was passed, but its behavior is now the default in the ONNX exporter and passing it is not required anymore.
Framework not specified. Using pt to export to ONNX.
Using framework PyTorch: 2.0.0
Overriding 1 configuration item(s)
	- use_cache -> False
  if max_pos > self.weights.size(0):
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
verbose: False, log level: Level.ERROR

Using framework PyTorch: 2.0.0
Overriding 1 configuration item(s)
	- use_cache -> True
  if input_shape[-1] > 1:
  mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
verbose: False, log level: Level.ERROR

Using framework PyTorch: 2.0.0
Overriding 1 configuration item(s)
	- use_cache -> True
Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for `dec

check if the model is correct

If the previous command was run successfully, we can see our model saved at `onnx/m2m100_418M_en_swa_rel_news`. 

By checking the size we notice data our encoder model have 1.1 Gb, and our decoder model have 1.7Gb which make our model size to 2.8GB. Additionally, in the same folder we have the tokenizer data.

In [15]:
from pathlib import Path


In [17]:
base_model_onnx_dir = Path.cwd().joinpath('onnx').joinpath('m2m100_418M_en_swa_rel_news')


In [18]:
base_model_onnx_dir.exists()


True

### Applying Quantization

Quantization is the process of reducing the model size by using fewer bits to represent its parameters. Instead of using 32 bits precision floating points for most of the models, with quantization we can use 12 bits to represent a number and consequently reduce the size of the model.

Smaller models resulting from quantization are faster to deploy and have low latency in production.
For this tutorial we will use quantization to reduce the size of our model for inference.

In [30]:
from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig


In [32]:
encoder_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="encoder_model.onnx")


In [33]:
decoder_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="decoder_model.onnx")


In [34]:
decoder_with_past_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="decoder_with_past_model.onnx")


In [35]:
quantizers = [encoder_quantizer, decoder_quantizer, decoder_with_past_quantizer]


We will use dynamic quantization to our model.

In [36]:
dynamic_quantization_config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)


In [37]:
quantized_model_path = Path("onnx").joinpath(f"{MODEL_SUFFIX}_quantized/")
quantized_model_path.mkdir(parents=True, exist_ok=True)


In [38]:
for quantizer in quantizers:
    quantizer.quantize(quantization_config=dynamic_quantization_config, save_dir=quantized_model_path)


Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Saving quantized model at: onnx/m2m100_418M_en_swa_rel_news_quantized (external data format: False)
Configuration saved in onnx/m2m100_418M_en_swa_rel_news_quantized/ort_config.json
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Saving quantized model at: onnx/m2m100_418M_en_swa_rel_news_quantized (external data format: False)
Configuration saved in onnx/m2m100_418M_en_swa_rel_news_quantized/ort_config.json
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Saving quantized model at: onnx/m2m100_418M_en_swa_rel_news_quantized (external data format: False)
Configuration saved in onnx/m2m100_418M_en_swa_rel_news_quantized/ort_config.json


Our model are save as quantized version, we can now check the size of the quantized models.

In [40]:
for model in quantized_model_path.glob("*.onnx"):
    print("the size of the model in MB is: ", model.stat().st_size / (1024 * 1024))


the size of the model in MB is:  823.3566484451294
the size of the model in MB is:  799.1366529464722
the size of the model in MB is:  649.4426412582397


We can see that we have managed to reduce the size of our initial models by two! From 1.6 Gb without quantization to 800 Mb with quantization. Let us see how to use the quantized model for inference.

### Use the quantized model

In [41]:
quantized_model_path = base_model_onnx_dir


In [42]:
quantized_model = ORTModelForSeq2SeqLM.from_pretrained(quantized_model_path, 
                                                       decoder_file_name='decoder_model_quantized.onnx',
                                                       encoder_file_name='encoder_model_quantized.onnx',)


2023-10-17 07:43:33.253277 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/Shape_7_output_0'. It is not used by any node and should be removed from the model.
2023-10-17 07:43:33.253377 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/Constant_22_output_0'. It is not used by any node and should be removed from the model.
2023-10-17 07:43:33.253444 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/Constant_17_output_0'. It is not used by any node and should be removed from the model.
2023-10-17 07:43:33.254122 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/Shape_4_output_0'. It is not used by any node and should be removed from the model.
2023-10-17 07:43:33.255608 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/S

In [43]:
quantized_pipeline = pipeline("translation_en_to_sw", model=quantized_model, tokenizer=tokenizer)


In [44]:
translated_text_quantized = quantized_pipeline(text_to_translate)


In [45]:
print(translated_text_quantized)


[{'translation_text': 'Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza'}]


The quantization seems to reduce the size of the model but keeping the same performance, as per the documentaiton and experience performed on other models, we need to perform the quantization on other model to check for the performance.

With  our model quantized let us move to the next step which is a deployment.

### Deploy the Model for inference

At this point we have our model quantized and saved as onnx format. We will now deploy it to a production server using triton inference server. 
In the first section we will deploy with triton server as a docker container, and then we will use Kserve to deploy it to the kubernetes deployment environment.

### What is Triton Inference Server?

Triton is software tool for deploying machine learning models for inference. It is designed to produce high quality inference across different hardware platform either GPU or CPU. It also support inference across cloud, data center and embedded devices.
One of the advantage about the triton server I found, is the fact that it support dynamic batching and concurrent model execution

- Concurency model execution is the capacity to run simultaneously multiple models on the same GPU or on multiple GPUs.

- Dynamic batching, for model that support batching, which is the case for deep learning models, triton implements scheduling and batching algorithms that combine individual requests together to improve inference throughput.


### Triton Server Backend
Triton support different backend to execute the model. A backend is a wrapper around a deep learning framework like Pytorch , TensorFlow, TensorRT or ONNX Runtime.
Two backend type interested us for this post, the Python Backend and the ONNX runtime backend. 

The onnx runtime backend is used to execute onnx models, the python backend allow to write the  model logic in python. 

In  this post we will be focused on the ONNX and the Python backend.

I decided to go with the python backend because I struggled to deploy the encoder decode model using ensemble of ONNX model. I still have a question in progress on [StackOverlow](https://stackoverflow.com/q/76638766/4683950).  


#### Uploading the Model to Repository.

The first step before using our model is to upload it to the model repository, for starting we will be using our local storage as model repository but later we will use a static storage such as google cloud or AWS S3 to host our model.

### Configuration

The first step to deploy our model in triton is to configure it.

The configuration setup the model and define the input shape and the output shape of our models

In [None]:
# %load ./triton_model_repository/m2m100_translation_model/config.pbtxt
name: "m2m100_translation_model"
backend: "python"
max_batch_size: 0
input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  },
{
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  }
]
output [
    {
    name: "generated_indices"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]


In the above configuration, we can see that the model is expecting two inputs, the inputs ids, and the attention masks, and it returns the generated input indices.

The input id and the attention masks are the outputs from the tokenisation process. The generated indices are the tokenised output indices, that will be decoded to find our generated indices.

The configuratin file needs to be save at the root file of our model repository.

#### Create the load model script

The load model script is the python script that load our model before and run it for inference.

In [3]:
# %load ./triton_model_repository/m2m100_translation_model/1/model.py
from typing import Dict, List
import triton_python_backend_utils as pb_utils
from pathlib import Path
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import torch

TOKENIZER_SW_LANG_CODE_TO_ID = 128088


class TritonPythonModel:

    def initialize(self, args: Dict[str, str]) -> None:
        """
        Initialize the tokenization process
        :param args: arguments from Triton config file
        """
        current_path: str = Path(args["model_repository"]).parent.absolute()
        model_path = current_path.joinpath("m2m100_translation_model", "1", "m2m100_418M_en_swa_rel_news_quantized")
        self.device = "cpu" if args["model_instance_kind"] == "CPU" else "cuda"
        # more variables in https://github.com/triton-inference-server/python_backend/blob/main/src/python.cc
        self.model = ORTModelForSeq2SeqLM.from_pretrained(model_path,
                                                          decoder_file_name="decoder_model_quantized.onnx",
                                                          encoder_file_name="encoder_model_quantized.onnx")
        if self.device == "cuda":
            self.model = self.model.cuda()
        print("TritonPythonModel initialized")

    def execute(self, requests) -> "List[List[pb_utils.Tensor]]":
        """
        Parse and tokenize each request
        :param requests: 1 or more requests received by Triton server.
        :return: text as input tensors
        """
        responses = []
        # for loop for batch requests (disabled in our case)
        for request in requests:
            # binary data typed back to string
            input_ids = pb_utils.get_input_tensor_by_name(request, "input_ids").as_numpy()
            attention_masks = pb_utils.get_input_tensor_by_name(request, "attention_mask").as_numpy()
            input_ids = torch.as_tensor(input_ids, dtype=torch.int64)
            attention_masks = torch.as_tensor(attention_masks, dtype=torch.int64)
            if self.device == "cuda":
                input_ids = input_ids.to("cuda")
                attention_masks = attention_masks.to("cuda")
            model_inputs = {"input_ids": input_ids, "attention_mask": attention_masks}
            generated_indices = self.model.generate(**model_inputs,
                                                    forced_bos_token_id=TOKENIZER_SW_LANG_CODE_TO_ID)
            tensor_output = pb_utils.Tensor("generated_indices", generated_indices.numpy())
            responses.append(tensor_output)
        responses = [pb_utils.InferenceResponse(output_tensors=responses)]
        return responses
    
    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')


The model contains a class with two methods:

- Initialize: The initialize method use the ORT model to load the model in the memory!
- The execute method parse and tokenize each request receive by the triton server. It call the generate method on the input of the request and return the generated text indices. This text will be later decode by the tokenizer.

If our configuration is done properly and the model was saved properly, we should have a model repository that looks like this:

```
triton_model_repository
└── m2m100_translation_model
    ├── 1
    │   ├── m2m100_418M_en_swa_rel_news_quantized
    │   │   ├── config.json
    │   │   ├── decoder_model_quantized.onnx
    │   │   ├── decoder_with_past_model_quantized.onnx
    │   │   ├── encoder_model_quantized.onnx
    │   │   └── ort_config.json
    │   └── model.py
    └── config.pbtxt
```

Make sure that you have the file located at the precise location as me in order to be able to run the code.


### Launching the docker image

If you look carefully at the code for our python model you can see that the model, is importing the ONNX runtime! However that runtime is not installed in the base triton server image. Reason why we decided to build our own triton image.

In [4]:
# %load Dockerfile
# Use the base image
FROM nvcr.io/nvidia/tritonserver:23.06-py3

# Install the required Python packages
RUN pip install optimum==1.9.0 onnxruntime==1.15.1 onnx==1.14.0


The above code show how we build our docker image.
We are using the base tritonserver image and then we add the different packages we need to run our model.

Next we can build our model using:

`docker build -t espymur/triton-onnx:dev  -f Dockerfile .`

Please not that the image is huge, around 15 Gb, in the next post I will try to optimize the image size by using technic suggested in the documentation.


`docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002  --shm-size 128M -v ${PWD}/triton_model_repository:/models  espymur/triton-onnx:dev tritonserver --model-repository=/models`

- This command run the docker container and map the the port 8000, 8001, 8002 to 8000, 8001, 8002 of our local machine.

- It then create a volume that maps the `${PWD}/triton_model_repository` path from our local machine to /models in the container.

- It is also using a shared memory of 128 Mb.


With this model we can see that our model is running and we can perform inference without any problem.

At this point, we have got our model running inside the docker container, the next step will be to make inference requests. Let see how we can achieve that.

### Making Inference Requests

The model is now updated and saved as  triton backend model we will apply tokenization offline and query the model with the tokenized words and the attention mask. 
The model will return the indices of the translated test, we will use the tokenizer again to decode the indices and produce the output.

We can later have the tokenizer as a separate service people can interact with using http.

In [16]:
MODEL_NAME = "masakhane/m2m100_418M_en_swa_rel_news"


In [17]:
from transformers import AutoTokenizer


In [18]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


In [11]:
import numpy as np
import tritonclient.http as httpclient


#### The HTTP client

In [28]:
client = httpclient.InferenceServerClient(url="localhost:8000")


#### The inputs

This line create the client object we will be using to interact with our server. To create the client object we are passing the url of the inference service as parameter.

In [14]:
input_ids = httpclient.InferInput("input_ids", shape=(-1,1) , datatype="TYPE_INT64",)
attention_mask = httpclient.InferInput("attention_mask", shape=(-1,1) , datatype="TYPE_INT64",)


### The outputs.

In [13]:
outputs = httpclient.InferRequestedOutput("generated_indices", binary_data=False)


To prepare our model input we are using the triton client library. 
The above code create two objects for the input id and the attention mask respectively! We can specify the shape our the element and their datatype when creating the code.

Additionally to our inputs and outputs, we will need some utility function to perform the tokenization. Here are those functions:

#### Utilities Functions

In [20]:
def get_tokenizer(model_name):
    """Returns a tokenizer for a given model name

    Args:
        model_name (_type_): _description_

    Returns:
        _type_: _description_
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer


In [21]:
from typing import Tuple, List

import numpy as np


In [4]:
from transformers import AutoTokenizer


  from .autonotebook import tqdm as notebook_tqdm


In [22]:

def tokenize_text(tokenizer: AutoTokenizer, text:str) -> Tuple[np.ndarray , np.ndarray]:
    tokenized_text = tokenizer(text, padding=True, return_tensors="np")
    return tokenized_text.input_ids, tokenized_text.attention_mask


In [23]:
def generate_inference_input(input_ids: np.ndarray, attention_mask: np.ndarray) -> List[httpclient.InferInput]:
    """
    Generate inference inputs for Triton server

    Args:
        input_ids (np.ndarray): _description_
        attention_mask (np.ndarray): _description_

    Returns:
        List[httpclient.InferInput]: _description_
    """
    inputs = []
    inputs.append(httpclient.InferInput("input_ids", input_ids.shape, "INT64"))
    inputs.append(httpclient.InferInput("attention_mask", attention_mask.shape, "INT64"))

    inputs[0].set_data_from_numpy(input_ids.astype(np.int64), binary_data=False)
    inputs[1].set_data_from_numpy(attention_mask.astype(np.int64), binary_data=False)
    return inputs



In [33]:
text = ["I am learning how to use Triton Server for Machine Learning", "Hello, my name is Espoir Murhabazi,  I am a Software Engineer from Congo DRC but living in UK"]


In [None]:
tokenizer = get_tokenizer(MODEL_NAME)


In [34]:
input_ids, attention_mask = tokenize_text(tokenizer, text)


In [35]:
inference_inputs = generate_inference_input(input_ids, attention_mask)


With our input prepared we can now make an inference request to our server. Here is the code we will be using to make the inference request.

In [36]:
results = client.infer(model_name="m2m100_translation_model", inputs=inference_inputs, outputs=[outputs])
inference_output = results.as_numpy('generated_indices')


<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '453'}>
**********


If everything goes as planned, we should be able to see the inference response.

In [37]:
inference_output


array([[     2, 128088,  71714,    720,  12089,    438,  51759,    377,
           102,  28668,  21552,  37578,  53140,    311,    103,   2447,
            82,   2786,   3194,    720,  12089,    438,  28668,  21552,
         55125,    360,      2,      1,      1,      1,      1,      1,
             1,      1],
       [     2, 128088,    298,    260, 118240,    243,   6209,  18234,
         10749,   8612,   2956,      4,    100,   1123,    243,    172,
          8245,    649,    311,  29574,    181, 112209,   1777,  14903,
           129,   9382,  22310,    247,  24109,  67338,   7022,    352,
         98264,      2]])

In [38]:
decoded_output = tokenizer.batch_decode(inference_output, skip_special_tokens=True)


In [39]:
decoded_output


['Ninajifunza Jinsi ya Kutumia Mtandao wa Triton ili Kujifunza Kutumia Mashine',
 'Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza']

With the decoded output, we can see that our inference server is working!
In this post, we saw how we can start form a raw translation model from huggingface, we then quantized it to reduce it's size, and finally deployed the model on a triton server to perform inference.
In the second part of this blog we will learn how to scale the whole prototype and build an end to end pipeline using kubernetes and Kserve.