#### Optimize M2M100 model with ONNX


In this notebook we will describes steps to laod models with m2m100 translation models from HuggingFace and optimize them with ONNX Runtime. We will also show how to use the optimized model to perform translation.

Once the model are optimize we will deploy them as an Api so that they can be used in a web application.

At the first step we will load the vanilla model from Hugginface and use it for inference, then we will convert it to ONNX and Finally we will optimize it with ONNX Runtime.

### First Step

Loading the vanilla model from hugginface

In [1]:
import torch

In [2]:
from transformers import AutoTokenizer, M2M100ForConditionalGeneration, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
MODEL_NAME = "masakhane/m2m100_418M_en_swa_rel_news"

In [4]:
model: M2M100ForConditionalGeneration = M2M100ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [5]:
text_to_translate = "Hello, my name is Espoir Murhabazi,  I am a Software Engineer from Congo DRC but living in UK"

In [6]:
model_input = tokenizer(text_to_translate, return_tensors="pt")

In [7]:
model_input.keys()

dict_keys(['input_ids', 'attention_mask'])

In [8]:
generated_tokens = model.generate(**model_input, forced_bos_token_id=tokenizer.lang_code_to_id["sw"])



In [9]:
generated_tokens.shape

torch.Size([1, 34])

In [10]:
model_input["input_ids"].shape

torch.Size([1, 27])

In [11]:
translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

In [12]:
translated_text

['Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza']

Trying to export the model manually and see if we can load the model.

In [15]:
MODEL_SUFFIX = MODEL_NAME.replace('masakhane/', '')

In [None]:
%%script false --no-raise-error
onnx_inputs, onnx_outputs = export_onnx(
    preprocessor=tokenizer,
    model=model,
    config=onnx_config,
    opset=13,
    output=output_path,
)

In [16]:
MODEL_SUFFIX

'm2m100_418M_en_swa_rel_news'

This command is not working properly, It is saving the model as one file instead of two file one for the encoder another one for the decoder.

The best approach is to use CLI as suggested in the documentation.

` optimum-cli export onnx --model masakhane/m2m100_418M_en_swa_rel_news --task seq2seq-lm-with-past --for-ort onnx`

In [173]:
%%script false --no-raise-error # uncomment if the model export run is not done yet.

! optimum-cli export onnx --model masakhane/m2m100_418M_en_swa_rel_news --task seq2seq-lm-with-past --for-ort onnx/m2m100_418M_en_swa_rel_news

check if the model is correct

In [17]:
from pathlib import Path

In [26]:
base_model_onnx_dir = Path.cwd().joinpath("triton_model_repository", "encoder_decoder_model", "1", "m2m100_418M_en_swa_rel_news_quantized")

In [27]:
base_model_onnx_dir.exists()

True

### Use the optimization to Opimze the model

In this section we will apply the first optimization to the model we saved in the previous step.

We will start by testing the basica optimization to see 

In [20]:
from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig
from transformers import AutoConfig
from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [21]:
optimization_config = OptimizationConfig(optimization_level=99)



### Loading the Model

In [22]:
onnx_model =  ORTModelForSeq2SeqLM.from_pretrained(base_model_onnx_dir)

OSError: onnx/m2m100_418M_en_swa_rel_news is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

In [None]:
optimizer = ORTOptimizer.from_pretrained(onnx_model)

In [None]:
optimized_model_path = Path("onnx").joinpath(f"{MODEL_SUFFIX}_optimized/")
optimized_model_path.mkdir(parents=True, exist_ok=True)

In [None]:
optimized_model_path

In [None]:
optimizer.optimize(save_dir=optimized_model_path, optimization_config=optimization_config)

Using the optimize model and check if the model is working.

### Use the optimized model

Once we have developed the model, let us now use the optimized model to run the inference and check if the model is working.

In [None]:
from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [None]:
optimized_model_path

In [None]:
optimized_model = ORTModelForSeq2SeqLM.from_pretrained(optimized_model_path)

In [None]:
from optimum.pipelines import pipeline

In [None]:
onnx_optimize = pipeline("translation_en_to_sw", model=optimized_model, tokenizer=tokenizer)

In [None]:
translated_text = onnx_optimize(text_to_translate)

In [None]:
translated_text

I have managed to apply optimization and run the inference on the model, the last issue will be to run the test to check if the performance of the predicted model is good but at least the model is now working. I need to now move to the next step which is deploying the model.

### Applying Quantization

Learn more about quantization here..

In [29]:
from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig

In [None]:
encoder_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="encoder_model.onnx")

In [None]:
decoder_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="decoder_model.onnx")

In [None]:
decoder_with_past_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="decoder_with_past_model.onnx")

In [None]:
quantizers = [encoder_quantizer, decoder_quantizer, decoder_with_past_quantizer]

In [None]:
dynamic_quantization_config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

In [None]:
quantized_model_path = Path("onnx").joinpath(f"{MODEL_SUFFIX}_quantized/")
quantized_model_path.mkdir(parents=True, exist_ok=True)

In [None]:
for quantizer in quantizers:
    quantizer.quantize(quantization_config=dynamic_quantization_config, save_dir=quantized_model_path)

In [None]:
quantized_model_path.exists()
print(list(quantized_model_path.iterdir()))

### Use the quantized model

In [28]:
quantized_model_path = base_model_onnx_dir

In [33]:
quantized_model = ORTModelForSeq2SeqLM.from_pretrained(quantized_model_path, 
                                                       decoder_file_name='decoder_model_quantized.onnx',
                                                       encoder_file_name='encoder_model_quantized.onnx',)

Generation config file not found, using a generation config created from the model config.


In [34]:
quantized_pipeline = pipeline("translation_en_to_sw", model=quantized_model, tokenizer=tokenizer, num_beams=6)

In [35]:
translated_text_quantized = quantized_pipeline(text_to_translate)

In [36]:
print(translated_text_quantized)

[{'translation_text': 'Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza'}]


The quantization seems to reduce the size of the model but keeping the same performance, as per the documentaiton and experience performed on other models, we need to perform the quantization on other model to check for the performance.

#### Loading the model separately:

In this section we will load the model separately without the huggingface pipeline abstraction. We will load the tokenizer, use it to generate the input ids, the attention mask  and then pass the inputs ids and the attention mask to the encoder to generate the the encoded version of the text, then the encoded text will be passed to the decoder to generate the translated text.

### Tokenization

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [None]:
text_to_translate = "I am learning how to use Triton Server for Machine Learning"

In [None]:
model_inputs = tokenizer(text_to_translate, return_tensors="pt")

In [None]:
model_inputs.get("input_ids")

The model input contains the the input ids and the attention mask, the next step will be to pass the input ids and the attention mask to the encoder to generate the encoded text.

#### Encoder Part

In [None]:
# import the configuration of the model

from transformers import AutoConfig
config = AutoConfig.from_pretrained(MODEL_NAME)

In [None]:
from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder

In [None]:
QUANTIZED_MODEL_SUFFIX = MODEL_NAME.replace('masakhane/', '').replace('en_swa_rel_news', 'en_swa_rel_news_quantized')

In [None]:
QUANTIZED_MODEL_SUFFIX

In [None]:
from pathlib import  Path

In [None]:
quantized_model_onnx_dir = Path("onnx").joinpath(QUANTIZED_MODEL_SUFFIX)

In [None]:
quantized_model_onnx_dir.exists()

In [None]:
encoder_path = quantized_model_onnx_dir.joinpath("encoder_model_quantized.onnx")
assert encoder_path.exists(), f"Encoder model does not exist at {encoder_path}"

In [None]:
provider = "CPUExecutionProvider"

In [None]:
from optimum.onnxruntime.modeling_ort import ORTModel

In [None]:
from optimum.onnxruntime.modeling_seq2seq import ORTEncoder, ORTDecoderForSeq2Seq

In [None]:
encoder_session = ORTModel.load_model(encoder_path, provider, None, None)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
encoder_output = encoder_session.run(None, {
    "input_ids": model_inputs.get("input_ids").numpy(),
    "attention_mask": model_inputs.get("attention_mask").numpy(),
})

In [None]:
encoder_output[0].shape

why is the output of this shape again?

Need to come back here and learn what is the output of the and how to pass it to the decoder.

The output of the decoder is the contextual reprensation of the imput text. 1, 15, 1024 mean we have 1 batch, 15 tokens and 1024 features for each token.

This is all for the encoder part, let now move to the decoder and the decoder with attention.

#### Decoder Model 

What does the decoder with attention return?

If only the context vector is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence.

Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights. These will be multiplied by the encoder output vectors to create a weighted combination. The result (called attn_applied in the code) should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.

The decoder model will take the last hidden state or the output of the encode model, the start of the sequence token as well as the attention mask. and it will produce an output token and the next hidden state. Then the output token will be passed to the decoder with attention model to generate the next output token and the next hidden state. and so on until we reach the end of the sequence token.

In [None]:
decoder_model_quantized_path = quantized_model_onnx_dir.joinpath("decoder_model_quantized.onnx")
assert decoder_model_quantized_path.exists(), f"Decoder model does not exist at {decoder_model_quantized_path.__str__()}"

In [None]:
decoder_model_session  = ORTModel.load_model(decoder_model_quantized_path, provider, None, None)

The input of the decoder is the start of sequence token, plus the encoder output and the attention mask.

The decoder generate iteratively the next token which are the output of the decoder.

What is passed to the decoder, what is passed to the decoder with attention? Are the keys question to answer to.

Thursday stuck on loading model in triton server, with a stupid bug, will raise an issue on the forum later.

In [None]:
decoder_output = decoder_model_session.run(None, {
            "input_ids": np.array(tokenizer.bos_token_id).reshape(1, 1),
            "encoder_hidden_states": encoder_output[0],
            "encoder_attention_mask": model_inputs.get("attention_mask").numpy(),
        
})

In [None]:
decoder_output[0].shape

In [None]:
np.array(tokenizer.bos_token_id)

In [None]:
## this seems to work but what is the output ?
# What does the output represent?

In [None]:
logits = decoder_output[0]

 thee first input of the decoder is the start token which has the id 2.as_integer_ratio
But on top of that to generate text using beam search we need to pass the as vector of shape[beam_size.]

1  is the batch size, 15 is the sequence length, 128112 is the vocabulary size.

logits

In [None]:
predicted_output = logits.argmax(-1)

In [None]:
predicted_output = predicted_output.reshape(1, -1)

In [None]:
tokenizer.batch_decode(predicted_output)

This is not working.

To implement the decoding step separtely we need to use the generate method of the model.

Exporting the decoder part to ONNX, is a bit challenging, we will try to get back to it later.

- https://discuss.huggingface.co/t/generate-without-using-the-generate-method/11379
- https://forums.developer.nvidia.com/t/deploying-machine-translation-to-triton-inference-server/188612
- https://aws.amazon.com/blogs/machine-learning/create-high-quality-images-with-stable-diffusion-models-and-deploy-them-cost-efficiently-with-amazon-sagemaker/
