#### Optimize M2M100 model with ONNX


In this notebook we will describes steps to laod models with m2m100 translation models from HuggingFace and optimize them with ONNX Runtime. We will also show how to use the optimized model to perform translation.

Once the model are optimize we will deploy them as an Api so that they can be used in a web application.

At the first step we will load the vanilla model from Hugginface and use it for inference, then we will convert it to ONNX and Finally we will optimize it with ONNX Runtime.

### First Step

Loading the vanilla model from hugginface

In [43]:
import torch

In [1]:
from transformers import AutoTokenizer, M2M100ForConditionalGeneration, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
MODEL_NAME = "masakhane/m2m100_418M_en_swa_rel_news"

In [59]:
model: M2M100ForConditionalGeneration = M2M100ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

loading configuration file config.json from cache at /Users/es.py/.cache/huggingface/hub/models--masakhane--m2m100_418M_en_swa_rel_news/snapshots/0a98b0ef693397620fe273e7325d769f2bd58a51/config.json
Model config M2M100Config {
  "_name_or_path": "facebook/m2m100_418M",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "M2M100ForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.05,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.05,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_bos_token_id": 128088,
  "gradient_checkpointing": false,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_length": 200,
  "max_position_embeddings": 1024,
  "model_type": "m2m_100",
  "num_beams":

In [4]:
text_to_translate = "Hello, my name is Espoir Murhabazi,  I am a Software Engineer from Congo DRC but living in UK"

In [5]:
model_input = tokenizer(text_to_translate, return_tensors="pt")

In [60]:
modeL_outputs = model(**model_input)

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

In [6]:
generated_tokens = model.generate(**model_input, forced_bos_token_id=tokenizer.lang_code_to_id["sw"])



In [7]:
translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

In [8]:
translated_text

['Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza']

Trying to export the model manually and see if we can load the model.

In [9]:
MODEL_SUFFIX = MODEL_NAME.replace('masakhane/', '')

In [10]:
%%script false --no-raise-error
onnx_inputs, onnx_outputs = export_onnx(
    preprocessor=tokenizer,
    model=model,
    config=onnx_config,
    opset=13,
    output=output_path,
)

In [12]:
MODEL_SUFFIX

'm2m100_418M_en_swa_rel_news'

This command is not working properly, It is saving the model as one file instead of two file one for the encoder another one for the decoder.

The best approach is to use CLI as suggested in the documentation.

` optimum-cli export onnx --model masakhane/m2m100_418M_en_swa_rel_news --task seq2seq-lm-with-past --for-ort onnx`

In [13]:
! optimum-cli export onnx --model masakhane/m2m100_418M_en_swa_rel_news --task seq2seq-lm-with-past --for-ort onnx/m2m100_418M_en_swa_rel_news

Framework not specified. Using pt to export to ONNX.
Using framework PyTorch: 1.13.1
Overriding 1 configuration item(s)
	- use_cache -> False
  if max_pos > self.weights.size(0):
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
Using framework PyTorch: 1.13.1
Overriding 1 configuration item(s)
	- use_cache -> True
  if input_shape[-1] > 1:
  mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min))
Using framework PyTorch: 1.13.1
Overriding 1 configuration item(s)
	- use_cache -> True
Asked a sequence length of 16, but a sequence length of 1 will be used with use_past ==True for `decoder_input_ids`.
  if (
Validating ONNX model...
	-[✓] ONNX model output names match reference model (last_hidden_state)
	- Validating ONNX Model output "last_hidden_state":
		-[✓] (2, 16, 1024) matches (2, 16, 1024)
		-[✓] all values clo

check if the model is correct

In [16]:
from pathlib import Path

In [17]:
base_model_onnx_dir = Path("onnx").joinpath(MODEL_SUFFIX)

In [18]:
base_model_onnx_dir

PosixPath('onnx/m2m100_418M_en_swa_rel_news')

### Use the optimization to Opimze the model

In this section we will apply the first optimization to the model we saved in the previous step.

We will start by testing the basica optimization to see 

In [23]:
from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig
from transformers import AutoConfig
from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [22]:
optimization_config = OptimizationConfig(optimization_level=99)



### Loading the Model

In [24]:
onnx_model =  ORTModelForSeq2SeqLM.from_pretrained(base_model_onnx_dir)

In [25]:
optimizer = ORTOptimizer.from_pretrained(onnx_model)

In [26]:
optimized_model_path = Path("onnx").joinpath(f"{MODEL_SUFFIX}_optimized/")
optimized_model_path.mkdir(parents=True, exist_ok=True)

In [27]:
optimizer.optimize(save_dir=optimized_model_path, optimization_config=optimization_config)

2023-03-01 15:37:47.100773 [W:onnxruntime:, inference_session.cc:1546 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.
symbolic shape infer failed. it's safe to ignore this message if there is no issue with optimized model
symbolic shape infer failed. it's safe to ignore this message if there is no issue with optimized model
symbolic shape infer failed. it's safe to ignore this message if there is no issue with optimized model
symbolic shape infer failed. it's safe to ignore this message if there is no issue with optimized model
symbolic shape infer failed. it's safe to ignore this message if there is no issue with optimized model
symbolic shape infer failed. it's safe to ignore this message if there is no issue with optimized model
symbolic shape infer failed.

PosixPath('onnx/m2m100_418M_en_swa_rel_news_optimized')

Using the optimize model and check if the model is working.

### Use the optimized model

Once we have developed the model, let us now use the optimized model to run the inference and check if the model is working.

In [28]:
from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [76]:
optimized_model_path

NameError: name 'optimized_model_path' is not defined

In [29]:
optimized_model = ORTModelForSeq2SeqLM.from_pretrained(optimized_model_path)

In [30]:
from optimum.pipelines import pipeline

In [31]:
onnx_optimize = pipeline("translation_en_to_sw", model=optimized_model, tokenizer=tokenizer)

In [32]:
translated_text = onnx_optimize(text_to_translate)

In [33]:
translated_text

[{'translation_text': 'Jina langu ni Espoir Murhabazi, Mimi ni mhandisi wa programu za kompyuta kutoka Kongo DRC lakini ninaishi Uingereza'}]

I have managed to apply optimization and run the inference on the model, the last issue will be to run the test to check if the performance of the predicted model is good but at least the model is now working. I need to now move to the next step which is deploying the model.

### Applying Quantization

Learn more about quantization here..

In [67]:
from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig

In [44]:
encoder_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="encoder_model.onnx")

In [45]:
decoder_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="decoder_model.onnx")

In [46]:
decoder_with_past_quantizer = ORTQuantizer.from_pretrained(base_model_onnx_dir, file_name="decoder_with_past_model.onnx")

In [47]:
quantizers = [encoder_quantizer, decoder_quantizer, decoder_with_past_quantizer]

In [48]:
dynamic_quantization_config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

In [61]:
quantized_model_path = Path("onnx").joinpath(f"{MODEL_SUFFIX}_quantized/")
quantized_model_path.mkdir(parents=True, exist_ok=True)

In [50]:
for quantizer in quantizers:
    quantizer.quantize(quantization_config=dynamic_quantization_config, save_dir=quantized_model_path)

In [65]:
quantized_model_path.exists()
print(list(quantized_model_path.iterdir()))

[PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/decoder_model_quantized.onnx'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/tokenizer_config.json'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/special_tokens_map.json'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/sentencepiece.bpe.model'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/config.json'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/decoder_with_past_model_quantized.onnx'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/vocab.json'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/encoder_model_quantized.onnx'), PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized/ort_config.json')]


### Use the quantized model

In [77]:
quantized_model_path

PosixPath('onnx/m2m100_418M_en_swa_rel_news_quantized')

In [68]:
quantized_model = ORTModelForSeq2SeqLM.from_pretrained(quantized_model_path)

loading configuration file onnx/m2m100_418M_en_swa_rel_news_quantized/config.json
Model config M2M100Config {
  "_name_or_path": "onnx/m2m100_418M_en_swa_rel_news_quantized/config.json",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "M2M100ForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.05,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.05,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_bos_token_id": 128088,
  "gradient_checkpointing": false,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_length": 200,
  "max_position_embeddings": 1024,
  "model_type": "m2m_100",
  "num_beams": 5,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "scale_embedding": true,
  "t

In [69]:
quantized_pipeline = pipeline("translation_en_to_sw", model=quantized_model, tokenizer=tokenizer)

In [74]:
translated_text_quantized = quantized_pipeline(text_to_translate)

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 128088,
  "max_length": 200,
  "num_beams": 5,
  "pad_token_id": 1,
  "transformers_version": "4.26.1"
}



I am called with the following input_ids: True
I am called with the following attention_mask: False
I am called with the following decoder_input_ids: torch.Size([5, 1])
I am called with the following past_keys values: True
I am called with the following input_ids: True
I am called with the following attention_mask: False
I am called with the following decoder_input_ids: torch.Size([5, 2])
I am called with the following past_keys values: False
I am called with the following input_ids: True
I am called with the following attention_mask: False
I am called with the following decoder_input_ids: torch.Size([5, 3])
I am called with the following past_keys values: False
I am called with the following input_ids: True
I am called with the following attention_mask: False
I am called with the following decoder_input_ids: torch.Size([5, 4])
I am called with the following past_keys values: False
I am called with the following input_ids: True
I am called with the following attention_mask: False
I am 

In [75]:
print(translated_text_quantized)

[{'translation_text': 'Ninajaribu kuandika andiko hili nikiwa na mifano ya hatua kwa hatua'}]


The quantization seems to reduce the size of the model but keeping the same performance, as per the documentaiton and experience performed on other models, we need to perform the quantization on other model to check for the performance.

#### Loading the model separately:

In this section we will load the model separately without the huggingface pipeline abstraction. We will load the tokeniser, use it to generate the input ids, the attention mask  and then pass the inputs ids and the attention mask to the encoder to generate the the encoded version of the text, then the encoded text will be passed to the decoder to generate the translated text.

### Tokenization

In [2]:
from transformers import AutoTokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [5]:
text_to_translate = "I am trying to translate this text with step by step models"

In [12]:
model_inputs = tokenizer(text_to_translate, return_tensors="pt")

In [16]:
model_inputs.get("attention_mask")

torch.Size([1, 15])

The model input contains the the input ids and the attention mask, the next step will be to pass the input ids and the attention mask to the encoder to generate the encoded text.

#### Encoder Part

In [17]:
# import the configuration of the model

from transformers import AutoConfig
config = AutoConfig.from_pretrained(MODEL_NAME)

In [18]:
config

M2M100Config {
  "_name_or_path": "masakhane/m2m100_418M_en_swa_rel_news",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "M2M100ForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.05,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.05,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_bos_token_id": 128088,
  "gradient_checkpointing": false,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_length": 200,
  "max_position_embeddings": 1024,
  "model_type": "m2m_100",
  "num_beams": 5,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "scale_embedding": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 128112
}

In [19]:
from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder

In [26]:
QUANTIZED_MODEL_SUFFIX = MODEL_NAME.replace('masakhane/', '').replace('en_swa_rel_news', 'en_swa_rel_news_quantized')

In [27]:
QUANTIZED_MODEL_SUFFIX

'm2m100_418M_en_swa_rel_news_quantized'

In [29]:
from pathlib import  Path

In [30]:
quantized_model_onnx_dir = Path("onnx").joinpath(QUANTIZED_MODEL_SUFFIX)

In [31]:
quantized_model_onnx_dir.exists()

True

In [33]:
encoder_path = quantized_model_onnx_dir.joinpath("encoder_model_quantized.onnx")
assert encoder_path.exists(), f"Encoder model does not exist at {encoder_path}"

In [36]:
provider = "CPUExecutionProvider"

In [34]:
from optimum.onnxruntime.modeling_ort import ORTModel

In [57]:
from optimum.onnxruntime.modeling_seq2seq import ORTEncoder, ORTDecoderForSeq2Seq

In [37]:
encoder_session = ORTModel.load_model(encoder_path, provider, None, None)

In [44]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [45]:
encoder = ORTEncoder(
            session=encoder_session,
            config=config,
            device=device,
            use_io_binding=None,
            main_input_name="input_ids"
        )

In [50]:
encoder_output = encoder.forward(**model_inputs, return_dict=True)

In [51]:
encoder_output.keys()

odict_keys(['last_hidden_state'])

In [52]:
encoder_output.get("last_hidden_state").shape

torch.Size([1, 15, 1024])

In [None]:
# let us try to deploy the encoder and see how it will works

This is all for the encoder part, let now move to the decoder and the decoder with attention.

#### Decode with Past. 

What does the decoder with attention return?

If only the context vector is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence.

Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights. These will be multiplied by the encoder output vectors to create a weighted combination. The result (called attn_applied in the code) should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.

In [54]:
decoder_with_past_path = quantized_model_onnx_dir.joinpath("decoder_with_past_model_quantized.onnx")
assert decoder_with_past_path.exists(), f"Decoder with past model does not exist at {decoder_with_past_path}"

In [55]:
decoder_with_past_session  = ORTModel.load_model(decoder_with_past_path, provider, None, None)

In [58]:
decoder_with_past = ORTDecoderForSeq2Seq(
                session=decoder_with_past_session,
                config=config,
                device=device,
                use_io_binding=None
            )

What is passed to the decoder, what is passed to the decoder with attention? Are the keys question to answer to.

In [None]:
decoder_outputs = self.decoder_with_past(
                input_ids=decoder_input_ids[:, -1:],  # Cut decoder_input_ids if past is used
                past_key_values=past_key_values,
                encoder_hidden_states=encoder_outputs.last_hidden_state,
                encoder_attention_mask=attention_mask,
                labels=labels,
            )

Thursday stuck on loading model in triton server, with a stupid bug, will raise an issue on the forum later.