#### Onnx Model conversion and Quantization

###### Note load the stuff about ONNX runtime from the machine translation tutorial.

In [1]:
from pathlib import Path
from transformers.convert_graph_to_onnx import convert

  from .autonotebook import tqdm as notebook_tqdm


In [50]:
model_path = Path.cwd().joinpath('models', 'onnx', 'bio-gpt.onnx')

In [51]:
assert model_path.parent.parent.exists(
), f"Model not found at {model_path.parent.parent}"

In [52]:
from torch.onnx import export

In [53]:
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed


model_id = "microsoft/biogpt"

tokenizer = BioGptTokenizer.from_pretrained(model_id)
model = BioGptForCausalLM.from_pretrained(model_id)

  from .autonotebook import tqdm as notebook_tqdm
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /microsoft/biogpt/resolve/main/vocab.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /microsoft/biogpt/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /microsoft/biogpt/resolve/main/generation_config.json HTTP/1.1" 404 0


In [54]:
def encode_input(input):
    return tokenizer([input],
                     return_tensors='pt',
                     max_length=1024,
                     truncation=True)

In [55]:
from transformers.onnx import FeaturesManager

In [7]:
feature = "seq2seq"

In [56]:
input = f"'question:what is the cause of covid ? context: the cause of covid is a virus'"
encoded_input = tokenizer([input],
                          return_tensors='pt',
                          max_length=1024,
                          truncation=True)

In [78]:
import torch

In [57]:
with torch.no_grad():
    beam_output = model.generate(**encoded_input,
                                 min_length=100,
                                 max_length=1024,
                                 num_beams=5,
                                 early_stopping=True
                                 )

In [58]:
output = model.generate(**encoded_input, max_length=30, num_return_sequences=5, do_sample=True)

In [61]:
generated_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)

In [62]:
generated_text

"'question: what is the cause of covid? context: the cause of covid is a virus': a commentary on 'The cause of covid is a virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context: the virus.context."

In [9]:
export(
    model,
    tuple(encoded_input.values()),
    f=model_path,
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'},
                  'attention_mask': {0: 'batch_size', 1: 'sequence'},
                  'logits': {0: 'batch_size', 1: 'sequence'}},
    do_constant_folding=True,
    opset_version=13,
)

  if input_shape[-1] > 1:
  mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min))
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):


verbose: False, log level: Level.ERROR



With our model converted to onnx, we will move to the next step which is to perform quantization on the model.

Next step will be exploring quantization approaches to reduce the size of the model and improve the latency for inference.

Ressources: 

- https://www.philschmid.de/static-quantization-optimum.
- https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb

#### Quantization

Quantization is a technique to reduce the the size of neural networks by using lower precision datatype to represent the weight and activation function in the neural network. In general weights and activation are represented as 32-bit floating points, but with quantization we can represent those floating points as 16-bit floating point or sometime using int16 or int8.

Quantization have proven to reduce the size of language model hence the inference latency by half while keeping a huge percentage of model accuracy for some downstream tasks. [Source](https://www.philschmid.de/static-quantization-optimum).

The bellow image illustrates the effect of the size and inference of quantization on a BERT model.


We can see that the model size and the inference time is reduce by third size using 8 bit quantization while the performance of the model remain the same.

Quantization does not always keep the same accuracy of the model, so before choosing it we need to make sure we evaluate the performance of the model on the whole dataset.

![image](./images/quantization.webp)



For our model we will convert 32 bits floating points to 16 bits, using the onnx library. 

In [13]:
from onnxruntime.transformers import optimizer

In [14]:
getattr(model.config, "num_attention_heads")

16

In [15]:
model_path.__str__()

'/Users/esp.py/Projects/Personal/end-to-end-rag/models/onnx/bio-gpt.onnx'

In [16]:
optimized_model =  optimizer.optimize_model(model_path.__str__(), 
                                            model_type='gpt2', 
                                            num_heads=model.config.num_attention_heads,
                                            hidden_size=model.config.hidden_size)

In [17]:
optimized_model.convert_float_to_float16()

In [18]:
quantized_model_path = model_path.parent.joinpath(
    'decoder_model_quantized.onnx')

In [19]:
optimized_model.save_model_to_file(quantized_model_path)

In [20]:
for model in model_path.parent.glob("*.onnx"):
    print(f"the size of {model.stem} the model in MB is: {model.stat().st_size / (1024 * 1024)}")

the size of decoder_model_quantized the model in MB is: 744.6517105102539
the size of bio-gpt the model in MB is: 1488.90811252594


We can clearly see that the size of our model have been reduced by 50% using the conversion of floats32 to float 16.

We see with this approach that we applied dynamic quantization of the model and it reduce the size of the model! However we could also aplly dynamic quantization to the model but I haven't yet learned about it.  But in [this blog](https://www.philschmid.de/static-quantization-optimum) it have been shown that static quantization improve the inference of the model.

### Using the Quantized model

In [21]:
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

In [22]:
from pathlib import Path

In [23]:
model_path = Path.cwd().joinpath('models', 'onnx', 'decoder_model_quantized.onnx')

In [25]:
model_path.exists()

True

In [92]:
quantized_model = ORTModelForCausalLM.from_pretrained(model_path.parent,
                                                      decoder_file_name=model_path,
                                                      use_cache=False,
                                                      use_io_binding=False)

/Users/esp.py/Projects/Personal/end-to-end-rag/models/onnx/decoder_model_quantized.onnx ******* the path form dir ***** /Users/esp.py/Projects/Personal/end-to-end-rag/models/onnx/decoder_model_quantized.onnx True


Generation config file not found, using a generation config created from the model config.


In [129]:
input = f"question: Is cytokeratin immunoreactivity useful in the diagnosis of short-segment Barrett's oesophagus in Korea? context: Cytokeratin 7/20 staining has been reported to be helpful in diagnosing Barrett's oesophagus and gastric intestinal metaplasia. However, this is still a matter of some controversy. To determine the diagnostic usefulness of cytokeratin 7/20 immunostaining for short-segment Barrett's oesophagus in Korea. In patients with Barrett's oesophagus, diagnosed endoscopically, at least two biopsy specimens were taken from just below the squamocolumnar junction. If goblet cells were found histologically with alcian blue staining, cytokeratin 7/20 immunohistochemical stains were performed. Intestinal metaplasia at the cardia was diagnosed whenever biopsy specimens taken from within 2 cm below the oesophagogastric junction revealed intestinal metaplasia. Barrett's cytokeratin 7/20 pattern was defined as cytokeratin 20 positivity in only the superficial gland, combined with cytokeratin 7 positivity in both the superficial and deep glands. Barrett's cytokeratin 7/20 pattern was observed in 28 out of 36 cases (77.8%) with short-segment Barrett's oesophagus, 11 out of 28 cases (39.3%) with intestinal metaplasia at the cardia, and nine out of 61 cases (14.8%) with gastric intestinal metaplasia. The sensitivity and specificity of Barrett's cytokeratin 7/20 pattern were 77.8 and 77.5%, respectively. answer: Barrett's cytokeratin 7/20 pattern can be a useful marker for the diagnosis of short-segment Barrett's oesophagus, although the false positive or false negative rate is approximately 25%."
encoded_input = tokenizer([input],
                          return_tensors='pt',
                          max_length=1024,
                          truncation=True)

In [116]:
with torch.no_grad():
    generated_text = model.generate(**encoded_input,
                                min_length=50,
                                max_length=1024,
                                num_beams=5,
                                early_stopping=True)

In [117]:
tokenizer.decode(generated_text[0], skip_special_tokens=True,)

'what is the cause of Covid-19? A case report and review of the literature on Covid-19 in patients with chronic kidney disease (CKD) and end-stage renal disease (ESRD) on hemodialysis (HD) and peritoneal dialysis (PD).'