# Model quantization
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic.

In this notebook, we will load a fine tuned IndoBERT model for sentiment analysis task in PyTorch and ONNX format then quantize both models. Finally, we will demonstrate the performance and model size of the quantized PyTorch and ONNX model.

In [18]:
# !pip install onnxruntime onnx transformers optimum

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import time
import numpy as np
import pandas as pd
import torch
from torch import optim
import torch.nn.functional as F

from transformers import BertForSequenceClassification, BertConfig, BertTokenizer

from pathlib import Path
import timeit
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
import onnxruntime as ort
from onnxruntime import InferenceSession
from onnxruntime.transformers.optimizer import optimize_model
from optimum.onnxruntime import ORTModelForSequenceClassification

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

## PyTorch

### Load and quantize model (PyTorch)
We load the fine-tuned model, and quantize it with PyTorch's dynamic quantization. Finally, we show the model size comparison between full precision and quantized model.

In [4]:
# Load model
indobert_path = Path("/content/drive/MyDrive/Models/indobert")
model = BertForSequenceClassification.from_pretrained(indobert_path).to("cpu")

In [5]:
# Quantize model
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

In [6]:
def print_size_of_model(model):
    torch.save(model.state_dict(), "/content/temp.p")
    print('Size (MB):', os.path.getsize("/content/temp.p")/(1024*1024))
    os.remove('temp.p')

print_size_of_model(model)
print_size_of_model(quantized_model)

Size (MB): 474.7809534072876
Size (MB): 230.1415147781372


### Evaluate the performance of PyTorch quantization

In [7]:
def evaluate(model):
  i2w = {0: 'positive', 1: 'neutral', 2: 'negative'}

  # Load data test
  test_dataset_path = "/content/test_preprocess.tsv"
  df_test = pd.read_table(test_dataset_path, header=None)
  df_test.rename(columns={0: "text", 1: "label"}, inplace=True)

  tokenizer = BertTokenizer.from_pretrained(indobert_path)

  def infer(text):
    inputs = tokenizer.encode(text)
    inputs = torch.LongTensor(inputs).view(1, -1).to(model.device)

    logits = model(inputs)[0]
    label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()
    return i2w[label]

  df_test['pred'] = df_test['text'].apply(infer)
  acc = accuracy_score(df_test['label'], df_test['pred'])
  pre = precision_score(df_test['label'], df_test['pred'], average="macro")
  rec = recall_score(df_test['label'], df_test['pred'], average="macro")
  f1 = f1_score(df_test['label'], df_test['pred'], average="macro")

  return {"accuracy": acc,
          "precision": pre,
          "recall": rec,
          "f1": f1}

def eval_time(model):
  start = time.time()
  result = evaluate(model)
  print(f"""
        Accuracy:{result['accuracy']}
        Precision:{result['precision']}
        Recall: {result['recall']}
        F1-Score: {result['f1']}
        """)
  end = time.time()
  print(f"Evaluation time: {end-start}\n")

In [9]:
print("Full Model")
eval_time(model)
print("Quantized Model")
eval_time(quantized_model)

Full Model

        Accuracy:0.916
        Precision:0.915580183764327
        Recall: 0.875811280223045
        F1-Score: 0.8905121652109605
        
Evaluation time: 102.87095475196838

Quantized Model

        Accuracy:0.912
        Precision:0.906692476448574
        Recall: 0.8726061520179167
        F1-Score: 0.8854390502105584
        
Evaluation time: 48.859798431396484



## ONNX
We will load the ONNX model that has already been optimized.

### Load and quantize ONNX model
We will call onnxruntime.quantization.quantize to apply quantization on the BERT model. It supports dynamic quantization with IntegerOps and static quantization with QLinearOps. For activation ONNXRuntime supports only uint8 format for now, and for weight ONNXRuntime supports both int8 and uint8 format.

We apply dynamic quantization for BERT model and use int8 for weight.

In [10]:
def quantize_onnx_model(onnx_model_path, quantized_model_path):
    onnx_opt_model = onnx.load(onnx_model_path)
    quantize_dynamic(onnx_model_path,
                     quantized_model_path,
                     weight_type=QuantType.QInt8)

onnx_path = Path("/content/drive/MyDrive/Models/indobert-onnx/optimized.onnx")
quantize_onnx_path = Path("/content/drive/MyDrive/Models/indobert-onnx/quantized_optimized.onnx")
quantize_onnx_model(onnx_path, quantize_onnx_path)

In [11]:
print('ONNX full precision model size (MB):', os.path.getsize(onnx_path)/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize(quantize_onnx_path)/(1024*1024))

ONNX full precision model size (MB): 474.74583435058594
ONNX quantized model size (MB): 119.11387634277344


### Evaluate ONNX quantization performance

In [12]:
def evaluate_onnx(session):
  i2w = {0: 'positive', 1: 'neutral', 2: 'negative'}

  # Load data test
  test_dataset_path = "/content/test_preprocess.tsv"
  df_test = pd.read_table(test_dataset_path, header=None)
  df_test.rename(columns={0: "text", 1: "label"}, inplace=True)

  tokenizer = BertTokenizer.from_pretrained(indobert_path)

  def infer(text):
    inputs = tokenizer([text])
    inputs_onnx = dict(
        input_ids=np.array(inputs["input_ids"]).astype("int64"),
        attention_mask=np.array(inputs["attention_mask"]).astype("int64"),
        token_type_ids=np.array(inputs["token_type_ids"]).astype("int64")
    )

    logits = session.run(None, input_feed=inputs_onnx)[0]
    label = torch.topk(torch.from_numpy(logits), k=1, dim=-1)[1].squeeze().item()
    probability = F.softmax(torch.from_numpy(logits), dim=-1).squeeze()[label].item()
    return i2w[label]

  df_test['pred'] = df_test['text'].apply(infer)
  acc = accuracy_score(df_test['label'], df_test['pred'])
  pre = precision_score(df_test['label'], df_test['pred'], average="macro")
  rec = recall_score(df_test['label'], df_test['pred'], average="macro")
  f1 = f1_score(df_test['label'], df_test['pred'], average="macro")

  return {"accuracy": acc,
          "precision": pre,
          "recall": rec,
          "f1": f1}

def eval_time_onnx(model):
  start = time.time()
  result = evaluate_onnx(model)
  print(f"""
        Accuracy:{result['accuracy']}
        Precision:{result['precision']}
        Recall: {result['recall']}
        F1-Score: {result['f1']}
        """)
  end = time.time()
  print(f"Evaluation time: {end-start}\n")

In [13]:
full_onnx = InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
quantized_onnx = InferenceSession(quantize_onnx_path, providers=["CPUExecutionProvider"])

In [14]:
print("Full Model ONNX")
eval_time_onnx(full_onnx)
print("Quantized Model ONNX")
eval_time_onnx(quantized_onnx)

Full Model ONNX

        Accuracy:0.916
        Precision:0.915580183764327
        Recall: 0.875811280223045
        F1-Score: 0.8905121652109605
        
Evaluation time: 69.74408197402954

Quantized Model ONNX

        Accuracy:0.912
        Precision:0.9090679170218813
        Recall: 0.8704208373326021
        F1-Score: 0.8846322352346448
        
Evaluation time: 39.85102200508118



## Summary
### Model Size
PyTorch quantizes torch.nn.Linear modules only and reduce the model from 474 MB to 230 MB. ONNXRuntime quantizes not only Linear(MatMul), but also the embedding layer. It achieves almost the ideal model size reduction with quantization.

| Engine | Full Precision(MB) | Quantized(MB) |
| --- | --- | --- |
| PyTorch | 474.8 | 230.1 |
| ONNX | 474.7 | 119.1 |

### Accuracy and F1-score

Quantized model of PyTorch and ONNX achieves similar result in accuracy and F1 score despite the later one has smaller size.


| Metrics | Full Size | PyTorch Quantization | ORT Quantization |
| --- | --- | --- | --- |
| Accuracy | 0.916 | 0.912 | 0.912 |
| F1 score | 0.890 | 0.885 | 0.884 |
| Precision | 0.915 | 0.906 | 0.909 |
| Recall | 0.875 | 0.872 | 0.870 |

### Inference Time
It is shown that the quantized model of ONNX achieves the fastest inference time.

In [15]:
def benchmark(f, name=""):
    # warmup
    for _ in range(10):
        f()
    seconds_per_iter = timeit.timeit(f, number=100) / 100
    print(
        f"{name}:",
        f"{seconds_per_iter * 1000:.3f} ms",
    )

    return seconds_per_iter * 1000

In [16]:
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'

tokenizer = BertTokenizer.from_pretrained(indobert_path)
inputs = tokenizer.encode(text)
inputs = torch.LongTensor(inputs).view(1, -1).to("cpu")

inputs_onnx = tokenizer([text])
inputs_onnx = dict(
    input_ids=np.array(inputs_onnx["input_ids"]).astype("int64"),
    attention_mask=np.array(inputs_onnx["attention_mask"]).astype("int64"),
    token_type_ids=np.array(inputs_onnx["token_type_ids"]).astype("int64")
)

In [17]:
speed_full_pt = benchmark(lambda: model(inputs), "Full")
speed_quant_pt = benchmark(lambda: quantized_model(inputs), "Quantized")
speed_quant_onnx = benchmark(lambda: full_onnx.run(None, input_feed=inputs_onnx), "Full ONNX")
speed_full_onnx = benchmark(lambda: quantized_onnx.run(None, input_feed=inputs_onnx), "Quantized ONNX")

Full: 213.455 ms
Quantized: 59.982 ms
Full ONNX: 69.804 ms
Quantized ONNX: 48.724 ms


Comparing with PyTorch full precision, PyTorch quantization achieves ~3.5x speedup, and ORT quantization achieves ~1.4x speedup. ORT quantization can achieve ~4.3x speedup, comparing with PyTorch quantization.