# Introduction

This tutorial demonstrates how to perform post training quantization (PTQ) on a [bert-base]() model.

## Prerequisite

### 1. Install packages

In [1]:
!pip install neural-compressor onnx onnxruntime torch tensorflow





### 2. Prepare Model


In [1]:
!wget https://github.com/onnx/models/raw/main/text/machine_comprehension/bert-squad/model/bertsquad-12.onnx
!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip

--2023-02-16 10:34:53--  https://github.com/onnx/models/raw/main/text/machine_comprehension/bert-squad/model/bertsquad-12.onnx
Resolving proxy-prc.intel.com (proxy-prc.intel.com)... 10.240.252.16
Connecting to proxy-prc.intel.com (proxy-prc.intel.com)|10.240.252.16|:913... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/onnx/models/main/text/machine_comprehension/bert-squad/model/bertsquad-12.onnx [following]
--2023-02-16 10:34:54--  https://media.githubusercontent.com/media/onnx/models/main/text/machine_comprehension/bert-squad/model/bertsquad-12.onnx
Connecting to proxy-prc.intel.com (proxy-prc.intel.com)|10.240.252.16|:913... connected.
Proxy request sent, awaiting response... 200 OK
Length: 435852736 (416M) [application/octet-stream]
Saving to: ‘bertsquad-12.onnx’


2023-02-16 10:35:39 (12.2 MB/s) - ‘bertsquad-12.onnx’ saved [435852736/435852736]



### 3. Prepare Dataset

In [2]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

--2023-02-17 13:16:52--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving proxy-prc.intel.com (proxy-prc.intel.com)... 10.240.252.16
Connecting to proxy-prc.intel.com (proxy-prc.intel.com)|10.240.252.16|:913... connected.
Proxy request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘dev-v1.1.json’


2023-02-17 13:16:54 (3.19 MB/s) - ‘dev-v1.1.json’ saved [4854279/4854279]



# Run

In [3]:
# dataset

import numpy as np
import onnxruntime
import onnx
import tokenization
import os
from run_onnx_squad import *
import json
from run_onnx_squad import read_squad_examples, convert_examples_to_features, write_predictions
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import tqdm
from squad_evaluate import evaluate
import sys

max_seq_length = 256
doc_stride = 128
max_query_length = 64
n_best_size = 20
max_answer_length = 30

class squadDataset(Dataset):
    def __init__(self, unique_ids, input_ids, input_mask, segment_ids, bs):
        self.unique_ids = unique_ids
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.bs = bs

    def __getitem__(self, index):
        return (list(range(index, index + self.bs)), self.input_ids[index:index + self.bs][0].astype(np.int64), 
            self.input_mask[index:index + self.bs][0].astype(np.int64), self.segment_ids[index:index + self.bs][0].astype(np.int64)), 0

    def __len__(self):
        assert len(self.input_ids) == len(self.input_mask)
        assert len(self.input_ids) == len(self.segment_ids)
        return len(self.input_ids)

2023-02-17 13:18:05.251540: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-17 13:18:06.065974: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-17 13:18:06.065997: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [5]:
# evaluation function

def evaluate_squad(model, dataloader, input_ids, eval_examples, extra_data, input_file):
    session = onnxruntime.InferenceSession(model.SerializeToString(), None,
        providers=onnxruntime.get_available_providers())
    for output_meta in session.get_outputs():
        print(output_meta)
    for input_meta in session.get_inputs():
        print(input_meta)
    n = len(input_ids)
    bs = 1
    all_results = []
    start = timer()
    for idx, (batch, label) in tqdm.tqdm(enumerate(dataloader), desc="eval"):
        data = {"unique_ids_raw_output___9:0": np.array(batch[0], dtype=np.int64),
                "input_ids:0": np.array(batch[1], dtype=np.int64),
                "input_mask:0": np.array(batch[2], dtype=np.int64),
                "segment_ids:0": np.array(batch[3], dtype=np.int64)}
        result = session.run(["unique_ids:0","unstack:0", "unstack:1"], data)
        in_batch = result[0].shape[0]
        start_logits = [float(x) for x in result[1][0].flat]
        end_logits = [float(x) for x in result[2][0].flat]
        for i in range(0, in_batch):
            unique_id = len(all_results)
            all_results.append(RawResult(unique_id=unique_id, start_logits=start_logits,end_logits=end_logits))
    
    # postprocessing
    output_dir = './output'
    os.makedirs(output_dir, exist_ok=True)
    output_prediction_file = os.path.join(output_dir, "predictions_mobilebert_fp32.json")
    output_nbest_file = os.path.join(output_dir, "nbest_predictions_mobilebert_fp32.json")
    write_predictions(eval_examples, extra_data, all_results,
                    n_best_size, max_answer_length,
                    True, output_prediction_file, output_nbest_file)

    with open(input_file) as dataset_file:
        dataset_json = json.load(dataset_file)
        expected_version = '1.1'
        if (dataset_json['version'] != expected_version):
            print('Evaluation expects v-' + expected_version +
                    ', but got dataset with v-' + dataset_json['version'],
                    file=sys.stderr)
        dataset = dataset_json['data']
    with open(output_prediction_file) as prediction_file:
        predictions = json.load(prediction_file)
    res = evaluate(dataset, predictions)
    return res['exact_match']

In [10]:
# launcher code

model = onnx.load('bertsquad-12.onnx')
input_file = 'dev-v1.1.json'
eval_examples = read_squad_examples(input_file='dev-v1.1.json')

vocab_file = os.path.join('uncased_L-12_H-768_A-12', 'vocab.txt')
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)
input_ids, input_mask, segment_ids, extra_data = convert_examples_to_features(eval_examples, tokenizer, 
                                                                            max_seq_length, doc_stride, max_query_length)
dataset = squadDataset(eval_examples, input_ids, input_mask, segment_ids, 1) 
eval_dataloader = DataLoader(dataset, batch_size=1)

def eval_func(model):
    return evaluate_squad(model, eval_dataloader, input_ids, eval_examples, extra_data, input_file)

from neural_compressor import quantization, PostTrainingQuantConfig
config = PostTrainingQuantConfig(approach='dynamic',)
q_model = quantization.fit(model, 
                           config,
                           eval_func=eval_func)
q_model.save('bert-int8.onnx')

2023-02-17 13:33:28 [INFO] Get FP32 model baseline.


NodeArg(name='unstack:1', type='tensor(float)', shape=['unk__496', 256])
NodeArg(name='unstack:0', type='tensor(float)', shape=['unk__497', 256])
NodeArg(name='unique_ids:0', type='tensor(int64)', shape=['unk__498'])
NodeArg(name='unique_ids_raw_output___9:0', type='tensor(int64)', shape=['unk__492'])
NodeArg(name='segment_ids:0', type='tensor(int64)', shape=['unk__493', 256])
NodeArg(name='input_mask:0', type='tensor(int64)', shape=['unk__494', 256])
NodeArg(name='input_ids:0', type='tensor(int64)', shape=['unk__495', 256])


  data = {"unique_ids_raw_output___9:0": np.array(batch[0], dtype=np.int64),
eval: 12006it [13:28, 14.85it/s]
2023-02-17 13:47:25 [INFO] Save tuning history to /home/mengniwa/notebook/nc_workspace/2023-02-17_13-22-48/./history.snapshot.
2023-02-17 13:47:25 [INFO] FP32 baseline is: [Accuracy: 80.6717, Duration (seconds): 837.1335]
2023-02-17 13:47:47 [INFO] |******Mixed Precision Statistics******|
2023-02-17 13:47:47 [INFO] +-----------------------+-------+------+
2023-02-17 13:47:47 [INFO] |        Op Type        | Total | INT8 |
2023-02-17 13:47:47 [INFO] +-----------------------+-------+------+
2023-02-17 13:47:47 [INFO] |         Gather        |   1   |  1   |
2023-02-17 13:47:47 [INFO] |         MatMul        |   86  |  86  |
2023-02-17 13:47:47 [INFO] |    DequantizeLinear   |   1   |  1   |
2023-02-17 13:47:47 [INFO] | DynamicQuantizeLinear |   74  |  74  |
2023-02-17 13:47:47 [INFO] +-----------------------+-------+------+
2023-02-17 13:47:47 [INFO] Pass quantize model elapsed t

NodeArg(name='unstack:1', type='tensor(float)', shape=['unk__496', 256])
NodeArg(name='unstack:0', type='tensor(float)', shape=['unk__497', 256])
NodeArg(name='unique_ids:0', type='tensor(int64)', shape=['unk__498'])
NodeArg(name='unique_ids_raw_output___9:0', type='tensor(int64)', shape=['unk__492'])
NodeArg(name='segment_ids:0', type='tensor(int64)', shape=['unk__493', 256])
NodeArg(name='input_mask:0', type='tensor(int64)', shape=['unk__494', 256])
NodeArg(name='input_ids:0', type='tensor(int64)', shape=['unk__495', 256])


eval: 12006it [09:18, 21.49it/s]
2023-02-17 13:57:34 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 80.3311|80.6717, Duration (seconds) (int8|fp32): 586.5366|837.1335], Best tune result is: [Accuracy: 80.3311, Duration (seconds): 586.5366]
2023-02-17 13:57:34 [INFO] |***********************Tune Result Statistics**********************|
2023-02-17 13:57:34 [INFO] +--------------------+-----------+---------------+------------------+
2023-02-17 13:57:34 [INFO] |     Info Type      |  Baseline | Tune 1 result | Best tune result |
2023-02-17 13:57:34 [INFO] +--------------------+-----------+---------------+------------------+
2023-02-17 13:57:34 [INFO] |      Accuracy      |  80.6717  |    80.3311    |     80.3311      |
2023-02-17 13:57:34 [INFO] | Duration (seconds) | 837.1335  |   586.5366    |    586.5366      |
2023-02-17 13:57:34 [INFO] +--------------------+-----------+---------------+------------------+
2023-02-17 13:57:34 [INFO] Save tuning history to /home/mengniwa/notebook/nc_wor