# (experimental) Dynamic Quantization on BERT

**Author**: [Jianyu Huang](https://github.com/jianyuh)

**Reviewed by**: [Raghuraman Krishnamoorthi](https://github.com/raghuramank100)

**Edited by**: [Jessica Lin](https://)

# Introduction


In this tutorial, we will apply the dynamic quantization on a BERT
model, closely following the BERT model from [the HuggingFace
Transformers examples](https://github.com/huggingface/transformers).
With this step-by-step journey, we would like to demonstrate how to
convert a well-known state-of-the-art model like BERT into dynamic
quantized model.

-  BERT, or Bidirectional Embedding Representations from Transformers,
   is a new method of pre-training language representations which
   achieves the state-of-the-art accuracy results on many popular
   Natural Language Processing (NLP) tasks, such as question answering,
   text classification, and others. The original paper can be found
   [here](https://arxiv.org/pdf/1810.04805.pdf).

-  Dynamic quantization support in PyTorch converts a float model to a
   quantized model with static int8 or float16 data types for the
   weights and dynamic quantization for the activations. The activations
   are quantized dynamically (per batch) to int8 when the weights are
   quantized to int8. In PyTorch, we have [torch.quantization.quantize_dynamic API](https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic),
   which replaces specified modules with dynamic weight-only quantized
   versions and output the quantized model.

-  We demonstrate the accuracy and inference performance results on the
   [Microsoft Research Paraphrase Corpus (MRPC) task](https://www.microsoft.com/en-us/download/details.aspx?id=52398)
   in the General Language Understanding Evaluation benchmark [(GLUE)](https://gluebenchmark.com/). The MRPC (Dolan and Brockett, 2005) is
   a corpus of sentence pairs automatically extracted from online news
   sources, with human annotations of whether the sentences in the pair
   are semantically equivalent. As the classes are imbalanced (68%
   positive, 32% negative), we follow the common practice and report [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).
   MRPC is a common NLP task for language pair classification, as shown
   below.

<!-- ![BERT for setence pair classification](https://drive.google.com/file/d/1m_VcRJNuMBwnrx3f0OShX6ffLyoEOJPC/view?usp=sharing). -->

![BERT for setence pair classification](https://gluon-nlp.mxnet.io/_images/bert-sentence-pair.png)


<!-- ![alt text](https://drive.google.com/file/d/1NJIWxtY39pBl0KUCOCMF5vpfuWLlSKf8/view?usp=sharing) -->








# 1. Setup
## 1.1 Install PyTorch and HuggingFace Transformers


To start this tutorial, let’s first follow the installation instructions in PyTorch [here](https://github.com/pytorch/pytorch/#installation) and HuggingFace Github Repo [here](https://github.com/huggingface/transformers#installation). In addition, we also install [scikit-learn](https://github.com/scikit-learn/scikit-learn) package, as we will reuse its built-in F1 score calculation helper function.


In [34]:
import requests

proxies = {
    "http": "http://127.0.0.1:7890",
    "https": "http://127.0.0.1:7890",
}
try:
    response = requests.get("https://www.google.com", proxies=proxies)
    print("Network connection is working.")
except requests.exceptions.RequestException as e: 
    print("Network connection is not working.")


Network connection is working.


In [10]:
!pip install sklearn
!pip install transformers

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' instead of 'scikit-le



Because we will be using the experimental parts of the PyTorch, it is recommended to install the latest version of torch and torchvision. You can find the most recent instructions on local installation [here](https://pytorch.org/get-started/locally/). For example, to install on Mac:

In [11]:
!yes y | pip uninstall torch torchvision
!yes y | pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

[0myes: standard output: Broken pipe
Looking in links: https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
Collecting torch
  Using cached https://download.pytorch.org/whl/nightly/cpu/torch-2.3.0.dev20240103%2Bcpu-cp39-cp39-linux_x86_64.whl (187.0 MB)
Installing collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.175 requires torchvision>=0.9.0, which is not installed.
torchaudio 2.1.0+cu118 requires torch==2.1.0, but you have torch 2.3.0.dev20240103+cpu which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.3.0.dev20240103+cpu
yes: standard output: Broken pipe


## 1.2 Import the necessary modules

In this step we import the necessary Python modules for the tutorial.

In [12]:
from __future__ import absolute_import, division, print_function

import logging
import numpy as np
import os
import random
import sys
import time
import torch

from argparse import Namespace
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import tqdm
from transformers import (BertConfig, BertForSequenceClassification, BertTokenizer,)
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers import glue_convert_examples_to_features as convert_examples_to_features

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.WARN)

logging.getLogger("transformers.modeling_utils").setLevel(
   logging.WARN)  # Reduce logging

print(torch.__version__)


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
2024-01-04 02:21:11.811903: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-04 02:21:11.854385: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-04 02:21:11.854412: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-04 02:21:11.855379: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been regi

2.3.0.dev20240103+cpu


We set the number of threads to compare the single thread performance between FP32 and INT8 performance. In the end of the tutorial, the user can set other number of threads by building PyTorch with right parallel backend.

In [14]:
torch.set_num_threads(1)
print(torch.__config__.parallel_info())

ATen/Parallel:
	at::get_num_threads() : 1
	at::get_num_interop_threads() : 96
OpenMP 201511 (a.k.a. OpenMP 4.5)
	omp_get_max_threads() : 1
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 1
Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
std::thread::hardware_concurrency() : 96
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP



## 1.3 Learn about helper functions

The helper functions are built-in in transformers library. We mainly use
the following helper functions: one for converting the text examples
into the feature vectors; The other one for measuring the F1 score of
the predicted result.

The [glue_convert_examples_to_features](https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py) function converts the texts into input features:

-  Tokenize the input sequences;
-  Insert [CLS] in the beginning;
-  Insert [SEP] between the first sentence and the second sentence, and
   in the end;
-  Generate token type ids to indicate whether a token belongs to the
   first sequence or the second sequence.

The [glue_compute_metrics function](https://github.com/huggingface/transformers/blob/master/transformers/data/processors/glue.py) has the compute metrics with the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), which
can be interpreted as a weighted average of the precision and recall,
where an F1 score reaches its best value at 1 and worst score at 0. The
relative contribution of precision and recall to the F1 score are equal.

-  The equation for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)





## 1.4 Download the dataset

Before running MRPC tasks we download the [GLUE data](https://gluebenchmark.com/tasks) by running [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and unpack it to some directory `glue_data/MRPC`.


In [50]:
# !python download_glue_data.py --data_dir='glue_data' --tasks='MRPC' --test_labels=True --path_to_mrpc='./glue_data/MRPC'
!pwd
!ls
!wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
!python download_glue_data.py --data_dir='glue_data' --tasks='MRPC' 
!ls glue_data/MRPC

/dataset01/zwc/myGitHub/Neural-Network-Quantization


'(experimental)_Dynamic_Quantization_on_BERT.ipynb'   download_glue_data.py
 MRPC						      glue_data
 MRPC.zip					      test.ipynb
 README.md
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/student001/.wget-hsts'. HSTS will be disabled.
--2024-01-03 15:55:53--  https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8225 (8.0K) [text/plain]
Saving to: 'download_glue_data.py.1'


2024-01-03 15:55:55 (30.6 MB/s) - 'download_glue_data.py.1' saved [8225/8225]

Processing MRPC...
Local MRPC data not specified, downloading data from https://dl

# 2 Fine-tune the BERT model

The spirit of BERT is to pre-train the language representations and then to fine-tune the deep bi-directional representations on a wide range of tasks with minimal task-dependent parameters, and achieves state-of-the-art results. In this tutorial, we will focus on fine-tuning with the pre-trained BERT model to classify semantically equivalent sentence pairs on MRPC task.

To fine-tune the pre-trained BERT model ("bert-base-uncased" model in HuggingFace transformers) for the MRPC task, you can follow the command in [examples](https://github.com/huggingface/transformers/tree/master/examples):
    
    export GLUE_DIR=./glue_data
    export TASK_NAME=MRPC
    export OUT_DIR=./$TASK_NAME/
    python ./run_glue.py \
        --model_type bert \
        --model_name_or_path bert-base-uncased \
        --task_name $TASK_NAME \
        --do_train \
        --do_eval \
        --do_lower_case \
        --data_dir $GLUE_DIR/$TASK_NAME \
        --max_seq_length 128 \
        --per_gpu_eval_batch_size=8   \
        --per_gpu_train_batch_size=8   \
        --learning_rate 2e-5 \
        --num_train_epochs 3.0 \
        --save_steps 100000 \
        --output_dir $OUT_DIR


We provide the fined-tuned BERT model for MRPC task [here](https://download.pytorch.org/tutorial/MRPC.zip).
To save time, you can download the model file (~400 MB) directly into your local folder ``$OUT_DIR``.



In [42]:
!wget https://download.pytorch.org/tutorial/MRPC.zip
!unzip MRPC.zip
!ls
!pwd

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/student001/.wget-hsts'. HSTS will be disabled.
--2024-01-03 15:52:11--  https://download.pytorch.org/tutorial/MRPC.zip
Resolving download.pytorch.org (download.pytorch.org)... 18.164.154.30, 18.164.154.17, 18.164.154.123, ...
Connecting to download.pytorch.org (download.pytorch.org)|18.164.154.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405365618 (387M) [application/zip]
Saving to: 'MRPC.zip'

MRPC.zip              0%[                    ]  72.00K   358KB/s               


2024-01-03 15:52:53 (9.44 MB/s) - 'MRPC.zip' saved [405365618/405365618]

Archive:  MRPC.zip
   creating: MRPC/
 extracting: MRPC/added_tokens.json  
  inflating: MRPC/tokenizer_config.json  
  inflating: MRPC/special_tokens_map.json  
  inflating: MRPC/config.json        
  inflating: MRPC/training_args.bin  
  inflating: MRPC/vocab.txt          
  inflating: MRPC/pytorch_model.bin  
'(experimental)_Dynamic_Quantization_on_BERT.ipynb'   download_glue_data.py
 MRPC						      glue_data
 MRPC.zip					      test.ipynb
 README.md
/dataset01/zwc/myGitHub/Neural-Network-Quantization


## 2.1 Set global configurations

Here we set the global configurations for evaluating the fine-tuned BERT model before and after the dynamic quantization.

In [15]:
configs = Namespace()

# The output directory for the fine-tuned model.
# configs.output_dir = "/content/MRPC/"
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark.
# configs.data_dir = "/content/glue_data/MRPC"
configs.data_dir = "./glue_data/MRPC"


# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "bert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.per_gpu_eval_batch_size = 8
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False


# Set random seed for reproducibility.
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)



## 2.2 Load the fine-tuned BERT model

We load the tokenizer and fine-tuned BERT sequence classifier model (FP32) from the `configs.output_dir`.

In [18]:
# tokenizer = BertTokenizer.from_pretrained(
#     configs.output_dir, do_lower_case=configs.do_lower_case)

# model = BertForSequenceClassification.from_pretrained(configs.output_dir)
# model.to(configs.device)


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=configs.do_lower_case)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.to(configs.device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## 2.3 Define the tokenize and evaluation function
We reuse the tokenize and evaluation function from [HuggingFace](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).

In [19]:
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {'input_ids':      batch[0],
                          'attention_mask': batch[1],
                          'labels':         batch[3]}
                if args.model_type != 'distilbert':
                    inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs['labels'].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
        'dev' if evaluate else 'train',
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                label_list=label_list,
                                                max_length=args.max_seq_length,
                                                output_mode=output_mode,
                                                # pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
                                                # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
                                                # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset


# 3. Apply the dynamic quantization

We call `torch.quantization.quantize_dynamic` on the model to apply the dynamic quantization on the HuggingFace BERT model. Specifically,

- We specify that we want the torch.nn.Linear modules in our model to be quantized;
- We specify that we want weights to be converted to quantized int8 values.

In [20]:
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
print(quantized_model)


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (key): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (value): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
              (dropout): Dropout(p=0.1, inplace=False)
            

## 3.1 Check the model size
Let's first check the model size. We can observe a significant reduction in model size:

In [21]:
def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

print_size_of_model(model)
print_size_of_model(quantized_model)

Size (MB): 438.000505
Size (MB): 181.479765


The BERT model used in this tutorial (bert-base-uncased) has a vocabulary size V of 30522. With the embedding size of 768, the total size of the word embedding table is ~ 4 (Bytes/FP32) * 30522 * 768 = 90 MB. So with the help of quantization, the model size of the non-embedding table part is reduced from 350 MB (FP32 model) to 90 MB (INT8 model).

## 3.2 Evaluate the inference accuracy and time

Next, let's compare the inference time as well as the evaluation accuracy between the original FP32 model and the INT8 model after the dynamic quantization.

In [22]:
def time_model_evaluation(model, configs, tokenizer):
    eval_start_time = time.time()
    result = evaluate(configs, model, tokenizer, prefix="")
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

# Evaluate the original FP32 BERT model
time_model_evaluation(model, configs, tokenizer)

Evaluating:   0%|          | 0/51 [00:00<?, ?it/s]

Evaluating: 100%|██████████| 51/51 [01:14<00:00,  1.46s/it]

{'acc': 0.6838235294117647, 'f1': 0.8122270742358079, 'acc_and_f1': 0.7480253018237863}
Evaluate total time (seconds): 74.6





In [23]:
# Evaluate the INT8 BERT model after the dynamic quantization
time_model_evaluation(quantized_model, configs, tokenizer)

Evaluating:   6%|▌         | 3/51 [00:01<00:31,  1.52it/s]

Evaluating: 100%|██████████| 51/51 [00:33<00:00,  1.54it/s]

{'acc': 0.6838235294117647, 'f1': 0.8122270742358079, 'acc_and_f1': 0.7480253018237863}
Evaluate total time (seconds): 33.2





Running this locally on a MacBook Pro, without quantization, inference (for all 408 examples in MRPC dataset) takes about 160 seconds, and with quantization it takes just about 90 seconds. We summarize the results for running the quantized BERT model inference on a Macbook Pro as the follows:

```
| Prec | F1 score | Model Size | 1 thread | 4 threads |
| FP32 |  0.9019  |   438 MB   | 160 sec  | 85 sec    |
| INT8 |  0.8953  |   181 MB   |  90 sec  | 46 sec    |
```

We have 0.6% F1 score accuracy after applying the post-training dynamic quantization on the fine-tuned BERT model on the MRPC task. As a comparison, in a [recent paper](https://arxiv.org/pdf/1910.06188.pdf) (Table 1), it achieved 0.8788 by applying the post-training dynamic quantization and 0.8956 by applying the quantization-aware training. The main difference is that we support the asymmetric quantization in PyTorch while that paper supports the symmetric quantization only.

Note that we set the number of threads to 1 for the single-thread comparison in this tutorial. We also support the intra-op parallelization for these quantized INT8 operators. The users can now set multi-thread by `torch.set_num_threads(N)` (`N` is the number of intra-op parallelization threads). One preliminary requirement to enable the intra-op parallelization support is to build PyTorch with the right [backend](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html#build-options) such as OpenMP, Native, or TBB. You can use `torch.__config__.parallel_info()` to check the parallelization settings. On the same MacBook Pro using PyTorch with Native backend for parallelization, we can get about 46 seconds for processing the evaluation of MRPC dataset.

## 3.3 Serialize the quantized model
We can serialize and save the quantized model for the future use.

In [33]:
quantized_output_dir = configs.output_dir + "quantized/"
if not os.path.exists(quantized_output_dir):
    os.makedirs(quantized_output_dir)
    quantized_model.save_pretrained(quantized_output_dir)


AttributeError: 'torch.dtype' object has no attribute 'device'

# Conclusion
In this tutorial, we demonstrated how to demonstrate how to convert a well-known state-of-the-art NLP model like BERT into dynamic quantized model. Dynamic quantization can reduce the size of the model while only having a limited implication on accuracy.

Thanks for reading! As always, we welcome any feedback, so please create an issue [here](https://github.com/pytorch/pytorch/issues) if you have any.

# References
[1] J.Devlin, M. Chang, K. Lee and K. Toutanova, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf) (2018)

[2] [HuggingFace Transformers](https://github.com/huggingface/transformers).

[3] O. Zafrir, G. Boudoukh, P. Izsak, & M. Wasserblat (2019). [Q8BERT: Quantized 8bit BERT](https://arxiv.org/pdf/1910.06188.pdf).

