{{ badge }}

# Dynamic Quantization with Hugging Face Optimum

In this session, you will learn how to apply _dynamic quantization_ to a 🤗 Transformers model. You will quantize a [DistilBERT model](https://huggingface.co/optimum/distilbert-base-uncased-finetuned-banking77) that's been fine-tuned on the [Banking77 dataset](https://huggingface.co/datasets/banking77) for intent classification. 

Along the way, you'll learn how to use two open-source libraries: 

* [🤗 Optimum](https://github.com/huggingface/optimum): an extension of 🤗 Transformers, which provides a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.
* [🤗 Evaluate](https://github.com/huggingface/evaluate): a library that makes evaluating and comparing models and reporting their performance easier and more standardized.


By the end of this session, you see how quantization with 🤗 Optimum can significantly decrease model latency while keeping almost 100% of the full-precision model.


> This tutorial was created and run on a c6i.xlarge AWS EC2 Instance.

## Learning objectives

By the end of this session, you will know how to:

* Setup a development environment
* Convert a 🤗 Transformers model to ONNX for inference
* Apply dynamic quantization using `ORTQuantizer` from 🤗 Optimum
* Test inference with the quantized model
* Evaluate the model performance with 🤗 Evaluate
* Push the quantized model to the Hub
* Load and run inference with a quantized model from the Hub


Let's get started! 🚀

## 1. Setup development environment

Our first step is to install 🤗 Optimum, along with 🤗 Evaluate and some other libraries. Running the following cell will install all the required packages for us including 🤗 Transformer, PyTorch, and ONNX Runtime utilities:

In [None]:
%pip install "optimum[onnxruntime]==1.2.2" "git+https://github.com/huggingface/evaluate.git#egg=evaluate[evaluator]" sklearn mkl-include mkl

> If you want to run inference on a GPU, you can install 🤗 Optimum with `pip install optimum[onnxruntime-gpu]`.

The final setup step is to login to our Hugging Face account:

In [17]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

While we're at it, let's disable the parallelism in the tokenizers to avoid a long list of warnings:

In [1]:
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


## 2. Convert a 🤗 Transformers model to ONNX for inference

Before we can optimize and quantize our model, we first need to export it to the ONNX format. To do this we will use the `ORTModelForSequenceClassification` class and call the `from_pretrained()` method. This method will download the PyTorch weights from the Hub and export them via the `from_transformers` argument. The model we are using is `optimum/distilbert-base-uncased-finetuned-banking77`, which is a fine-tuned DistilBERT model on the Banking77 dataset achieving an accuracy score of 92.5% and as the feature (task) text-classification:

In [2]:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
from pathlib import Path


model_id="optimum/distilbert-base-uncased-finetuned-banking77"
dataset_id="banking77"
onnx_path = Path("onnx")

# load vanilla transformers and convert to onnx
model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading:   0%|          | 0.00/5.81k [00:00<?, ?B/s]

One neat thing about 🤗 Optimum, is that allows you to run ONNX models with the `pipeline()` function from  🤗 Transformers. This means, you get all the pre- and post-processing features for free, without needing to re-implement them for each model! Here's how you can run inference with our vanilla ONNX model:

In [3]:
from transformers import pipeline

vanilla_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
vanilla_clf("Could you assist me in finding my lost card?")

[{'label': 'lost_or_stolen_card', 'score': 0.9664045572280884}]

This looks good, so let's save the model and tokenizer to disk for later usage:

In [4]:
# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

('onnx/tokenizer_config.json',
 'onnx/special_tokens_map.json',
 'onnx/vocab.txt',
 'onnx/added_tokens.json',
 'onnx/tokenizer.json')

If we inspect the `onnx` directory where we've saved the model and tokenizer:

In [5]:
!ls {onnx_path}

config.json		      special_tokens_map.json  vocab.txt
model-quantized-dynamic.onnx  tokenizer.json
model.onnx		      tokenizer_config.json


we can see that there's a `model.onnx` file that corresponds to our exported model. Let's now go ahead and optimize this!

## 3. Apply dynamic quantization using `ORTQuantizer` from 🤗 Optimum

To apply quantization in 🤗 Optimum, we do this by:

* Creating an optimizer based on our ONNX model
* Defining the type of optimizations via a configuration class
* Exporting the optimized model as a new ONNX file

The following code snippet does these steps for us:

In [7]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature="sequence-classification")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized-dynamic.onnx",
    quantization_config=dqconfig,
)

Here we can see that we've specifed in the configuration the type of execution engine to use with the Intel AVX512-VNNI CPU. If we now take a look at our `onnx` directory:

In [8]:
!ls {onnx_path}

config.json		      special_tokens_map.json  vocab.txt
model-quantized-dynamic.onnx  tokenizer.json
model.onnx		      tokenizer_config.json


we can see we have a new ONNX file called `model-quantized-dynamic.onnx`. Let's do a quick size comparison of the two models:

In [9]:
import os

# get model file size
size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
quantized_model = os.path.getsize(onnx_path / "model-quantized-dynamic.onnx")/(1024*1024)

print(f"Model file size: {size:.2f} MB")
print(f"Quantized Model file size: {quantized_model:.2f} MB")

Model file size: 255.68 MB
Quantized Model file size: 134.43 MB


Nice, dynamic quantization has reduced the model size by around a factor of 2! This should allow us to speed up the inference time by a similar factor, so let's now see how we can test the latency of our models.

## 4. Test inference with the quantized model

As we saw earlier, Optimum has built-in support for transformers pipelines. This allows us to leverage the same API that we know from using PyTorch and TensorFlow models. Therefore we can load our quantized model with `ORTModelForSequenceClassification` class and the transformers `pipeline()` function:

In [10]:
model = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model-quantized-dynamic.onnx")
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

quantized_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
quantized_clf("Could you assist me in finding my lost card?")

[{'label': 'cash_withdrawal_not_recognised', 'score': 0.06697972863912582}]

In [11]:
from evaluate import evaluator
from datasets import load_dataset 

eval_pipe = evaluator("text-classification")
eval_dataset = load_dataset("banking77", split="test")
label_mapping = model.config.label2id

results = eval_pipe.compute(
    model_or_pipeline=quantized_clf,
    data=eval_dataset,
    metric="accuracy",
    input_column="text",
    label_column="label",
    label_mapping=label_mapping,
    strategy="simple",
)
print(results)



{'accuracy': 0.012987012987012988}


In [12]:
print(f"Vanilla model: 92.5%")
print(f"Quantized model: {results['accuracy']*100:.2f}%")
print(f"The quantized model achieves {round(results['accuracy']/0.925,4)*100:.2f}% accuracy of the fp32 model")

Vanilla model: 92.5%
Quantized model: 1.30%
The quantized model achieves 1.40% accuracy of the fp32 model


Okay, now let's test the performance (latency) of our quantized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the quantized model.

In [13]:
from time import perf_counter
import numpy as np 

payload="Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend "*2
print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')

def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in range(300):
        start_time = perf_counter()
        _ =  pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)
    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms


vanilla_model=measure_latency(vanilla_clf)
quantized_model=measure_latency(quantized_clf)

print(f"Vanilla model: {vanilla_model[0]}")
print(f"Quantized model: {quantized_model[0]}")
print(f"Improvement through quantization: {round(vanilla_model[1]/quantized_model[1],2)}x")

Payload sequence length: 128
Vanilla model: P95 latency (ms) - 134.45046685028507; Average latency (ms) - 124.96 +\- 45.30;
Quantized model: P95 latency (ms) - 92.54824770014238; Average latency (ms) - 72.11 +\- 13.55;
Improvement through quantization: 1.45x


## Push quantized model to the Hub

In [19]:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

tmp_store_directory="onnx_hub_repo"
repository_id="quantized-distilbert-banking77"
model_file_name="model-quantized-dynamic.onnx"

model.latest_model_name=model_file_name # workaround for PR #214
model.save_pretrained(tmp_store_directory)
dynamic_quantizer.tokenizer.save_pretrained(tmp_store_directory)

model.push_to_hub(tmp_store_directory,
                  repository_id=repository_id,
                  use_auth_token=True
                  )



## Load and run inference from the Hub

In [20]:
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import pipeline, AutoTokenizer

model = ORTModelForSequenceClassification.from_pretrained("lewtun/quantized-distilbert-banking77")
tokenizer = AutoTokenizer.from_pretrained("lewtun/quantized-distilbert-banking77")

remote_clx = pipeline("text-classification", model=model, tokenizer=tokenizer)

remote_clx("What is the exchange rate like on this app?")

Downloading:   0%|          | 0.00/5.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/141M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/361 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/695k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'cash_withdrawal_not_recognised', 'score': 0.07231996953487396}]