# Accelerating Transformers with Hugging Face Optimum

Lewis Tunstall (open-source @ Hugging Face)

`@_lewtun`

## Who is Lewis?

<div style="text-align: center;">
    <img src="images/about.png" alt="About me" title="Title text">
</div>

* PhD in Physics from University of Adelaide, Australia
* Co-author of O'Reilly book [_Natural Language Processing with Transformers_](https://transformersbook.com/)
* Co-developer of the **free** [Hugging Face course](https://huggingface.co/course/chapter1/1)
* Maintainer of ONNX API in `transformers`

## Outline

* What is Optimum?
* Question answering as a case study
* Making models faster with quantization
* Optimizing inference with ONNX and ONNX Runtime

## What is Optimum?

<div style="text-align: center;">
    <img src="images/optimum.jpeg" alt="About me" title="Title text">
</div>

<div style="text-align: center;">
    <img src="images/hardware.png" alt="About me" title="Title text">
</div>

Today:

* Running inference with **ONNX Runtime** in Optimum
* Dynamic quantization as a demo

## Question answering as a case study

<div style="text-align: center;">
    <img src="images/marie-curie.png" alt="About me" title="Title text" width=800>
</div>

* Low latencies critical for user experience!

In [42]:
import IPython

IPython.display.IFrame("https://hf.space/gradioiframe/abidlabs/question-answering-simple/+", width=1200, height=800)

In [45]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer)

context = """Marie Sklodowska was born in Warsaw, Poland, to a family of teachers who believed strongly in education. She moved to Paris to continue her studies and there met Pierre Curie, who became both her husband and colleague in the field of radioactivity. The couple later shared the 1903 Nobel Prize in Physics. Marie was widowed in 1906, but continued the couple's work and went on to become the first person ever to be awarded two Nobel Prizes. During World War I, Curie organized mobile X-ray teams. The Curies' daughter, Irene, was also jointly awarded the Nobel Prize in Chemistry alongside her husband, Frederic Joliot."""
question = "when did marie curie win her first nobel prize?"
pred = question_answerer(question, context)
pred

{'score': 0.7219798564910889, 'start': 277, 'end': 281, 'answer': '1903'}

* Model looks good, deploy to prod?

<div style="text-align: center;">
    <img src="images/prod.jpeg" alt="About me" title="Title text">
</div>

Deployment involves tradoff among several constraints:

* Model performance (accuracy, F1 score etc)
* Latency
* Memory

In [48]:
from time import perf_counter
import numpy as np 

def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=question, context=context)
    # Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ =  pipe(question=question, context=context)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

In [49]:
print(f"Vanilla model {measure_latency(question_answerer)}")

Vanilla model average latency (ms) - 97.11 +\- 0.17


## Making models faster with quantization

Basic idea:

* Represent weights and activations with **low-precision data types** like 8-bit integer instead of 32-bit floating point.
* Less memory storage & faster matmuls!

In practice:

* Map range $[f_\mathrm{min}, f_\mathrm{max}]$ of FP values to smaller range $[q_\mathrm{min}, q_\mathrm{max}]$:

$$ f = \left(\frac{f_\mathrm{max} - f_\mathrm{min}}{q_\mathrm{max} - q_\mathrm{min}} \right)(q - Z) = S(q-Z) $$

* $S$ is _scale factor_ and $Z$ the _zero point_ (where the quantized value of $f=0$)

<div style="text-align: center;">
    <img src="images/fp32-to-int8.png" alt="Quantization" title="Title text">
</div>

_Figure courtesy of Manas Sahni_

Three main ways to quantize:

* **Dynamic quantization:** quantize weights & activations on-the-fly. Simplest to start with.
* **Static quantization:** precompute quantization scheme by observing activation patterns on sample of data. Generally ives better latency, but more complex to calibrate
* **Quantization aware training:** simulate quantization during training with "fake" quantization of FP32 values. 

## Optimizing inference with ONNX and ONNX Runtime

### Step 1: Install Optimum

In [None]:
# pip install "optimum[onnxruntime]==1.2.0"

### Step 2: Export the model to ONNX

In [56]:
from pathlib import Path
from transformers import pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

model_id = "deepset/roberta-base-squad2"
onnx_path = Path("onnx")
task = "question-answering"

# Load PyTorch weights and convert to ONNX
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Save ONNX checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

('onnx/tokenizer_config.json',
 'onnx/special_tokens_map.json',
 'onnx/vocab.json',
 'onnx/merges.txt',
 'onnx/added_tokens.json',
 'onnx/tokenizer.json')

In [53]:
!ls "onnx"

config.json  model-optimized.onnx     tokenizer_config.json
merges.txt   model-quantized.onnx     tokenizer.json
model.onnx   special_tokens_map.json  vocab.json


### Step 3: Quantize the model with ONNX Runtime

In [57]:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# create ORTQuantizer and define quantization configuration
quantizer = ORTQuantizer.from_pretrained(model_id, feature=task)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)

# apply the quantization configuration to the model
quantized_path = quantizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=qconfig,
)

In [58]:
import os
# get model file size
size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
print(f"Vanilla Onnx Model file size: {size:.2f} MB")
size = os.path.getsize(onnx_path / "model-quantized.onnx")/(1024*1024)
print(f"Quantized Onnx Model file size: {size:.2f} MB")

Vanilla Onnx Model file size: 473.34 MB
Quantized Onnx Model file size: 230.83 MB


### Step 4: Run inference with Transformers pipelines

<div style="text-align: center;">
    <img src="images/cat.png" alt="Quantization" title="Title text", width=400>
</div>

In [60]:
# load quantized model
quantized_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-quantized.onnx")

# test the quantized model with using transformers pipeline
quantized_optimum_qa = pipeline(task, model=quantized_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = quantized_optimum_qa(question=question, context=context)
prediction

{'score': 0.4670557975769043, 'start': 277, 'end': 281, 'answer': '1903'}

In [61]:
print(f"Quantized model {measure_latency(quantized_optimum_qa)}")

Quantized model average latency (ms) - 41.55 +\- 7.63


Nice, dynamic quantization gave a 2x speed-up 🤯!

## Evaluation

In [64]:
import datasets
datasets.logging.set_verbosity_error()
from datasets import load_metric,load_dataset

metric = load_metric("squad_v2")
dataset = load_dataset("squad_v2")["validation"]

print(f"length of dataset {len(dataset)}")

  0%|          | 0/2 [00:00<?, ?it/s]

length of dataset 11873


In [None]:
def evaluate(example):
    default = optimum_qa(question=example["question"], context=example["context"])
    quantized = quantized_optimum_qa(question=example["question"], context=example["context"])
    return {
      'reference': {'id': example['id'], 'answers': example['answers']},
      'default': {'id': example['id'],'prediction_text': default['answer'], 'no_answer_probability': 0.},
      'quantized': {'id': example['id'],'prediction_text': quantized['answer'], 'no_answer_probability': 0.},
      }

result = dataset.shuffle(seed=42).select(range(2000)).map(evaluate, num_proc=4)
# COMMENT IN to run evaluation on whoel validation set - takes ~1.5h
# result = dataset.map(evaluate)

In [66]:
default_metrics = metric.compute(predictions=result["default"], references=result["reference"])
quantized_metrics = metric.compute(predictions=result["quantized"], references=result["reference"])

In [12]:
print(f"Vanilla model: Exact match={default_acc['exact']}% F1={default_acc['f1']}%")
print(f"Quantized model: Exact match={quantized['exact']}% F1={quantized['f1']}%")

vanilla model: exact=79.1% f1=81.99621377859617%
quantized model: exact=78.55% f1=81.33489049340754%
