# Optimizing BERT for Fast Inference with NVIDIA TensorRT

This notebook uses the `transformer-deploy` library to facilitate `Nvidia TensorRT` INT8 quantization and optimization of BERT that was fine-tuned with a custom implementation in the notebook `fine-tuning.ipynb`.

Modified from the transformer-deploy end-to-end walkthrough: https://github.com/ELS-RD/transformer-deploy/blob/main/demo/quantization/quantization_end_to_end.ipynb

Transformer-deploy documentation: https://els-rd.github.io/transformer-deploy/quantization/quantization_intro/.

Transformer-deploy Github: https://github.com/ELS-RD/transformer-deploy


### Dependencies

To run this notebook, your system must have `Nvidia CUDA 11.X`, `TensorRT 8.2.1`, and `cuBLAS` installed.

To ensure that all dependencies are correctly installed and that the code runs as expected, it is suggested to use the below Docker image, which can be pulled with the below command in your terminal.

In [5]:
# docker pull ghcr.io/els-rd/transformer-deploy:0.6.0

Install packages

In [3]:
# !pip install git+https://github.com/ELS-RD/transformer-deploy.git
# !pip install numpy==1.23.5
# !pip install pandas
# !pip install datasets

Import packages

In [8]:
# standard library imports
import logging
import math
import os
import time
from collections import OrderedDict
from pathlib import Path
from typing import Dict, List, Union
from typing import OrderedDict as OD

# third party imports
import datasets
import numpy as np
import pandas as pd
import tensorrt as trt
import torch
import tqdm
import transformers
from datasets import Dataset, load_dataset, load_metric
from tensorrt.tensorrt import IExecutionContext, Logger, Runtime
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    EvalPrediction,
    IntervalStrategy,
    PreTrainedModel,
    PreTrainedTokenizer,
    Trainer,
    TrainingArguments,
    default_data_collator,
)

# local library specific imports
from pytorch_quantization import nn as quant_nn
from transformer_deploy.QDQModels.calibration_utils import QATCalibrate
from transformer_deploy.backends.ort_utils import (
    cpu_quantization,
    create_model_for_provider,
    optimize_onnx,
)
from transformer_deploy.backends.pytorch_utils import convert_to_onnx
from transformer_deploy.backends.trt_utils import (
    build_engine,
    get_binding_idxs,
    infer_tensorrt,
    load_engine,
    save_engine,
)
from transformer_deploy.benchmarks.utils import print_timings, track_infer_time

Check that numpy version 1.23.5 is installed.

In [6]:
print(np.__version__)  # make sure numpy version is 1.23.5, otherwise install

# !pip install numpy==1.23.5  # run this line if not numpy version 1.23.5

1.23.5


### STEP 1: Convert Fine-Tuned Model to Hugging Face Format


In `fine-tune.py` we saved the PyTorch state dictionary of our fine-tuned custom implementations of BERT in the `models/` directory. 

We will use Hugging Face's implementation of BERT moving forward in this notebook.

To do so we need to convert the state dictionary (ex. `bert_base_epoch_1.pt`) to a serielized Hugging Face format that will include a `config.json` and `pytorch_model.bin` file .

In [9]:
# load pretrained BERT-base model
bert_base_fine_tuned = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

# assign to variable the state dictionary of fine-tuned model from fine-tune.py that you want to quantize
state_dict = torch.load("../models/bert_base_fine_tuned/bert_base_epoch_1.pt")

# load the fine-tuned weights from the state dictionary into the base model
bert_base_fine_tuned.load_state_dict(state_dict["model_state_dict"])

# create directory and filepaths to download into serielized Hugging Face format
dir_path = Path("../models/model_hugging_face")
dir_path.mkdir(parents=True, exist_ok=True)

# save model in serielized Hugging Face format
bert_base_fine_tuned.save_pretrained(dir_path)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### STEP 2: Data Preprocessing

Set logging to error level for readability in notebook.

In [11]:
log_level = logging.ERROR
logging.getLogger().setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
runtime: Runtime = trt.Runtime(trt_logger)
transformers.logging.set_verbosity_error()

Set directories and other parameters.

In [12]:
model_dir = "../models/model_hugging_face"  # directory to the serielized hugging face model we are optimizing/quantizing
model_quantized_dir = "../models/model_quantized"  # directory to save the model after quantized aware training
num_labels = 2 
batch_size = 32
max_seq_len = 512
timings: Dict[str, List[float]] = dict()
runtime: Runtime = trt.Runtime(trt_logger)
profile_index = 0

Preprocess data.

In [None]:
df = pd.read_csv("data/train-sample.csv")

# dict mapping strings to integers
string_to_int = {
    'open': 0,
    'not a real question': 1,
    'off topic': 1,
    'not constructive': 1,
    'too localized': 1
}

# add new features to dataframe
df['OpenStatusInt'] = df['OpenStatus'].map(string_to_int)  # convert class strings to integers
df['BodyLength'] = df['BodyMarkdown'].apply(lambda x: len(x.split(" ")))  # number of words in body text
df['TitleLength'] = df['Title'].apply(lambda x: len(x.split(" ")))  # number of words in title text
df['TitleConcatWithBody'] = df.apply(lambda x: x.Title +  " " + x.BodyMarkdown, axis=1)  # combine title and body text
df['NumberOfTags'] = df.apply(
    lambda x: len([x[col] for col in ['Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5'] if not pd.isna(x[col])]), 
    axis=1,
)  # number of tags

# list of col names with tabular data 
tabular_feature_list = [
    'ReputationAtPostCreation',  
    'BodyLength', 
    'TitleLength', 
    'NumberOfTags',
]

# place the desired data from the dataframe into a dictionary
data_dict = {
    'text': df.TitleConcatWithBody.tolist(),
    'tabular': df[tabular_feature_list].values,
    'label': df.OpenStatusInt.tolist(),
}

# load data into hugging face dataset object
dataset_stackoverflow = Dataset.from_dict(data_dict)

# define the indices at which to split the dataset
n_samples = len(dataset_stackoverflow)
split_idx1 = int(n_samples * 0.8)
split_idx2 = int(n_samples * 0.9)

# shuffle the dataset
shuffled_dataset = dataset_stackoverflow.shuffle(seed=42)

# split dataset training/validation/test
train_dataset = shuffled_dataset.select(range(split_idx1))
val_dataset = shuffled_dataset.select(range(split_idx1, split_idx2))
test_dataset = shuffled_dataset.select(range(split_idx2, n_samples))

# calculate mean and std of each tabular feature
mean_train = torch.mean(torch.tensor(train_dataset['tabular'], dtype=torch.float32), dim=0)
std_train = torch.std(torch.tensor(train_dataset['tabular'], dtype=torch.float32), dim=0)

# define a function to apply standard scaling to the tabular data
def standard_scale(example):
    example['tabular'] = (torch.tensor(example['tabular']) - mean_train) / std_train
    return example

# apply the standard scaling function to the tabular features
train_dataset = train_dataset.map(standard_scale)
val_dataset = val_dataset.map(standard_scale)
test_dataset = test_dataset.map(standard_scale)

# instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=512, truncation=True)

train_tokenized = train_dataset.map(tokenize_function, batched=True)
val_tokenized = val_dataset.map(tokenize_function, batched=True)
test_tokenized = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/112217 [00:00<?, ? examples/s]

Map:   0%|          | 0/14027 [00:00<?, ? examples/s]

Map:   0%|          | 0/14028 [00:00<?, ? examples/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/112217 [00:00<?, ? examples/s]

Map:   0%|          | 0/14027 [00:00<?, ? examples/s]

Map:   0%|          | 0/14028 [00:00<?, ? examples/s]

Helper functions.

In [None]:
def compute_metrics(eval_pred):
    """ Function to compute evaluation metrics for Hugging Face's Trainer. """
    
    predictions, labels = eval_pred
    
    return metric.compute(predictions=predictions, references=labels)

def preprocess_logits_for_metrics(logits, labels):
    """ Function to drop unnecessary tensors from the logits that can consume memory. """
    
    if isinstance(logits, tuple):
        logits = logits[0]
    
    return logits.argmax(dim=-1)

def get_trainer(model):
    """ Function that instantiates and returns a Hugging Face Trainer.  """
    
    trainer = Trainer(
        model,
        args,
        train_dataset=train_tokenized,
        eval_dataset=test_tokenized,
        tokenizer=tokenizer,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
        compute_metrics=compute_metrics,
    )
    transformers.logging.set_verbosity_error()
    
    return trainer


def convert_tensor(data: OD[str, List[List[int]]], output: str) -> OD[str, Union[np.ndarray, torch.Tensor]]:
    """ Function to convert list inputs into either tensor or numpy array format. """
    
    input: OD[str, Union[np.ndarray, torch.Tensor]] = OrderedDict()
    for k in ["input_ids", "attention_mask", "token_type_ids"]:
        if k in data:
            v = data[k]
            if output == "torch":
                value = torch.tensor(v, dtype=torch.long, device="cuda")
            elif output == "np":
                value = np.asarray(v, dtype=np.int32)
            else:
                raise Exception(f"unknown output type: {output}")
            input[k] = value
    return input


def measure_accuracy(infer, tensor_type: str) -> float:
    """ Function to compute the accuracy from infer_tensorrt output. """
    
    outputs = list()
    for start_index in range(0, len(test_tokenized), batch_size):
        end_index = start_index + batch_size
        data = test_tokenized[start_index:end_index]
        inputs: OD[str, np.ndarray] = convert_tensor(data=data, output=tensor_type)
        output = infer(inputs)["output1"]
        if tensor_type == "torch":
            output = output.detach().cpu().numpy()
        output = np.argmax(output, axis=1).astype(int).tolist()
        outputs.extend(output)
    return np.mean(np.array(outputs) == np.array(test_tokenized['label']))

Set the parameters for training and evaluation with the Hugging Face Trainer.

In [None]:
# load the accuracy metric from the Hugging Face `datasets` library
metric = load_metric("accuracy")

# training arguments for Hugging Face Trainer
nb_step = 1000
strategy = IntervalStrategy.STEPS

args = TrainingArguments(
    output_dir=model_quantized_dir,
    evaluation_strategy=strategy,
    eval_steps=nb_step,
    logging_steps=nb_step,
    save_steps=nb_step,
    save_strategy=strategy,
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size * 2,
    num_train_epochs=1,
    fp16=True,
    group_by_length=True,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    report_to=[],    
)

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

### STEP 3: Add Quantization Support to the Model

The main idea is to automate the addition of QDQ nodes into the source code of the model. QDQ nodes, which store information required to map between high and low precision numbers, will be positioned both before and after the operations set for quantization.

More information here: https://els-rd.github.io/transformer-deploy/quantization/quantization_ast/

________________________________________________________________________________
Loop through different percentiles, which are used in the precision mapping, to find the one that results in the highest accuracy. 

In [18]:
config = AutoConfig.from_pretrained(model_dir)
for idx, percentile in enumerate([99.9, 99.99, 99.999, 99.9999]):
    print(f"percentile {idx+1} of 4")
    with QATCalibrate(method="histogram", percentile=percentile) as qat:
        model_q: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
            model_dir, config=config
        )
        model_q = model_q.cuda()
        qat.setup_model_qat(model_q)  # prepare quantizer to any model

        with torch.no_grad():
            for start_index in tqdm.tqdm(range(0, 128, batch_size)):
                end_index = start_index + batch_size
                data = train_tokenized[start_index:end_index]
                input_torch = {
                    k: torch.tensor(v, dtype=torch.long, device="cuda")
                    for k, v in data.items()
                    if k in ["input_ids", "attention_mask", "token_type_ids"]
                }
                model_q(**input_torch)
    trainer = get_trainer(model_q)
    print(f"percentile: {percentile}")
    print(trainer.evaluate())

percentile 1 of 4


100%|██████████| 4/4 [10:48<00:00, 162.21s/it]


percentile: 99.9
{'eval_loss': 0.5294165015220642, 'eval_accuracy': 0.7392358140861135, 'eval_runtime': 142.6743, 'eval_samples_per_second': 98.322, 'eval_steps_per_second': 1.542}
{'eval_loss': 0.5294165015220642, 'eval_accuracy': 0.7392358140861135, 'eval_runtime': 142.6743, 'eval_samples_per_second': 98.322, 'eval_steps_per_second': 1.542}
percentile 2 of 4


100%|██████████| 4/4 [10:46<00:00, 161.62s/it]


percentile: 99.99
{'eval_loss': 0.5322121977806091, 'eval_accuracy': 0.7344596521243227, 'eval_runtime': 142.9213, 'eval_samples_per_second': 98.152, 'eval_steps_per_second': 1.539}
{'eval_loss': 0.5322121977806091, 'eval_accuracy': 0.7344596521243227, 'eval_runtime': 142.9213, 'eval_samples_per_second': 98.152, 'eval_steps_per_second': 1.539}
percentile 3 of 4


100%|██████████| 4/4 [10:43<00:00, 160.82s/it]


percentile: 99.999
{'eval_loss': 0.5312505960464478, 'eval_accuracy': 0.7581978899344168, 'eval_runtime': 143.0553, 'eval_samples_per_second': 98.06, 'eval_steps_per_second': 1.538}
{'eval_loss': 0.5312505960464478, 'eval_accuracy': 0.7581978899344168, 'eval_runtime': 143.0553, 'eval_samples_per_second': 98.06, 'eval_steps_per_second': 1.538}
percentile 4 of 4


100%|██████████| 4/4 [10:37<00:00, 159.39s/it]


percentile: 99.9999
{'eval_loss': 0.537022590637207, 'eval_accuracy': 0.763116623895067, 'eval_runtime': 142.6047, 'eval_samples_per_second': 98.37, 'eval_steps_per_second': 1.543}
{'eval_loss': 0.537022590637207, 'eval_accuracy': 0.763116623895067, 'eval_runtime': 142.6047, 'eval_samples_per_second': 98.37, 'eval_steps_per_second': 1.543}


Add quantization support using the percentile that resulted in the best results from above.

That would be `99.9999` for this model.

In [None]:
best_percentile = 99.9999  # choose the percentile that had the best evaluation accuracy from the grid search in the cell above

config = AutoConfig.from_pretrained(model_dir)
with QATCalibrate(method="histogram", percentile=best_percentile) as qat:
    model_q: PreTrainedModel = AutoModelForSequenceClassification.from_pretrained(
        model_dir, config=config, ignore_mismatched_sizes=True
    )
    
    model_q = model_q.cuda()
    qat.setup_model_qat(model_q)  # prepare quantizer to any model

    with torch.no_grad():
        for start_index in tqdm.tqdm(range(0, 128, batch_size)):
            end_index = start_index + batch_size
            data = train_tokenized[start_index:end_index]
            input_torch = {
                k: torch.tensor(v, dtype=torch.long, device="cuda")
                for k, v in data.items()
                if k in ["input_ids", "attention_mask", "token_type_ids"]
            }
            model_q(**input_torch)
trainer = get_trainer(model_q)
print(trainer.evaluate())

100%|██████████| 4/4 [11:10<00:00, 167.59s/it]


{'eval_loss': 0.5371240377426147, 'eval_accuracy': 0.7628314798973481, 'eval_runtime': 143.0255, 'eval_samples_per_second': 98.08, 'eval_steps_per_second': 1.538}
{'eval_loss': 0.5371240377426147, 'eval_accuracy': 0.7628314798973481, 'eval_runtime': 143.0255, 'eval_samples_per_second': 98.08, 'eval_steps_per_second': 1.538}


### STEP 4: Quantization Analysis


**Per layer quantization analysis.**

Enable quantization of one layer at a time to detect if the quantization of a specific layer has a larger cost on accuracy than other layers.

Layer 1 and 10 seem to be the most sensitive in the example below making them possible candidates to disable.


In [17]:
for i in range(12):
    layer_name = f"layer.{i}"
    print(layer_name)
    for name, module in model_q.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if layer_name in name:
                module.enable_quant()
            else:
                module.disable_quant()
    print(trainer.evaluate())
    print("----")

layer.0
{'eval_loss': 0.4587136507034302, 'eval_accuracy': 0.7879241516966068, 'eval_runtime': 60.0344, 'eval_samples_per_second': 233.666, 'eval_steps_per_second': 3.665}
{'eval_loss': 0.4587136507034302, 'eval_accuracy': 0.7879241516966068, 'eval_runtime': 60.0344, 'eval_samples_per_second': 233.666, 'eval_steps_per_second': 3.665}
----
layer.1
{'eval_loss': 0.45947614312171936, 'eval_accuracy': 0.7845024237239806, 'eval_runtime': 75.2785, 'eval_samples_per_second': 186.348, 'eval_steps_per_second': 2.922}
{'eval_loss': 0.45947614312171936, 'eval_accuracy': 0.7845024237239806, 'eval_runtime': 75.2785, 'eval_samples_per_second': 186.348, 'eval_steps_per_second': 2.922}
----
layer.2
{'eval_loss': 0.4602658152580261, 'eval_accuracy': 0.785642999714856, 'eval_runtime': 60.3391, 'eval_samples_per_second': 232.486, 'eval_steps_per_second': 3.646}
{'eval_loss': 0.4602658152580261, 'eval_accuracy': 0.785642999714856, 'eval_runtime': 60.3391, 'eval_samples_per_second': 232.486, 'eval_steps_pe

**Operator quantization analysis**

Enable quantization of one operator type at a time to detect if a specific operator has a larger cost on accuracy.

The LayerNorm operation seems to be one of the most sensitive in the example below making it apossible candidate to disable.


In [18]:
for op in ["query", "key", "value",
           "matmul", "matmul_quantizer_0", "matmul_quantizer_1", "matmul_quantizer_2", "matmul_quantizer_3", 
           "dense", "dense._input", "dense._weight", 
           "layernorm", "pooler"]:
    
    for name, module in model_q.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if op in name:
                module.enable_quant()
            else:
                module.disable_quant()
    print(op)
    print(trainer.evaluate())
    print("----")

query
{'eval_loss': 0.4594363272190094, 'eval_accuracy': 0.7865697177074422, 'eval_runtime': 56.8388, 'eval_samples_per_second': 246.803, 'eval_steps_per_second': 3.871}
{'eval_loss': 0.4594363272190094, 'eval_accuracy': 0.7865697177074422, 'eval_runtime': 56.8388, 'eval_samples_per_second': 246.803, 'eval_steps_per_second': 3.871}
----
key
{'eval_loss': 0.4590598940849304, 'eval_accuracy': 0.7869261477045908, 'eval_runtime': 56.6929, 'eval_samples_per_second': 247.438, 'eval_steps_per_second': 3.881}
{'eval_loss': 0.4590598940849304, 'eval_accuracy': 0.7869261477045908, 'eval_runtime': 56.6929, 'eval_samples_per_second': 247.438, 'eval_steps_per_second': 3.881}
----
value
{'eval_loss': 0.45914191007614136, 'eval_accuracy': 0.7858568577131452, 'eval_runtime': 56.8126, 'eval_samples_per_second': 246.917, 'eval_steps_per_second': 3.872}
{'eval_loss': 0.45914191007614136, 'eval_accuracy': 0.7858568577131452, 'eval_runtime': 56.8126, 'eval_samples_per_second': 246.917, 'eval_steps_per_seco

The goal is to disable quantization for as few operations as possible while preserving accuracy as much as possible. 

Through some trial and error of different combinations of layers I have settled on disabling LayerNorm on Layer 2 and 10.

In [18]:
for name, module in model_q.named_modules():
    if isinstance(module, quant_nn.TensorQuantizer): 
        if any([l in name for l in ["layer.2", "layer.10"]]) and "layernorm" in name and "attention" not in name: 
            print(f"disable {name}")
            module.disable_quant() 
        
        else:
            module.enable_quant()

trainer.evaluate()

disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.10.output.layernorm_quantizer_0
disable bert.encoder.layer.10.output.layernorm_quantizer_1
{'eval_loss': 0.534470796585083, 'eval_accuracy': 0.7723838038209295, 'eval_runtime': 142.3247, 'eval_samples_per_second': 98.563, 'eval_steps_per_second': 1.546}


{'eval_loss': 0.534470796585083,
 'eval_accuracy': 0.7723838038209295,
 'eval_runtime': 142.3247,
 'eval_samples_per_second': 98.563,
 'eval_steps_per_second': 1.546}

### STEP 5: Quantization aware training

After having quantized the model, fine-tune again with a lower learning rate to recover some of the lost accuracy.

In [19]:
args.learning_rate = 1e-7
args.num_train_epochs = 1
trainer = get_trainer(model_q)
trainer.eval_dataset=val_tokenized
trainer.train()

print("evaluation with test dataset")
trainer.eval_dataset = test_tokenized
print(trainer.evaluate())

model_q.save_pretrained("../models/model-qat")



{'loss': 0.4811, 'learning_rate': 7.14856002281152e-08, 'epoch': 0.29}
{'eval_loss': 0.4795648753643036, 'eval_accuracy': 0.7801383046980823, 'eval_runtime': 141.8049, 'eval_samples_per_second': 98.918, 'eval_steps_per_second': 1.551, 'epoch': 0.29}
{'loss': 0.4497, 'learning_rate': 4.2971200456230394e-08, 'epoch': 0.57}
{'eval_loss': 0.47320234775543213, 'eval_accuracy': 0.7775718257645968, 'eval_runtime': 142.2774, 'eval_samples_per_second': 98.589, 'eval_steps_per_second': 1.546, 'epoch': 0.57}
{'loss': 0.4397, 'learning_rate': 1.4513829483889364e-08, 'epoch': 0.86}
{'eval_loss': 0.4719729423522949, 'eval_accuracy': 0.7802095957795679, 'eval_runtime': 141.266, 'eval_samples_per_second': 99.295, 'eval_steps_per_second': 1.557, 'epoch': 0.86}
{'train_runtime': 2844.0087, 'train_samples_per_second': 39.457, 'train_steps_per_second': 1.233, 'train_loss': 0.45373689179954824, 'epoch': 1.0}
evaluation with test dataset
{'eval_loss': 0.4707139730453491, 'eval_accuracy': 0.7808668377530653,

### STEP 6: Export the QDQ Pytorch model to ONNX

In [20]:
data = train_tokenized[1:3]
input_torch = convert_tensor(data, output="torch")
convert_to_onnx(
    model_pytorch=model_q,
    output_path="../models/model_qat.onnx",
    inputs_pytorch=input_torch,
    quantization=True,
    var_output_seq=False,
    output_names = ["output1"] 
)

  if amax.numel() == 1:
  inputs, amax.item() / bound, 0,
  quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])


verbose: False, log level: Level.ERROR



In [22]:
del model_q
QATCalibrate.restore()

### STEP 7: TensorRT INT8 Quantization Benchmark

Convert ONNX graph to TensorRT engine.

In [23]:
engine = build_engine(
    runtime=runtime,
    onnx_file_path="../models/model_qat.onnx",
    logger=trt_logger,
    min_shape=(1, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=True,
    int8=True,
)

[07/25/2023-16:20:01] [TRT] [E] 3: [builderConfig.cpp::validatePool::313] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/builderConfig.cpp::validatePool::313, condition: false. Setting DLA memory pool size on TensorRT build with DLA disabled.
)


Prepare input and output buffer.

In [24]:
context: IExecutionContext = engine.create_execution_context()
context.set_optimization_profile_async(
    profile_index=profile_index, stream_handle=torch.cuda.current_stream().cuda_stream
)
input_binding_idxs, output_binding_idxs = get_binding_idxs(engine, profile_index)  # type: List[int], List[int]


data = train_tokenized[0:batch_size]
input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")
input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")

Check that inference is working correctly.

In [25]:
tensorrt_output = infer_tensorrt(
    context=context,
    inputs=input_torch,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
)

print(tensorrt_output)

{'output1': tensor([[ 8.8974e-01, -3.6593e-01],
        [-1.5406e+00,  1.1372e+00],
        [ 7.6251e-01, -1.3573e-01],
        [-1.0282e+00,  9.7014e-01],
        [ 4.2417e-01, -1.8649e-03],
        [-2.6301e-01,  3.6717e-01],
        [ 2.8515e-01,  1.1626e-01],
        [ 7.6680e-01, -1.8062e-01],
        [-8.3431e-01,  7.3697e-01],
        [-1.3840e+00,  8.7692e-01],
        [ 1.8685e+00, -9.7738e-01],
        [ 5.9061e-01, -3.5877e-02],
        [ 1.6593e+00, -7.9597e-01],
        [ 1.7117e+00, -8.3231e-01],
        [-1.3771e-01,  3.2420e-01],
        [-8.6504e-01,  7.9855e-01],
        [ 1.4455e+00, -5.3692e-01],
        [-9.1140e-01,  8.2719e-01],
        [-1.3829e+00,  7.9771e-01],
        [ 1.7107e+00, -7.6251e-01],
        [ 9.0488e-01, -1.5545e-01],
        [ 1.5613e+00, -7.3928e-01],
        [ 8.3291e-01, -2.2420e-01],
        [-3.3598e-01,  5.0985e-01],
        [ 5.8289e-01,  4.4166e-03],
        [-1.1288e+00,  9.4518e-01],
        [ 9.4472e-01, -4.0603e-01],
        [-1.3020

Measure accuracy on test dataset.

In [26]:
infer_trt = lambda inputs: infer_tensorrt(
    context=context,
    inputs=inputs,
    input_binding_idxs=input_binding_idxs,
    output_binding_idxs=output_binding_idxs,
)

measure_accuracy(infer=infer_trt, tensor_type="torch")

0.7811519817507842

Save the engine and reload it into a new variable.

In [27]:
# save the engine
save_engine(engine, "../models/stack_int8_fp16_model.trt")

# del the engine and context from memory
del engine, context

# load the saved engine - this includes the function for inference that handles the context and binding indices
trt_engine_int8_fp16 = load_engine(runtime=runtime, engine_file_path="../models/stack_int8_fp16_model.trt")

Measure accuracy and speed with a batch size of 32.

In [28]:
correct = 0
total = 0

batch_size = 32
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    tensorrt_output = trt_engine_int8_fp16(
        inputs=input_torch,
    )
    
    preds=tensorrt_output['output1'].argmax(-1)
    labels = torch.tensor(data['label']).cuda()
    correct+=torch.sum(preds==labels).item()
    total+=labels.shape[0]

    
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7816067351598174
Execution time: 25.8
Samples per second: 543


Measure accuracy and speed with a batch size of 1.

In [29]:
correct = 0
total = 0

batch_size = 1
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    tensorrt_output = trt_engine_int8_fp16(
        inputs=input_torch,
    )
    
    preds=tensorrt_output['output1'].argmax(-1)
    labels = torch.tensor(data['label']).cuda()
    correct+=torch.sum(preds==labels).item()
    total+=labels.shape[0]

    
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7813658397490733
Execution time: 39.0
Samples per second: 359


### STEP 8: Pytorch GPU Benchmarks

#### Floating Point 32
Load the pytorch FP32 version of the model.

In [17]:
pytorch_model_fp32 = AutoModelForSequenceClassification.from_pretrained(
    model_dir, num_labels=num_labels)
pytorch_model_fp32 = pytorch_model_fp32.cuda()
pytorch_model_fp32 = pytorch_model_fp32.eval()

Measure accuracy and speed with a batch size of 32.

In [27]:
correct = 0
total = 0

batch_size = 32
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    with torch.inference_mode():
        pytorch_output = pytorch_model_fp32(**input_torch)
    
        preds=pytorch_output.logits.argmax(-1)
        labels = torch.tensor(data['label']).cuda()
        correct+=torch.sum(preds==labels).item()
        total+=labels.shape[0]
        
        
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7857448630136986
Execution time: 111.5
Samples per second: 125


Measure accuracy and speed with a batch size of 1.

In [18]:
correct = 0
total = 0

batch_size = 1
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")
    with torch.inference_mode():
        pytorch_output = pytorch_model_fp32(**input_torch)
    
        preds=pytorch_output.logits.argmax(-1)
        labels = torch.tensor(data['label']).cuda()
        correct+=torch.sum(preds==labels).item()
        total+=labels.shape[0]
        
        
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7857142857142857
Execution time: 150.2
Samples per second: 93


#### Floating Point 16
Load the pytorch FP16 version of the model.

In [19]:
pytorch_model_fp16 = AutoModelForSequenceClassification.from_pretrained(
    model_dir, num_labels=num_labels)
pytorch_model_fp16 = pytorch_model_fp16.half()
pytorch_model_fp16 = pytorch_model_fp16.cuda()
pytorch_model_fp16 = pytorch_model_fp16.eval()

Measure accuracy and speed with a batch size of 32.

In [29]:
correct = 0
total = 0

batch_size = 32
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    with torch.inference_mode():
        pytorch_output = pytorch_model_fp16(**input_torch)
    
        preds=pytorch_output.logits.argmax(-1)
        labels = torch.tensor(data['label']).cuda()
        correct+=torch.sum(preds==labels).item()
        total+=labels.shape[0]

        
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7858162100456622
Execution time: 37.8
Samples per second: 370


Measure accuracy and speed with a batch size of 1.

In [20]:
correct = 0
total = 0

batch_size = 1
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    with torch.inference_mode():
        pytorch_output = pytorch_model_fp16(**input_torch)
    
        preds=pytorch_output.logits.argmax(-1)
        labels = torch.tensor(data['label']).cuda()
        correct+=torch.sum(preds==labels).item()
        total+=labels.shape[0]

        
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7857855717137154
Execution time: 86.0
Samples per second: 163


### STEP 9: TensorRT FP16 Benchmark

Check the performance on mixed precision (FP16, no quantization).

In [8]:
data = train_tokenized[0:batch_size]
input_torch = convert_tensor(data, output="torch")

config = AutoConfig.from_pretrained(model_dir)
baseline_model = AutoModelForSequenceClassification.from_pretrained(
        model_dir, config=config
    )

baseline_model = baseline_model.cuda()

convert_to_onnx(
    baseline_model, output_path="../models/baseline.onnx", inputs_pytorch=input_torch, quantization=False, var_output_seq=False,
    output_names = ["output1"]
)

del baseline_model

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
verbose: False, log level: Level.ERROR



In [9]:
max_seq_len=512
batch_size=32
engine = build_engine(
    runtime=runtime,
    onnx_file_path="../models/baseline.onnx",
    logger=trt_logger,
    min_shape=(1, max_seq_len),
    optimal_shape=(batch_size, max_seq_len),
    max_shape=(batch_size, max_seq_len),
    workspace_size=10000 * 1024 * 1024,
    fp16=True,
    int8=False,
)

[07/22/2023-15:42:51] [TRT] [E] 3: [builderConfig.cpp::validatePool::313] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/builderConfig.cpp::validatePool::313, condition: false. Setting DLA memory pool size on TensorRT build with DLA disabled.
)


In [None]:
# save the engine
save_engine(engine, "../models/stack_fp16_model.trt")

# del the engine from memory
del engine

# load the saved engine - this includes the function for inference that handles the context and binding indices
trt_engine_fp16 = load_engine(runtime=runtime, engine_file_path="../models/stack_fp16_model.trt")


Measure accuracy and speed with a batch size of 32.

In [23]:
correct = 0
total = 0

batch_size = 32
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    tensorrt_output = trt_engine_fp16(
        inputs=input_torch,
    )
    
    preds=tensorrt_output['output1'].argmax(-1)
    labels = torch.tensor(data['label']).cuda()
    correct+=torch.sum(preds==labels).item()
    total+=labels.shape[0]

    
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7857448630136986
Execution time: 27.4
Samples per second: 511


Measure accuracy and speed with a batch size of 1.

In [22]:
correct = 0
total = 0

batch_size = 1
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_torch: OD[str, torch.Tensor] = convert_tensor(data=data, output="torch")

    tensorrt_output = trt_engine_fp16(
        inputs=input_torch,
    )
    
    preds=tensorrt_output['output1'].argmax(-1)
    labels = torch.tensor(data['label']).cuda()
    correct+=torch.sum(preds==labels).item()
    total+=labels.shape[0]

    
end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.785642999714856
Execution time: 41.2
Samples per second: 340


### STEP 10: FP16 ONNX Runtime Benchmark

Convert to model to ONNX Runtime using all available cores and enabling any possible optimizations.

In [15]:
optimize_onnx(
    onnx_path="../models/baseline.onnx",
    onnx_optim_model_path="../models/baseline-optimized.onnx",
    fp16=True,
    use_cuda=True,
    num_attention_heads=12,
    hidden_size=768,
    architecture="bert",
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
rm: cannot remove 'baseline-optimized.onnx': No such file or directory


In [23]:
onnx_model_fp16 = create_model_for_provider(path="../models/baseline-optimized.onnx", provider_to_use="CUDAExecutionProvider")

Measure accuracy and speed with a batch size of 32.

In [25]:
correct = 0
total = 0

batch_size = 32
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")

    onnx_output = onnx_model_fp16.run(
        None,
        input_np,
    )
    
    preds=onnx_output[0].argmax(-1)
    labels = np.array(data['label'])
    correct+=np.sum(preds==labels)
    total+=labels.shape[0]


end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7858162100456622
Execution time: 34.4
Samples per second: 408


Measure accuracy and speed with a batch size of 1.

In [24]:
correct = 0
total = 0

batch_size = 1
n_samples = test_tokenized.num_rows
n_batches = n_samples // batch_size

start_time = time.time()


for i in range(n_batches):

    data = test_tokenized[0 + i*batch_size: 0 + i*batch_size + batch_size]
    input_np: OD[str, np.ndarray] = convert_tensor(data=data, output="np")

    onnx_output = onnx_model_fp16.run(
        None,
        input_np,
    )
    
    preds=onnx_output[0].argmax(-1)
    labels = np.array(data['label'])
    correct+=np.sum(preds==labels)
    total+=labels.shape[0]


end_time = time.time()

execution_time = end_time - start_time

print(f"Samples: {n_samples}") 
print(f"Accuracy: {correct/total}")
print(f"Execution time: {round(execution_time, 1)}")
print(f"Samples per second: {math.floor(n_samples / execution_time)}")

Samples: 14028
Accuracy: 0.7859281437125748
Execution time: 45.4
Samples per second: 308
