# SentenceTransformers ONNX Inference Model

This notebook is intended to give a brief introduction on how to create ONNX models based on a given SentenceTransformers model. This tutorial is only applicable for models, which have been released on the transformers model hub: https://huggingface.co/sentence-transformers

### Preliminaries

Initially, we start with the required imports:

In [1]:
!rm -rf bert-base-nli-stsb-mean-tokens
import time
import pprint
import multiprocessing
from pathlib import Path

import onnx
import torch
import transformers

import numpy as np
import onnxruntime as rt

from termcolor import colored
from sentence_transformers import SentenceTransformer
from transformers import convert_graph_to_onnx

pp = pprint.PrettyPrinter(indent=4)
pprint = pp.pprint

In order to provide some performance measurments, this notebook has been run on the following hardware:

In [2]:
print(colored(f"GPU available {torch.cuda.is_available()}", "green"))
print(colored(f"GPU Name: {torch.cuda.get_device_name(0)}", "green"))
print(colored(f"GPU Count: {torch.cuda.device_count()}", "green"))
print(colored(f"CORE Count: {multiprocessing.cpu_count()}", "green"))

[32mGPU available True[0m
[32mGPU Name: Tesla V100-SXM2-32GB[0m
[32mGPU Count: 1[0m
[32mCORE Count: 48[0m


Next, we define a test span and some preliminary information w.r.t. the model to be used.  We also load the raw model from this library as a benchmark and for checking the sanity of our converted ONNX model

In [3]:
span = "I am a span. A short span, but nonetheless a span"

model_type = "bert"
model_name = f"{model_type}-base-nli-stsb-mean-tokens"
model_access = f"sentence-transformers/{model_name}"

model_raw = SentenceTransformer(model_name, device="cuda")

## Loading the Pipeline

We subequently load the FeatureExtractionPipeline from the transformers library. This step is ultimately not necessary if you know all the input shapes and config of the model you want to use. However, using the corresponding functions from convert_graph_to_onnx significantly eases creating our custom model.

The resulting variables will be used for the torch export call later in this notebook.

In [4]:
model_pipeline = transformers.FeatureExtractionPipeline(
    model=transformers.AutoModel.from_pretrained(model_access),
    tokenizer=transformers.AutoTokenizer.from_pretrained(model_access, use_fast=True),
    framework="pt",
    device=-1
)

config = model_pipeline.model.config
tokenizer = model_pipeline.tokenizer

with torch.no_grad():
    input_names, output_names, dynamic_axes, tokens = convert_graph_to_onnx.infer_shapes(
        model_pipeline, 
        "pt"
    )
    ordered_input_names, model_args = convert_graph_to_onnx.ensure_valid_input(
        model_pipeline.model, tokens, input_names
    )

Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']


In [5]:
pprint(input_names)
pprint(output_names)
pprint(dynamic_axes)
pprint(tokens)
pprint(ordered_input_names)
pprint(model_args)

['input_ids', 'token_type_ids', 'attention_mask']
['output_0', 'output_1']
{   'attention_mask': {0: 'batch', 1: 'sequence'},
    'input_ids': {0: 'batch', 1: 'sequence'},
    'output_0': {0: 'batch', 1: 'sequence'},
    'output_1': {0: 'batch'},
    'token_type_ids': {0: 'batch', 1: 'sequence'}}
{   'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),
    'input_ids': tensor([[ 101, 2023, 2003, 1037, 7099, 6434,  102]]),
    'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])}
['input_ids', 'attention_mask', 'token_type_ids']
(   tensor([[ 101, 2023, 2003, 1037, 7099, 6434,  102]]),
    tensor([[1, 1, 1, 1, 1, 1, 1]]),
    tensor([[0, 0, 0, 0, 0, 0, 0]]))


For our application, we want to create a custom transformers model with a different output than the feature extraction pipeline.
We only use the output of the feature extractor, but return the pooled sentence embedding.

We must add a new output for the pooled sentence embedding. This output is of fixed size, as opposed to (for example) the original output_0, corresponding to the token embeddings.

All other (input) variables can be left unchanged, because they are identical for the models.

Therefore, we change variables as follows:

In [6]:
del dynamic_axes["output_0"] # Delete unused output
del dynamic_axes["output_1"] # Delete unused output

output_names = ["sentence_embedding"]
dynamic_axes["sentence_embedding"] = {0: 'batch'}

# Check that everything worked
pprint(output_names)
pprint(dynamic_axes)

['sentence_embedding']
{   'attention_mask': {0: 'batch', 1: 'sequence'},
    'input_ids': {0: 'batch', 1: 'sequence'},
    'sentence_embedding': {0: 'batch'},
    'token_type_ids': {0: 'batch', 1: 'sequence'}}


## Creating the SentenceTransformer Model

Next, we create the custom transformers model, which is based on the BertModel of the original model contained in the pipeline (make sure to get the **inheritance** right.)

If you want to add further outputs, you will have to modify the dynamic_axes and output_names accordingly.

In [7]:
class SentenceTransformer(transformers.BertModel):
    def __init__(self, config):
        super().__init__(config)
        # Naming alias for ONNX output specification
        # Makes it easier to identify the layer
        self.sentence_embedding = torch.nn.Identity()

    def forward(self, input_ids, token_type_ids, attention_mask):
        # Get the token embeddings from the base model
        token_embeddings = super().forward(
            input_ids, 
            attention_mask=attention_mask, 
            token_type_ids=token_type_ids
        )[0]
        # Stack the pooling layer on top of it
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return self.sentence_embedding(sum_embeddings / sum_mask)

# Create the new model based on the config of the original pipeline
model = SentenceTransformer(config=config).from_pretrained(model_access)

Let's make sure that both, the original model and the newly created model, result in the same output:

In [8]:
assert np.allclose(
    model_raw.encode(span),
    model(**tokenizer(span, return_tensors="pt")).squeeze().detach().numpy(),
    atol=1e-6,
)

## Exporting the Model to ONNX

The following step is heavily based on the original [convert_graph_to_onnx.py](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_graph_to_onnx.py) from the transformers library.

Important note: The opset version defines the version of the set of operations which can be converted to ONNX. For the given model, the highest opset version is required.

Note: The opset version might cause problems with the subsequent call of optimizer.optimize_model from onnxruntime_tools.

In [9]:
outdir = Path(model_name)
output = outdir / f"{model_name}.onnx"
outdir.mkdir(parents=True, exist_ok=True)

if output.exists():
    print(f"Model {model_type} exists. Skipping creation")
else:
    print(f"Saving to {output}")
    # This is essentially a copy of transformers.convert_graph_to_onnx.convert
    torch.onnx.export(
        model,
        model_args,
        f=output.as_posix(),
        input_names=input_names,
        output_names=output_names,
        dynamic_axes=dynamic_axes,
        do_constant_folding=True,
        use_external_data_format=False,
        enable_onnx_checker=True,
        opset_version=12,
    )

Saving to bert-base-nli-stsb-mean-tokens/bert-base-nli-stsb-mean-tokens.onnx


  position_ids = self.position_ids[:, :seq_length]
  assert all(


Lets quickly check if the ONNX model works as intended:

In [10]:
onnx_model = onnx.load(output)
onnx.checker.check_model(onnx_model)
print('The model is checked!')

The model is checked!


## Running Inference

We are finally able to run an inference session based on the ONNX Runtime.

In [11]:
opt = rt.SessionOptions()
opt.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
opt.log_severity_level = 3
opt.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL

sess = rt.InferenceSession(str(output), opt) # Loads the model

Before doing anything else, lets validate if the outputs of the ONNX model correspond to the raw models.

In [12]:
model_input = tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
onnx_result = sess.run(None, model_input)

assert np.allclose(model_raw.encode(span), onnx_result, atol=1e-6)
assert np.allclose(
    model(**tokenizer(span, return_tensors="pt")).squeeze().detach().numpy(), 
    onnx_result, 
    atol=1e-6
)

Finally, we are able to run the benchmark

## Online Encoding Benchmark

This benchmark simulates encoding spans on the fly without any batching

In [13]:
%%timeit -n 200
model_raw.encode(span)

14.9 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 200 loops each)


In [14]:
%%timeit -n 200
model_input = tokenizer.encode_plus(span)
model_input = {name : np.atleast_2d(value) for name, value in model_input.items()}
output = sess.run(None, model_input)

2.21 ms ± 56.6 µs per loop (mean ± std. dev. of 7 runs, 200 loops each)


So, in this example we are able to speed up the online inference speed of bert-base-nli-stsb-mean-tokens by a factor of *6-7* on an empty V100.

## Batched Encoding Benchmark

Lets benchmark the model in batches:

In [15]:
no_spans = 100_000
batch_size = 32
sentences = ["I am a very short span." for _ in range(no_spans)]

In [16]:
def convert(sentences, batch_size=batch_size):
    """Wrapped by method for line profiler"""
    iterator = range(0, len(sentences), batch_size)
    for start_index in iterator:
        sentences_batch = sentences[start_index:start_index+batch_size]

        tokens = tokenizer(sentences_batch)
        tokens = {name: np.atleast_2d(value) for name, value in tokens.items()}
        out = sess.run(None, tokens)[0]    

In [17]:
start = time.time()
_ = model_raw.encode(
    sentences=sentences,
    batch_size=batch_size,
)
end = time.time()

In [18]:
sentences_per_second = 1 / ((end-start) / no_spans) 
print(f"{sentences_per_second:.2f}")

1990.25


In [19]:
start = time.time()
convert(sentences)
end = time.time()

In [20]:
sentences_per_second = 1 / ((end-start) / no_spans) 
print(f"{sentences_per_second:.2f}")

3994.79


In the batched case, the ONNX model achieves a 2x speedup

## Cleanup

In [21]:
!rm -rf bert-base-nli-stsb-mean-tokens