## Converting PyTorch model to ONNX model:

Saving ONNX model to `onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx`

In [1]:
!rm -rf onnx_bert_large/
from transformers.convert_graph_to_onnx import convert
convert(framework="pt", model="sentence-transformers/bert-large-nli-mean-tokens", output="onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx", opset=11)

ONNX opset version set to: 11
Loading pipeline (model: sentence-transformers/bert-large-nli-mean-tokens, tokenizer: sentence-transformers/bert-large-nli-mean-tokens)
Creating folder onnx_bert_large
Using framework PyTorch: 1.5.0
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']


## Creating an ONNX Inference Session

In [2]:
from os import environ
from psutil import cpu_count

# Constants from the performance optimization available in onnxruntime
# It needs to be done before importing onnxruntime
environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'

from onnxruntime import InferenceSession, SessionOptions, get_all_providers

In [3]:
sess = InferenceSession("onnx_bert_large/bert-large-nli-mean-tokens-onnx.onnx", providers=["CPUExecutionProvider"])

## Initializing mean_pooling function to convert model sequence outputs to pooled outputs

In [4]:
import torch

In [5]:
def mean_pooling(model_output, attention_mask):
    model_output = torch.from_numpy(model_output[0])
    token_embeddings = model_output #First element of model_output contains all token embeddings
    attention_mask = torch.from_numpy(attention_mask)
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask, input_mask_expanded, sum_mask

## Initializing BERT Tokenizer

In [8]:
from transformers import BertTokenizerFast
# Using bert-base-uncased because Sentence Transformers uses the same
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

## Getting sentence embeddings using the inference session

In [37]:
query = "My name is Bert"

Tokenizing and getting into format required for ONNX Model

In [9]:
model_inputs = tokenizer(query, return_tensors="pt")
inputs_onnx = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()}

Running the inference session with `inputs_onnx`

In [14]:
sequence = sess.run(None, inputs_onnx)

Converting sequence embeddings to pooled embeddings using `mean_pooling` function

In [38]:
sentence_embeddings = mean_pooling(sequence, inputs_onnx['attention_mask'])

### Sentence embeddings:

In [29]:
sentence_embeddings[0]

tensor([[-0.2796, -0.1890,  0.6042,  ..., -0.0047, -0.1386, -0.3606]])

In [30]:
sentence_embeddings

(tensor([[-0.2796, -0.1890,  0.6042,  ..., -0.0047, -0.1386, -0.3606]]),
 tensor([[[1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 1,  ..., 1, 1, 1]]]),
 tensor([[6, 6, 6,  ..., 6, 6, 6]]))

In [34]:
np.shape(sentence_embeddings[0])

torch.Size([1, 1024])

In [35]:
np.shape(sentence_embeddings[1])

torch.Size([1, 6, 1024])

In [36]:
np.shape(sentence_embeddings[2])

torch.Size([1, 1024])