# Colab notebook

In [1]:
!pip install --upgrade transformers
!pip install --upgrade onnxruntime
!pip install --upgrade onnx
!pip install torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.6.1)
Requirement already up-to-date: onnxruntime in /usr/local/lib/python3.7/dist-packages (1.7.0)
Requirement already up-to-date: onnx in /usr/local/lib/python3.7/dist-packages (1.9.0)
Looking in links: https://download.pytorch.org/whl/torch_stable.html


## Create ONNX model

In [2]:
from transformers.convert_graph_to_onnx import convert, optimize, quantize
from pathlib import Path

In [3]:
model_to_use = 'bert-base-cased'
tmp_path = Path('./tmp')
onnx_output_path = Path('./ml_model')

convert(
  framework="pt",  ## pt for pytorch
  model=model_to_use,
  output=tmp_path/f'{model_to_use}.onnx',
  opset=13,
  pipeline_name = "fill-mask",
)

# Move from tmp to ml_model directory.
(tmp_path/f'{model_to_use}.onnx').rename(onnx_output_path/f'{model_to_use}.onnx')

ONNX opset version set to: 13
Loading pipeline (model: bert-base-cased, tokenizer: bert-base-cased)


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Using framework PyTorch: 1.8.1+cpu
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']


  input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors


In [4]:
!ls -AcFhlt ml_model

total 4.5G
-rw-r--r-- 1 root root 499M May 26 01:04 bert-base-cased.onnx
-rw-r--r-- 1 root root 348M May 26 01:00 bert-large-cased-optimized-quantized.onnx
-rw-r--r-- 1 root root 1.4G May 26 00:59 bert-large-cased-optimized.onnx
-rw-r--r-- 1 root root 348M May 26 00:58 bert-large-cased-quantized.onnx
-rw-r--r-- 1 root root 1.4G May 26 00:57 bert-large-cased.onnx
-rw-r--r-- 1 root root 126M May 26 00:55 bert-base-cased-optimized-quantized.onnx
-rw-r--r-- 1 root root 499M May 26 00:55 bert-base-cased-optimized.onnx


In [5]:
quantize(onnx_output_path/f'{model_to_use}.onnx')

         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.


As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
Quantized model has been written at ml_model/bert-base-cased-quantized.onnx: ✔


PosixPath('ml_model/bert-base-cased-quantized.onnx')

In [6]:
optimize(onnx_output_path/f'{model_to_use}.onnx')
quantize(onnx_output_path/f'{model_to_use}-optimized.onnx')

Optimized model has been written at ml_model/bert-base-cased-optimized.onnx: ✔
/!\ Optimized model contains hardware specific operators which might not be portable. /!\


         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.


As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
Quantized model has been written at ml_model/bert-base-cased-optimized-quantized.onnx: ✔


PosixPath('ml_model/bert-base-cased-optimized-quantized.onnx')

In [7]:
!ls -AcFhlt ml_model

total 4.7G
-rw-r--r-- 1 root root 126M May 26 01:04 bert-base-cased-optimized-quantized.onnx
-rw-r--r-- 1 root root 499M May 26 01:04 bert-base-cased-optimized.onnx
-rw-r--r-- 1 root root 126M May 26 01:04 bert-base-cased-quantized.onnx
-rw-r--r-- 1 root root 499M May 26 01:04 bert-base-cased.onnx
-rw-r--r-- 1 root root 348M May 26 01:00 bert-large-cased-optimized-quantized.onnx
-rw-r--r-- 1 root root 1.4G May 26 00:59 bert-large-cased-optimized.onnx
-rw-r--r-- 1 root root 348M May 26 00:58 bert-large-cased-quantized.onnx
-rw-r--r-- 1 root root 1.4G May 26 00:57 bert-large-cased.onnx


## Testing model

In [8]:
import numpy as np
from transformers import BertTokenizerFast
from onnxruntime import ExecutionMode, InferenceSession, SessionOptions

model_to_use = model_to_use or 'bert-base-cased'
MASK_STR = '[MASK]'

# Create the tokenizer.
tokenizer = BertTokenizerFast.from_pretrained(model_to_use)

# Create the InferenceSession.
options = SessionOptions()
options.intra_op_num_threads = 1
options.execution_mode = ExecutionMode.ORT_SEQUENTIAL
session = InferenceSession(
  f'./ml_model/{model_to_use}-quantized.onnx',
  options,
  providers=['CPUExecutionProvider']
)


# Simulate some HTTP request and response.
def some_request(input_sentence: str) -> dict:
  if input_sentence.count(MASK_STR) == 0:
    return {'error': f'{MASK_STR} is missing from the text'}

  return {'data': {'suggestions': fill_mask_onnx(input_sentence)}}


def fill_mask_onnx(input_sentence: str, topn: int = 10) -> list:
  tokens = tokenizer(input_sentence, return_tensors='np')
  output = session.run(None, tokens.__dict__['data'])
  token_logits = output[0]

  # Get token indices of the masks.
  mask_token_indices = np.where(
    tokens['input_ids'] == tokenizer.mask_token_id)[1]

  # Get the top tokens for each mask.
  result = None
  for i in range(len(mask_token_indices)):
    mask_token_index = mask_token_indices[i:i+1]

    mask_token_logits = token_logits[0, mask_token_index, :]
    score = np.exp(mask_token_logits) / np.exp(
      mask_token_logits).sum(-1, keepdims=True)

    top_idx = (-score[0]).argsort()[:topn]
    top_values = score[0][top_idx]

    current = None
    for token, token_score in zip(top_idx.tolist(), top_values.tolist()):
      if current is not None:
        current = np.append(
            current, {'text': tokenizer.decode([token]), 'score': token_score})
      else:
        current = np.array(
            [{'text': tokenizer.decode([token]), 'score': token_score}])

    if result is not None:
      result = np.append(result, [current], axis=0)
    else:
      result = np.array([current])

  # Transpose and convert back to a regular list so that it can be serialized.
  return np.transpose(result).tolist()


print(some_request(f'{MASK_STR} {MASK_STR}, also called performance or concert dance, is intended primarily as a spectacle, usually a performance upon a stage by virtuoso dancers. It often tells a story, perhaps using mime, costume and scenery, or else it may simply interpret the musical accompaniment, which is often specially composed. Examples are western ballet and modern dance, Classical Indian dance and Chinese and Japanese song and dance dramas. Most classical forms are centred upon dance alone, but performance dance may also appear in opera and other forms of musical theatre. Participatory dance, on the other hand, whether it be a folk dance, a social dance, a group dance such as a line, circle, chain or square dance, or a partner dance such as is common in Western ballroom dancing, is undertaken primarily for a common purpose, such as social interaction or exercise, or building flexibility of participants rather than to serve any benefit to onlookers. Such dance seldom has any narrative. A group dance and a corps de ballet, a social partner dance and a pas de deux, differ profoundly. Even a solo dance may be undertaken solely for the satisfaction of the dancer. Participatory dancers often all employ the same movements and steps but, for example, in the rave culture of electronic dance music, vast crowds may engage in free dance, uncoordinated with those around them. On the other hand, some cultures lay down strict rules as to the particular dances in which, for example, men, women and children may or must participate.'))

{'data': {'suggestions': [[{'text': 'Performance', 'score': 0.4660041630268097}, {'text': 'dance', 'score': 0.9748988151550293}], [{'text': 'Performing', 'score': 0.05453372746706009}, {'text': 'dancing', 'score': 0.012139721773564816}], [{'text': 'Dance', 'score': 0.03462020307779312}, {'text': 'Dance', 'score': 0.004944453947246075}], [{'text': 'Musical', 'score': 0.034068189561367035}, {'text': 'ballet', 'score': 0.0024087349884212017}], [{'text': 'Concert', 'score': 0.02929711528122425}, {'text': 'dances', 'score': 0.0016606670105829835}], [{'text': 'The', 'score': 0.028143607079982758}, {'text': 'music', 'score': 0.000740378862246871}], [{'text': 'Stage', 'score': 0.018190523609519005}, {'text': 'choreography', 'score': 0.0005686341901309788}], [{'text': 'Live', 'score': 0.017869982868433}, {'text': 'work', 'score': 0.00025001788162626326}], [{'text': 'Public', 'score': 0.016925238072872162}, {'text': '##dance', 'score': 0.00023710304230917245}], [{'text': '.', 'score': 0.01687968

## Save to Google Drive

In [9]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
!ls /content/drive/MyDrive/ml_models

bert-base-cased-quantized.onnx	  wikitext-103-raw.model-50.wv.pkl
wikitext-103-raw.model-25.wv.pkl


In [11]:
model_to_use = model_to_use or 'bert-base-cased'

# Move to drive.
!mv ml_model/bert-base-cased-quantized.onnx /content/drive/MyDrive/ml_models

In [12]:
!ls /content/drive/MyDrive/ml_models

bert-base-cased-quantized.onnx	  wikitext-103-raw.model-50.wv.pkl
wikitext-103-raw.model-25.wv.pkl
