## Notebook 2 : Post-Optimization of BERT.

### Step 1: Free-up the cuda space.

In [2]:
!free -m

             total       used       free     shared    buffers     cached
Mem:         61401      15695      45705        515        990      11174
-/+ buffers/cache:       3530      57870
Swap:            0          0          0


### STEP 2 : Downloading the official scripts of Post-optimization of BERT using TF2ONNX.

In [3]:
!mkdir bert_op_scripts
!wget -O ./bert_op_scripts/bert_model_optimization.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/bert_model_optimization.py
!wget -O ./bert_op_scripts/BertOnnxModelTF.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/BertOnnxModelTF.py
!wget -O ./bert_op_scripts/BertOnnxModel.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/BertOnnxModel.py
!wget -O ./bert_op_scripts/OnnxModel.py https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/OnnxModel.py

--2020-02-26 01:40:06--  https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/bert_model_optimization.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.200.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.200.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7165 (7.0K) [text/plain]
Saving to: ‘./bert_op_scripts/bert_model_optimization.py’


2020-02-26 01:40:06 (79.6 MB/s) - ‘./bert_op_scripts/bert_model_optimization.py’ saved [7165/7165]

--2020-02-26 01:40:06--  https://raw.githubusercontent.com/microsoft/onnxruntime/master/onnxruntime/python/tools/bert/BertOnnxModelTF.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.200.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.200.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26114 (26K) [text/plain]
Saving to: ‘./bert_op_scri

In [6]:
!ls

bert		 out			     uncased_L-12_H-768_A-12
bert_op_scripts  squad-1.1		     uncased_L-12_H-768_A-12.zip
lost+found	 tensor-bert-pipeline.ipynb  Untitled.ipynb


In [7]:
!cp out/bert.onnx bert_op_scripts/

### STEP 3 : Analysing the script to make neccessary changes.

In [9]:
! cat bert_op_scripts/bert_model_optimization.py

#-------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation.  All rights reserved.
# Licensed under the MIT License.
#--------------------------------------------------------------------------

# Convert Bert ONNX model converted from TensorFlow or exported from PyTorch to use Attention, Gelu,
# SkipLayerNormalization and EmbedLayerNormalization ops to optimize
# performance on NVidia GPU and CPU.

# For Bert model exported from PyTorch, OnnxRuntime has bert model optimization support internally.
# You can use the option --use_onnxruntime to use model optimization from OnnxRuntime package.
# For Bert model file like name.onnx, optimized model for GPU or CPU from OnnxRuntime will output as
# name_ort_gpu.onnx or name_ort_cpu.onnx in the same directory.
# This script is retained for experiment purpose. Useful senarios like the following:
#  (1) Change model from fp32 to fp16.
#  (2) Change input data type from int64 to i

The official script "bert_model_optimization.py" has an import "from BertOnnxModelKeras import BertOnnxModelKeras".
We need to modify our script and remove Keras import as we are using only Tensorflow and official Implementation didn't provided the Keras script and thus We will get an error of "No module found BertOnnxModelKeras"

In [11]:
! ls bert_op_scripts/


bert_model_optimization.py  BertOnnxModel.py	OnnxModel.py
bert.onnx		    BertOnnxModelTF.py	__pycache__


In [19]:
# Below are three examples to run bert_model_optimization.py. Choose one according to your needs and adjust --input
# --output path names as necessary.

# For CPU
#!python bert_op_scripts/bert_model_optimization.py --input <bert.onnx> --output <bert_cpu.onnx> --framework tensorflow

# # For inferences under NVidia GPU with Tensor Core like V100 and T4
# !python bert_op_scripts/bert_model_optimization.py --input <bert.onnx> --output <bert_gpu_fp16.onnx> --framework tensorflow --gpu_only –float16

# For inferences under other NVidia GPUs except V100 and T4
!python bert_op_scripts/bert_model_optimization.py --input bert.onnx --output bert_gpu_fp32.onnx --framework tensorflow --gpu_only

Traceback (most recent call last):
  File "bert_op_scripts/bert_model_optimization.py", line 36, in <module>
    from BertOnnxModelKeras import BertOnnxModelKeras
ModuleNotFoundError: No module named 'BertOnnxModelKeras'


Modifying the Official Script to make it work with tensorflow.

MODEL_CLASSES = {
    "bert" : (BertOnnxModel, "pytorch", False),
    "bert_tf": (BertOnnxModelTF, "tf2onnx", True),
    "bert_keras" : (BertOnnxModelKeras, "keras2onnx", False)
}


### STEP4 : Optimizing the ONNX-BERT for GPU with Half-Precision(fp32).

In [24]:
!python bert_op_scripts/bert_model_optimization.py --input out/bert.onnx --output bert_gpu_fp32.onnx  --gpu_only

    BertOnnxModel.py: Fused LayerNormalization count: 0
    BertOnnxModel.py: Fused Reshape count:0
    BertOnnxModel.py: Fused SkipLayerNormalization count: 0
    BertOnnxModel.py: Fused Attention count:0
    BertOnnxModel.py: skip embed layer fusion since mask input is not found
    BertOnnxModel.py: opset verion: 8
        OnnxModel.py: Output model to bert_gpu_fp32.onnx


### STEP 5 : Inference on the GPU version of BERT-ONNX using ONNXRUNTIME.

In [43]:
import onnxruntime as rt  
import numpy as np
import time

sess_options = rt.SessionOptions()

# Set graph optimization level to ORT_ENABLE_EXTENDED to enable bert optimization.
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

session = rt.InferenceSession("out/bert_gpu_fp32.onnx", sess_options)

# evaluate the model
# Generate dummy inputs to the model. Adjust if neccessary
inputs = {
    'input_ids:0':   np.random.randint(0, 256, size=[1, 256], dtype=np.int64), # list of numerical ids for the tokenised text
    'segment_ids:0': np.ones(shape=[1, 256], dtype=np.int64),        # dummy list of ones
    'input_mask:0':  np.ones(shape=[1, 256], dtype=np.int64),        # dummy list of ones
    'unique_ids_raw_output___9:0': np.arange(0, 256, dtype=np.int64)
}

start = time.time()
# Run the optimized model with inputs
output_names = ['unstack:1', 'unstack:0', 'unique_ids:0']
res = session.run(output_names, inputs) 
end = time.time()
gpu_time = end - start
print("ONNX GPU Runtime Inference time: ", end - start)

ONNX GPU Runtime Inference time:  1.1366171836853027


### STEP 6 : Optimizing the ONNX-BERT for CPU(x86).

In [44]:
#Doing inference only on CPU
!python bert_op_scripts/bert_model_optimization.py --input out/bert.onnx --output bert_cpu_only.onnx

    BertOnnxModel.py: Fused LayerNormalization count: 0
    BertOnnxModel.py: Fused Reshape count:0
    BertOnnxModel.py: Fused SkipLayerNormalization count: 0
    BertOnnxModel.py: Fused Attention count:0
    BertOnnxModel.py: skip embed layer fusion since mask input is not found
    BertOnnxModel.py: opset verion: 8
        OnnxModel.py: Output model to bert_cpu_only.onnx


### STEP 7 : Inference on the CPU version of ONNX-BERT.

In [2]:
!pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/24/19/4804aea17cd136f1705a5e98a00618cb8f6ccc375ad8bfa437408e09d058/torch-1.4.0-cp36-cp36m-manylinux1_x86_64.whl (753.4MB)
[K     |████████████████████████████████| 753.4MB 6.9kB/s  eta 0:00:01��        | 562.6MB 51.1MB/s eta 0:00:04     |████████████████████████████    | 659.9MB 58.8MB/s eta 0:00:02
[?25hInstalling collected packages: torch
Successfully installed torch-1.4.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [6]:
import torch

print(f'Number of GPU : {torch.cuda.device_count()}')
print(f'Name of the GPU {torch.cuda.get_device_name(0)}')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Number of GPU : 1
Name of the GPU Tesla K80
Using device: cuda


In [46]:
import onnxruntime as rt  
import numpy as np
import time

sess_options = rt.SessionOptions()

# Set graph optimization level to ORT_ENABLE_EXTENDED to enable bert optimization.
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

session = rt.InferenceSession("bert_cpu_only.onnx", sess_options)

# evaluate the model
# Generate dummy inputs to the model. Adjust if neccessary
inputs = {
    'input_ids:0':   np.random.randint(0, 256, size=[1, 256], dtype=np.int64), # list of numerical ids for the tokenised text
    'segment_ids:0': np.ones(shape=[1, 256], dtype=np.int64),        # dummy list of ones
    'input_mask:0':  np.ones(shape=[1, 256], dtype=np.int64),        # dummy list of ones
    'unique_ids_raw_output___9:0': np.arange(0, 256, dtype=np.int64)
}

start = time.time()
# Run the optimized model with inputs
output_names = ['unstack:1', 'unstack:0', 'unique_ids:0']
res = session.run(output_names, inputs) 
end = time.time()
cpu_time = end - start
print("ONNX CPU Runtime Inference time: ", end - start)

ONNX CPU Runtime Inference time:  1.2086129188537598


### Conclusion -
I used the following configuration for this Project- 
- Framework - Tensorflow / Sagemaker
- EC2 Instance - P2.xlarge
- GPU - K80(11GB)

In [47]:
'''
After Optimizing the ONNX Model for CPU and GPU which is also Framework Independent we reduced the Inference time.
'''
print(f'ONNX GPU Inference time is : {gpu_time}')
print(f'ONNX CPU Inferene time is : {cpu_time}')


ONNX GPU Inference time is : 1.1366171836853027
ONNX CPU Inferene time is : 1.2086129188537598
