# Deployment with TVM


In this notebook, we will focus on the deployment with TVM. [TVM](https://tvm.ai) is an open deep learning compiler for CPUs, GPUs, and specialized accelerators. [Amazon Sagemaker Neo](https://aws.amazon.com/sagemaker/neo/) provides the compilation service that built on top of TVM.

In this tutorial, we use BERT model for the question answering task to show how deployment through TVM works. Specifically, we will:

- learn how to convert MXNet model to TVM representation (Relay)
- learn how to compile and run TVM model, and
- evaluate TVM performance

## Preparation

We first import the MXNet, GluonNLP, and TVM libraries:

In [1]:
import os
import collections, time, logging
import numpy as np
import gluonnlp as nlp
import mxnet as mx
import bert
import qa_utils
from bert.bert_qa_evaluate import PredResult, predict
from bert.export.hybrid_bert import HybridBERTForQA, get_hybrid_model
from bert.data.qa import SQuADTransform, preprocess_dataset
# TVM libraries
import tvm
from tvm import relay
from tvm import autotvm

### Load hybrid BERT model

In order to export the model for deployment, we need to use hybrid model with fixed sequence length. Here we specify the maximum sequence length to 256.

In [2]:
max_seq_length = 256
base_model, vocab = get_hybrid_model(
    name="bert_12_768_12",
    dataset_name="book_corpus_wiki_en_uncased",
    pretrained=False,
    use_pooler=False,
    use_decoder=False,
    use_classifier=False,
    seq_length=max_seq_length)
net = HybridBERTForQA(base_model)
mx_ctx = mx.cpu()
ckpt = qa_utils.download_qa_ckpt()
net.load_parameters(ckpt, ctx=mx_ctx)

Downloaded checkpoint to ./temp/bert_qa-7eb11865.params


### Prepare the sample dataset

Similar to the questiong answering tutorial, we use the SQuAD dataset and create a subset dataset with 10 samples for demonstration. 

In [3]:
full_data = nlp.data.SQuAD(segment='dev', version='1.1')

# loading a subset of the dev set of SQuAD
num_target_samples = 10
target_samples = [full_data[i] for i in range(num_target_samples)]
dataset = mx.gluon.data.SimpleDataset(target_samples)

tokenizer = nlp.data.BERTTokenizer(vocab=vocab, lower=True)
transform = bert.data.qa.SQuADTransform(tokenizer, is_pad=False, is_training=False, do_lookup=False)
dev_data_transform, _ = bert.data.qa.preprocess_dataset(dataset, transform)


Done! Transform dataset costs 0.22 seconds.


We then convert the transformed texts to subword indices and prepare the dataloader.

In [4]:
def vocab_lookup(example_id, subwords, type_ids, length, start, end):
    indices = vocab[subwords]
    return example_id, indices, type_ids, length, start, end
dev_data_transform = dev_data_transform.transform(vocab_lookup, lazy=False)

batch_size = 1
dev_dataloader = mx.gluon.data.DataLoader(
    dev_data_transform, batch_size=batch_size, shuffle=False)

## Compile MXNet model with TVM

Now we convert the MXNet model into Relay. We need to provide a mapping from input names to their shapes at this step. TVM frontend converter supports both MXNet static graph (symbol) and HybridBlock.

In [5]:
shape_dict = {
    'data0': (1, max_seq_length), # inputs
    'data1': (1, max_seq_length), # token types
    'data2': (1,) # sequence length
}
mod, params = relay.frontend.from_mxnet(net, shape_dict)
# uncomment the following line to see the converted model in Relay IR
# print(mod)

### Load the AutoTVM logs and build module

Next, we load the AutoTVM logs that were previously tuned on c5.9x instances.

We won't cover how to tune kernels using AutoTVM in this tutorial. If you are interested, you can check the [auto tuning tutorial](https://docs.tvm.ai/tutorials/autotvm/tune_relay_x86.html).

In [6]:
log_dir = "autotvm_logs"
logs = [os.path.join(log_dir, f) for f in os.listdir(log_dir)]
autotvm_ctx = autotvm.apply_history_best(None)
for log_file in logs:
    autotvm_ctx.load(log_file)

We then compile the model. We specify the target CPU as skylake avx512 to use the vectorized instructions for floating operations. 

If compile on other devices, e.g., ARM CPU, we need to change the target string, e.g., "llvm -device=arm_cpu -target=aarch64-linux-gnu".

In [7]:
target = "llvm -mcpu=skylake-avx512"
# change the target when compile on ARM CPU
# target = "llvm -device=arm_cpu -target=aarch64-linux-gnu"
with autotvm_ctx:
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(mod[mod.entry_func], target, params=params)

### Export library

Lastly we can export library, graph structure, and parameters into files.

In [8]:
lib.export_library("deploy_lib.tar")
with open("deploy_graph.json", "w") as fo:
    fo.write(graph)
with open("deploy_param.params", "wb") as fo:
    fo.write(relay.save_param_dict(params))

## Evaluate TVM

We now load back graph, library, and params from files that are exported earlier, and create the graph runtime to execute the compiled graph.

In [9]:
import tvm.contrib.graph_runtime as runtime

loaded_graph = open("deploy_graph.json").read()
loaded_lib = tvm.module.load("deploy_lib.tar")
loaded_params = bytearray(open("deploy_param.params", "rb").read())

tvm_ctx = tvm.cpu()
ex = runtime.create(loaded_graph, loaded_lib, tvm_ctx)
ex.load_params(loaded_params)

Note that hybrid BERT model requires fixed length inputs. Therefore, before we feed in the input and token types, we need to pad them to the max sequence length.

In [10]:
def pad(arr, length, pad_val, dtype="float32"):
    padded = np.full(shape=(1, length), fill_value=pad_val, dtype=dtype)
    padded[0, :arr.shape[1]] = arr.asnumpy()[0]
    return padded

tvm_results = collections.defaultdict(list)
example_ids, inputs, token_types, valid_length, _, _ = next(iter(dev_dataloader))
padded_inputs = pad(inputs, max_seq_length, vocab[vocab.padding_token])
padded_token_types = pad(token_types, max_seq_length, 0)

Now let's run the graph runtime.

In [11]:
ex.set_input(data0=padded_inputs,
             data1=padded_token_types,
             data2=valid_length.astype('float32').asnumpy())
ex.run()
out = ex.get_output(0)
output = np.split(out.asnumpy(), axis=2, indices_or_sections=2)
example_ids = example_ids.asnumpy().tolist()
pred_start = output[0].reshape((1, -1))
pred_end = output[1].reshape((1, -1))

for example_id, start, end in zip(example_ids, pred_start, pred_end):
    tvm_results[example_id].append(PredResult(start=start, end=end))

In [12]:
qa_utils.predict(dataset, tvm_results, vocab, number=1)


Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: which nfl team represented the afc at super bowl 50 ?

Top predictions: 
99.36% 	 Denver Broncos
0.23% 	 The American Football Conference (AFC) champion Denver Broncos
0.20% 	 Broncos



We also execute MXNet to check the correctness.

In [13]:
mx_results = collections.defaultdict(list)
example_ids, inputs, token_types, valid_length, _, _ = next(iter(dev_dataloader))
padded_inputs = pad(inputs, max_seq_length, vocab[vocab.padding_token])
padded_token_types = pad(token_types, max_seq_length, 0)

out = net(mx.nd.array(padded_inputs, mx_ctx),
          mx.nd.array(padded_token_types, mx_ctx),
          valid_length.astype('float32').as_in_context(mx_ctx))
output = mx.nd.split(out, axis=2, num_outputs=2)
example_ids = example_ids.asnumpy().tolist()
pred_start = output[0].reshape((0, -3)).asnumpy()
pred_end = output[1].reshape((0, -3)).asnumpy()
for example_id, start, end in zip(example_ids, pred_start, pred_end):
    mx_results[example_id].append(PredResult(start=start, end=end))

In [14]:
qa_utils.predict(dataset, mx_results, vocab, number=1)


Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: which nfl team represented the afc at super bowl 50 ?

Top predictions: 
99.36% 	 Denver Broncos
0.23% 	 The American Football Conference (AFC) champion Denver Broncos
0.20% 	 Broncos



## Benchmark TVM performance

This benchmark shows the mean inference time of TVM and MXNet.

In [15]:
inputs = np.random.uniform(size=(1, max_seq_length)).astype('float32')
token_types = np.random.uniform(size=(1, max_seq_length)).astype('float32')
valid_length = np.asarray([max_seq_length]).astype('float32')
ex.set_input(data0=inputs, data1=token_types, data2=valid_length)

ftimer = ex.module.time_evaluator("run", tvm_ctx, number=10)
prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
print("TVM mean inference time: %.2f ms" % np.mean(prof_res))

TVM mean inference time: 93.56 ms


In [16]:
inputs_nd = mx.nd.array(inputs)
token_types_nd = mx.nd.array(token_types)
valid_length_nd = mx.nd.array(valid_length)
# dry run
mx_out = net(inputs_nd, token_types_nd, valid_length_nd)
mx_out.wait_to_read()
# benchmark
number = 10
beg = time.time()
for _ in range(number):
    mx_out = net(inputs_nd, token_types_nd, valid_length_nd)
    mx_out.wait_to_read()
lat = (time.time() - beg) * 1e3 / number
print('MXNet mean inference time: %.2f ms' % lat)

MXNet mean inference time: 85.90 ms
