# Deployment with TVM


Pre-trained language representations have been shown to improve many downstream NLP tasks such as question answering, and natural language inference. Devlin, Jacob, et al proposed BERT [1] (Bidirectional Encoder Representations from Transformers), which fine-tunes deep bidirectional representations on a wide range of tasks with minimal task-specific parameters, and obtained state- of-the-art results.

In this tutorial, we will focus on adapting the BERT model for the question answering task on the SQuAD dataset. Specifically, we will:

- understand how to pre-process the SQuAD dataset to leverage the learnt representation in BERT,
- adapt the BERT model to the question answering task, and
- load a trained model to perform inference on the SQuAD dataset


## Load MXNet, GluonNLP, and TVM

We first import the libraries:

In [1]:
import os
import collections, time, logging
import numpy as np
import gluonnlp as nlp
import mxnet as mx
import bert
import qa_utils
from bert.bert_qa_evaluate import PredResult, predict
from bert.export.hybrid_bert import HybridBERTForQA, get_hybrid_model
from bert.data.qa import SQuADTransform, preprocess_dataset
# TVM libraries
import tvm
from tvm import relay
from tvm import autotvm

## Load hybrid BERT model

Currently hybrid BERT model for QA doesn't support control flow, we need to specify a maximum sequence length and padded the input to it.

In [2]:
from bert.export.hybrid_bert import HybridBERTForQA, get_hybrid_model
max_seq_length = 256
max_query_length = 64
base_model, vocab = get_hybrid_model(
    name="bert_12_768_12",
    dataset_name="book_corpus_wiki_en_uncased",
    pretrained=False,
    use_pooler=False,
    use_decoder=False,
    use_classifier=False,
    seq_length=max_seq_length)

net = HybridBERTForQA(base_model)
mx_ctx = mx.cpu()
ckpt = qa_utils.download_qa_ckpt()
net.load_parameters(ckpt, ctx=mx_ctx)

Downloaded checkpoint to ./temp/bert_qa-7eb11865.params


## Prepare the sample dataset

In [3]:
full_data = nlp.data.SQuAD(segment='dev', version='1.1')
# loading a subset of the dev set of SQuAD
num_target_samples = 20
target_samples = [full_data[i] for i in range(num_target_samples)]
dataset = mx.gluon.data.SimpleDataset(target_samples)
#print('Number of samples in the created dataset subsampled from SQuAD = %d'%len(dataset))

tokenizer = nlp.data.BERTTokenizer(vocab=vocab, lower=True)
transform = bert.data.qa.SQuADTransform(tokenizer, is_pad=False, is_training=False, do_lookup=False)
dev_data_transform, _ = bert.data.qa.preprocess_dataset(dataset, transform)
#print('The number of examples after preprocessing:{}'.format(len(dev_data_transform)))

def vocab_lookup(example_id, subwords, type_ids, length, start, end):
    indices = vocab[subwords]
    return example_id, indices, type_ids, length, start, end
dev_data_transform = dev_data_transform.transform(vocab_lookup, lazy=False)

batch_size = 1
dev_dataloader = mx.gluon.data.DataLoader(
    dev_data_transform, batch_size=batch_size, shuffle=False)

Done! Transform dataset costs 0.29 seconds.


## Compile MXNet model with TVM

Now we convert the MXNet model into Relay. We need to provide a mapping from input names to their shapes for conversion. TVM frontend converter supports both MXNet static graph (symbol) and HybridBlock.

In [4]:
shape_dict = {
    'data0': (1, max_seq_length), # inputs
    'data1': (1, max_seq_length), # token types
    'data2': (1,) # sequence length
}
mod, params = relay.frontend.from_mxnet(net, shape_dict)
# uncomment the following line to see the converted model in Relay IR
# print(mod)

### Load the AutoTVM logs and build module

Next, we load the AutoTVM search logs that were previously generated when tuning on c5.9x instances. These logs contain thousands of tuning results and then apply the best schedule during compilation.

In [5]:
log_dir = "autotvm_logs"
logs = [os.path.join(log_dir, f) for f in os.listdir(log_dir)]
autotvm_ctx = autotvm.apply_history_best(None)
for log_file in logs:
    autotvm_ctx.load(log_file)

Now we compile the model. We specify the target CPU as skylake avx512 in order to use the vectorized instructions for floating operations.

In [6]:
target = "llvm -mcpu=skylake-avx512"
# change the target when compile on ARM CPU
# target = "llvm -device=arm_cpu -target=aarch64-linux-gnu"
with autotvm_ctx:
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(mod[mod.entry_func], target, params=params)

## Evaluate TVM

We now create the graph runtime to execute the compiled graph.

In [7]:
import tvm.contrib.graph_runtime as runtime

tvm_ctx = tvm.cpu()
ex = runtime.create(graph, lib, tvm_ctx)
ex.set_input(**params)

In [8]:
def pad(arr, length, pad_val, dtype="float32"):
    padded = np.full(shape=(1, length), fill_value=pad_val, dtype=dtype)
    padded[0, :arr.shape[1]] = arr.asnumpy()[0]
    return padded

tvm_results = collections.defaultdict(list)
epoch_tic = time.time()
total_num = 0
for data in dev_dataloader:
    example_ids, inputs, token_types, valid_length, _, _ = data
    total_num += len(inputs)
    padded_inputs = pad(inputs, max_seq_length, vocab[vocab.padding_token])
    padded_token_types = pad(token_types, max_seq_length, 0)
    ex.set_input(data0=padded_inputs,
                 data1=padded_token_types,
                 data2=valid_length.astype('float32').asnumpy())
    ex.run()
    out = ex.get_output(0)
    output = np.split(out.asnumpy(), axis=2, indices_or_sections=2)
    example_ids = example_ids.asnumpy().tolist()
    pred_start = output[0].reshape((1, -1))
    pred_end = output[1].reshape((1, -1))

    for example_id, start, end in zip(example_ids, pred_start, pred_end):
        tvm_results[example_id].append(PredResult(start=start, end=end))

epoch_toc = time.time()
print('Time cost={:.2f} s, Thoughput={:.2f} samples/s'.format(
    epoch_toc - epoch_tic, total_num/(epoch_toc - epoch_tic)))

qa_utils.predict(dataset, tvm_results, vocab, number=1)

Time cost=2.39 s, Thoughput=8.36 samples/s

Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: which nfl team represented the afc at super bowl 50 ?

Top predictions: 
99.36% 	 Denver Broncos
0.23% 	 The American Football Conference (AFC) champion Denver 

In [9]:
mx_results = collections.defaultdict(list)

epoch_tic = time.time()
total_num = 0
for data in dev_dataloader:
    example_ids, inputs, token_types, valid_length, _, _ = data
    total_num += len(inputs)
    padded_inputs = pad(inputs, max_seq_length, vocab[vocab.padding_token])
    padded_token_types = pad(token_types, max_seq_length, 0)
    out = net(mx.nd.array(padded_inputs, mx_ctx),
              mx.nd.array(padded_token_types, mx_ctx),
              valid_length.astype('float32').as_in_context(mx_ctx))

    output = mx.nd.split(out, axis=2, num_outputs=2)
    example_ids = example_ids.asnumpy().tolist()
    pred_start = output[0].reshape((0, -3)).asnumpy()
    pred_end = output[1].reshape((0, -3)).asnumpy()

    for example_id, start, end in zip(example_ids, pred_start, pred_end):
        mx_results[example_id].append(PredResult(start=start, end=end))

epoch_toc = time.time()
print('Time cost={:.2f} s, Thoughput={:.2f} samples/s'.format(
    epoch_toc - epoch_tic, total_num/(epoch_toc - epoch_tic)))

qa_utils.predict(dataset, mx_results, vocab, number=1)

Time cost=1.77 s, Thoughput=11.28 samples/s

Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: which nfl team represented the afc at super bowl 50 ?

Top predictions: 
99.36% 	 Denver Broncos
0.23% 	 The American Football Conference (AFC) champion Denver

## Benchmark TVM performance

In [10]:
inputs = np.random.uniform(size=(1, max_seq_length)).astype('float32')
token_types = np.random.uniform(size=(1, max_seq_length)).astype('float32')
valid_length = np.asarray([max_seq_length]).astype('float32')

ftimer = ex.module.time_evaluator("run", tvm_ctx, number=20, min_repeat_ms=1000)
prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
print("TVM mean inference time: %.2f ms" % np.mean(prof_res))

TVM mean inference time: 102.72 ms


In [11]:
inputs_nd = mx.nd.array(inputs)
token_types_nd = mx.nd.array(token_types)
valid_length_nd = mx.nd.array(valid_length)
mx_out = net(inputs_nd, token_types_nd, valid_length_nd)
mx_out.wait_to_read()

min_repeat_ms = 1000
number = 20
while True:
    beg = time.time()
    for _ in range(number):
        mx_out = net(inputs_nd, token_types_nd, valid_length_nd)
        mx_out.wait_to_read()
    end = time.time()
    lat = (end - beg) * 1e3
    if lat >= min_repeat_ms:
        break
    number = int(max(min_repeat_ms / (lat / number) + 1, number * 1.618))
print('MXNet mean inference time: %.2f ms' % (lat / number))

MXNet mean inference time: 85.93 ms
