In [None]:
%matplotlib inline


Deploy a Quantized Model on Cuda
================================
**Author**: `Wuwei Lin <https://github.com/vinx13>`_

This article is an introductory tutorial of automatic quantization with TVM.
Automatic quantization is one of the quantization modes in TVM. More details on
the quantization story in TVM can be found
`here <https://discuss.tvm.ai/t/quantization-story/3920>`_.
In this tutorial, we will import a GluonCV pre-trained model on ImageNet to
Relay, quantize the Relay model and then perform the inference.



### Install mxnet on the machine

pip3 install mxnet-cu100 for CUDA 10.0

In [2]:
import tvm
from tvm import te
from tvm import relay
import mxnet as mx
from tvm.contrib.download import download_testdata
from mxnet import gluon
import logging
import os

batch_size = 1
model_name = "resnet18_v1"
target = 'cuda'
ctx = tvm.context(target)

Prepare the Dataset
-------------------
We will demonstrate how to prepare the calibration dataset for quantization.
We first download the validation set of ImageNet and pre-process the dataset.



In [3]:
calibration_rec = download_testdata(
    'http://data.mxnet.io.s3-website-us-west-1.amazonaws.com/data/val_256_q90.rec',
    'val_256_q90.rec')

def get_val_data(num_workers=4):
    mean_rgb = [123.68, 116.779, 103.939]
    std_rgb = [58.393, 57.12, 57.375]

    def batch_fn(batch):
        return batch.data[0].asnumpy(), batch.label[0].asnumpy()

    img_size = 299 if model_name == 'inceptionv3' else 224
    val_data = mx.io.ImageRecordIter(
        path_imgrec=calibration_rec,
        preprocess_threads=num_workers,
        shuffle=False,
        batch_size=batch_size,
        resize=256,
        data_shape=(3, img_size, img_size),
        mean_r=mean_rgb[0],
        mean_g=mean_rgb[1],
        mean_b=mean_rgb[2],
        std_r=std_rgb[0],
        std_g=std_rgb[1],
        std_b=std_rgb[2],
    )
    return val_data, batch_fn

Downloading from url http://data.mxnet.io.s3-website-us-west-1.amazonaws.com/data/val_256_q90.rec to /home/hongbing/.tvm_test_data/val_256_q90.rec
...1%, 23.52 MB, 8715 KB/s, 2 seconds passedIOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

...3%, 56.43 MB, 11538 KB/s, 5 seconds passedIOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

...6%, 89.78 MB, 12536 KB/s, 7 seconds passedIOPub message rate exceeded.
The notebook server will temporarily sto

The calibration dataset should be an iterable object. We define the
calibration dataset as a generator object in Python. In this tutorial, we
only use a few samples for calibration.



In [4]:
calibration_samples = 10

def calibrate_dataset():
    val_data, batch_fn = get_val_data()
    val_data.reset()
    for i, batch in enumerate(val_data):
        if i * batch_size >= calibration_samples:
            break
        data, _ = batch_fn(batch)
        yield {'data': data}

Import the model
----------------
We use the Relay MxNet frontend to import a model from the Gluon model zoo.



In [5]:
def get_model():
    gluon_model = gluon.model_zoo.vision.get_model(model_name, pretrained=True)
    img_size = 299 if model_name == 'inceptionv3' else 224
    data_shape = (batch_size, 3, img_size, img_size)
    mod, params = relay.frontend.from_mxnet(gluon_model, {"data": data_shape})
    return mod, params

Quantize the Model
------------------
In quantization, we need to find the scale for each weight and intermediate
feature map tensor of each layer.

For weights, the scales are directly calculated based on the value of the
weights. Two modes are supported: `power2` and `max`. Both modes find the
maximum value within the weight tensor first. In `power2` mode, the maximum
is rounded down to power of two. If the scales of both weights and
intermediate feature maps are power of two, we can leverage bit shifting for
multiplications. This make it computationally more efficient. In `max` mode,
the maximum is used as the scale. Without rounding, `max` mode might have
better accuracy in some cases. When the scales are not powers of two, fixed
point multiplications will be used.

For intermediate feature maps, we can find the scales with data-aware
quantization. Data-aware quantization takes a calibration dataset as the
input argument. Scales are calculated by minimizing the KL divergence between
distribution of activation before and after quantization.
Alternatively, we can also use pre-defined global scales. This saves the time
for calibration. But the accuracy might be impacted.



In [6]:
def quantize(mod, params, data_aware):
    if data_aware:
        with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max'):
            mod = relay.quantize.quantize(mod, params, dataset=calibrate_dataset())
    else:
        with relay.quantize.qconfig(calibrate_mode='global_scale', global_scale=8.0):
            mod = relay.quantize.quantize(mod, params)
    return mod

Run Inference
-------------
We create a Relay VM to build and execute the model.



In [7]:
def run_inference(mod):
    executor = relay.create_executor('vm', mod, ctx, target)
    val_data, batch_fn = get_val_data()
    for i, batch in enumerate(val_data):
        data, label = batch_fn(batch)
        prediction = executor.evaluate()(data)
        if i > 10:  # only run inference on a few samples in this tutorial
            break

def main():
    mod, params = get_model()
    mod = quantize(mod, params, data_aware=True)
    run_inference(mod)

if __name__ == '__main__':
    main()

Downloading /home/hongbing/.mxnet/models/resnet18_v1-a0666292.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
...100%, 0.02 MB, 99 KB/s, 0 seconds passed
ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
ANTLR runtime and generated code versions disagree: 4.8!=4.7.2


TVMError: Traceback (most recent call last):
  [bt] (8) /home/hongbing/Projects/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f98cf83a3b5]
  [bt] (7) /home/hongbing/Projects/tvm/build/libtvm.so(+0xb3a7b6) [0x7f98cf7397b6]
  [bt] (6) /home/hongbing/Projects/tvm/build/libtvm.so(tvm::relay::vm::VMCompiler::Codegen()+0xb73) [0x7f98cf7390a3]
  [bt] (5) /home/hongbing/Projects/tvm/build/libtvm.so(tvm::build(tvm::Map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::Array<tvm::tir::LoweredFunc, void>, void, void> const&, tvm::Target const&, tvm::BuildConfig const&)+0x419) [0x7f98cf331859]
  [bt] (4) /home/hongbing/Projects/tvm/build/libtvm.so(tvm::build(tvm::Map<tvm::Target, tvm::Array<tvm::tir::LoweredFunc, void>, void, void> const&, tvm::Target const&, tvm::BuildConfig const&)+0x374) [0x7f98cf330e14]
  [bt] (3) /home/hongbing/Projects/tvm/build/libtvm.so(tvm::codegen::Build(tvm::IRModule, tvm::Target const&)+0x1df) [0x7f98cf35d70f]
  [bt] (2) /home/hongbing/Projects/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule)>(tvm::runtime::Module (*)(tvm::IRModule))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x51) [0x7f98cf3ab0c1]
  [bt] (1) /home/hongbing/Projects/tvm/build/libtvm.so(tvm::codegen::BuildCUDA(tvm::IRModule)+0x780) [0x7f98cf7ca7f0]
  [bt] (0) /home/hongbing/Projects/tvm/build/libtvm.so(+0xc3681b) [0x7f98cf83581b]
  File "/home/hongbing/Projects/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 78, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/hongbing/Projects/tvm/python/tvm/autotvm/measure/measure_methods.py", line 597, in tvm_callback_cuda_compile
    ptx = nvcc.compile_cuda(code, target=target, arch=AutotvmGlobalScope.current.cuda_target_arch)
  File "/home/hongbing/Projects/tvm/python/tvm/contrib/nvcc.py", line 103, in compile_cuda
    raise RuntimeError(msg)
RuntimeError: Compilation error:
/tmp/tmp40vxtja_/my_kernel.cu(16): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(65): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(84): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(154): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(185): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(204): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(223): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(260): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(297): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(320): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(524): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(551): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(598): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(692): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(717): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(740): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(765): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(786): error: identifier "__dp4a" is undefined

/tmp/tmp40vxtja_/my_kernel.cu(809): error: identifier "__dp4a" is undefined

19 errors detected in the compilation of "/tmp/tmpxft_00002c9d_00000000-6_my_kernel.cpp1.ii".
