In [1]:
%matplotlib inline


# Auto-tuning a convolutional network on VTA
**Author**: [Lianmin Zheng](https://github.com/merrymercy), [Thierry Moreau](https://homes.cs.washington.edu/~moreau/)

Auto-tuning for a specific accelerator design is critical for getting the best
performance for any given operator. This is a tutorial showcases how to tune a
whole convolutional network on VTA.

The operator implementation for VTA in TVM is written in template form.
The template has many tunable knobs (tile factor, virtual threads, etc).
We will tune all convolution operators in the neural network. After tuning,
we produce a log file which stores the best schedule parameters for all tuned
operators. When the TVM compiler compiles these operators, it will query this
log file to get the best knob parameters.


## Install dependencies
To use the autotvm package in tvm, we need to install some extra dependencies.
(change "3" to "2" if you use python2):

```bash
pip3 install --user psutil xgboost tornado mxnet requests "Pillow<7" cloudpickle
```
To make TVM run faster during tuning, it is recommended to use cython
as FFI of TVM. In the root directory of TVM, execute
(change "3" to "2" if you use python2):

```bash
pip3 install --user cython
sudo make cython3
```
Now return to python code. Import packages.



In [2]:
import os
from mxnet.gluon.model_zoo import vision
import numpy as np
from PIL import Image

from tvm import topi
import tvm
from tvm import te
from tvm import rpc, autotvm, relay
from tvm.contrib import graph_executor, utils, download
from tvm.autotvm.measure.measure_methods import request_remote
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner

import vta
from vta.testing import simulator
from vta.top import graph_pack
import torch
import torchvision

In [3]:
help(graph_pack)

Help on function graph_pack in module vta.top.graphpack:

graph_pack(expr, bfactor, cfactor, weight_bits, start_name='nn.max_pool2d', stop_name='nn.global_avg_pool2d', start_name_idx=None, stop_name_idx=None, count_meta=False, device_annot=False, annot_start_name='nn.conv2d', annot_end_name='annotation.stop_fusion')
    Pack the graph into batch&channel packed format.
    
    Parameters
    ----------
    expr : relay.Expr
       The input program.
    
    bfactor : int
       The packing factor in batch
    
    cfactor : int
       The packing factor in channel
    
    weight_bits: int
        The bit-width of the weights.
    
    start_name: str, optional
       Start packing from certain known node when start_name_idx is None.
    
    stop_name: str, optional
       Stop packing from certain known node when stop_name_idx is None.
    
    start_name_idx: int, optional
        When start_name_idx not None, start packing only when node name equal start_name
        and node idx 

## Compile network
Perform vta-specific compilation with Relay from a Gluon model



In [3]:
def compile_network_pytorch(env, target, model, start_pack, start_pack_idx, stop_pack, stop_pack_idx):
    
    input_name = "input0"
    
    # Populate the shape and data type dictionary
    dtype_dict = {"data": "float32"}
    shape_dict = {"data": (env.BATCH, 3, 224, 224)}

    # Get off the shelf gluon model, and convert to relay
    pytorch_model = torch.hub.load('pytorch/vision', model, pretrained=True)
    input_shape = [1, 3, 224, 224]
    input_data = torch.randn(input_shape)
    scripted_model = torch.jit.trace(pytorch_model, input_data).eval()
    
    shape_list = [(input_name, input_shape)]
    mod , params = relay.frontend.from_pytorch(scripted_model, shape_list)    

    # Update shape and type dictionary
    shape_dict.update({k: v.shape for k, v in params.items()})
    dtype_dict.update({k: str(v.dtype) for k, v in params.items()})

    # Perform quantization in Relay
    # Note: We set opt_level to 3 in order to fold batch norm
    with tvm.transform.PassContext(opt_level=3):
        with relay.quantize.qconfig(global_scale=8.0, skip_conv_layers=[0]):
            mod = relay.quantize.quantize(mod, params=params)

    # Perform graph packing and constant folding for VTA target
    if target.device_name == "vta":
        assert env.BLOCK_IN == env.BLOCK_OUT
        relay_prog = graph_pack(
            mod["main"],
            env.BATCH,
            env.BLOCK_OUT,
            env.WGT_WIDTH,
            start_name=start_pack,
            start_name_idx=start_pack_idx,
            stop_name=stop_pack,
            stop_name_idx=stop_pack_idx,
        )

    return relay_prog, params


def compile_network_mxnet(env, target, model, start_pack, start_pack_idx, stop_pack, stop_pack_idx):
        
    # Populate the shape and data type dictionary
    dtype_dict = {"data": "float32"}
    shape_dict = {"data": (env.BATCH, 3, 224, 224)}

    # Get off the shelf gluon model, and convert to relay
    gluon_model = vision.get_model(model, pretrained=True)
    mod, params = relay.frontend.from_mxnet(gluon_model, shape_dict)
    

    # Update shape and type dictionary
    shape_dict.update({k: v.shape for k, v in params.items()})
    dtype_dict.update({k: str(v.dtype) for k, v in params.items()})

    # Perform quantization in Relay
    # Note: We set opt_level to 3 in order to fold batch norm
    with tvm.transform.PassContext(opt_level=3):
        with relay.quantize.qconfig(global_scale=8.0, skip_conv_layers=[0]):
            mod = relay.quantize.quantize(mod, params=params)

    # Perform graph packing and constant folding for VTA target
    if target.device_name == "vta":
        assert env.BLOCK_IN == env.BLOCK_OUT
        relay_prog = graph_pack(
            mod["main"],
            env.BATCH,
            env.BLOCK_OUT,
            env.WGT_WIDTH,
            start_name=start_pack,
            start_name_idx=start_pack_idx,
            stop_name=stop_pack,
            stop_name_idx=stop_pack_idx,
        )

    return relay_prog, params

In [5]:
help(torchvision.models)

Help on package torchvision.models in torchvision:

NAME
    torchvision.models

PACKAGE CONTENTS
    alexnet
    densenet
    inception
    resnet
    squeezenet
    vgg

FILE
    /home/srchand/miniconda3/envs/tvm-build-clone/lib/python3.8/site-packages/torchvision/models/__init__.py




In [4]:
#input_name = "input0"
    
    # Populate the shape and data type dictionary
dtype_dict = {"data": "float32"}
shape_dict = {"data": (1, 3, 224, 224)}

# Get off the shelf gluon model, and convert to relay
#gluon_model = vision.get_model("resnet34_v2", pretrained=True)
pytorch_model = torch.hub.load('pytorch/vision', 'mobilenet_v2', pretrained=True)

input_shape = [1, 3, 224, 224]
input_data = torch.randn(input_shape)
scripted_model = torch.jit.trace(pytorch_model, input_data).eval()

# shape_list = [(input_name, input_shape)]
#mod , params = relay.frontend.from_mxnet(gluon_model, shape_dict)

with tvm.transform.PassContext(opt_level=3):
        with relay.quantize.qconfig(global_scale=8.0, skip_conv_layers=[0]):
            mod = relay.quantize.quantize(mod, params=params)
            
print(mod.astext(show_meta_data=False))

Using cache found in /home/srchand/.cache/torch/hub/pytorch_vision_main


ImportError: cannot import name 'get_weight' from 'torchvision.models' (/home/srchand/miniconda3/envs/tvm-build-clone/lib/python3.8/site-packages/torchvision/models/__init__.py)

## Start RPC Tracker
TVM uses an RPC session to communicate with Pynq boards.
During tuning, the tuner will send the generated code to the board and
measure the speed of code on the board.

To scale up tuning, TVM uses an RPC Tracker to manage multiple devices.
The RPC Tracker is a centralized controller node. We can register all devices to
the tracker. For example, if we have 10 Pynq boards, we can register all of them
to the tracker, and run 10 measurements in parallel, accelerating the tuning process.

To start an RPC tracker, run this command on the host machine. The tracker is
required during the whole tuning process, so we need to open a new terminal for
this command:

```bash
python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
```
The expected output is:

```bash
INFO:RPCTracker:bind to 0.0.0.0:9190
```


## Register devices to RPC Tracker
Now we can register our devices to the tracker. The first step is to
build the TVM runtime for the Pynq devices.

Follow `vta-index`
to build the TVM runtime on the device. Then register the device to the tracker with:

```bash
python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=pynq
```
(replace :code:`[HOST_IP]` with the IP address of your host machine)

After registering devices, we can confirm it by querying the rpc_tracker:

```bash
python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
```
For example, if we have 6 Pynq boards and 11 Raspberry Pi 3B,
the output can be

```bash
Queue Status
----------------------------------
key          total  free  pending
----------------------------------
pynq         6      6     0
rpi3b        11     11    0
----------------------------------
```
You can register multiple devices to the tracker to accelerate tuning.



## Set Tuning Options
Before tuning, we should apply some configurations.
Here we use an Pynq-Z1 board as an example.



In [5]:
# Tracker host and port can be set by your environment
tracker_host = os.environ.get("TVM_TRACKER_HOST", "127.0.0.1")
tracker_port = int(os.environ.get("TVM_TRACKER_PORT", 9190))

# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file
env = vta.get_env()

# This target is used for cross compilation. You can query it by :code:`gcc -v` on your device.
# Set ``device=arm_cpu`` to run inference on the CPU
# or ``device=vta`` to run inference on the FPGA.
device = "vta"
target = env.target if device == "vta" else env.target_vta_cpu

# Name of Gluon model to compile
# The ``start_pack`` and ``stop_pack`` labels indicate where
# to start and end the graph packing relay pass: in other words
# where to start and finish offloading to VTA.
network = "inceptionv3"
start_pack = "cast"
start_pack_idx=7
stop_pack = "nn.avg_pool2d"
stop_pack_idx = 966

# Tuning option
log_file = "%s.%s.log" % (device, network)
tuning_option = {
    "log_filename": log_file,
    "tuner": "random",
    "n_trial": 1000,
    "early_stopping": None,
    "measure_option": autotvm.measure_option(
        builder=autotvm.LocalBuilder(),
        runner=autotvm.RPCRunner(
            env.TARGET,
            host=tracker_host,
            port=tracker_port,
            number=5,
            timeout=60,
            module_loader=vta.module_loader(),
            # check_correctness=True, # TODO: re-enable when check_correctness works again.
        ),
    ),
}

<div class="alert alert-info"><h4>Note</h4><p>How to set tuning options

  In general, the default values provided here work well.
  If you have enough time budget, you can set :code:`n_trial`, :code:`early_stopping`
  to larger values, makes the tuning run for longer.
  If your device is under-powered or your conv2d operators are large, consider
  setting a longer timeout.</p></div>




## Begin Tuning
Now we can extract tuning tasks from the network and begin tuning.
Here, we provide a simple utility function to tune a list of tasks.
This function is just an initial implementation which tunes them in sequential order.
We will introduce a more sophisticated tuning scheduler in the future.

Given that the tuning will be done on Pynq FPGA boards, make sure that
the ```TARGET`` entry in the ``vta_config.json`` file is set to ``pynq``.



In [6]:
# You can skip the implementation of this function for this tutorial.
def tune_tasks(
    tasks,
    measure_option,
    tuner="xgb",
    n_trial=1000,
    early_stopping=None,
    log_filename="tuning.log",
    use_transfer_learning=True,
):

    # create tmp log file
    tmp_log_file = log_filename + ".tmp"
    if os.path.exists(tmp_log_file):
        os.remove(tmp_log_file)

    for i, tsk in enumerate(reversed(tasks)):
        prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))

        # create tuner
        if tuner == "xgb" or tuner == "xgb-rank":
            tuner_obj = XGBTuner(tsk, loss_type="rank")
        elif tuner == "xgb_knob":
            tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="knob")
        elif tuner == "ga":
            tuner_obj = GATuner(tsk, pop_size=50)
        elif tuner == "random":
            tuner_obj = RandomTuner(tsk)
        elif tuner == "gridsearch":
            tuner_obj = GridSearchTuner(tsk)
        else:
            raise ValueError("Invalid tuner: " + tuner)

        if use_transfer_learning:
            if os.path.isfile(tmp_log_file):
                tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))

        # do tuning
        tsk_trial = min(n_trial, len(tsk.config_space))
        tuner_obj.tune(
            n_trial=tsk_trial,
            early_stopping=early_stopping,
            measure_option=measure_option,
            callbacks=[
                autotvm.callback.progress_bar(tsk_trial, prefix=prefix),
                autotvm.callback.log_to_file(tmp_log_file),
            ],
        )

    # pick best records to a cache file
    autotvm.record.pick_best(tmp_log_file, log_filename)
    os.remove(tmp_log_file)

Register VTA-specific tuning tasks



In [7]:
def register_vta_tuning_tasks():
    from tvm.autotvm.task import TaskExtractEnv

    @tvm.te.tag_scope(tag=topi.tag.ELEMWISE)
    def my_clip(x, a_min, a_max):
        """Unlike topi's current clip, put min and max into two stages."""
        const_min = tvm.tir.const(a_min, x.dtype)
        const_max = tvm.tir.const(a_max, x.dtype)
        x = te.compute(x.shape, lambda *i: tvm.te.min(x(*i), const_max), name="clipA")
        x = te.compute(x.shape, lambda *i: tvm.te.max(x(*i), const_min), name="clipB")
        return x

    # init autotvm env to register VTA operator
    TaskExtractEnv()

    @autotvm.template("conv2d_packed.vta")
    def _topi_nn_conv2d(*args, **kwargs):
        assert not kwargs, "Do not support kwargs in template function call"
        A, W = args[:2]

        with tvm.target.vta():
            res = vta.top.conv2d_packed(*args, **kwargs)
            res = topi.right_shift(res, 8)
            res = my_clip(res, 0, 127)
            res = topi.cast(res, "int8")

        if tvm.target.Target.current().device_name == "vta":
            s = vta.top.schedule_conv2d_packed([res])
        else:
            s = te.create_schedule([res.op])
        return s, [A, W, res]

Finally, we launch tuning jobs and evaluate the end-to-end performance.



In [8]:
def tune_and_evaluate(tuning_opt):

    # Register VTA tuning tasks
    register_vta_tuning_tasks()

    # Perform task extraction on Relay program
    print("Extract tasks...")
    relay_prog, params = compile_network(env, target, network, start_pack, start_pack_idx, stop_pack, stop_pack_idx)
    mod = tvm.IRModule.from_expr(relay_prog)
    tasks = autotvm.task.extract_from_program(
        mod,
        params=params,
        ops=(relay.op.get("nn.conv2d"),),
        target=target,
        target_host=env.target_host,
    )

    # filter out non-packed conv2d task
    tasks = list(filter(lambda t: len(t.args[0][1]) > 4 and "conv" in t.name, tasks))

    print(len(tasks))
    # We should have extracted 10 convolution tasks
    assert len(tasks) == 20
    print("Extracted {} conv2d tasks:".format(len(tasks)))
    for tsk in tasks:
        inp = tsk.args[0][1]
        wgt = tsk.args[1][1]
        batch = inp[0] * inp[4]
        in_filter = inp[1] * inp[5]
        out_filter = wgt[0] * wgt[4]
        height, width = inp[2], inp[3]
        hkernel, wkernel = wgt[2], wgt[3]
        hstride, wstride = tsk.args[2][0], tsk.args[2][1]
        hpad, wpad = tsk.args[3][0], tsk.args[3][1]
        print(
            "({}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {})".format(
                batch,
                height,
                width,
                in_filter,
                out_filter,
                hkernel,
                wkernel,
                hpad,
                wpad,
                hstride,
                wstride,
            )
        )

    # We do not run the tuning in our webpage server since it takes too long.
    # Comment the following line to run it by yourself.
    #return

    # run tuning tasks
    print("Tuning...")
    tune_tasks(tasks, **tuning_opt)
    
    
    #device_host = os.environ.get("VTA_RPC_HOST", "192.168.2.99")
    #device_port = os.environ.get("VTA_RPC_PORT", "9091")
    
    # evaluate with tuning history
    if env.TARGET != "sim":
        # Get remote from fleet node
        remote = autotvm.measure.request_remote(
           env.TARGET, tracker_host, tracker_port, timeout=10000
        )
        #remote = rpc.connect(device_host, int(device_port))
        # Reconfigure the JIT runtime and FPGA.
        vta.reconfig_runtime(remote)
        vta.program_fpga(remote, bitstream=None)
    else:
        # In simulation mode, host the RPC server locally.
        remote = rpc.LocalSession()

    # compile kernels with history best records
    with autotvm.tophub.context(target, extra_files=[log_file]):
        # Compile network
        print("Compile...")
        if target.device_name != "vta":
            with tvm.transform.PassContext(opt_level=3, disabled_pass={"AlterOpLayout"}):
                lib = relay.build(
                    relay_prog, target=target, params=params, target_host=env.target_host
                )
        else:
            with vta.build_config(opt_level=3, disabled_pass={"AlterOpLayout"}):
                lib = relay.build(
                    relay_prog, target=target, params=params, target_host=env.target_host
                )

        # Export library
        print("Upload...")
        temp = utils.tempdir()
        lib.export_library(temp.relpath("graphlib.tar"))
        remote.upload(temp.relpath("graphlib.tar"))
        lib = remote.load_module("graphlib.tar")

        # Generate the graph executor
        ctx = remote.ext_dev(0) if device == "vta" else remote.cpu(0)
        m = graph_executor.GraphModule(lib["default"](ctx))

        # upload parameters to device
        image = tvm.nd.array((np.random.uniform(size=(1, 3, 299, 299))).astype("float32"))
        m.set_input("data", image)

        # evaluate
        print("Evaluate inference time cost...")
        timer = m.module.time_evaluator("run", ctx, number=1, repeat=10)
        tcost = timer()
        prof_res = np.array(tcost.results) * 1000  # convert to millisecond
        print(
            "Mean inference time (std dev): %.2f ms (%.2f ms)"
            % (np.mean(prof_res), np.std(prof_res))
        )


# Run the tuning and evaluate the results
tune_and_evaluate(tuning_option)

Extract tasks...


Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/srchand/miniconda3/envs/tvm-build/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/srchand/miniconda3/envs/tvm-build/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/srchand/Desktop/research/TVM/tvm/python/tvm/autotvm/task/relay_integration.py", line 49, in _lower
    grc.codegen(mod, mod["main"])
  File "/home/srchand/Desktop/research/TVM/tvm/python/tvm/relay/backend/graph_executor_codegen.py", line 87, in codegen
    self._codegen(ir_module, func, default_mod_name)
  File "/home/srchand/Desktop/research/TVM/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
ValueError: Traceback (most recent call last):
  157: TVMFuncCall
  156: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::relay::backend::GraphExecutorCodegenModule::GetFuncti

20
Extracted 20 conv2d tasks:
(1, 149, 149, 32, 32, 3, 3, 0, 0, 1, 1)
(1, 147, 147, 32, 64, 3, 3, 1, 1, 1, 1)
(1, 73, 73, 64, 80, 1, 1, 0, 0, 1, 1)
(1, 73, 73, 80, 192, 3, 3, 0, 0, 1, 1)
(1, 35, 35, 192, 64, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 192, 48, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 48, 64, 5, 5, 2, 2, 1, 1)
(1, 35, 35, 64, 96, 3, 3, 1, 1, 1, 1)
(1, 35, 35, 96, 96, 3, 3, 1, 1, 1, 1)
(1, 35, 35, 192, 32, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 256, 64, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 256, 48, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 288, 64, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 288, 48, 1, 1, 0, 0, 1, 1)
(1, 35, 35, 288, 384, 3, 3, 0, 0, 2, 2)
(1, 35, 35, 96, 96, 3, 3, 0, 0, 2, 2)
(1, 17, 17, 768, 192, 1, 1, 0, 0, 1, 1)
(1, 17, 17, 768, 128, 1, 1, 0, 0, 1, 1)
(1, 17, 17, 128, 128, 1, 7, 0, 3, 1, 1)
(1, 17, 17, 128, 192, 7, 1, 3, 0, 1, 1)
Tuning...
[Task  1/20]  Current/Best:    0.00/  42.36 GFLOPS | Progress: (384/384) | 52.63 s Done.
[Task  2/20]  Current/Best:    4.84/  41.32 GFLOPS | Progress: (256/256) | 276.42 s Done.



ValueError: Traceback (most recent call last):
  172: TVMFuncCall
  171: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  170: tvm::relay::backend::RelayBuildModule::Build(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, tvm::Target const&, tvm::relay::Executor const&, tvm::relay::Runtime const&, tvm::runtime::String)
  169: tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::IRModule, tvm::runtime::String const&)
  168: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::relay::backend::GraphExecutorCodegenModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  167: tvm::relay::backend::GraphExecutorCodegen::Codegen(tvm::IRModule, tvm::relay::Function, tvm::runtime::String)
  166: tvm::transform::Pass::operator()(tvm::IRModule) const
  165: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  164: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  163: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  162: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  161: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::relay::tec::LowerTEPass(tvm::runtime::String const&, std::function<void (tvm::BaseFunc)>)::{lambda(tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::relay::tec::LowerTEPass(tvm::runtime::String const&, std::function<void (tvm::BaseFunc)>)::{lambda(tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  160: tvm::relay::tec::LowerTE(tvm::IRModule const&, tvm::runtime::String const&, std::function<void (tvm::BaseFunc)>)
  159: tvm::transform::Pass::operator()(tvm::IRModule) const
  158: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  157: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  156: tvm::runtime::TypedPackedFunc<tvm::relay::Function (tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::relay::tec::LowerTensorExpr(tvm::runtime::String const&, tvm::relay::tec::TECompiler, std::function<void (tvm::BaseFunc)>)::{lambda(tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::relay::tec::LowerTensorExpr(tvm::runtime::String const&, tvm::relay::tec::TECompiler, std::function<void (tvm::BaseFunc)>)::{lambda(tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const, tvm::runtime::TVMRetValue) const
  155: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  154: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  153: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::FunctionNode const*)
  152: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::FunctionNode const*)
  151: _ZN3tvm5relay9transform22DeviceAwareExprMutator21DeviceAwareVisit
  150: tvm::relay::ExprMutator::VisitExpr_(tvm::relay::FunctionNode const*)
  149: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  148: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  147: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  146: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  145: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  144: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  143: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  142: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  141: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  140: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  139: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  138: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  137: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  136: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  135: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  134: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  133: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  132: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  131: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  130: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  129: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  128: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  127: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  126: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  125: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  124: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  123: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  122: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  121: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  120: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  119: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  118: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  117: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  116: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  115: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  114: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  113: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  112: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  111: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  110: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  109: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  108: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  107: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  106: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  105: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  104: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  103: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  102: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  101: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  100: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  99: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  98: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  97: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  96: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  95: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  94: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  93: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  92: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  91: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  90: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  89: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  88: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  87: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  86: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  85: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  84: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  83: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  82: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  81: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  80: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  79: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  78: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  77: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  76: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  75: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  74: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  73: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  72: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  71: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  70: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  69: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  68: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  67: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  66: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  65: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  64: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  63: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  62: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  61: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  60: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  59: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  58: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  57: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  56: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  55: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  54: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  53: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  52: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  51: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  50: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  49: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  48: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  47: tvm::relay::transform::DeviceAwareExprMutator::VisitExpr_(tvm::relay::CallNode const*)
  46: tvm::relay::tec::LowerTensorExprMutator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  45: tvm::relay::tec::LowerTensorExprMutator::MakeLoweredCall(tvm::relay::Function, tvm::runtime::Array<tvm::RelayExpr, void>, tvm::runtime::Array<tvm::Type, void>, tvm::Span, tvm::Target)
  44: tvm::relay::tec::TECompilerImpl::Lower(tvm::relay::tec::CCacheKey const&, tvm::runtime::String)
  43: tvm::relay::tec::TECompilerImpl::LowerInternal(tvm::relay::tec::CCacheKey const&, std::function<tvm::runtime::String (tvm::runtime::String)>)
  42: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::te::Tensor, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, bool)
  41: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, bool)
  40: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
  39: tvm::transform::Pass::operator()(tvm::IRModule) const
  38: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  37: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  36: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  35: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  34: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::tir::PrimFunc (tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::tir::transform::InjectCopyIntrin(tvm::runtime::String, tvm::runtime::PackedFunc)::{lambda(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::tir::transform::InjectCopyIntrin(tvm::runtime::String, tvm::runtime::PackedFunc)::{lambda(tvm::tir::PrimFunc, tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  33: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  32: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::ForNode const*)
  31: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  30: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  29: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::ForNode const*)
  28: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  27: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  26: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::ForNode const*)
  25: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  24: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  23: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::AllocateNode const*)
  22: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  21: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime9Ob
  20: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::SeqStmtNode const*)
  19: void tvm::runtime::Array<tvm::tir::Stmt, void>::MutateByApply<tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}>(tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
  18: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  17: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime9Ob
  16: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::SeqStmtNode const*)
  15: void tvm::runtime::Array<tvm::tir::Stmt, void>::MutateByApply<tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}>(tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
  14: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  13: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  12: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::ForNode const*)
  11: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  10: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  9: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::AllocateNode const*)
  8: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  7: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime9Ob
  6: tvm::tir::StmtMutator::VisitStmt_(tvm::tir::SeqStmtNode const*)
  5: void tvm::runtime::Array<tvm::tir::Stmt, void>::MutateByApply<tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1}>(tvm::tir::StmtMutator::Internal::Mutate(tvm::tir::StmtMutator*, tvm::runtime::Array<tvm::tir::Stmt, void> const&)::{lambda(tvm::tir::Stmt const&)#1})
  4: tvm::tir::StmtMutator::VisitStmt(tvm::tir::Stmt const&)
  3: _ZZN3tvm3tir11StmtFunctorIFNS0_4StmtERKS2_EE10InitVTableEvENUlRKNS_7runtime
  2: tvm::tir::CopyIntrinInjector::VisitStmt_(tvm::tir::AttrStmtNode const*)
  1: tvm::tir::CopyIntrinInjector::MatchCopyPattern(tvm::tir::Stmt, tvm::tir::Stmt*)
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&) [clone .cold]
  File "/home/srchand/Desktop/research/TVM/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/srchand/Desktop/research/TVM/tvm/vta/python/vta/transform.py", line 557, in _inject_copy
    raise ValueError("Limitation of 2D pad load forbid ndim=%d" % ndim)
ValueError: Limitation of 2D pad load forbid ndim=2

## Sample Output
The tuning needs to compile many programs and extract feature from them.
So a high performance CPU is recommended.
One sample output is listed below.
It takes about 2 hours on a 16T CPU, and 6 Pynq boards.

```bash
Extract tasks...
[Warning] Invalid shape during AutoTVM task creation
Extracted 10 conv2d tasks:
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 1, 1, 16, 16), 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 16, 14, 14, 1, 16, 'int8'), (32, 16, 1, 1, 16, 16, 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 1, 1, 16, 16), 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 8, 28, 28, 1, 16, 'int8'), (16, 8, 1, 1, 16, 16, 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 4, 56, 56, 1, 16), 'int8'), ('TENSOR', (8, 4, 1, 1, 16, 16), 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 4, 56, 56, 1, 16, 'int8'), (8, 4, 1, 1, 16, 16, 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 4, 56, 56, 1, 16), 'int8'), ('TENSOR', (4, 4, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 4, 56, 56, 1, 16, 'int8'), (4, 4, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (8, 8, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 8, 28, 28, 1, 16, 'int8'), (8, 8, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 4, 56, 56, 1, 16), 'int8'), ('TENSOR', (8, 4, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 4, 56, 56, 1, 16, 'int8'), (8, 4, 3, 3, 16, 16, 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (16, 16, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 16, 14, 14, 1, 16, 'int8'), (16, 16, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 8, 28, 28, 1, 16, 'int8'), (16, 8, 3, 3, 16, 16, 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 32, 7, 7, 1, 16), 'int8'), ('TENSOR', (32, 32, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 32, 7, 7, 1, 16, 'int8'), (32, 32, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
    Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 16, 14, 14, 1, 16, 'int8'), (32, 16, 3, 3, 16, 16, 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))
Tuning...
[Task  1/10]  Current/Best:    0.72/  23.24 GFLOPS | Progress: (480/1000) | 640.31 s Done.
[Task  2/10]  Current/Best:    0.00/  27.69 GFLOPS | Progress: (576/1000) | 810.09 s Done.
[Task  3/10]  Current/Best:    0.00/  22.97 GFLOPS | Progress: (1000/1000) | 1125.37 s Done.
[Task  4/10]  Current/Best:    0.00/  31.26 GFLOPS | Progress: (1000/1000) | 1025.52 s Done.
[Task  5/10]  Current/Best:    0.00/  15.15 GFLOPS | Progress: (1000/1000) | 1236.58 s Done.
[Task  6/10]  Current/Best:    0.00/  22.74 GFLOPS | Progress: (1000/1000) | 906.60 s Done.
[Task  7/10]  Current/Best:    0.00/  15.27 GFLOPS | Progress: (1000/1000) | 1056.25 s Done.
[Task  8/10]  Current/Best:    0.00/   2.18 GFLOPS | Progress: (1000/1000) | 2275.29 s Done.
[Task  9/10]  Current/Best:    2.23/   3.99 GFLOPS | Progress: (1000/1000) | 2527.25 s Done.
[Task 10/10]  Current/Best:    1.56/   6.32 GFLOPS | Progress: (480/1000) | 1304.84 s Done.
Compile...
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 621.79 ms (0.14 ms)
```


<div class="alert alert-info"><h4>Note</h4><p>**Experiencing Difficulties?**

  The auto tuning module is error-prone. If you always see " 0.00/ 0.00 GFLOPS",
  then there must be something wrong.

  First, make sure you set the correct configuration of your device.
  Then, you can print debug information by adding these lines in the beginning
  of the script. It will print every measurement result, where you can find useful
  error messages.

```python
import logging
logging.getLogger('autotvm').setLevel(logging.DEBUG)
```
  Finally, always feel free to ask our community for help on https://discuss.tvm.apache.org</p></div>

