[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/uwsampl/tutorial/blob/master/notebook/02_TVM_Tutorial_Relay.ipynb)

Please run the following block to ensure TVM is setup for *this notebook*, each notebook may have its own runtime.

In [41]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    ! gsutil cp "gs://tvm-fcrc-binariesd5fce43e-8373-11e9-bfb6-0242ac1c0002/tvm.tar.gz" /tmp/tvm.tar.gz
    ! mkdir -p /tvm
    ! tar -xf /tmp/tvm.tar.gz --strip-components=4 --directory /tvm
    ! ls -la /tvm
    ! bash /tvm/package.sh
    # Add TVM to the Python path.
    import sys
    sys.path.append('/tvm/python')
    sys.path.append('/tvm/topi/python')
    sys.path.append('/tvm/vta/python')
else:
    print("Notebook executing locally, skipping Colab setup ...")

Notebook executing locally, skipping Colab setup ...


# Relay: an Extensible Deep Learning IR

Last year TVM introduced Relay IR – a second generation high-level IR for deep learning. 

Relay's design comes from a simple insight that the critical difference between regular IRs
and deep learning IRs are the primitive values they manipulate. Relay is designed using well
known insights from the programming languages community coupled with TVM's existing 
infrastructure to provide state of the art performance. 

If you are familiar with ideas from programming languages or existing computation graph
representations we will connect Relay to your existing knowledge during this tutorial.

We will first cover the design of Relay, then elaborate on how one can use it to 
accomplish a wide variety of tasks. This piece of the tutorial focused directly 
on Relay but Relay will be present throughout all of the content today, and serves
as the interface layer to TVM.

In [42]:
import tvm
from tvm import relay
import tvm.relay.testing
from tvm.relay.expr_functor import ExprMutator
import torch
import torchvision
import onnx
import numpy

## Language 

We will briefly introduce the concepts of Relay below before showing how to use Relay to accomplish specific tasks.
You can find a full language specification [here](https://docs.tvm.ai/langref/index.html).

### Variables 

In [43]:
# A single Relay variable, the string is just a hint
x = relay.var('x')

# A Relay variable with a different dtype, defaults to float32.
x = relay.var('x', dtype='int32')

# A Relay variable with a different shape.
x = relay.var('x', shape=(10, 1))

### Operators

Relay provides high performance operators defined in TVM that implement the primitive operations needed by deep learning applications. Operators can be applied to arguments just like regular Python or C++ functions. Common arithemetic operations are provided both via names and operator overloading.

Variables can be used to construct Relay *expressions* which replace the concept of graphs present in previous frameworks. A Relay expression can be viewed much like a graph with extra functionality as we will see as we go
forward.

In [44]:
w = relay.op.add(x, x)
print(w)

v0.0.1
free_var %x: Tensor[(10, 1), float32]
add(%x, %x)


In [45]:
z = x + x
print(z)

v0.0.1
free_var %x: Tensor[(10, 1), float32]
add(%x, %x)


### Functions

The fundamental packaging of computation in Relay is the function. A function is a combination of a set of inputs,
and a Relay expression. One view is a function is no different than the ones in programming languages today, and another is that it replaces named subgraphs.

In [46]:
f = relay.Function([x], z)
print(f)

v0.0.1
fn (%x: Tensor[(10, 1), float32]) {
  add(%x, %x)
}


### Module

Finally we can give functions a global name and package many of them together into a module. When we add a function to the module, it will be type checked before hand.

When we print the module you can see the program annotated with all type information. 

In [47]:
mod = relay.Module({})
fname = relay.GlobalVar('f')
mod[fname] = f

print(mod)

v0.0.1
def @f(%x: Tensor[(10, 1), float32]) -> Tensor[(10, 1), float32] {
  add(%x, %x) /* ty=Tensor[(10, 1), float32] */
}



## Frontends

Relay comes with a variety of frontends and supports most major frameworks including TensorFlow, PyTorch, MxNet, ONNX, Keras and Caffe2.

Below we provide a couple examples of using these frontends to import models into Relay.

You can find specific tutorials on deploying pretrained models below:  

- [ONNX](https://docs.tvm.ai/tutorials/frontend/from_onnx.html#sphx-glr-tutorials-frontend-from-onnx-py)
- [TensorFlow](https://docs.tvm.ai/tutorials/frontend/from_tensorflow.html#sphx-glr-tutorials-frontend-from-tensorflow-py)
- [Keras](https://docs.tvm.ai/tutorials/frontend/from_keras.html#sphx-glr-tutorials-frontend-from-keras-py)
- [PyTorch](https://tvm.ai/2019/05/30/pytorch-frontend)
- [Caffe2](https://docs.tvm.ai/tutorials/frontend/from_caffe2.html#sphx-glr-tutorials-frontend-from-caffe2-py)

In [48]:
torch_resnet18 = torchvision.models.resnet18()
dummy_input = torch.randn(10, 3, 224, 224)
torch.onnx.export(torch_resnet18, dummy_input, "resnet.onnx", verbose=True)
onnx_resnet18 = onnx.load('resnet.onnx')
func, params = relay.frontend.from_onnx(onnx_resnet18, shape={ '0': (10, 3, 224, 224) })
print(func)

graph(%0 : Float(10, 3, 224, 224),
      %conv1.weight : Float(64, 3, 7, 7),
      %bn1.weight : Float(64),
      %bn1.bias : Float(64),
      %bn1.running_mean : Float(64),
      %bn1.running_var : Float(64),
      %bn1.num_batches_tracked : Long(),
      %layer1.0.conv1.weight : Float(64, 64, 3, 3),
      %layer1.0.bn1.weight : Float(64),
      %layer1.0.bn1.bias : Float(64),
      %layer1.0.bn1.running_mean : Float(64),
      %layer1.0.bn1.running_var : Float(64),
      %layer1.0.bn1.num_batches_tracked : Long(),
      %layer1.0.conv2.weight : Float(64, 64, 3, 3),
      %layer1.0.bn2.weight : Float(64),
      %layer1.0.bn2.bias : Float(64),
      %layer1.0.bn2.running_mean : Float(64),
      %layer1.0.bn2.running_var : Float(64),
      %layer1.0.bn2.num_batches_tracked : Long(),
      %layer1.1.conv1.weight : Float(64, 64, 3, 3),
      %layer1.1.bn1.weight : Float(64),
      %layer1.1.bn1.bias : Float(64),
      %layer1.1.bn1.running_mean : Float(64),
      %layer1.1.bn1.running_var



v0.0.1
fn (%v0: Tensor[(10, 3, 224, 224), float32], %conv1.weight: Tensor[(64, 3, 7, 7), float32], %bn1.weight: Tensor[(64,), float32], %bn1.bias: Tensor[(64,), float32], %bn1.running_mean: Tensor[(64,), float32], %bn1.running_var: Tensor[(64,), float32], %layer1.0.conv1.weight: Tensor[(64, 64, 3, 3), float32], %layer1.0.bn1.weight: Tensor[(64,), float32], %layer1.0.bn1.bias: Tensor[(64,), float32], %layer1.0.bn1.running_mean: Tensor[(64,), float32], %layer1.0.bn1.running_var: Tensor[(64,), float32], %layer1.0.conv2.weight: Tensor[(64, 64, 3, 3), float32], %layer1.0.bn2.weight: Tensor[(64,), float32], %layer1.0.bn2.bias: Tensor[(64,), float32], %layer1.0.bn2.running_mean: Tensor[(64,), float32], %layer1.0.bn2.running_var: Tensor[(64,), float32], %layer1.1.conv1.weight: Tensor[(64, 64, 3, 3), float32], %layer1.1.bn1.weight: Tensor[(64,), float32], %layer1.1.bn1.bias: Tensor[(64,), float32], %layer1.1.bn1.running_mean: Tensor[(64,), float32], %layer1.1.bn1.running_var: Tensor[(64,), floa

## Text Format

Relay has a textual representation that can be used to write and print programs. The textual format is still being stablized but can already be of great use to users today. For example instead of providing inscrutable graph representations of programs we can produce human readable output by default.

There are a few different ways to interact with the textual format. The first is to just print out a Realy expression as we have seen above.

In [49]:
mlp, params = relay.testing.mlp.get_workload(1)

print(mlp)

v0.0.1
fn (%data: Tensor[(1, 1, 28, 28), float32], %fc1_weight: Tensor[(128, 784), float32], %fc1_bias: Tensor[(128,), float32], %fc2_weight: Tensor[(64, 128), float32], %fc2_bias: Tensor[(64,), float32], %fc3_weight: Tensor[(10, 64), float32], %fc3_bias: Tensor[(10,), float32]) -> Tensor[(1, 10), float32] {
  %0 = nn.batch_flatten(%data) /* ty=Tensor[(1, 784), float32] */
  %1 = nn.dense(%0, %fc1_weight, units=128) /* ty=Tensor[(1, 128), float32] */
  %2 = nn.bias_add(%1, %fc1_bias, axis=-1) /* ty=Tensor[(1, 128), float32] */
  %3 = nn.relu(%2) /* ty=Tensor[(1, 128), float32] */
  %4 = nn.dense(%3, %fc2_weight, units=64) /* ty=Tensor[(1, 64), float32] */
  %5 = nn.bias_add(%4, %fc2_bias, axis=-1) /* ty=Tensor[(1, 64), float32] */
  %6 = nn.relu(%5) /* ty=Tensor[(1, 64), float32] */
  %7 = nn.dense(%6, %fc3_weight, units=10) /* ty=Tensor[(1, 10), float32] */
  %8 = nn.bias_add(%7, %fc3_bias, axis=-1) /* ty=Tensor[(1, 10), float32] */
  nn.softmax(%8) /* ty=Tensor[(1, 10), float32] */
}

By default the textual format prints the version of the format, and the code without metadata. The metadata section of the format contains information such as constants. Imagine we perform an optimization such as inlining the parameters into the program for further optimization. Rendering this in the textual format would be nearly unreadable, common models often have 100s of megabytes of parameters.

In [50]:
print(mlp.astext(show_meta_data=True))

v0.0.1
fn (%data: Tensor[(1, 1, 28, 28), float32], %fc1_weight: Tensor[(128, 784), float32], %fc1_bias: Tensor[(128,), float32], %fc2_weight: Tensor[(64, 128), float32], %fc2_bias: Tensor[(64,), float32], %fc3_weight: Tensor[(10, 64), float32], %fc3_bias: Tensor[(10,), float32]) -> Tensor[(1, 10), float32] {
  %0 = nn.batch_flatten(%data) /* ty=Tensor[(1, 784), float32] */
  %1 = nn.dense(%0, %fc1_weight, units=128) /* ty=Tensor[(1, 128), float32] */
  %2 = nn.bias_add(%1, %fc1_bias, axis=-1) /* ty=Tensor[(1, 128), float32] */
  %3 = nn.relu(%2) /* ty=Tensor[(1, 128), float32] */
  %4 = nn.dense(%3, %fc2_weight, units=64) /* ty=Tensor[(1, 64), float32] */
  %5 = nn.bias_add(%4, %fc2_bias, axis=-1) /* ty=Tensor[(1, 64), float32] */
  %6 = nn.relu(%5) /* ty=Tensor[(1, 64), float32] */
  %7 = nn.dense(%6, %fc3_weight, units=10) /* ty=Tensor[(1, 10), float32] */
  %8 = nn.bias_add(%7, %fc3_bias, axis=-1) /* ty=Tensor[(1, 10), float32] */
  nn.softmax(%8) /* ty=Tensor[(1, 10), float32] */
}

Relay's pretty printer also allows users to attach debugging output and metadata to the IR, for example you can see the type information on the example above, but we can also customize the annotation process, by passing a callback for annotating nodes. 

In [51]:
i = 0 
def ann(*args):
    global i
    i += 1
    return f" <expression: {i}>"

print(mlp.astext(show_meta_data=True, annotate=ann))

v0.0.1
fn (%data: Tensor[(1, 1, 28, 28), float32], %fc1_weight: Tensor[(128, 784), float32], %fc1_bias: Tensor[(128,), float32], %fc2_weight: Tensor[(64, 128), float32], %fc2_bias: Tensor[(64,), float32], %fc3_weight: Tensor[(10, 64), float32], %fc3_bias: Tensor[(10,), float32]) -> Tensor[(1, 10), float32] {
  %0 = nn.batch_flatten <expression: 1>(%data) <expression: 2>
  %1 = nn.dense <expression: 3>(%0, %fc1_weight, units=128) <expression: 4>
  %2 = nn.bias_add <expression: 5>(%1, %fc1_bias, axis=-1) <expression: 6>
  %3 = nn.relu <expression: 7>(%2) <expression: 8>
  %4 = nn.dense <expression: 3>(%3, %fc2_weight, units=64) <expression: 9>
  %5 = nn.bias_add <expression: 5>(%4, %fc2_bias, axis=-1) <expression: 10>
  %6 = nn.relu <expression: 7>(%5) <expression: 11>
  %7 = nn.dense <expression: 3>(%6, %fc3_weight, units=10) <expression: 12>
  %8 = nn.bias_add <expression: 5>(%7, %fc3_bias, axis=-1) <expression: 13>
  nn.softmax <expression: 14>(%8) <expression: 15>
} <expression: 16>


Finally an important part of the Relay text format is the ability to load Relay code 
like a normal programming language. We can use the Relay parser to parse code, we actually do this to define the Relay
*prelude* the small standard library of utilities shipped in Relay. 

## Executing Relay

Now that we have looked out how to write and manipulate a Relay program, we will show you how to run one. Relay has multiple execution mechanisms. One is a custom *debug interpreter* for Relay which can be used for experimentation and debugging, the second is TVM's older graph runtime the existing execution mechanism. The final one is the Relay VM, a newly designed execution mechanism with the goal to smoothly execute all of Relay efficiently. 

We provide a high level interface which imposes some wrapping overhead, but enables quick experimentation with each API. 

In [52]:
mod = relay.Module()
debug_ex = relay.create_executor('debug', mod=mod)
graph_ex = relay.create_executor('graph', mod=mod)
vm_ex = relay.create_executor('vm', mod=mod)

Each executor can be used to evaluate an expression given a Relay module, in this case we use an empty module, and will just evaluate the same expression, a MLP, using each one.

In [53]:
debug_mlp = debug_ex.evaluate(mlp)
graph_mlp = graph_ex.evaluate(mlp)
vm_mlp = vm_ex.evaluate(mlp)

Each one can be called like a normal Python function with the inputs passed as positional arguments and the parameters as keyword arguments.

In [58]:
data = numpy.random.rand(1, 1, 28, 28).astype('float32')
print("Debug: ", debug_mlp(data, **params))
print("Graph: ", graph_mlp(data, **params))
print("VM: ", vm_mlp(data, **params))

Debug:  [[0.12006255 0.15952827 0.06028504 0.07578056 0.05348781 0.17940904
  0.06953022 0.02311306 0.17301401 0.0857894 ]]
Graph:  [[0.12006255 0.15952827 0.06028504 0.07578056 0.05348781 0.17940904
  0.06953022 0.02311306 0.17301401 0.0857894 ]]
VM:  [[0.12006255 0.15952827 0.06028504 0.07578056 0.05348781 0.17940904
  0.06953022 0.02311306 0.17301401 0.0857894 ]]


### Virtual Machine

In particular the Relay virtual machine is worth highlighting ... 


## Pass Manager

Relay has a flexible and configurable pass manager with an elegant API which be used to easily compose and schedule pass pipelines. We believe an easy to configure pipeline is important to enable intelligent exploration between a variety of 


## Optimizations

Defining optimizations to transform your program is straight forward and easy to do in Relay.

For example let's define a constant evaluator for Relay.

## Quantization

## Heterogeneous Execution

Relay supports a high-level interface for scheduling computation across multiple heterogeneous devices. An interesting property of this pass is that it is not special, it is built using Relay's standard machinery for
passes. 

We implement this by using an annotation to mark which computations we would like to schedule on which device, 
and a pass inserts all the appropriate calls to synchronize memory across devices. 

The below pass uses this machinery to schedule all convolutions onto the GPU.

In [11]:
class ScheduleConv2d(ExprMutator):
    def __init__(self, device):
        self.device = device
        super().__init__()

    def visit_call(self, expr):
        visit = super().visit_call(expr)
        if expr.op == tvm.relay.op.get("nn.conv2d"):
            return relay.annotation.on_device(visit, self.device)
        else:
            return visit

def schedule_conv2d_on_gpu(expr):
    sched = ScheduleConv2d(tvm.gpu(0))
    return sched.visit(expr)

We can grab a model, we provide a few basic models in Relay's testing library. By default when printing a model we will see it rendered in Relay's textual format.

In [12]:
# We can grab a model, we provide a few basic models in Relay's testing library.
resnet, params = relay.testing.resnet.get_workload()
print(resnet)

v0.0.1
fn (%data: Tensor[(1, 3, 224, 224), float32], %bn_data_gamma: Tensor[(3,), float32], %bn_data_beta: Tensor[(3,), float32], %bn_data_moving_mean: Tensor[(3,), float32], %bn_data_moving_var: Tensor[(3,), float32], %conv0_weight: Tensor[(64, 3, 7, 7), float32], %bn0_gamma: Tensor[(64,), float32], %bn0_beta: Tensor[(64,), float32], %bn0_moving_mean: Tensor[(64,), float32], %bn0_moving_var: Tensor[(64,), float32], %stage1_unit1_bn1_gamma: Tensor[(64,), float32], %stage1_unit1_bn1_beta: Tensor[(64,), float32], %stage1_unit1_bn1_moving_mean: Tensor[(64,), float32], %stage1_unit1_bn1_moving_var: Tensor[(64,), float32], %stage1_unit1_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_bn2_gamma: Tensor[(64,), float32], %stage1_unit1_bn2_beta: Tensor[(64,), float32], %stage1_unit1_bn2_moving_mean: Tensor[(64,), float32], %stage1_unit1_bn2_moving_var: Tensor[(64,), float32], %stage1_unit1_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_sc_weight: Tensor[(64, 64, 1, 1)

We can now run the customized pass we defined above to schedule individual convolutions on the GPU.

In [13]:
resnet = schedule_conv2d_on_gpu(resnet)
print(resnet)


v0.0.1
fn (%data: Tensor[(1, 3, 224, 224), float32], %bn_data_gamma: Tensor[(3,), float32], %bn_data_beta: Tensor[(3,), float32], %bn_data_moving_mean: Tensor[(3,), float32], %bn_data_moving_var: Tensor[(3,), float32], %conv0_weight: Tensor[(64, 3, 7, 7), float32], %bn0_gamma: Tensor[(64,), float32], %bn0_beta: Tensor[(64,), float32], %bn0_moving_mean: Tensor[(64,), float32], %bn0_moving_var: Tensor[(64,), float32], %stage1_unit1_bn1_gamma: Tensor[(64,), float32], %stage1_unit1_bn1_beta: Tensor[(64,), float32], %stage1_unit1_bn1_moving_mean: Tensor[(64,), float32], %stage1_unit1_bn1_moving_var: Tensor[(64,), float32], %stage1_unit1_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_bn2_gamma: Tensor[(64,), float32], %stage1_unit1_bn2_beta: Tensor[(64,), float32], %stage1_unit1_bn2_moving_mean: Tensor[(64,), float32], %stage1_unit1_bn2_moving_var: Tensor[(64,), float32], %stage1_unit1_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_sc_weight: Tensor[(64, 64, 1, 1)

We can later rewrite away the device annotations to insert the copies.

In [14]:
resnet = relay.ir_pass.rewrite_annotated_ops(resnet, 0)
print(resnet)

v0.0.1
fn (%data: Tensor[(1, 3, 224, 224), float32], %bn_data_gamma: Tensor[(3,), float32], %bn_data_beta: Tensor[(3,), float32], %bn_data_moving_mean: Tensor[(3,), float32], %bn_data_moving_var: Tensor[(3,), float32], %conv0_weight: Tensor[(64, 3, 7, 7), float32], %bn0_gamma: Tensor[(64,), float32], %bn0_beta: Tensor[(64,), float32], %bn0_moving_mean: Tensor[(64,), float32], %bn0_moving_var: Tensor[(64,), float32], %stage1_unit1_bn1_gamma: Tensor[(64,), float32], %stage1_unit1_bn1_beta: Tensor[(64,), float32], %stage1_unit1_bn1_moving_mean: Tensor[(64,), float32], %stage1_unit1_bn1_moving_var: Tensor[(64,), float32], %stage1_unit1_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_bn2_gamma: Tensor[(64,), float32], %stage1_unit1_bn2_beta: Tensor[(64,), float32], %stage1_unit1_bn2_moving_mean: Tensor[(64,), float32], %stage1_unit1_bn2_moving_var: Tensor[(64,), float32], %stage1_unit1_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_sc_weight: Tensor[(64, 64, 1, 1)

Finally we will look at a couple case studies of what can be built using Relay. We will first look at how Relay is used as a backend in PyTorch integration, then how Relay can be used to compile a model down to traditional hardware, and finally how it can be used to support a custom accelerator, VTA, which we will dicuss in detail today.

## Ahead of time compilation

An example of what can be built using Relay can be found with the 

## PyTorch Integration

## VTA
TODO TALK WITH THIERRY