<a href="https://colab.research.google.com/github/aquapapaya/BYOC/blob/main/How_BYOC_annotates_a_Relay_graph_(CUDA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BYOC Demo
**Author**: [Kuen-Wey Lin](https://github.com/aquapapaya)

We use a simple Relay graph to walkthrough the BYOC workflow.


In [1]:
%%shell
# Installs pre-built binaries including CUDA from https://tlcpack.ai/
pip install apache-tvm-cu102 -f https://tlcpack.ai/wheels

Looking in links: https://tlcpack.ai/wheels
Collecting apache-tvm-cu102
  Downloading https://github.com/tlc-pack/tlcpack/releases/download/v0.7.dev1/apache_tvm_cu102-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (403.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m403.1/403.1 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting synr==0.6.0 (from apache-tvm-cu102)
  Downloading synr-0.6.0-py3-none-any.whl (18 kB)
Installing collected packages: synr, apache-tvm-cu102
Successfully installed apache-tvm-cu102-0.9.0 synr-0.6.0




In [2]:
import tvm
from tvm import relay
import tvm.relay.testing

Since the entire Relay graph is pretty large, here we use a simple Relay pass to show the total number of operators it has and what they are.

In [3]:
def profile_graph(func):
    class OpProfiler(tvm.relay.ExprVisitor):
        def __init__(self):
            super().__init__()
            self.ops = {}

        def visit_call(self, call):
            op = call.op
            if op not in self.ops:
                self.ops[op] = 0
            self.ops[op] += 1
            super().visit_call(call)

        def get_cuda_graph_num(self):
            cnt = 0
            for op in self.ops:
                if str(op).find("cuda") != -1:
                    cnt += 1
            return cnt

    profiler = OpProfiler()
    profiler.visit(func)
    print("Total number of operators: %d" % sum(profiler.ops.values()))
    print("Detail breakdown")
    for op, count in profiler.ops.items():
        print("\t%s: %d" % (op, count))
    print("cuda subgraph #: %d" % profiler.get_cuda_graph_num())

Here we demonstrate how BYOC annotates a Relay graph.
Let's first define a simple Relay graph with supported and unsupported operators.



In [4]:
# Define the neural network
# Get the symbol definition and random weight of a network
mod, params = relay.testing.vgg.get_workload(batch_size=1, num_classes=1000,
    image_shape=(3, 224, 224), dtype='float32', num_layers=11
)
print(mod)
profile_graph(mod["main"])

def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %conv1_1_weight: Tensor[(64, 3, 3, 3), float32] /* ty=Tensor[(64, 3, 3, 3), float32] */, %conv1_1_bias: Tensor[(64), float32] /* ty=Tensor[(64), float32] */, %conv2_1_weight: Tensor[(128, 64, 3, 3), float32] /* ty=Tensor[(128, 64, 3, 3), float32] */, %conv2_1_bias: Tensor[(128), float32] /* ty=Tensor[(128), float32] */, %conv3_1_weight: Tensor[(256, 128, 3, 3), float32] /* ty=Tensor[(256, 128, 3, 3), float32] */, %conv3_1_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv3_2_weight: Tensor[(256, 256, 3, 3), float32] /* ty=Tensor[(256, 256, 3, 3), float32] */, %conv3_2_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv4_1_weight: Tensor[(512, 256, 3, 3), float32] /* ty=Tensor[(512, 256, 3, 3), float32] */, %conv4_1_bias: Tensor[(512), float32] /* ty=Tensor[(512), float32] */, %conv4_2_weight: Tensor[(512, 512, 3, 3), float32] /* ty=Tensor[(512, 512, 3, 3), flo

Then we define the annotation rules.
Developers can specify both operator-based and pattern-based annotation rules. Here, we define the single operators `dense` is supported. In addition, we also define two supported patterns `(Conv2D - (Bias) - ReLU)`.



In [5]:
# Operator-based annotation rules
@tvm.ir.register_op_attr("nn.dense", "target.cuda")
def dense(expr):
    return True

# Pattern-based annotation rules
def make_pattern(with_bias=True):
    from tvm.relay.dataflow_pattern import is_op, wildcard
    data = wildcard()
    weight = wildcard()
    bias = wildcard()
    conv = is_op("nn.conv2d")(data, weight)
    if with_bias:
        conv_out = is_op("nn.bias_add")(conv, bias)
    else:
        conv_out = conv
    return is_op("nn.relu")(conv_out)

conv2d_bias_relu_pat = ("cuda.conv2d_relu_with_bias", make_pattern(with_bias=True))
conv2d_relu_pat = ("cuda.conv2d_relu_wo_bias", make_pattern(with_bias=False))
patterns = [conv2d_bias_relu_pat, conv2d_relu_pat]

Now let's perform pattern-based annotation:

In [6]:
mod2 = relay.transform.MergeComposite(patterns)(mod)
print(mod2)
profile_graph(mod2["main"])

def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %conv1_1_weight: Tensor[(64, 3, 3, 3), float32] /* ty=Tensor[(64, 3, 3, 3), float32] */, %conv1_1_bias: Tensor[(64), float32] /* ty=Tensor[(64), float32] */, %conv2_1_weight: Tensor[(128, 64, 3, 3), float32] /* ty=Tensor[(128, 64, 3, 3), float32] */, %conv2_1_bias: Tensor[(128), float32] /* ty=Tensor[(128), float32] */, %conv3_1_weight: Tensor[(256, 128, 3, 3), float32] /* ty=Tensor[(256, 128, 3, 3), float32] */, %conv3_1_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv3_2_weight: Tensor[(256, 256, 3, 3), float32] /* ty=Tensor[(256, 256, 3, 3), float32] */, %conv3_2_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv4_1_weight: Tensor[(512, 256, 3, 3), float32] /* ty=Tensor[(512, 256, 3, 3), float32] */, %conv4_1_bias: Tensor[(512), float32] /* ty=Tensor[(512), float32] */, %conv4_2_weight: Tensor[(512, 512, 3, 3), float32] /* ty=Tensor[(512, 512, 3, 3), flo

A composite function has two specialized attributes -- `PartitionedFromPattern` and `Composite`:
*   PartitionedFromPattern: Indicate the operators in the function body.
*   Composite: Indicate the pattern name we defined.

Next, let's continue to apply the operator-based annotation rules:

In [7]:
mod3 = relay.transform.AnnotateTarget("cuda")(mod2)
print(mod3)
profile_graph(mod3["main"])

def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %conv1_1_weight: Tensor[(64, 3, 3, 3), float32] /* ty=Tensor[(64, 3, 3, 3), float32] */, %conv1_1_bias: Tensor[(64), float32] /* ty=Tensor[(64), float32] */, %conv2_1_weight: Tensor[(128, 64, 3, 3), float32] /* ty=Tensor[(128, 64, 3, 3), float32] */, %conv2_1_bias: Tensor[(128), float32] /* ty=Tensor[(128), float32] */, %conv3_1_weight: Tensor[(256, 128, 3, 3), float32] /* ty=Tensor[(256, 128, 3, 3), float32] */, %conv3_1_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv3_2_weight: Tensor[(256, 256, 3, 3), float32] /* ty=Tensor[(256, 256, 3, 3), float32] */, %conv3_2_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv4_1_weight: Tensor[(512, 256, 3, 3), float32] /* ty=Tensor[(512, 256, 3, 3), float32] */, %conv4_1_bias: Tensor[(512), float32] /* ty=Tensor[(512), float32] */, %conv4_2_weight: Tensor[(512, 512, 3, 3), float32] /* ty=Tensor[(512, 512, 3, 3), flo

In [8]:
mod4 = relay.transform.MergeCompilerRegions()(mod3)
print(mod4)

def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %conv1_1_weight: Tensor[(64, 3, 3, 3), float32] /* ty=Tensor[(64, 3, 3, 3), float32] */, %conv1_1_bias: Tensor[(64), float32] /* ty=Tensor[(64), float32] */, %conv2_1_weight: Tensor[(128, 64, 3, 3), float32] /* ty=Tensor[(128, 64, 3, 3), float32] */, %conv2_1_bias: Tensor[(128), float32] /* ty=Tensor[(128), float32] */, %conv3_1_weight: Tensor[(256, 128, 3, 3), float32] /* ty=Tensor[(256, 128, 3, 3), float32] */, %conv3_1_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv3_2_weight: Tensor[(256, 256, 3, 3), float32] /* ty=Tensor[(256, 256, 3, 3), float32] */, %conv3_2_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv4_1_weight: Tensor[(512, 256, 3, 3), float32] /* ty=Tensor[(512, 256, 3, 3), float32] */, %conv4_1_bias: Tensor[(512), float32] /* ty=Tensor[(512), float32] */, %conv4_2_weight: Tensor[(512, 512, 3, 3), float32] /* ty=Tensor[(512, 512, 3, 3), flo

Almost all nodes in the graph are annotated with `compiler_begin` and `compiler_end` nodes. `compiler_*` nodes has an attribute `compiler` to indicate which target should this node go. In this example, it can be `default` or `cuda`.

Composite function calls are also annotated with `compiler=cuda`, indicating that this entire function can be offloaded.

We use the pass, `MergeCompilerRegion`, to merge them so that we can minimize the number of subgraphs.

Finally, let's partition this graph:

In [9]:
mod5 = relay.transform.PartitionGraph()(mod4)
print(mod5)

def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %conv1_1_weight: Tensor[(64, 3, 3, 3), float32] /* ty=Tensor[(64, 3, 3, 3), float32] */, %conv1_1_bias: Tensor[(64), float32] /* ty=Tensor[(64), float32] */, %conv2_1_weight: Tensor[(128, 64, 3, 3), float32] /* ty=Tensor[(128, 64, 3, 3), float32] */, %conv2_1_bias: Tensor[(128), float32] /* ty=Tensor[(128), float32] */, %conv3_1_weight: Tensor[(256, 128, 3, 3), float32] /* ty=Tensor[(256, 128, 3, 3), float32] */, %conv3_1_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv3_2_weight: Tensor[(256, 256, 3, 3), float32] /* ty=Tensor[(256, 256, 3, 3), float32] */, %conv3_2_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv4_1_weight: Tensor[(512, 256, 3, 3), float32] /* ty=Tensor[(512, 256, 3, 3), float32] */, %conv4_1_bias: Tensor[(512), float32] /* ty=Tensor[(512), float32] */, %conv4_2_weight: Tensor[(512, 512, 3, 3), float32] /* ty=Tensor[(512, 512, 3, 3), flo

We can see that 8 subgraphs have been partitioned for `cuda`.



1.   @tvmgen_default_cuda_target_main_0
2.   @tvmgen_default_cuda_target_main_3
3.   @tvmgen_default_cuda_target_main_6
4.   @tvmgen_default_cuda_target_main_11
5.   @tvmgen_default_cuda_target_main_16
6.   @tvmgen_default_cuda_target_main_21
7.   @tvmgen_default_cuda_target_main_23
8.   @tvmgen_default_cuda_target_main_25

Each partitioned function will be sent to the `cuda` codegen for code generation.

As a result, you can imagine that the customized codegen only needs to consider the subgraphs without worrying about rest parts of the graph.



# Build (optimize and generate code) your Relay IR

At first, we build the original Relay IR to generate LLVM code and
run the code using CPU.

In [10]:
# Build the original Raly IR to generate LLVM code
with tvm.transform.PassContext(opt_level=3):
  lib = relay.build(mod, target="llvm", params=params)

#print(lib.get_lib().get_source()) # host code
#print(lib.get_lib().imported_modules[0].get_source()) # device code

print("Runtime module structure:")
print("\t %s" % str(lib.get_lib()))
for sub_mod in lib.get_lib().imported_modules:
  print("\t  |- %s" % str(sub_mod))

# Create the runtime module for the generated LLVM code
import tvm.contrib.graph_executor as runtime
run_mod = runtime.GraphModule(lib["default"](tvm.cpu(0)))

# Run the runtime module 10 times
import time
import numpy as np
times = []
for _ in range(10):
  start = time.time()
  run_mod.run()
  times.append(time.time() - start)
print("Median inference latency %.2f ms" % (1000 * np.median(times)))



Runtime module structure:
	 Module(llvm, 5c37c18e8108)
Median inference latency 1307.70 ms


Then, we dispatch convolution operators to GPU to accelerate this model.

In [11]:
# dispatch convolution operators to GPU
from tvm.relay.expr_functor import ExprMutator
class ScheduleDense(ExprMutator):
    def __init__(self, device):
        self.device = device
        super().__init__()

    def visit_call(self, expr):
        visit = super().visit_call(expr)
        if expr.op == tvm.relay.op.get("nn.conv2d"):
            return relay.annotation.on_device(visit, self.device)
        else:
            return visit
func = mod["main"]
sched = ScheduleDense("cuda")
func = sched.visit(func)
mod["main"] = func
print('Relay IR:\n', mod)

Relay IR:
 def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %conv1_1_weight: Tensor[(64, 3, 3, 3), float32] /* ty=Tensor[(64, 3, 3, 3), float32] */, %conv1_1_bias: Tensor[(64), float32] /* ty=Tensor[(64), float32] */, %conv2_1_weight: Tensor[(128, 64, 3, 3), float32] /* ty=Tensor[(128, 64, 3, 3), float32] */, %conv2_1_bias: Tensor[(128), float32] /* ty=Tensor[(128), float32] */, %conv3_1_weight: Tensor[(256, 128, 3, 3), float32] /* ty=Tensor[(256, 128, 3, 3), float32] */, %conv3_1_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv3_2_weight: Tensor[(256, 256, 3, 3), float32] /* ty=Tensor[(256, 256, 3, 3), float32] */, %conv3_2_bias: Tensor[(256), float32] /* ty=Tensor[(256), float32] */, %conv4_1_weight: Tensor[(512, 256, 3, 3), float32] /* ty=Tensor[(512, 256, 3, 3), float32] */, %conv4_1_bias: Tensor[(512), float32] /* ty=Tensor[(512), float32] */, %conv4_2_weight: Tensor[(512, 512, 3, 3), float32] /* ty=Tensor[(512, 512,

In [12]:
# Build the Raly IR to generate LLVM code and CUDA code
with tvm.transform.PassContext(opt_level=3):
  graph, lib, params = relay.build(mod, target={'cpu':'llvm','gpu':'cuda'}, params=params)

# Create the runtime module for the generated LLVM code and CUDA code
from tvm.contrib import graph_runtime
ctx = [tvm.cpu(0), tvm.cuda(0)]
run_mod = graph_runtime.create(graph, lib, ctx)

# Run inference 10 times
import time
import numpy as np
times = []
for _ in range(10):
  start = time.time()
  run_mod.run()
  times.append(time.time() - start)
print("Median inference latency %.2f ms" % (1000 * np.median(times)))

  graph, lib, params = relay.build(mod, target={'cpu':'llvm','gpu':'cuda'}, params=params)


Median inference latency 99.60 ms
