# Unity BYOC Tutorial

  Bring-Your-Own-Codegen (BYOC) is the interface that TVM offers to enable integration of external libraries like TensorRT, Cutlass, DNNL, etc.  This doc aims to provide high-level idea about how to use BYOC in TVM Unity in a composable and modular way.

## User-level Guide

### Setup

  Build TVM with your BYOC in `config.cmake`. 
  For example, if you want to use TensorRT:

```python
set(USE_TENSORRT_CODEGEN ON)
set(USE_TENSORRT_RUNTIME ON)
```

### Basic workflow

Unity BYOC offers a pattern-based offloading mechanism: users define a set of operator patterns they want to run with their library of interest and apply a pass sequence to perform the offloading. The following example showcases the end-to-end workflow. 

(1) Prepare a model you want to optimize.

In [1]:
import tvm
from tvm import relax
from tvm.script.parser import relax as R
import numpy as np

# Define an example IRModule
@tvm.script.ir_module
class InputModule:
    @R.function
    def main(
        x: R.Tensor((16, 16), "float32"), y: R.Tensor((16, 16), "float32")
    ) -> R.Tensor((16, 16), "float32"):
        with R.dataflow():
            z1 = R.multiply(x, y)
            z2 = R.add(z1, x)
            z3 = R.add(z1, z2)
            z4 = R.multiply(z3, z2)
            z5 = R.add(z4, z1)
            R.output(z5)
        return z5


mod = InputModule

  from pandas import MultiIndex, Int64Index


(2) Define a list of operator patterns you want to match and execute with TensorRT. Let's try to match every `multiply` and `add` operator to pass the whole graph in this example. We define each pattern in Relax pattern language and assign them names in "${LIBRARY_NAME}.{$PATTERN_NAME}" notation.

In [2]:
from tvm.relax.dpl import is_op, wildcard
patterns = [
        ("tensorrt.multiply", is_op("relax.multiply")(wildcard(), wildcard())),
        ("tensorrt.add", is_op("relax.add")(wildcard(), wildcard())),
]

Although this example uses simple patterns, pattern can be more complicated if necessary. For instance, you may want to match when inputs have certain shapes or a certain sequene of operators. To learn more about Relax pattern language, please see [this reference](https://github.com/tlc-pack/relax/issues/160).

(3) Run a series of of following passes: 
* `FuseOpsByPattern` 
* `MergeCompositeFunctions`
* `RunCodegen`

In practice, you will run them consequtively within `tvm.transform.Sequential`. To demonstrate what each pass does, this example will apply them one-by-one and walkthrough the changes. 

In [3]:
from tvm import relax
mod1 = relax.transform.FuseOpsByPattern(patterns)(mod)
mod1.show()

As you can see above, `FuseOpsByPattern` splits a graph into a set of composite functions based on the given list of patterns and appends assigned names in the function attribute. 

In [4]:
mod2 = relax.transform.MergeCompositeFunctions()(mod1)
mod2.show()

`MergeCompositeFunctions` combines adjacent composite functions. Also, this pass annotates target codegen and global symbol that will be used by the following `RunCodegen`. 

In [5]:
mod3 = relax.transform.RunCodegen()(mod2)
mod3.show()
# Produced runtime module will be attached in the IRModule attribute.
print(f"TensorRT runtime module: {mod3.attrs['external_mods']}")

TensorRT runtime module: [runtime.Module(0x4d951d8)]


`RunCodegen` converts composite functions of interest into the external runtime module by using each codegen. Then, instead of calling into Relax composite functions, we will invoke this BYOC runtime module attached in the IRModule attribute. 

(4) Now we are ready to run the model. Check if the final IRModule is well-formed, build and run it with Relax virtual machine. 

In [6]:
# Check if output IRModule is well-formed. 
assert relax.analysis.well_formed(mod3)

# Define your target hardware and device.
target, dev = tvm.target.Target("cuda"), tvm.cuda()

# Prepare inputs.
np0 = np.random.rand(16, 16).astype(np.float32)
np1 = np.random.rand(16, 16).astype(np.float32)
data0 = tvm.nd.array(np0, dev)
data1 = tvm.nd.array(np1, dev)
inputs = [data0, data1]

# Prepare expected output.
t1 = np.multiply(np0, np1)
t2 = np.add(t1, np0)
t3 = np.add(t1, t2)
t4 = np.multiply(t3, t2)
expected = np.add(t4, t1)

# Build and prepare VM. 
ex = relax.build(mod3, target, params={})
vm = relax.VirtualMachine(ex, dev)

# Run VM. 
out = vm["main"](*inputs)

import tvm.testing
tvm.testing.assert_allclose(out.numpy(), expected, rtol=1e-6, atol=1e-6)



### Mix-and-Match BYOC and Tuning
In Relax, you can optimize one part of the graph with BYOC while tuning other parts in a flexible way. 
In this section, from the previous example, let's say we want to offload only `add` to TensorRT and optimize `matmul` kernel with MetaSchedul tuning. 

Like in previous section, we define the operator patterns that we want to offload to BYOC. 
Here, we only target `add` operation.

In [7]:
patterns = [
        ("tensorrt.add", is_op("relax.add")(wildcard(), wildcard())),
    ]

For MetaSchedule, you need to provide more specific hardware information. 

In [8]:
# Define your target hardware and device.
target, dev = tvm.target.Target("nvidia/geforce-rtx-3070"), tvm.cuda()

Now we are ready to apply the pass sequence. On the top of the passes for BYOC offloading, we simply add additional passes for lowering and MetaSchedule tuning. 
Once BYOC passes offload the target operators (`add` in this example) to tensorrt, `LegalizeOps` pass will lower the rest of operators (`multiply` in this example) to TIR PrimFunc. Then, `MetaScheduleTuneIRMod` and `MetaScheduleApplyDatabase` will perform tuning and apply the best optimization decision based on the tuning database. 

In [9]:
import tempfile
from tvm.relax.transform.tuning_api import Trace

# Run Codegen pass
with tempfile.TemporaryDirectory() as work_dir:
  with target, tvm.transform.PassContext(trace=Trace(mod)):
      mod4 = tvm.transform.Sequential(
        [
                relax.transform.FuseOpsByPattern(patterns),
                relax.transform.MergeCompositeFunctions(),
                relax.transform.RunCodegen(),
                relax.transform.LegalizeOps(),
                relax.transform.MetaScheduleTuneIRMod(
                params={}, work_dir=work_dir, max_trials_global=8
                ),
                relax.transform.MetaScheduleApplyDatabase(work_dir),
        ]
        )(mod)
assert relax.analysis.well_formed(mod4)
# Build and prepare VM. 
ex = relax.build(mod4, target, params={})
vm = relax.VirtualMachine(ex, dev)

# Run VM. 
out = vm["main"](*inputs)
tvm.testing.assert_allclose(out.numpy(), expected, rtol=1e-6, atol=1e-6)

2023-03-20 10:04:42 [INFO] [task_scheduler.cc:260] Task #0 has finished. Remaining task(s): 0


Unnamed: 0,Name,FLOP,Weight,Speed (GFLOPS),Latency (us),Weighted Latency (us),Trials,Done
0,multiply,256,2,0.1619,1.5815,3.163,4,Y



Total trials: 4
Total latency (us): 3.16303

2023-03-20 10:04:42 [DEBUG] [task_scheduler.cc:318] 
 ID |     Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------
  0 | multiply |  256 |      2 |         0.1619 |       1.5815 |                3.1630 |      4 |    Y 
-------------------------------------------------------------------------------------------------------
Total trials: 4
Total latency (us): 3.16303



