<a href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/2_tensor_program_abstraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensor Program Abstraction in Action





## Install packages 

For the purpose of this course, we will use some on-going development in tvm, which is an open source machine learning compilation framework. We provide the following command to install a packaged version for mlc course.

In [5]:
%pip install mlc-ai-nightly -f https://mlc.ai/wheels
%pip install tlcpack-nightly -f https://tlcpack.ai/wheels

Looking in links: https://mlc.ai/wheels
Note: you may need to restart the kernel to use updated packages.
Looking in links: https://tlcpack.ai/wheels
Note: you may need to restart the kernel to use updated packages.


## Constructing Tensor Program

Let us begin by constructing a tensor program that performs addition among two vectors.

In [6]:
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import numpy as np

ModuleNotFoundError: No module named 'tvm'

In [None]:
@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(A: T.Buffer[128, "float32"], 
             B: T.Buffer[128, "float32"], 
             C: T.Buffer[128, "float32"]):
        # extra annotations for the function
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        for i in range(128):
            with T.block("C"):
                # declare a data parallel iterator on spatial domain
                vi = T.axis.spatial(128, i)
                C[vi] = A[vi] + B[vi]

TVMScript is a way for us to express tensor program in python ast. Note that this code do not actually correspond to a python program, but a tensor program  that can be used in MLC process. The language is designed to align with python syntax with additional structures to facilitate analysis and transformation. 

In [None]:
type(MyModule)

tvm.ir.module.IRModule

MyModule is an instance of an **IRModule** data structure, which is used to hold a collection of tensor functions. 

We can use the `show()` function to get a highlighted string based representation of the IRModule. This function is quite useful for inspecting the module during each step of transformation.

In [None]:
MyModule.show()

### Build and run

Any any time point, we can turn an IRModule to runnable functions by calling a build function.

In [None]:
rt_mod = tvm.build(MyModule, target="llvm")  # The module for CPU backends.
print(type(rt_mod))

<class 'tvm.driver.build_module.OperatorModule'>


After build, mod contains a collection of runnable functions. We can retrieve each function by its name.

In [None]:
func = rt_mod["main"]

In [None]:
func

<tvm.runtime.packed_func.PackedFunc at 0x7fdf00501e40>

In [None]:
a = tvm.nd.array(np.arange(128, dtype="float32"))

In [None]:
b = tvm.nd.array(np.ones(128, dtype="float32")) 

In [None]:
c = tvm.nd.empty((128,), dtype="float32") 

To invoke the function, we can create three NDArrays in the tvm runtime, and then invoke the generated function.

In [None]:
func(a, b, c)


In [None]:
print(a)
print(b)
print(c)

[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
 126. 127.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

## Transform the Tensor Program

Now let us start to transform the Tensor Program. A tensor prigram can be transformed using an auxiliary data structure called schedule.


In [None]:
sch = tvm.tir.Schedule(MyModule)
print(type(sch))

<class 'tvm.tir.schedule.schedule.Schedule'>


Let us first try to split the loops

In [None]:
# Get block by its name
block_c = sch.get_block("C")
# Get loops surronding the block
(i,) = sch.get_loops(block_c)
# Tile the loop nesting.
i_0, i_1, i_2 = sch.split(i, factors=[None, 4, 4])
sch.mod.show()

We can also reorder the loops. Now we move loop i_2 to outside of i_1.




In [None]:
sch.reorder(i_0, i_2, i_1)
sch.mod.show()

Finally, we can add hints to the program generator that we want to vectorize the inner most loop.

In [None]:
sch.mod.show()

In [None]:
sch.parallel(i_0)
sch.mod.show()

We can build and run the transformed program


In [None]:
transformed_mod = tvm.build(sch.mod, target="llvm")  # The module for CPU backends.
transformed_mod["main"](a, b, c)

## Constructing Tensor Program using Tensor Expression

In the previous example, we directly use TVMScript to construct the tensor program. In practice, it is usually helpful to construct these functions pragmatically from existing definitions. Tensor expression is an API that helps us to build some of the expression-like array computations.

In [None]:
# namespace for tensor expression utility
from tvm import te

# declare the computation using the expression API
A = te.placeholder((128, ), name="A")
B = te.placeholder((128, ), name="B")
C = te.compute((128,), lambda i: A[i] + B[i], name="C")

# create a function with the specified list of arguments. 
func = te.create_prim_func([A, B, C])
# mark that the function name is main
func = func.with_attr("global_symbol", "main")
ir_mod_from_te = IRModule({"main": func})

ir_mod_from_te.show()

## Transforming a matrix multiplication program

In the above example, we showed how to transform an vector add. Now let us try to apply that to a slightly more complicated program(matrix multiplication). Let us first try to build the initial code using the tensor expression API.


In [None]:
from tvm import te

M = 1024
K = 1024
N = 1024

# The default tensor type in tvm
dtype = "float32"

target = "llvm"
dev = tvm.device(target, 0)

# Algorithm
k = te.reduce_axis((0, K), "k")
A = te.placeholder((M, K), name="A")
B = te.placeholder((K, N), name="B")
C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C")

# Default schedule
func = te.create_prim_func([A, B, C])
func = func.with_attr("global_symbol", "main")
ir_module = IRModule({"main": func})
ir_module.show()


func = tvm.build(ir_module, target="llvm")  # The module for CPU backends.

a = tvm.nd.array(np.random.rand(M, K).astype(dtype), dev)
b = tvm.nd.array(np.random.rand(K, N).astype(dtype), dev)
c = tvm.nd.array(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)

evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("Baseline: %f" % evaluator(a, b, c).mean)

Baseline: 4.970728


We can transform the loop access pattern to make it more cache friendly. Let us use the following schedule.

In [None]:
sch = tvm.tir.Schedule(ir_module)
print(type(sch))
block_c = sch.get_block("C")
# Get loops surronding the block
(y, x, k) = sch.get_loops(block_c)
block_size = 32
yo, yi = sch.split(y, [None, block_size])
xo, xi = sch.split(x, [None, block_size])

sch.reorder(yo, xo, k, yi, xi)
sch.mod.show()

func = tvm.build(sch.mod, target="llvm")  # The module for CPU backends.

c = tvm.nd.array(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)

evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("after transformation: %f" % evaluator(a, b, c).mean)

<class 'tvm.tir.schedule.schedule.Schedule'>


after transformation: 0.101731


Try to change the value of bn to see what performance you can get. In pratice, we will leverage an automated system to search over a set of possible transfromations to find an optimal one.