[Schedule] Support for intra-kernel data placement #436

hecmay · 2022-02-04T03:37:41Z

This PR aims to enhance .systolic() and .to() primitive to better support intra-kernel data placement for systolic array generation using AutoSA backend.

.systolic() primitive is a push-button API that maps the compute kernel to a systolic array automatically (while the dataflow pattern is left to compiler's decision). .to() primitive provides more flexibility for expert designers to explore the trade-offs of different systolic dataflows.

I have successfully solved the dependency issues and installed AutoSA on our local server. In this PR, i will also add the CI/CD local testing for systolic array programs with AutoSA backend.

hecmay · 2022-03-06T22:25:34Z

@zzzDavid @chhzh123 can you maybe take a quick pass on this PR? Thanks!

chhzh123

Sorry for the late review. I’ve looked through the code and think maybe you could add more descriptions for this PR. Seems you have added several new features besides the AutoSA backend.

I notice you introduced new APIs like transpose and pack, and new passes like transform_layout and explicit_unroll, could you also describe the changes in this PR?
Just a small question: You are not writing a C++ codegen for AutoSA right? All the compilation happens at the Python level (except for some transformation passes).

chhzh123 · 2022-03-06T22:31:42Z

.github/workflows/main.yml

@@ -26,5 +26,5 @@ jobs:
          source $VITIS/settings64.sh
          source /opt/xilinx/xrt/setup.sh
          export LOCAL_CI_TEST=1
-          which vivado_hls


The vivado_hls has been included in the previous paths?

chhzh123 · 2022-03-06T22:33:07Z

python/heterocl/autosa.py

+def indent(num):
+    return " " * num
+
+def get_function_code(name, code):
+    pos = code.find(name)
+    start_pos = pos - len("inline void")
+    end_pos = code.find("/* Helper", pos)
+    return code[start_pos:end_pos]
+
+
+def get_ser_size(code):
+    lines = code.split("\n")


I am not sure what Python formatter HeteroCL uses, but mixing one-line space and two-line space seems weird.

chhzh123 · 2022-03-06T22:34:34Z

python/heterocl/autosa.py

+            PART = "10,16"
+            LAT = "2,2"
+            SIMD = 4


What are these magic numbers? Could you add comments or use more specific variable names here?

chhzh123 · 2022-03-06T22:35:30Z

python/heterocl/autosa.py

+        print(f"[  INFO  ] input size OC({OC}), OH({OH}), OW({OW}), IC({IC}), R({R}), C({C})")
+        PART = "16,13,13,1"
+        LAT  = "2,1,2"
+        SIMD = "1,1,2,4"


Why "16,13,13,1"? I suppose this is not a test file but a general implementation.

chhzh123 · 2022-03-06T22:41:56Z

tvm/src/pass/adjust_buffer_binding.cc

@@ -1,6 +1,6 @@
 /*!
 *  Copyright (c) 2019 by Contributors
- * \file adjust_buffer_binding.cc
+ * \file loop_partition.cc


I suppose you changed the header by mistake? The file name remains the same.

chhzh123 · 2022-03-06T22:47:53Z

python/heterocl/schedule.py

@@ -265,6 +265,61 @@ def join(self, srcs, dest=None):
                        "inconsistent tensor joining"
            self.sch.join(target, dest, self[src])

+    def transpose(self, tensor=None):


I think there is one in compute_api.py. What's the difference between these two transpose?

chhzh123 · 2022-03-06T22:51:00Z

samples/gemm/gemm_systolic.py

+        return Y
+
+    # Note that you have to make sure AutoSA binary
+    # in on the PATH by running which command, otherwise HCL runtime


"in on" typo. Better add quotation marks for which.

chhzh123 · 2022-03-06T23:00:59Z

python/heterocl/autosa.py

+        extra_flags = "--simd-info=./autosa_tests/cnn/simd_info.json "
+    return ST, PART, LAT, SIMD, extra_flags
+
+def generate_systolic_array(keys, values, code, backend):


Seems the codegen, copying files, and generating headers are done in this function? Maybe it would be better if this function can be separated into several subfunctions or several steps like what we did in runtime.py.

hecmay · 2022-03-07T00:32:09Z

Sorry for the late review. I’ve looked through the code and think maybe you could add more descriptions for this PR. Seems you have added several new features besides the AutoSA backend.

I notice you introduced new APIs like transpose and pack, and new passes like transform_layout and explicit_unroll, could you also describe the changes in this PR?

Just a small question: You are not writing a C++ codegen for AutoSA right? All the compilation happens at the Python level (except for some transformation passes).

Thanks for pointing that out.

These new APIs (e.g., packing, layout transformation) are necessary to generate a high-throughput memory subsystem for the GEMM systolic array. I will add more explanations on these new APIs.
The AutoSA codegen in HCL is a mix of C++ and python rn - the HLS/OpenCL code generator (i.e., C++ part) will call a utility function (i.e., python part) that is responsible for inferring the CLI arguments and then invoking AutoSA. I can probably implement that utility function in C++, which would make the flow a bit cleaner

zzzDavid · 2022-03-07T16:02:46Z

python/heterocl/schedule.py

+
+            self.cascade_tensor = tensor
+            self.cascade_source_stage = None
+            self.sch.transpose(src, tensor, new_shape)


I have a question about this self.sch.transpose function: is data packing actually done by this function? It seems to me that this pack function only calculates the new shape.

zzzDavid · 2022-03-07T17:47:28Z

tvm/src/pass/stream_inference.cc

+          if (top_arg_names_.find(var_name) != top_arg_names_.end()) {
+            placement_info += "[0]";  // located on off-chip memory
+          } else {
+            placement_info += "[1]";  // loacted on on-chip memory


zzzDavid

I have a few general questions about the newly added passes:

From the test case, we are wrapping a piece of imperative gemm code into a stage and calls systolic() on the stage. Do we have any checks to see if the imperative code can be mapped to AutoSA? Or AutoSA would complain if it can't map the algorithm?
I want to check if my understanding is correct. For a piece of code that is targeted to AutoSA backend, we first generate C code from it, and then calls AutoSA to generate systolic array HLS code + serialization/de-serialization code, which is then wrapped into a stage. Is that correct?
About the "explicit unroll" pass, is it unrolling a loop and then outline the loop body to become PEs (function calls)?
What does "transform layout" do?

zzzDavid · 2022-03-07T18:12:27Z

python/heterocl/tvm/build_module.py

@@ -380,6 +380,9 @@ def lower(sch,
    stmt = ir_pass.AdjustBufferBinding(stmt, arg_list)
    stmt = ir_pass.InferStream(stmt, arg_list)
    stmt = ir_pass.AdjustBufferBinding(stmt, arg_list)
+    # perform layout transformation
+    stmt = ir_pass.TransformLayout(stmt, arg_list)
+    stmt = ir_pass.AdjustBufferBinding(stmt, arg_list)


What does AdjustBufferBinding do? Why is it called multiple times after each pass?

hecmay added 18 commits February 3, 2022 22:36

systolic primitive and gemm example

0fef016

commit missing python source files; format cpp

22b61d9

format cpp style manual

f3aeaf0

format cpp style fix & add attrs

28103b8

update hcl-autosa integration and hlsc codegen

5848775

add missing import pkg

3f6eb0d

update the test case names

16d38c5

update tensor layout transformation pass

d830bab

clang-format the cpp source

3e71efa

[+] fix format in transform_layout

7977359

[+] fix another format issue in transform_layout

861f6ad

[+] update xocl host codegen for ser/deser

a407632

[+] fix format issue in IR.h

e09052b

[+] update legacy test case (416)

d5ba3b8

[+] add hcl-autosa test in local CI/CD

c1f1dc9

[+] skip csyn in hcl-autosa test case (too slow)

adc8bd1

[+] verify PE number in generated HLS code

93946c9

[-] avoid flooding log into CU runner

39c4e76

hecmay changed the title ~~[Schedule] Improve support for intra-kernel data placement~~ [Schedule] Support for intra-kernel data placement Feb 6, 2022

[+] fix test case checking stmt

a4978a9

hecmay requested review from seanlatias, chhzh123 and zzzDavid February 12, 2022 14:42

chhzh123 reviewed Mar 6, 2022

View reviewed changes

zzzDavid reviewed Mar 7, 2022

View reviewed changes

hecmay mentioned this pull request Apr 7, 2022

[Examples] Update code and results for SODA/Stencil backend #442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Schedule] Support for intra-kernel data placement #436

[Schedule] Support for intra-kernel data placement #436

hecmay commented Feb 4, 2022 •

edited

Loading

hecmay commented Mar 6, 2022

chhzh123 left a comment

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

chhzh123 Mar 6, 2022

hecmay commented Mar 7, 2022

zzzDavid Mar 7, 2022

zzzDavid Mar 7, 2022

zzzDavid left a comment

zzzDavid Mar 7, 2022

[Schedule] Support for intra-kernel data placement #436

Are you sure you want to change the base?

[Schedule] Support for intra-kernel data placement #436

Conversation

hecmay commented Feb 4, 2022 • edited Loading

hecmay commented Mar 6, 2022

chhzh123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hecmay commented Mar 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzzDavid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hecmay commented Feb 4, 2022 •

edited

Loading