[Metaschedule] Auto tensorization for CPU / GPU dot product #11088

masahi · 2022-04-21T10:13:42Z

Building on #11075, add MultiLevelTilingWithIntrin schedule rule and RewriteTensorize postproc, which can be used for auto-tensorization with a single intrinsic, such as CPU / GPU dot product. This is the simplistic but non-trivial use of auto tensorization.

The diff looks large but most of them are boilerplate from tests. The actual change to enable auto tensorization is about 300 lines.

MultiLevelTilingWithIntrin can be used to auto-tensorize schedules with the following intrinsics. We should be able to deprecate corresponding manual templates in AutoTVM, but detail perf analysis is yet to be done.

VNNI conv2d / dense
ARM NCHWc conv2d (with or without sdot) (cc @tkonolige)
dp4a for cuda, SPIRV integer dot product for vulkan, and AMDGPU gfx10 sdot4 for rocm.

As a demonstration, I've add integration tests in tests/python/integration/test_meta_schedule_auto_tensorize.py, one of which is E2E auto-tensorzation on quantized bert-base x {VNNI, DP4A}. DP4A tests can also run on AMDGPU via vulkan or rocm backends (@mei-ye @tmoreau89).

Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com
Co-authored-by: Wuwei Lin wuwei@apache.org

@junrushao1994 @vinx13 @comaniac @mbrookhart @spectrometerHBH @Hzfengsy @MasterJH5574 @jinhongyii

masahi · 2022-04-21T11:46:49Z

src/meta_schedule/postproc/rewrite_tensorize.cc

+      if (Optional<String> intrin_name =
+              tir::GetAnn<String>(block_sref, tir::attr::meta_schedule_auto_tensorize)) {
+        std::string block_name = block_sref->StmtAs<tir::BlockNode>()->name_hint;
+        if (block_name.find("init") == std::string::npos) {


DecomposeReduction applied before this postproc copies meta_schedule_auto_tensorize attributes to the init block as well. So we need to make sure that we won't try to tensorize a block even if it has meta_schedule_auto_tensorize annotation.

there are target-specific handling here, ideally we can make the init block behavior configurable in meta schedule rule, it is fine for now

masahi · 2022-04-21T11:50:30Z

src/meta_schedule/postproc/rewrite_tensorize.cc

+            ICHECK(child_blocks.size() == 1);
+            Array<LoopRV> init_loops = sch->GetLoops(child_blocks[0]);
+            ICHECK(init_loops.size() == 1);
+            sch->Vectorize(init_loops[0]);


Related to above, since DecomposeReduction introduces a new loop that should be vectorized on CPU, for now I'm applying vecotorization to the decomposed init loop here. This can also be done in RewriteReductionBlock.

does postproc::RewriteParallelVectorizeUnroll for this case?

I hope it would, but it doesn't. Also since parallelization etc is supposed to be applied before DecomposeReduction, I don't think running RewriteParallelVectorizeUnroll after RewriteReductionBlock() is a good idea. So vectorization of the init loop has to be done manually somehow.

I'd prefer vectoring in the init loop right after we run DecomposeReduction during RewriteReductionBlock, since vecotorization of the init loop should be done on CPU regardless of tensorization. cc @MasterJH5574

Interesting! What’s the order of post-processors being applied now? Perhaps we should reflect this order by adding this post-processor to tune.py

tvm/python/tvm/meta_schedule/tune.py

Lines 159 to 170 in effc23d

@staticmethod

def _postproc() -> List[Postproc]:

from tvm.meta_schedule import postproc as M

return [

M.DisallowDynamicLoop(),

M.RewriteCooperativeFetch(),

M.RewriteUnboundBlock(),

M.RewriteParallelVectorizeUnroll(),

M.RewriteReductionBlock(),

M.VerifyGPUCode(),

]

The issue in question is vectorization for CPU targets. I'm using the default postprocs in

tvm/python/tvm/meta_schedule/tune.py

Lines 96 to 103 in effc23d

def _postproc() -> List[Postproc]:

from tvm.meta_schedule import postproc as M

return [

M.DisallowDynamicLoop(),

M.RewriteParallelVectorizeUnroll(),

M.RewriteReductionBlock(),

]

Since loop parallelization or vectorization checks for the "compact dataflow" constraint,

tvm/src/tir/schedule/primitive/for_kind.cc

Line 160 in 0ddaaa6

CheckSubtreeCompactDataflow(self, loop_sref);

, they need to be applied before DecomposeReduction in RewriteReductionBlock(). So having RewriteParallelVectorizeUnroll before RewriteReductionBlock() in the default postprocs makes sense.

However, this is not sufficient to vectorize the init loop of reduction block, since it is generated during RewriteReductionBlock(). I don't think we should run RewriteParallelVectorizeUnroll again after RewriteReductionBlock() (and it doesn't work anyway), so we need to manually vectorize the decomposed init loop in RewriteReductionBlock or the new RewriteTensorize postproc I added. I prefer the former.

In this case I want to tensorize the reduction block. So before DecomposeReduction is called, the loop kind of the reduction is serial, which makes the decomposed init loop be serial as well.

I see. So the block we want to tensorize wasn’t applied by the schedule rule ParallelVectorizeUnroll as well 🤔?

ah yes (otherwise tensorize pattern matching fails, because an intrin desc is always serial), I'm not exactly sure what prevents ParallelVectorizeUnroll from tampering the block we want to tensorize (which is a good thing), maybe Blockize I do at

tvm/src/meta_schedule/schedule_rule/multi_level_tiling_with_intrin.cc

Line 34 in dba2b31

tir::BlockRV outer_block = sch->Blockize(tiled_loop_rv.value());

(after tiling the inner loop nests to be tensorized) is helping?

Quite interesting.. So here the case is, on one hand we don’t want the block being annotated by rule ParallelVectorizeUnroll, but on the other hand we do want its init block to be vectorized after the decomposition. Am I right?

Since before decomposition the block wasn’t annotated by ParallelVectorizeUnroll, the decomposed init block isn’t vectorized, which makes sense. In addition, the decomposed init block doesn’t have any information to indicate that it’s supposed to vectorized (e.g., it doesn’t have an “need vectorization” annotation). In this case, no matter we vectorize the init block loop in RewriteReductionBlock or RewriteTensorize, it’s all due to our human knowledge, which I don’t think is perfect.

For upstreaming, it might be okay to do manual vectorization in RewriteTensorize (how does the vectorization in RewriteTensorize bypass the compact dataflow issue BTW?). But in the long term I suppose we should enhance the compact dataflow check to allow such vectorization. After all, such vectorization won’t incur any incorrectness.

cc @junrushao1994 @spectrometerHBH

Quite interesting.. So here the case is, on one hand we don’t want the block being annotated by rule ParallelVectorizeUnroll, but on the other hand we do want its init block to be vectorized after the decomposition. Am I right?

Exactly.

how does the vectorization in RewriteTensorize bypass the compact dataflow issue BTW?

That's a great question! Until recently, vectorization of the init loop after DecomposeReduction was rejected by the compact dataflow check. I brought this topic to @Hzfengsy and the team came up with a relaxation of the constraint that allows vectorizing init loop. This is the PR #10705

Yeah, the ideally all outer loop parallelizations and inner loop vectorization can be done by one pass of ParallelVectorizeUnroll, meaning we run it after DecomposeReduction. Currently outer loop parallelization after DecomposeReduction would be rejected by the compact dataflow check, but I think this is still too restrictive.

junrushao · 2022-04-22T07:26:15Z

I'm super excited to see this PR!! Would love to have some helping hands review this PR :-) CC: @vinx13 @spectrometerHBH

masahi · 2022-04-22T07:39:59Z

Some perf numbers on int8 bert-base:

VNNI, rocketlake 6 core

 ID |                                                  Name |       FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
-----------------------------------------------------------------------------------------------------------------------------------------------------------
-----
  0 |   fused_nn_batch_matmul_multiply_expand_dims_subtract |  228266496 |     12 |      2570.6195 |      88.7982 |             1065.5789 |    256 |
  1 | fused_nn_batch_matmul_multiply_expand_dims_subtract_1 |  226788096 |     12 |      2354.2875 |      96.3298 |             1155.9579 |    256 |
  2 |                  fused_nn_contrib_dense_pack_subtract |  453279744 |     48 |      2630.4608 |     172.3195 |             8271.3371 |    256 | Y
  3 |                fused_nn_contrib_dense_pack_subtract_1 | 1813118976 |     12 |      2773.5020 |     653.7291 |              7844.7493 |    256 |Y
  4 |                fused_nn_contrib_dense_pack_subtract_2 | 1812234240 |     12 |      2775.5088 |     652.9377 |             7835.2520 |    256 | Y
----------------------------------------------------------------------------------------------------------------------------------------------------------------

RTX 3070 with DP4A (FP32 peak around 16 TFLOPS)

 ID |                    Name |       FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated                                                             ----------------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_batch_matmul |  226492416 |     12 |     10978.4995 |      20.6305 |              247.5665 |    512 |          Y
  1 | fused_nn_batch_matmul_1 |  226492416 |     12 |     14038.9348 |      16.1332 |              193.5979 |    512 |
  2 |          fused_nn_dense |  452984832 |     48 |     17875.0444 |      25.3417 |             1216.4038 |    512 |          Y
  3 |        fused_nn_dense_1 | 1811939328 |     12 |     25448.8947 |      71.1991 |              854.3896 |    512 |          Y
  4 |        fused_nn_dense_2 | 1811939328 |     12 |     21945.3012 |      82.5662 |              990.7940 |    512 |          Y
----------------------------------------------------------------------------------------------------------------------------------

AMDGPU RX6600xt with DP4A (FP32 peak around 10 TFLOPS)

 ID |                    Name |       FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated
 ----------------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_batch_matmul |  226492416 |     12 |     10589.5889 |      21.3882 |              256.6586 |    512 |
  1 | fused_nn_batch_matmul_1 |  226492416 |     12 |      9998.6694 |      22.6523 |              271.8271 |    512 |          Y
  2 |          fused_nn_dense |  452984832 |     48 |     13374.8473 |      33.8684 |             1625.6837 |    512 |          Y
  3 |        fused_nn_dense_1 | 1811939328 |     12 |     13873.1209 |     130.6079 |             1567.2949 |    512 |          Y
  4 |        fused_nn_dense_2 | 1811939328 |     12 |     17295.8264 |     104.7617 |             1257.1398 |    512 |          Y
----------------------------------------------------------------------------------------------------------------------------------

MasterJH5574

Thanks for the efforts! Excited to see auto-tensorization happening!

python/tvm/meta_schedule/schedule_rule/multi_level_tiling.py

src/meta_schedule/schedule_rule/multi_level_tiling.cc

src/meta_schedule/schedule_rule/multi_level_tiling_with_intrin.cc

MasterJH5574

Should we update the list of post-processors here as well?

tvm/include/tvm/meta_schedule/postproc.h

Line 110 in effc23d

class Postproc : public runtime::ObjectRef {

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org>

include/tvm/meta_schedule/postproc.h

…1088) * [Metaschedule] Auto-tensorization for CPU / GPU dot product Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> * doc update * add vnni conv2d test * add dp4a test * adding tests for rewrite_tensorize * add rewrite_tensorize test * add missing pydoc * black * more doc * adding auto tensorize integration test * add dp4a test * fix target name * fix dtype in test * skip bert test * replace hard-coded llvm intrinsic id in test with look up * remove unnecessary include, add doc for the rest of params * update postproc.h * update doc * fix shape in te matmul workload * fix newline in cppdoc Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org>

masahi commented Apr 21, 2022

View reviewed changes

masahi force-pushed the auto-tensorize-dot branch from 0d9e476 to 71d9ab5 Compare April 21, 2022 12:29

MasterJH5574 reviewed Apr 22, 2022

View reviewed changes

python/tvm/meta_schedule/schedule_rule/multi_level_tiling.py Show resolved Hide resolved

src/meta_schedule/schedule_rule/multi_level_tiling.cc Outdated Show resolved Hide resolved

src/meta_schedule/schedule_rule/multi_level_tiling_with_intrin.cc Show resolved Hide resolved

MasterJH5574 reviewed Apr 22, 2022

View reviewed changes

masahi force-pushed the auto-tensorize-dot branch 3 times, most recently from 6d6c3b4 to e104593 Compare April 22, 2022 21:13

masahi and others added 19 commits April 23, 2022 07:00

doc update

264dde7

add vnni conv2d test

b1430f9

add dp4a test

5d6baa9

adding tests for rewrite_tensorize

cf6c9a7

add rewrite_tensorize test

b0e2b21

add missing pydoc

d380964

black

bda570d

more doc

b29a303

adding auto tensorize integration test

6742810

add dp4a test

9377eaa

fix target name

07d0457

fix dtype in test

e7483eb

skip bert test

fcd35e3

replace hard-coded llvm intrinsic id in test with look up

b97b56c

remove unnecessary include, add doc for the rest of params

9b9855a

update postproc.h

b90f8ee

update doc

9e10cf9

fix shape in te matmul workload

3d773f9

masahi force-pushed the auto-tensorize-dot branch from e104593 to 3d773f9 Compare April 22, 2022 22:00

spectrometerHBH approved these changes Apr 23, 2022

View reviewed changes

include/tvm/meta_schedule/postproc.h Outdated Show resolved Hide resolved

fix newline in cppdoc

fda3d83

vinx13 approved these changes Apr 26, 2022

View reviewed changes

vinx13 merged commit 6846484 into apache:main Apr 26, 2022

vinx13 mentioned this pull request Jul 13, 2022

[RFC][Tracking Issue] Meta Schedule (AutoTIR) #8473

Open

62 tasks

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metaschedule] Auto tensorization for CPU / GPU dot product #11088

[Metaschedule] Auto tensorization for CPU / GPU dot product #11088

masahi commented Apr 21, 2022 •

edited

Loading

masahi Apr 21, 2022

vinx13 Apr 21, 2022

masahi Apr 21, 2022

vinx13 Apr 21, 2022

masahi Apr 21, 2022 •

edited

Loading

MasterJH5574 Apr 22, 2022

masahi Apr 22, 2022 •

edited

Loading

masahi Apr 23, 2022

MasterJH5574 Apr 23, 2022

masahi Apr 23, 2022 •

edited

Loading

MasterJH5574 Apr 23, 2022 •

edited

Loading

masahi Apr 23, 2022 •

edited

Loading

junrushao commented Apr 22, 2022

masahi commented Apr 22, 2022

MasterJH5574 left a comment

MasterJH5574 left a comment

	@staticmethod
	def _postproc() -> List[Postproc]:
	from tvm.meta_schedule import postproc as M

	return [
	M.DisallowDynamicLoop(),
	M.RewriteCooperativeFetch(),
	M.RewriteUnboundBlock(),
	M.RewriteParallelVectorizeUnroll(),
	M.RewriteReductionBlock(),
	M.VerifyGPUCode(),
	]

[Metaschedule] Auto tensorization for CPU / GPU dot product #11088

[Metaschedule] Auto tensorization for CPU / GPU dot product #11088

Conversation

masahi commented Apr 21, 2022 • edited Loading

masahi Apr 21, 2022

Choose a reason for hiding this comment

vinx13 Apr 21, 2022

Choose a reason for hiding this comment

masahi Apr 21, 2022

Choose a reason for hiding this comment

vinx13 Apr 21, 2022

Choose a reason for hiding this comment

masahi Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

MasterJH5574 Apr 22, 2022

Choose a reason for hiding this comment

masahi Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

masahi Apr 23, 2022

Choose a reason for hiding this comment

MasterJH5574 Apr 23, 2022

Choose a reason for hiding this comment

masahi Apr 23, 2022 • edited Loading

Choose a reason for hiding this comment

MasterJH5574 Apr 23, 2022 • edited Loading

Choose a reason for hiding this comment

masahi Apr 23, 2022 • edited Loading

Choose a reason for hiding this comment

junrushao commented Apr 22, 2022

masahi commented Apr 22, 2022

MasterJH5574 left a comment

Choose a reason for hiding this comment

MasterJH5574 left a comment

Choose a reason for hiding this comment

masahi commented Apr 21, 2022 •

edited

Loading

masahi Apr 21, 2022 •

edited

Loading

masahi Apr 22, 2022 •

edited

Loading

masahi Apr 23, 2022 •

edited

Loading

MasterJH5574 Apr 23, 2022 •

edited

Loading

masahi Apr 23, 2022 •

edited

Loading