[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

Shawn-IEITSystems · 2020-03-19T14:32:55Z

Adding Conv2d and Dense ops support on Tensor Core.
This enables running the CNNs on Tensor Core units of modern GPUs.
Great performance improvements have been achieved.

Please see RFC link below for details:
Optimization of CNNs on Tensor Core

@Laurawly @vinx13 @masahi Please help to review.

…to tensorcore

jwfromm · 2020-03-20T22:09:26Z

topi/python/topi/cuda/conv2d_nhwc_tensorcore.py

+
+        return update(), init(), update()
+
+    return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, B: BB, C: BC})


The two intrin_wmma_gemm functions in conv2d and dense are extremely similar. Do you think it makes sense to instead have one more general gemm intrinsic for both or is there something preventing that?

Thanks for the suggestions! The codes has been modified to have a general gemm intrinsic for both conv2d and dense.

jwfromm · 2020-03-20T22:13:40Z

topi/python/topi/cuda/conv2d_nhwc_tensorcore.py

+        PaddedInput.shape,
+        lambda h, w, i, o: PaddedInput[h, w, i, o].astype('float16'))
+    TransFilter = te.compute(
+        Filter.shape, lambda h, w, i, o: Filter[h, w, i, o].astype('float16'))


Does it make sense to do this casting as part of the function? Instead we could apply a relay pass to downcast to float16 ahead of time, which should be more efficient and make the conversions more visible to users.

This is a very good suggestion! Regarding this pass, we think the following two functions should be implemented:
1, Pre-quantizing weights from fp32 to fp16 to avoid conversions of weights in runtime.
2, Removing coupled conversions, like converting from fp16 to fp32 followed by an operation that converting from fp32 to fp16. In this case, the pass can automatically insert or remove conversion operations in graph, and keeps the cost of data type conversions as few as possible.

In addition to functions above, there is an open issue:
Should we place this pass in the quantization module?

The pass is complex enough that we would like to submit it as an optimization for Tensor Core in the future.

The current submission focuses on supporting Tensor Core for Conv2d and Dense to enable computing CNNs on Tensor Core. The performance is fairly good on Tensor Core even with these data type conversions.

jwfromm · 2020-03-20T22:15:56Z

topi/python/topi/cuda/conv2d_nhwc_tensorcore.py

+        WS = s.cache_read(kernel, 'shared', [OL])
+        AF = s.cache_read(AS, 'wmma.matrix_a', [OL])
+        WF = s.cache_read(WS, 'wmma.matrix_b', [OL])
+        ConvF = s.cache_write(OL, 'wmma.accumulator')


this if/else has a lot of duplicated code that could be shared.

The duplicated codes have been placed before the if/else block.

jwfromm · 2020-03-20T22:19:26Z

topi/python/topi/cuda/dense_tensorcore.py

+    cfg.define_knob("offsetCS", [0, 8])
+    cfg.define_knob("vec", [1, 2, 4, 8])
+
+    #Make it available by default


Unclear what this comment means.

It means the default parameters are applicable when users do not take autoTVM to fine tune the parameters. We have modified this comment to make it clear.

jwfromm · 2020-03-20T22:23:59Z

Hi @Shawn-Inspur, thanks so much for this work. I'm really excited to try out some tensor core workloads. To help me better understand how the auto tensorcore codegen introduced in PR #4234 is meant to be used can you explain why you didn't use it for this PR?

Hzfengsy · 2020-03-21T16:09:54Z

Thank you for your interest @jwfromm! That is a very good question and I think there must be other people who have the same confusion as yours. So, I would like to tell the story in detail.

The PR #4136 introduced the Tensor Core low-level intrinsic, which is tvm Tensor Core infrastructure. It enabled many different ways to use Tensor Core in TVM. This PR and #4234 are exactly two ways.

In PR #4234, it uses a pass RewriteForTensorCore to detect the Matmul pattern. (Please see RFC #4105 for the details). The good thing is that users can write the normal matmul schedule and the pass will do the reset. However, this algorithm brings too many constraints:

The pass can only detect gemm pattern, but cannot support conv2d
This algorithm only supports one local fragment in a warp, which brings a great performance regression on large scale workloads. For a better understanding of this point, please see the CUTLASS introduction (https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/) Figure 7. It uses 2*4 local fragments in large scale gemm.

As we know, performance is the most important thing for schedules in topi. Hence, we directly use intrinsics rather than Auto Tensor Core CodeGen pass, the similar way as that shows in the tutorial (https://docs.tvm.ai/tutorials/optimize/opt_conv_tensorcore.html). The major differences are the following two things:

It enables traditional data layout (NHWC, and NCHW maybe in the future). The tutorial requires a packed layout(NHWCnc). We have done a lot of experience to choose the layout and achieved the best performance as we can. If you are interested in the details, @Shawn-Inspur may show it.
AutoTVM boosts the op calculation. Search space contains different local fragment numbers, different warp numbers in one block and different memory layout offset. Even more, we can use AutoTVM to search the fragment shape(32 * 16 * 8, 16 * 16 * 16 or 8 * 16 *32)

jwfromm · 2020-03-21T19:32:43Z

That makes sense @Hzfengsy, thanks for the explanation!

Laurawly · 2020-03-21T23:28:05Z

topi/tests/python/test_topi_conv2d_nhwc_tensorcore.py

+
+def test_conv2d_nhwc_tensorcore():
+    """Test the conv2d with tensorcore for nhwc layout"""
+    verify_conv2d_nhwc(16, 16, 14, 16, 3, 1, 1)


Are we able to run workloads with batch size of 1. If not, we should put an assertion somewhere in compute or schedule.

Currently, it does not support batch size of 1. Thanks for the suggestions! An assertion has been added in compute.

Laurawly · 2020-03-21T23:29:15Z

topi/tests/python/test_topi_dense_tensorcore.py

+
+def test_dense_tensorcore():
+    """Test cases"""
+    verify_dense(8, 16, 32, use_bias=True)


Is there any shape limitation for dense?

Currently, the shape of (batch size, input channel, output channel) must be divisible by (16, 16, 16) or (32, 16, 8) or (8, 16, 32). An assertion has been added in compute to make sure the shape is right for Tensor Core.

Laurawly · 2020-03-22T02:45:49Z

topi/python/topi/cuda/conv2d_nhwc_tensorcore.py

+    return Output
+
+
+def schedule_nhwc_tensorcore_cuda(cfg, s, Conv):


Need assertion on size requirements of batch size and output channel size.

Thanks for the suggestions! An assertion has been added in compute.

…ine general gemm intrinsic

Laurawly

LGTM

tqchen · 2020-03-27T23:21:08Z

Thanks @Shawn-Inspur @Hzfengsy @Laurawly @vinx13 @jwfromm !

…he#5099) * [TOPI][Tensor Core] Optimization of CNNs on Tensor Core apache#6004 * update conv2d test * # pylint: dense_tensorcore.py * modify * modify conv2d * modify the unclear comment,add shape assertion in conv2d compute,combine general gemm intrinsic * add shape assertion in conv2d compute, combine general gemm intrinsic Co-authored-by: libaihong <libaihong@inspur.com> Co-authored-by: libaihong <61525430+libaihong@users.noreply.github.com>

boh-inspur and others added 7 commits March 19, 2020 10:58

[TOPI][Tensor Core] Optimization of CNNs on Tensor Core apache#6004

734d24c

update conv2d test

1b503c6

# pylint: dense_tensorcore.py

bf8be7a

Merge branch 'tensorcore' of github.com:Shawn-Inspur/incubator-tvm in…

8c0e9a5

…to tensorcore

modify

61ce424

modify conv2d

a27c4f5

Merge branch 'tensorcore' of github.com:Shawn-Inspur/incubator-tvm in…

2d1860c

…to tensorcore

tqchen assigned vinx13 and Laurawly Mar 20, 2020

tqchen added the status: need review label Mar 20, 2020

jwfromm reviewed Mar 20, 2020

View reviewed changes

Laurawly reviewed Mar 21, 2020

View reviewed changes

Laurawly requested changes Mar 22, 2020

View reviewed changes

Shawn-IEITSystems and others added 3 commits March 23, 2020 09:05

modify the unclear comment,add shape assertion in conv2d compute,comb…

79f3fb1

…ine general gemm intrinsic

add shape assertion in conv2d compute, combine general gemm intrinsic

56cfb2d

Merge branch 'master' into tensorcore

e005dae

Laurawly approved these changes Mar 25, 2020

View reviewed changes

tqchen merged commit f4286cc into apache:master Mar 27, 2020

tqchen added status: accepted and removed status: need review labels Mar 27, 2020

jwfromm mentioned this pull request Apr 8, 2020

[Topi] Tensorcore support for Conv3D #5284

Merged

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

Shawn-IEITSystems commented Mar 19, 2020

jwfromm Mar 20, 2020

Shawn-IEITSystems Mar 23, 2020

jwfromm Mar 20, 2020 •

edited

Loading

Shawn-IEITSystems Mar 24, 2020

jwfromm Mar 20, 2020

Shawn-IEITSystems Mar 23, 2020

jwfromm Mar 20, 2020

Shawn-IEITSystems Mar 23, 2020

jwfromm commented Mar 20, 2020

Hzfengsy commented Mar 21, 2020

jwfromm commented Mar 21, 2020

Laurawly Mar 21, 2020

Shawn-IEITSystems Mar 23, 2020

Laurawly Mar 21, 2020

Shawn-IEITSystems Mar 23, 2020

Laurawly Mar 22, 2020

Shawn-IEITSystems Mar 23, 2020

Laurawly left a comment

tqchen commented Mar 27, 2020


		return update(), init(), update()

		return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, B: BB, C: BC})

		return Output


		def schedule_nhwc_tensorcore_cuda(cfg, s, Conv):

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

Conversation

Shawn-IEITSystems commented Mar 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwfromm Mar 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwfromm commented Mar 20, 2020

Hzfengsy commented Mar 21, 2020

jwfromm commented Mar 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laurawly left a comment

Choose a reason for hiding this comment

tqchen commented Mar 27, 2020

jwfromm Mar 20, 2020 •

edited

Loading