Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

Merged
merged 10 commits into from
Mar 27, 2020

Conversation

Shawn-IEITSystems
Copy link
Contributor

  • Adding Conv2d and Dense ops support on Tensor Core.
  • This enables running the CNNs on Tensor Core units of modern GPUs.
  • Great performance improvements have been achieved.

Please see RFC link below for details:
Optimization of CNNs on Tensor Core

@Laurawly @vinx13 @masahi Please help to review.


return update(), init(), update()

return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, B: BB, C: BC})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two intrin_wmma_gemm functions in conv2d and dense are extremely similar. Do you think it makes sense to instead have one more general gemm intrinsic for both or is there something preventing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! The codes has been modified to have a general gemm intrinsic for both conv2d and dense.

PaddedInput.shape,
lambda h, w, i, o: PaddedInput[h, w, i, o].astype('float16'))
TransFilter = te.compute(
Filter.shape, lambda h, w, i, o: Filter[h, w, i, o].astype('float16'))
Copy link
Contributor

@jwfromm jwfromm Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to do this casting as part of the function? Instead we could apply a relay pass to downcast to float16 ahead of time, which should be more efficient and make the conversions more visible to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good suggestion! Regarding this pass, we think the following two functions should be implemented:
1, Pre-quantizing weights from fp32 to fp16 to avoid conversions of weights in runtime.
2, Removing coupled conversions, like converting from fp16 to fp32 followed by an operation that converting from fp32 to fp16. In this case, the pass can automatically insert or remove conversion operations in graph, and keeps the cost of data type conversions as few as possible.

In addition to functions above, there is an open issue:
Should we place this pass in the quantization module?

The pass is complex enough that we would like to submit it as an optimization for Tensor Core in the future.

The current submission focuses on supporting Tensor Core for Conv2d and Dense to enable computing CNNs on Tensor Core. The performance is fairly good on Tensor Core even with these data type conversions.

WS = s.cache_read(kernel, 'shared', [OL])
AF = s.cache_read(AS, 'wmma.matrix_a', [OL])
WF = s.cache_read(WS, 'wmma.matrix_b', [OL])
ConvF = s.cache_write(OL, 'wmma.accumulator')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if/else has a lot of duplicated code that could be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicated codes have been placed before the if/else block.

cfg.define_knob("offsetCS", [0, 8])
cfg.define_knob("vec", [1, 2, 4, 8])

#Make it available by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what this comment means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the default parameters are applicable when users do not take autoTVM to fine tune the parameters. We have modified this comment to make it clear.

@jwfromm
Copy link
Contributor

jwfromm commented Mar 20, 2020

Hi @Shawn-Inspur, thanks so much for this work. I'm really excited to try out some tensor core workloads. To help me better understand how the auto tensorcore codegen introduced in PR #4234 is meant to be used can you explain why you didn't use it for this PR?

@Hzfengsy
Copy link
Member

Thank you for your interest @jwfromm! That is a very good question and I think there must be other people who have the same confusion as yours. So, I would like to tell the story in detail.

The PR #4136 introduced the Tensor Core low-level intrinsic, which is tvm Tensor Core infrastructure. It enabled many different ways to use Tensor Core in TVM. This PR and #4234 are exactly two ways.

In PR #4234, it uses a pass RewriteForTensorCore to detect the Matmul pattern. (Please see RFC #4105 for the details). The good thing is that users can write the normal matmul schedule and the pass will do the reset. However, this algorithm brings too many constraints:

  • The pass can only detect gemm pattern, but cannot support conv2d
  • This algorithm only supports one local fragment in a warp, which brings a great performance regression on large scale workloads. For a better understanding of this point, please see the CUTLASS introduction (https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/) Figure 7. It uses 2*4 local fragments in large scale gemm.

As we know, performance is the most important thing for schedules in topi. Hence, we directly use intrinsics rather than Auto Tensor Core CodeGen pass, the similar way as that shows in the tutorial (https://docs.tvm.ai/tutorials/optimize/opt_conv_tensorcore.html). The major differences are the following two things:

  • It enables traditional data layout (NHWC, and NCHW maybe in the future). The tutorial requires a packed layout(NHWCnc). We have done a lot of experience to choose the layout and achieved the best performance as we can. If you are interested in the details, @Shawn-Inspur may show it.
  • AutoTVM boosts the op calculation. Search space contains different local fragment numbers, different warp numbers in one block and different memory layout offset. Even more, we can use AutoTVM to search the fragment shape(32 * 16 * 8, 16 * 16 * 16 or 8 * 16 *32)

@jwfromm
Copy link
Contributor

jwfromm commented Mar 21, 2020

That makes sense @Hzfengsy, thanks for the explanation!


def test_conv2d_nhwc_tensorcore():
"""Test the conv2d with tensorcore for nhwc layout"""
verify_conv2d_nhwc(16, 16, 14, 16, 3, 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to run workloads with batch size of 1. If not, we should put an assertion somewhere in compute or schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it does not support batch size of 1. Thanks for the suggestions! An assertion has been added in compute.


def test_dense_tensorcore():
"""Test cases"""
verify_dense(8, 16, 32, use_bias=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any shape limitation for dense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the shape of (batch size, input channel, output channel) must be divisible by (16, 16, 16) or (32, 16, 8) or (8, 16, 32). An assertion has been added in compute to make sure the shape is right for Tensor Core.

return Output


def schedule_nhwc_tensorcore_cuda(cfg, s, Conv):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need assertion on size requirements of batch size and output channel size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! An assertion has been added in compute.

Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tqchen tqchen merged commit f4286cc into apache:master Mar 27, 2020
@tqchen
Copy link
Member

tqchen commented Mar 27, 2020

Thanks @Shawn-Inspur @Hzfengsy @Laurawly @vinx13 @jwfromm !

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
…he#5099)

* [TOPI][Tensor Core] Optimization of CNNs on Tensor Core apache#6004

* update conv2d test

* # pylint: dense_tensorcore.py

* modify

* modify conv2d

* modify the unclear comment,add shape assertion in conv2d compute,combine general gemm intrinsic

* add shape assertion in conv2d compute, combine general gemm intrinsic

Co-authored-by: libaihong <libaihong@inspur.com>
Co-authored-by: libaihong <61525430+libaihong@users.noreply.github.com>
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
…he#5099)

* [TOPI][Tensor Core] Optimization of CNNs on Tensor Core apache#6004

* update conv2d test

* # pylint: dense_tensorcore.py

* modify

* modify conv2d

* modify the unclear comment,add shape assertion in conv2d compute,combine general gemm intrinsic

* add shape assertion in conv2d compute, combine general gemm intrinsic

Co-authored-by: libaihong <libaihong@inspur.com>
Co-authored-by: libaihong <61525430+libaihong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants