-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099
Conversation
|
||
return update(), init(), update() | ||
|
||
return te.decl_tensor_intrin(C.op, intrin_func, binds={A: BA, B: BB, C: BC}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two intrin_wmma_gemm
functions in conv2d and dense are extremely similar. Do you think it makes sense to instead have one more general gemm intrinsic for both or is there something preventing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! The codes has been modified to have a general gemm intrinsic for both conv2d and dense.
PaddedInput.shape, | ||
lambda h, w, i, o: PaddedInput[h, w, i, o].astype('float16')) | ||
TransFilter = te.compute( | ||
Filter.shape, lambda h, w, i, o: Filter[h, w, i, o].astype('float16')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to do this casting as part of the function? Instead we could apply a relay pass to downcast to float16
ahead of time, which should be more efficient and make the conversions more visible to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good suggestion! Regarding this pass, we think the following two functions should be implemented:
1, Pre-quantizing weights from fp32 to fp16 to avoid conversions of weights in runtime.
2, Removing coupled conversions, like converting from fp16 to fp32 followed by an operation that converting from fp32 to fp16. In this case, the pass can automatically insert or remove conversion operations in graph, and keeps the cost of data type conversions as few as possible.
In addition to functions above, there is an open issue:
Should we place this pass in the quantization module?
The pass is complex enough that we would like to submit it as an optimization for Tensor Core in the future.
The current submission focuses on supporting Tensor Core for Conv2d and Dense to enable computing CNNs on Tensor Core. The performance is fairly good on Tensor Core even with these data type conversions.
WS = s.cache_read(kernel, 'shared', [OL]) | ||
AF = s.cache_read(AS, 'wmma.matrix_a', [OL]) | ||
WF = s.cache_read(WS, 'wmma.matrix_b', [OL]) | ||
ConvF = s.cache_write(OL, 'wmma.accumulator') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if/else has a lot of duplicated code that could be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicated codes have been placed before the if/else block.
cfg.define_knob("offsetCS", [0, 8]) | ||
cfg.define_knob("vec", [1, 2, 4, 8]) | ||
|
||
#Make it available by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what this comment means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means the default parameters are applicable when users do not take autoTVM to fine tune the parameters. We have modified this comment to make it clear.
Hi @Shawn-Inspur, thanks so much for this work. I'm really excited to try out some tensor core workloads. To help me better understand how the auto tensorcore codegen introduced in PR #4234 is meant to be used can you explain why you didn't use it for this PR? |
Thank you for your interest @jwfromm! That is a very good question and I think there must be other people who have the same confusion as yours. So, I would like to tell the story in detail. The PR #4136 introduced the Tensor Core low-level intrinsic, which is tvm Tensor Core infrastructure. It enabled many different ways to use Tensor Core in TVM. This PR and #4234 are exactly two ways. In PR #4234, it uses a pass
As we know, performance is the most important thing for schedules in topi. Hence, we directly use intrinsics rather than Auto Tensor Core CodeGen pass, the similar way as that shows in the tutorial (https://docs.tvm.ai/tutorials/optimize/opt_conv_tensorcore.html). The major differences are the following two things:
|
That makes sense @Hzfengsy, thanks for the explanation! |
|
||
def test_conv2d_nhwc_tensorcore(): | ||
"""Test the conv2d with tensorcore for nhwc layout""" | ||
verify_conv2d_nhwc(16, 16, 14, 16, 3, 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to run workloads with batch size of 1. If not, we should put an assertion somewhere in compute or schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it does not support batch size of 1. Thanks for the suggestions! An assertion has been added in compute.
|
||
def test_dense_tensorcore(): | ||
"""Test cases""" | ||
verify_dense(8, 16, 32, use_bias=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any shape limitation for dense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the shape of (batch size, input channel, output channel) must be divisible by (16, 16, 16) or (32, 16, 8) or (8, 16, 32). An assertion has been added in compute to make sure the shape is right for Tensor Core.
return Output | ||
|
||
|
||
def schedule_nhwc_tensorcore_cuda(cfg, s, Conv): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need assertion on size requirements of batch size and output channel size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! An assertion has been added in compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…he#5099) * [TOPI][Tensor Core] Optimization of CNNs on Tensor Core apache#6004 * update conv2d test * # pylint: dense_tensorcore.py * modify * modify conv2d * modify the unclear comment,add shape assertion in conv2d compute,combine general gemm intrinsic * add shape assertion in conv2d compute, combine general gemm intrinsic Co-authored-by: libaihong <libaihong@inspur.com> Co-authored-by: libaihong <61525430+libaihong@users.noreply.github.com>
…he#5099) * [TOPI][Tensor Core] Optimization of CNNs on Tensor Core apache#6004 * update conv2d test * # pylint: dense_tensorcore.py * modify * modify conv2d * modify the unclear comment,add shape assertion in conv2d compute,combine general gemm intrinsic * add shape assertion in conv2d compute, combine general gemm intrinsic Co-authored-by: libaihong <libaihong@inspur.com> Co-authored-by: libaihong <61525430+libaihong@users.noreply.github.com>
Please see RFC link below for details:
Optimization of CNNs on Tensor Core
@Laurawly @vinx13 @masahi Please help to review.