-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277
Conversation
name='data_vec') | ||
|
||
if pre_packed: | ||
kernel_vec = kernel | ||
if adjusted_dtype != kernel.dtype: | ||
kernel_vec = tvm.compute(kvshape, lambda co, ci, kh, kw, vc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this scheduled correctly? presumably we want this kernel_vec
computation inlined in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to compute it inline in the schedule. But doing that upsets LLVM and we go back to older less-performant assembly code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to this PR can be a legalize pass - which when sees a conv2d with int8 inputs, inserts an upcast to int16. In this case, the upcast will show up at Relay level, and the topi will remain unchanged.
In both the cases, the memory footprint takes a hit as we upcast weight to int16. For the alternate implementation, they show up in artifacts (or disk). For current PR, the allocation happens internally in the conv operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only do parallel for kernel_vec before, however, we introduce one compute now, better way is compute_inline. Could you try this schedule:
s[kernel_vec].unroll(kh)
s[kernel_vec].unroll(kw)
s[kernel_vec].vectorize(vc)
s[kernel_vec].parallel(co)
s[kernel_vec].compute_inline()
Which is used in our schedule internally and could produce SMLAL instruction when to cast into int16. However, I can not make sure whether to work here, because our computation and schedule is not the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @FrozenGene :)
I will come back to it in a day or two and will play with your suggestions if it leads to improvements.
Duplicating the message from the comments here for discussion. An alternative to this PR can be a legalize pass - which when sees a conv2d with int8 inputs, inserts an upcast to int16. In this case, the upcast will show up at Relay level, and the topi will remain unchanged. I manually verified that LLVM gives good assembly when conv2D inputs are int16. In both the cases, the memory footprint takes a hit as we upcast weight to int16. For the alternate implementation, they show up in artifacts (or disk). For current PR, the allocation happens internally in the conv operator. |
Ha! This is so interesting - it is very similar to our internal try many months ago. I think legalization is a good idea, though it is not easy to read. And, we are going to share some end to end INT8 benchmarks on TVM Shanghai Meetup next week (presented by @FrozenGene). Also, from the tuning experience, I recently get the idea of trying trivial im2col schedule (without tensorize). Maybe we can share some code later :). |
@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass? Changing Schedule
Legalize Pass
I dont have any strong opinion. |
@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over |
@ajtulloch Thanks for the interest and great discussion between us ever. :-) I want to summary some high idea of us and will present the results next TVM meetup in Shanghai. For Convolution:
We stuied QNNPACK, but QNNPACK can not be used by us directly, some concept in QNNPACK we can not simulate, for example, indirect buffer. So we write the kernel by ourselves. For Depthwise Convolution
Yes. We use INT6 * INT16 + INT16 -> INT32 instruction (SMLAL), which is better than INT32*INT32 + INT32->INT32. The way we do is we will substract the input_zero_point / kernel_zero_point before computation, at there, we will cast the dtype from UINT8 -> INT16. For Depthwise convolution, even though we don't use Tensorize, we still get the performance bettern than QNNPACK(in mobilenet V1 / mobilenetV2, only 2 layers slower than it, others we are faster than QNNPACK). Amazing result. I wanna list two keypoints:
Currently, we have tested MobilenetV2 on rasp, we are 1.34X compared with QNNPACK. In our in-house model, we are beyond more compared with QNNPACK. We will present more in TVM meetup. |
If someone is schedule guy, modify the schedule frequently, we apply it in schedule is better for them, because they could handle it whole in schedule. However, from the code viewpoint, I suggest adding in Pass. Because this should be applied in all templates. And for ARMv8.2, dot product instruction is better, could be done in legalize too. |
@FrozenGene wow, those are very impressive results. Congratulations, looking forward to the talk and the code :) |
@ajtulloch I looked a bit into your code, very decent design! I have not tried tensorization with spatial pack schedule, would you please share some performance result? |
Hi @anijain2305 , I think a Legalize Pass may lead to a clean code structure or architecture, while a Changing Schedule contribute to a easy to read. I feel the same on having no strong opinion :( Yet, one thing that I am considering is that, if the legalization is going to handle all, will it make the tensorize compute pattern matching a bit more hard to write or debug or something
|
@jackwish Thanks for the comment. If we go legalize way, following kind of transformation would happen
So, in this case, even if we want to use tensorize, we will only see int16 inputs in conv. Does that answer you question? I am sort of inclining towards going for schedule, because it is contained in one place for now. And one can easily change if one decides to use tensorize. Legalize might throw schedule developers off who want to see int8 inputs, and are surprised to see int16 inputs. |
Thanks for your kind explanation @anijain2305 . That is exactly my consideration as the legalization approach may confuse tensorie implementation. In this case, I vote for changing the schedule. |
@ajtulloch it would be great if you can manage this PR, you should have the permission to approve and merge |
@jackwish Thanks! Can you please review and let me know if you have more comments? @ajtulloch Let me know if you have any comments. |
# because LLVM is able to better interleave vmlal.s16 and vldr instructions, | ||
# leading to higher CPU utilization. | ||
adjusted_dtype = data.dtype | ||
if 'int8' in data.dtype and 'int8' in kernel.dtype and out_dtype == 'int32': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should cover uint8?
Yeah, I think the schedule approach makes more sense to me as well. |
@FrozenGene @anijain2305 would be great if we can followup on this thread |
Apologies. Will take a look at it again next week. |
ping @anijain2305 |
Closing as this is not relevant anymore given parallel efforts for improving int8 conv schedules |
I am working on improving the performance of Int8 conv on Raspberry Pi 3.
For Conv2D, there is an upcast from int8 to int32 before performing the dot-product. ARM ISA has an instruction called vmlal.s16 that takes 3 SIMD registers each containing 4 16-bit values and does FMA producing 1 SIMD register containing 4 32-bit values. However, LLVM (4.0, 6.0 and 8.0) is not able to efficiently use this instruction. In the absence of this PR, assembly looks something like this
However, if we add an intermediate upcasting to int16 i.e. instead of going from int8 to int32, we can go from int8 to int16 and then this int16 can go to conv2D, it results in much better packing of compute and memory instructions
I tested this with one Conv2D and with auto-tuning.
@jackwish @yzhliu @FrozenGene @merrymercy @zhiics @u99127