Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277

Closed
wants to merge 1 commit into from

Conversation

anijain2305
Copy link
Contributor

@anijain2305 anijain2305 commented Nov 8, 2019

I am working on improving the performance of Int8 conv on Raspberry Pi 3.

For Conv2D, there is an upcast from int8 to int32 before performing the dot-product. ARM ISA has an instruction called vmlal.s16 that takes 3 SIMD registers each containing 4 16-bit values and does FMA producing 1 SIMD register containing 4 32-bit values. However, LLVM (4.0, 6.0 and 8.0) is not able to efficiently use this instruction. In the absence of this PR, assembly looks something like this

	add	r2, sp, #304
	vmlal.s16	q12, d0, d4
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #208
	vmlal.s16	q12, d10, d0
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #192
	vmlal.s16	q12, d8, d0
	vld1.64	{d0, d1}, [r2:128]
	sub	r2, r6, #7
	mov	r6, r7
	vmlal.s16	q12, d0, d14
	vld1.8	{d0[]}, [r2]
	add	r2, sp, #192
	vmovl.s8	q4, d0
	vst1.64	{d8, d9}, [r2:128]
	add	r2, sp, #240
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #176
	vmlal.s16	q14, d8, d0
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #128
	vmlal.s16	q14, d0, d12
	vld1.64	{d12, d13}, [r2:128]
	add	r2, sp, #288
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #96
	vmlal.s16	q14, d12, d0
	vld1.64	{d8, d9}, [r2:128]
	add	r2, sp, #352
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #224
	vmlal.s16	q14, d8, d0
	vld1.64	{d14, d15}, [r2:128]
	add	r2, sp, #80
	vmlal.s16	q14, d14, d2
	vld1.64	{d2, d3}, [r2:128]
	add	r2, sp, #304
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #208
	vmlal.s16	q14, d6, d4
	vld1.64	{d4, d5}, [r2:128]
	add	r2, sp, #240
	vmlal.s16	q14, d2, d0
	vmlal.s16	q14, d10, d4
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #192
	vld1.64	{d6, d7}, [r2:128]
	add	r2, sp, #160
	vmlal.s16	q13, d7, d11
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #176
	vld1.64	{d6, d7}, [r2:128]
	add	r2, sp, #288
	vmlal.s16	q13, d7, d11
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #352
	vmlal.s16	q13, d13, d11
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #336
	vld1.64	{d6, d7}, [r2:128]
	add	r2, sp, #144
	vmlal.s16	q13, d9, d11
	vld1.64	{d8, d9}, [r2:128]
	add	r2, sp, #320
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #272
	vmlal.s16	q13, d15, d7
	vmlal.s16	q13, d9, d11
	vmlal.s16	q13, d3, d1
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #256
	vmlal.s16	q13, d1, d5
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #112
	vld1.64	{d2, d3}, [r2:128]
	vmlal.s16	q14, d0, d2
	vmlal.s16	q13, d1, d3
	bne	.LBB6_5

However, if we add an intermediate upcasting to int16 i.e. instead of going from int8 to int32, we can go from int8 to int16 and then this int16 can go to conv2D, it results in much better packing of compute and memory instructions

.LBB7_6:
	add	r2, lr, r8
	add	r1, r12, #32
	vld1.64	{d0, d1}, [r1:128]
	add	r1, r12, #16
	add	r3, r2, #10
	add	r8, r8, #8
	vld1.64	{d4, d5}, [r1:128]
	mov	r1, r2
	vld1.16	{d6[]}, [r1:16], r7
	cmp	r8, #24
	vld1.16	{d7[]}, [r3:16]
	add	r3, r2, #2
	vld1.16	{d10[]}, [r1:16]
	add	r1, r2, #8
	vld1.16	{d2, d3}, [r10:128], r0
	vmlal.s16	q14, d6, d2
	vmlal.s16	q13, d6, d3
	vld1.16	{d6[]}, [r3:16]
	add	r3, r2, #12
	vmlal.s16	q12, d6, d2
	vmlal.s16	q11, d6, d3
	mov	r12, r10
	vmlal.s16	q14, d6, d4
	vld1.16	{d8[]}, [r3:16]
	vmlal.s16	q13, d6, d5
	add	r3, r2, #4
	vld1.16	{d6[]}, [r1:16]
	vmlal.s16	q15, d7, d3
	vmlal.s16	q9, d6, d2
	add	r1, r2, #6
	vmlal.s16	q8, d7, d2
	vld1.16	{d9[]}, [r3:16]
	vmlal.s16	q10, d6, d3
	vmlal.s16	q15, d8, d5
	vld1.16	{d2[]}, [r1:16]
	vmlal.s16	q12, d9, d4
	vmlal.s16	q11, d9, d5
	vmlal.s16	q9, d7, d4
	vmlal.s16	q10, d7, d5
	vmlal.s16	q8, d8, d4
	vmlal.s16	q13, d9, d1
	vmlal.s16	q14, d9, d0
	vmlal.s16	q15, d10, d1
	vmlal.s16	q8, d10, d0
	vmlal.s16	q10, d8, d1
	vmlal.s16	q9, d8, d0
	vmlal.s16	q11, d2, d1
	vmlal.s16	q12, d2, d0

I tested this with one Conv2D and with auto-tuning.

  • Workload - Input shape - 1, 64, 16, 16, Kernel - 64, 64, 3, 3
  • Platform - Raspberry Pi - 0.6 GHz
  • Auto-tuning done for both FP32 and Int8 conv
  • FP32 Latency = 3817 us, Int8 latency = 3015 us
  • In the absence of this PR, Int8 latency is worse than FP32.

@jackwish @yzhliu @FrozenGene @merrymercy @zhiics @u99127

@anijain2305 anijain2305 changed the title [ARM][Topi] Supporting Int8 in Spatial schedule. [ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. Nov 8, 2019
name='data_vec')

if pre_packed:
kernel_vec = kernel
if adjusted_dtype != kernel.dtype:
kernel_vec = tvm.compute(kvshape, lambda co, ci, kh, kw, vc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this scheduled correctly? presumably we want this kernel_vec computation inlined in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to compute it inline in the schedule. But doing that upsets LLVM and we go back to older less-performant assembly code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to this PR can be a legalize pass - which when sees a conv2d with int8 inputs, inserts an upcast to int16. In this case, the upcast will show up at Relay level, and the topi will remain unchanged.

In both the cases, the memory footprint takes a hit as we upcast weight to int16. For the alternate implementation, they show up in artifacts (or disk). For current PR, the allocation happens internally in the conv operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only do parallel for kernel_vec before, however, we introduce one compute now, better way is compute_inline. Could you try this schedule:

    s[kernel_vec].unroll(kh)
    s[kernel_vec].unroll(kw)
    s[kernel_vec].vectorize(vc)
    s[kernel_vec].parallel(co)
    s[kernel_vec].compute_inline()

Which is used in our schedule internally and could produce SMLAL instruction when to cast into int16. However, I can not make sure whether to work here, because our computation and schedule is not the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @FrozenGene :)
I will come back to it in a day or two and will play with your suggestions if it leads to improvements.

@anijain2305
Copy link
Contributor Author

Duplicating the message from the comments here for discussion.

An alternative to this PR can be a legalize pass - which when sees a conv2d with int8 inputs, inserts an upcast to int16. In this case, the upcast will show up at Relay level, and the topi will remain unchanged. I manually verified that LLVM gives good assembly when conv2D inputs are int16.

In both the cases, the memory footprint takes a hit as we upcast weight to int16. For the alternate implementation, they show up in artifacts (or disk). For current PR, the allocation happens internally in the conv operator.

@zhenhuaw-me
Copy link
Contributor

Ha! This is so interesting - it is very similar to our internal try many months ago. I think legalization is a good idea, though it is not easy to read.

And, we are going to share some end to end INT8 benchmarks on TVM Shanghai Meetup next week (presented by @FrozenGene).

Also, from the tuning experience, I recently get the idea of trying trivial im2col schedule (without tensorize). Maybe we can share some code later :).

@anijain2305
Copy link
Contributor Author

@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass?

Changing Schedule

  • (+) The change is contained to only one template.
  • (-) Schedules are generally hard to understand. In general, I try to avoid complicating them.

Legalize Pass

  • (+) The codebase might be cleaner (if one understands Relay passes).
  • (+) Topi schedules are generally hard to understand. In general, I try to avoid complicating them. This avoids that.
  • (-) Applies to all templates. For example, if I write a new template (like direct, winograd), legalize will upcast.

I dont have any strong opinion.

@ajtulloch
Copy link
Contributor

@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.

@FrozenGene
Copy link
Member

@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.

@ajtulloch Thanks for the interest and great discussion between us ever. :-)

I want to summary some high idea of us and will present the results next TVM meetup in Shanghai.

For Convolution:

  1. We use NHWC layout
  2. Currently, we use Tensorize.

We stuied QNNPACK, but QNNPACK can not be used by us directly, some concept in QNNPACK we can not simulate, for example, indirect buffer. So we write the kernel by ourselves.

For Depthwise Convolution

  1. We use NHWC layout
  2. We don't use Tensorize.

Yes. We use INT6 * INT16 + INT16 -> INT32 instruction (SMLAL), which is better than INT32*INT32 + INT32->INT32. The way we do is we will substract the input_zero_point / kernel_zero_point before computation, at there, we will cast the dtype from UINT8 -> INT16.

For Depthwise convolution, even though we don't use Tensorize, we still get the performance bettern than QNNPACK(in mobilenet V1 / mobilenetV2, only 2 layers slower than it, others we are faster than QNNPACK). Amazing result. I wanna list two keypoints:

  1. Avoid data pack. In im2col / spatial pack, we will do data pack on H / W, which is cost on depthwise convolution, you could compute it directly and just split C. i.e. like this:
    kvshape = (C // VC, M, KH, KW, VC)
    oshape = (N, OH, OW, C)
    dvshape = (N, OH, OW, C // VC, KH, KW, VC)
  1. compute_at is very important in depthwise convoltion. data_pad_inline / data_vec_inline / conv_inline should be tunable, this is one important factor to beyond QNNPACK.

Currently, we have tested MobilenetV2 on rasp, we are 1.34X compared with QNNPACK. In our in-house model, we are beyond more compared with QNNPACK. We will present more in TVM meetup.

@FrozenGene
Copy link
Member

@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass?

Changing Schedule

  • (+) The change is contained to only one template.
  • (-) Schedules are generally hard to understand. In general, I try to avoid complicating them.

Legalize Pass

  • (+) The codebase might be cleaner (if one understands Relay passes).
  • (+) Topi schedules are generally hard to understand. In general, I try to avoid complicating them. This avoids that.
  • (-) Applies to all templates. For example, if I write a new template (like direct, winograd), legalize will upcast.

I dont have any strong opinion.

If someone is schedule guy, modify the schedule frequently, we apply it in schedule is better for them, because they could handle it whole in schedule. However, from the code viewpoint, I suggest adding in Pass. Because this should be applied in all templates. And for ARMv8.2, dot product instruction is better, could be done in legalize too.

@ajtulloch
Copy link
Contributor

@FrozenGene wow, those are very impressive results. Congratulations, looking forward to the talk and the code :)

@zhenhuaw-me
Copy link
Contributor

@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.

@ajtulloch I looked a bit into your code, very decent design! I have not tried tensorization with spatial pack schedule, would you please share some performance result?

@zhenhuaw-me
Copy link
Contributor

zhenhuaw-me commented Nov 10, 2019

@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass?

Changing Schedule

  • (+) The change is contained to only one template.
  • (-) Schedules are generally hard to understand. In general, I try to avoid complicating them.

Legalize Pass

  • (+) The codebase might be cleaner (if one understands Relay passes).
  • (+) Topi schedules are generally hard to understand. In general, I try to avoid complicating them. This avoids that.
  • (-) Applies to all templates. For example, if I write a new template (like direct, winograd), legalize will upcast.

I dont have any strong opinion.

Hi @anijain2305 , I think a Legalize Pass may lead to a clean code structure or architecture, while a Changing Schedule contribute to a easy to read. I feel the same on having no strong opinion :(

Yet, one thing that I am considering is that, if the legalization is going to handle all, will it make the tensorize compute pattern matching a bit more hard to write or debug or something
By this I mean:

  • in schedule we write int8 cast to int32 and multiply
  • legalization rewrites to int8 cast to int16 and multiplied to int32
    So, what is the compute pattern shall a tensorize to match? Not sure if this will be a problem :)

@anijain2305
Copy link
Contributor Author

anijain2305 commented Nov 10, 2019

@jackwish Thanks for the comment.

If we go legalize way, following kind of transformation would happen

Original Relay graph --> conv2d(%int8_data, %int8_kernel)

AfterLegalize Relay graph --> 

%int16_data = cast(%int8_data, dtype="int16")
%int16_kernel = cast(%int8_kernel, dtype="int16")
conv2d(%int16_data, %int16_kernel)

So, in this case, even if we want to use tensorize, we will only see int16 inputs in conv. Does that answer you question?

I am sort of inclining towards going for schedule, because it is contained in one place for now. And one can easily change if one decides to use tensorize. Legalize might throw schedule developers off who want to see int8 inputs, and are surprised to see int16 inputs.

@zhenhuaw-me
Copy link
Contributor

zhenhuaw-me commented Nov 10, 2019

Thanks for your kind explanation @anijain2305 . That is exactly my consideration as the legalization approach may confuse tensorie implementation. In this case, I vote for changing the schedule.

@tqchen
Copy link
Member

tqchen commented Nov 10, 2019

@ajtulloch it would be great if you can manage this PR, you should have the permission to approve and merge

@anijain2305
Copy link
Contributor Author

@jackwish Thanks! Can you please review and let me know if you have more comments?

@ajtulloch Let me know if you have any comments.

# because LLVM is able to better interleave vmlal.s16 and vldr instructions,
# leading to higher CPU utilization.
adjusted_dtype = data.dtype
if 'int8' in data.dtype and 'int8' in kernel.dtype and out_dtype == 'int32':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should cover uint8?

@ajtulloch
Copy link
Contributor

Yeah, I think the schedule approach makes more sense to me as well.

@tqchen
Copy link
Member

tqchen commented Jan 15, 2020

@FrozenGene @anijain2305 would be great if we can followup on this thread

@anijain2305
Copy link
Contributor Author

Apologies. Will take a look at it again next week.

@tqchen
Copy link
Member

tqchen commented Mar 30, 2020

ping @anijain2305

@anijain2305
Copy link
Contributor Author

Closing as this is not relevant anymore given parallel efforts for improving int8 conv schedules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants