[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277

anijain2305 · 2019-11-08T00:53:37Z

I am working on improving the performance of Int8 conv on Raspberry Pi 3.

For Conv2D, there is an upcast from int8 to int32 before performing the dot-product. ARM ISA has an instruction called vmlal.s16 that takes 3 SIMD registers each containing 4 16-bit values and does FMA producing 1 SIMD register containing 4 32-bit values. However, LLVM (4.0, 6.0 and 8.0) is not able to efficiently use this instruction. In the absence of this PR, assembly looks something like this

	add	r2, sp, #304
	vmlal.s16	q12, d0, d4
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #208
	vmlal.s16	q12, d10, d0
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #192
	vmlal.s16	q12, d8, d0
	vld1.64	{d0, d1}, [r2:128]
	sub	r2, r6, #7
	mov	r6, r7
	vmlal.s16	q12, d0, d14
	vld1.8	{d0[]}, [r2]
	add	r2, sp, #192
	vmovl.s8	q4, d0
	vst1.64	{d8, d9}, [r2:128]
	add	r2, sp, #240
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #176
	vmlal.s16	q14, d8, d0
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #128
	vmlal.s16	q14, d0, d12
	vld1.64	{d12, d13}, [r2:128]
	add	r2, sp, #288
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #96
	vmlal.s16	q14, d12, d0
	vld1.64	{d8, d9}, [r2:128]
	add	r2, sp, #352
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #224
	vmlal.s16	q14, d8, d0
	vld1.64	{d14, d15}, [r2:128]
	add	r2, sp, #80
	vmlal.s16	q14, d14, d2
	vld1.64	{d2, d3}, [r2:128]
	add	r2, sp, #304
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #208
	vmlal.s16	q14, d6, d4
	vld1.64	{d4, d5}, [r2:128]
	add	r2, sp, #240
	vmlal.s16	q14, d2, d0
	vmlal.s16	q14, d10, d4
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #192
	vld1.64	{d6, d7}, [r2:128]
	add	r2, sp, #160
	vmlal.s16	q13, d7, d11
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #176
	vld1.64	{d6, d7}, [r2:128]
	add	r2, sp, #288
	vmlal.s16	q13, d7, d11
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #352
	vmlal.s16	q13, d13, d11
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #336
	vld1.64	{d6, d7}, [r2:128]
	add	r2, sp, #144
	vmlal.s16	q13, d9, d11
	vld1.64	{d8, d9}, [r2:128]
	add	r2, sp, #320
	vld1.64	{d10, d11}, [r2:128]
	add	r2, sp, #272
	vmlal.s16	q13, d15, d7
	vmlal.s16	q13, d9, d11
	vmlal.s16	q13, d3, d1
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #256
	vmlal.s16	q13, d1, d5
	vld1.64	{d0, d1}, [r2:128]
	add	r2, sp, #112
	vld1.64	{d2, d3}, [r2:128]
	vmlal.s16	q14, d0, d2
	vmlal.s16	q13, d1, d3
	bne	.LBB6_5

However, if we add an intermediate upcasting to int16 i.e. instead of going from int8 to int32, we can go from int8 to int16 and then this int16 can go to conv2D, it results in much better packing of compute and memory instructions

.LBB7_6:
	add	r2, lr, r8
	add	r1, r12, #32
	vld1.64	{d0, d1}, [r1:128]
	add	r1, r12, #16
	add	r3, r2, #10
	add	r8, r8, #8
	vld1.64	{d4, d5}, [r1:128]
	mov	r1, r2
	vld1.16	{d6[]}, [r1:16], r7
	cmp	r8, #24
	vld1.16	{d7[]}, [r3:16]
	add	r3, r2, #2
	vld1.16	{d10[]}, [r1:16]
	add	r1, r2, #8
	vld1.16	{d2, d3}, [r10:128], r0
	vmlal.s16	q14, d6, d2
	vmlal.s16	q13, d6, d3
	vld1.16	{d6[]}, [r3:16]
	add	r3, r2, #12
	vmlal.s16	q12, d6, d2
	vmlal.s16	q11, d6, d3
	mov	r12, r10
	vmlal.s16	q14, d6, d4
	vld1.16	{d8[]}, [r3:16]
	vmlal.s16	q13, d6, d5
	add	r3, r2, #4
	vld1.16	{d6[]}, [r1:16]
	vmlal.s16	q15, d7, d3
	vmlal.s16	q9, d6, d2
	add	r1, r2, #6
	vmlal.s16	q8, d7, d2
	vld1.16	{d9[]}, [r3:16]
	vmlal.s16	q10, d6, d3
	vmlal.s16	q15, d8, d5
	vld1.16	{d2[]}, [r1:16]
	vmlal.s16	q12, d9, d4
	vmlal.s16	q11, d9, d5
	vmlal.s16	q9, d7, d4
	vmlal.s16	q10, d7, d5
	vmlal.s16	q8, d8, d4
	vmlal.s16	q13, d9, d1
	vmlal.s16	q14, d9, d0
	vmlal.s16	q15, d10, d1
	vmlal.s16	q8, d10, d0
	vmlal.s16	q10, d8, d1
	vmlal.s16	q9, d8, d0
	vmlal.s16	q11, d2, d1
	vmlal.s16	q12, d2, d0

I tested this with one Conv2D and with auto-tuning.

Workload - Input shape - 1, 64, 16, 16, Kernel - 64, 64, 3, 3
Platform - Raspberry Pi - 0.6 GHz
Auto-tuning done for both FP32 and Int8 conv
FP32 Latency = 3817 us, Int8 latency = 3015 us
In the absence of this PR, Int8 latency is worse than FP32.

@jackwish @yzhliu @FrozenGene @merrymercy @zhiics @u99127

ajtulloch · 2019-11-08T01:04:46Z

topi/python/topi/arm_cpu/conv2d_spatial_pack.py

                               name='data_vec')

    if pre_packed:
        kernel_vec = kernel
+        if adjusted_dtype != kernel.dtype:
+            kernel_vec = tvm.compute(kvshape, lambda co, ci, kh, kw, vc:


is this scheduled correctly? presumably we want this kernel_vec computation inlined in this case?

I tried to compute it inline in the schedule. But doing that upsets LLVM and we go back to older less-performant assembly code.

An alternative to this PR can be a legalize pass - which when sees a conv2d with int8 inputs, inserts an upcast to int16. In this case, the upcast will show up at Relay level, and the topi will remain unchanged.

In both the cases, the memory footprint takes a hit as we upcast weight to int16. For the alternate implementation, they show up in artifacts (or disk). For current PR, the allocation happens internally in the conv operator.

we only do parallel for kernel_vec before, however, we introduce one compute now, better way is compute_inline. Could you try this schedule:

s[kernel_vec].unroll(kh) s[kernel_vec].unroll(kw) s[kernel_vec].vectorize(vc) s[kernel_vec].parallel(co) s[kernel_vec].compute_inline()

Which is used in our schedule internally and could produce SMLAL instruction when to cast into int16. However, I can not make sure whether to work here, because our computation and schedule is not the same.

Thanks for the suggestion @FrozenGene :)
I will come back to it in a day or two and will play with your suggestions if it leads to improvements.

anijain2305 · 2019-11-08T01:34:33Z

Duplicating the message from the comments here for discussion.

An alternative to this PR can be a legalize pass - which when sees a conv2d with int8 inputs, inserts an upcast to int16. In this case, the upcast will show up at Relay level, and the topi will remain unchanged. I manually verified that LLVM gives good assembly when conv2D inputs are int16.

In both the cases, the memory footprint takes a hit as we upcast weight to int16. For the alternate implementation, they show up in artifacts (or disk). For current PR, the allocation happens internally in the conv operator.

zhenhuaw-me · 2019-11-08T01:57:31Z

Ha! This is so interesting - it is very similar to our internal try many months ago. I think legalization is a good idea, though it is not easy to read.

And, we are going to share some end to end INT8 benchmarks on TVM Shanghai Meetup next week (presented by @FrozenGene).

Also, from the tuning experience, I recently get the idea of trying trivial im2col schedule (without tensorize). Maybe we can share some code later :).

anijain2305 · 2019-11-08T02:08:26Z

@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass?

Changing Schedule

(+) The change is contained to only one template.
(-) Schedules are generally hard to understand. In general, I try to avoid complicating them.

Legalize Pass

(+) The codebase might be cleaner (if one understands Relay passes).
(+) Topi schedules are generally hard to understand. In general, I try to avoid complicating them. This avoids that.
(-) Applies to all templates. For example, if I write a new template (like direct, winograd), legalize will upcast.

I dont have any strong opinion.

ajtulloch · 2019-11-08T02:26:34Z

@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.

FrozenGene · 2019-11-08T02:43:09Z

@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.

@ajtulloch Thanks for the interest and great discussion between us ever. :-)

I want to summary some high idea of us and will present the results next TVM meetup in Shanghai.

For Convolution:

We use NHWC layout
Currently, we use Tensorize.

We stuied QNNPACK, but QNNPACK can not be used by us directly, some concept in QNNPACK we can not simulate, for example, indirect buffer. So we write the kernel by ourselves.

For Depthwise Convolution

We use NHWC layout
We don't use Tensorize.

Yes. We use INT6 * INT16 + INT16 -> INT32 instruction (SMLAL), which is better than INT32*INT32 + INT32->INT32. The way we do is we will substract the input_zero_point / kernel_zero_point before computation, at there, we will cast the dtype from UINT8 -> INT16.

For Depthwise convolution, even though we don't use Tensorize, we still get the performance bettern than QNNPACK(in mobilenet V1 / mobilenetV2, only 2 layers slower than it, others we are faster than QNNPACK). Amazing result. I wanna list two keypoints:

Avoid data pack. In im2col / spatial pack, we will do data pack on H / W, which is cost on depthwise convolution, you could compute it directly and just split C. i.e. like this:

    kvshape = (C // VC, M, KH, KW, VC)
    oshape = (N, OH, OW, C)
    dvshape = (N, OH, OW, C // VC, KH, KW, VC)

compute_at is very important in depthwise convoltion. data_pad_inline / data_vec_inline / conv_inline should be tunable, this is one important factor to beyond QNNPACK.

Currently, we have tested MobilenetV2 on rasp, we are 1.34X compared with QNNPACK. In our in-house model, we are beyond more compared with QNNPACK. We will present more in TVM meetup.

FrozenGene · 2019-11-08T02:56:34Z

@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass?

Changing Schedule

(+) The change is contained to only one template.

(-) Schedules are generally hard to understand. In general, I try to avoid complicating them.

Legalize Pass

(+) The codebase might be cleaner (if one understands Relay passes).

(+) Topi schedules are generally hard to understand. In general, I try to avoid complicating them. This avoids that.

(-) Applies to all templates. For example, if I write a new template (like direct, winograd), legalize will upcast.

I dont have any strong opinion.

If someone is schedule guy, modify the schedule frequently, we apply it in schedule is better for them, because they could handle it whole in schedule. However, from the code viewpoint, I suggest adding in Pass. Because this should be applied in all templates. And for ARMv8.2, dot product instruction is better, could be done in legalize too.

ajtulloch · 2019-11-08T21:17:54Z

@FrozenGene wow, those are very impressive results. Congratulations, looking forward to the talk and the code :)

zhenhuaw-me · 2019-11-10T03:00:05Z

@jackwish I'd be very interested in those results. I got some good results for NHWC on ARMv7 by porting the QNNPACK kernels over
and tensorizing (https://github.com/ajtulloch/tvm/blob/95e5e2d44a08e2dfb8444706370505944ffb7c91/topi/python/topi/arm_cpu/conv2d_int8.py#L9-L166), and it'd be awesome to see how you folks have approached this problem.

@ajtulloch I looked a bit into your code, very decent design! I have not tried tensorization with spatial pack schedule, would you please share some performance result?

zhenhuaw-me · 2019-11-10T03:08:14Z

@jackwish Yes, sharing code will be very helpful. For now, do you prefer changing schedule OR Legalize pass?

Changing Schedule

(+) The change is contained to only one template.

(-) Schedules are generally hard to understand. In general, I try to avoid complicating them.

Legalize Pass

(+) The codebase might be cleaner (if one understands Relay passes).

(+) Topi schedules are generally hard to understand. In general, I try to avoid complicating them. This avoids that.

(-) Applies to all templates. For example, if I write a new template (like direct, winograd), legalize will upcast.

I dont have any strong opinion.

Hi @anijain2305 , I think a Legalize Pass may lead to a clean code structure or architecture, while a Changing Schedule contribute to a easy to read. I feel the same on having no strong opinion :(

Yet, one thing that I am considering is that, if the legalization is going to handle all, will it make the tensorize compute pattern matching a bit more hard to write or debug or something
By this I mean:

in schedule we write int8 cast to int32 and multiply
legalization rewrites to int8 cast to int16 and multiplied to int32
So, what is the compute pattern shall a tensorize to match? Not sure if this will be a problem :)

anijain2305 · 2019-11-10T07:19:24Z

@jackwish Thanks for the comment.

If we go legalize way, following kind of transformation would happen

Original Relay graph --> conv2d(%int8_data, %int8_kernel)

AfterLegalize Relay graph --> 

%int16_data = cast(%int8_data, dtype="int16")
%int16_kernel = cast(%int8_kernel, dtype="int16")
conv2d(%int16_data, %int16_kernel)

So, in this case, even if we want to use tensorize, we will only see int16 inputs in conv. Does that answer you question?

I am sort of inclining towards going for schedule, because it is contained in one place for now. And one can easily change if one decides to use tensorize. Legalize might throw schedule developers off who want to see int8 inputs, and are surprised to see int16 inputs.

zhenhuaw-me · 2019-11-10T08:31:30Z

Thanks for your kind explanation @anijain2305 . That is exactly my consideration as the legalization approach may confuse tensorie implementation. In this case, I vote for changing the schedule.

tqchen · 2019-11-10T18:19:59Z

@ajtulloch it would be great if you can manage this PR, you should have the permission to approve and merge

anijain2305 · 2019-11-11T04:42:03Z

@jackwish Thanks! Can you please review and let me know if you have more comments?

@ajtulloch Let me know if you have any comments.

FrozenGene · 2019-11-11T05:05:35Z

topi/python/topi/arm_cpu/conv2d_spatial_pack.py

+    # because LLVM is able to better interleave vmlal.s16 and vldr instructions,
+    # leading to higher CPU utilization.
+    adjusted_dtype = data.dtype
+    if 'int8' in data.dtype and 'int8' in kernel.dtype and out_dtype == 'int32':


should cover uint8?

ajtulloch · 2019-11-12T05:42:50Z

Yeah, I think the schedule approach makes more sense to me as well.

tqchen · 2020-01-15T23:17:33Z

@FrozenGene @anijain2305 would be great if we can followup on this thread

anijain2305 · 2020-01-16T23:57:17Z

Apologies. Will take a look at it again next week.

tqchen · 2020-03-30T20:57:26Z

ping @anijain2305

anijain2305 · 2020-06-13T17:48:33Z

Closing as this is not relevant anymore given parallel efforts for improving int8 conv schedules

[ARM][Topi] Supporting Int8 in Spatial schedule.

4d2d057

tqchen added the status: need review label Nov 8, 2019

anijain2305 changed the title ~~[ARM][Topi] Supporting Int8 in Spatial schedule.~~ [ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. Nov 8, 2019

ajtulloch reviewed Nov 8, 2019

View reviewed changes

tqchen assigned ajtulloch Nov 10, 2019

FrozenGene reviewed Nov 11, 2019

View reviewed changes

anijain2305 closed this Jun 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277

[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277

anijain2305 commented Nov 8, 2019 •

edited

ajtulloch Nov 8, 2019

anijain2305 Nov 8, 2019

anijain2305 Nov 8, 2019

FrozenGene Nov 11, 2019

anijain2305 Nov 11, 2019

anijain2305 commented Nov 8, 2019

zhenhuaw-me commented Nov 8, 2019

anijain2305 commented Nov 8, 2019

ajtulloch commented Nov 8, 2019

FrozenGene commented Nov 8, 2019

FrozenGene commented Nov 8, 2019

ajtulloch commented Nov 8, 2019

zhenhuaw-me commented Nov 10, 2019

zhenhuaw-me commented Nov 10, 2019 •

edited

anijain2305 commented Nov 10, 2019 •

edited

zhenhuaw-me commented Nov 10, 2019 •

edited

tqchen commented Nov 10, 2019

anijain2305 commented Nov 11, 2019

FrozenGene Nov 11, 2019

ajtulloch commented Nov 12, 2019

tqchen commented Jan 15, 2020

anijain2305 commented Jan 16, 2020

tqchen commented Mar 30, 2020

anijain2305 commented Jun 13, 2020

[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277

[ARM][Topi] Improving Int8 Perf in Spatial Conv2D schedule. #4277

Conversation

anijain2305 commented Nov 8, 2019 • edited

ajtulloch Nov 8, 2019

Choose a reason for hiding this comment

anijain2305 Nov 8, 2019

Choose a reason for hiding this comment

anijain2305 Nov 8, 2019

Choose a reason for hiding this comment

FrozenGene Nov 11, 2019

Choose a reason for hiding this comment

anijain2305 Nov 11, 2019

Choose a reason for hiding this comment

anijain2305 commented Nov 8, 2019

zhenhuaw-me commented Nov 8, 2019

anijain2305 commented Nov 8, 2019

ajtulloch commented Nov 8, 2019

FrozenGene commented Nov 8, 2019

FrozenGene commented Nov 8, 2019

ajtulloch commented Nov 8, 2019

zhenhuaw-me commented Nov 10, 2019

zhenhuaw-me commented Nov 10, 2019 • edited

anijain2305 commented Nov 10, 2019 • edited

zhenhuaw-me commented Nov 10, 2019 • edited

tqchen commented Nov 10, 2019

anijain2305 commented Nov 11, 2019

FrozenGene Nov 11, 2019

Choose a reason for hiding this comment

ajtulloch commented Nov 12, 2019

tqchen commented Jan 15, 2020

anijain2305 commented Jan 16, 2020

tqchen commented Mar 30, 2020

anijain2305 commented Jun 13, 2020

anijain2305 commented Nov 8, 2019 •

edited

zhenhuaw-me commented Nov 10, 2019 •

edited

anijain2305 commented Nov 10, 2019 •

edited

zhenhuaw-me commented Nov 10, 2019 •

edited