[ARM_CPU] Conv2d int8 intrinsic for cortex-A72 #10310

tkonolige · 2022-02-18T17:38:43Z

Add an intrinsic that performs a dot product of 8 4-element vectors at once. Also conditionally inline fused operators into the main convolution loop depending on convolutions size. Small convolution = no inlining. Performance improves by ~20% on mobilenet on raspberry pi 4 and ~30% improvement on performance for the individual convolutions.

model	before	after	speedup
mobilenet_v2	47.7ms	39.6ms	20.5%
inception_v3	282.8ms	209.1ms	35.2%
resnet_50	363.5ms	321.3ms	13.1%

@masahi @mbrookhart @leandron

masahi · 2022-02-18T18:50:28Z

cc @FrozenGene @giuseros

tmoreau89 · 2022-02-18T21:20:01Z

python/tvm/topi/arm_cpu/tensor_intrin.py

@@ -533,6 +533,138 @@ def _instr(index):
    )


+def dot_int8_int8_int32_neon():
+    """
+    Int8 dot product using vmlal instructions


just came here to say how nice this documentation block is

tmoreau89 · 2022-02-18T21:48:14Z

python/tvm/topi/arm_cpu/conv2d_alter_op.py

@@ -252,12 +254,20 @@ def _alter_conv2d_layout(attrs, inputs, tinfos, out_type):

        return relay.nn.conv2d(*inputs, **new_attrs)

-    if topi_tmpl == "conv2d_NCHWc.x86":
+    if topi_tmpl == "conv2d_NCHWc.arm_cpu":


Do we want to also include the conv2d_NCHWc.x86 topi template here?

I was unhappy that I was seeing conv2d_NCHWc.x86 schedules when running on arm, this was to fix that but I didn't get all the places I think. I will split it out to a separate PR.

Ok that's what I had suspected; it sounds like cleaning it up would be what's best to avoid confusion later!

tmoreau89 · 2022-02-18T21:49:44Z

python/tvm/topi/arm_cpu/conv2d_int8.py

@@ -39,10 +40,10 @@ def _get_default_config(cfg, data, kernel, strides, padding, dilation, out_dtype
    wkl = _get_conv2d_workload(data, kernel, strides, padding, dilation, out_dtype)
    is_kernel_1x1 = wkl.kernel_h == 1 and wkl.kernel_w == 1
    if is_kernel_1x1:
-        conv2d_generic.fallback_schedule_cpu_1x1_int8(cfg, wkl, int32_lanes=2, num_int8_elements=4)
+        conv2d_generic.fallback_schedule_cpu_1x1_int8(cfg, wkl, int32_lanes=4, num_int8_elements=4)


What's the reasoning behind this change in int32_lanes? Did we use the wrong value all along?

Yes, it was the wrong value all along. If you look down at schedule_conv2d_NCHWc_int8 (

tvm/python/tvm/topi/arm_cpu/conv2d_int8.py

Lines 110 to 116 in 2c0a7c2

conv2d_generic.schedule_conv_NCHWc_cpu_1x1_int8(

*args, int32_lanes=4, intrin=dot_int8_int8_int32(int32_lanes=4, dtype=dtype)

)

else:

conv2d_generic.schedule_conv_NCHWc_cpu_common_int8(

*args, int32_lanes=4, intrin=dot_int8_int8_int32(int32_lanes=4, dtype=dtype)

)

), int32_lanes was set to 4 already.

tmoreau89 · 2022-02-18T21:54:07Z

Great work @tkonolige , left a couple comments

tmoreau89

LGTM

masahi · 2022-02-21T05:04:55Z

python/tvm/topi/arm_cpu/conv2d_alter_op.py

+            kernel_OHWoIi,
+            (out_channel // oc_bn, kh, kw, oc_bn, in_channel // ic_bn, ic_bn // n_elems, n_elems),
+        )
+        kernel_OIHWioe = relay.transpose(kernel_OHWoIie, axes=(0, 4, 1, 2, 5, 3, 6))


Can clean this up since #9996 is merged. See the change there.

python/tvm/topi/arm_cpu/conv2d_alter_op.py

Add an intrinsic that performs a dot product of 8 4-element vectors at once. Also conditionally inline fused operators into the main convolution loop depending on convolutions size. Small convolution = no inlining. Performance improves by ~20% on mobilenet on raspberry pi 4 and ~30% improvement on performance for the individual convolutions.

@tkonolige

This PR fixes a bug in TE ARM int8 compute for NCHWc conv2d, introduced in #10310. The compute itself, not the schedule, is broken for the following reasons: * We are using `n_elems = 8` in https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_alter_op.py#L350. Thus, the innermost axis of the transformed kernel has extent 8: https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_alter_op.py#L375 * In the TE compute, we iterate over the innermost axis `ic_s_inner` of the kernel at https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L577. `ic_s_inner` has extent `n_elems` according to https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L566. `n_elems` is 4 by default according to https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L478 * The ARM code that calls this compute does not explicitly pass `n_elems`, according to https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_int8.py#L106-L108 * Thus, even though the innermost axis of the kernel has extent 8, the TE compute only loops over `n_elems = 4` of the input channel dimension. Initially, I tried to keep `n_elems = 8` in alter layout and fix the intrinsic definition. But `n_elems = 8` breaks tensorization pattern matching, since now the compute is doing 4x8 innermost loop but this intrinsic is supposed to do 4x4 dot product, see https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/tensor_intrin.py#L467-L479. Setting `num_int8_elements = 8` there does fix the tensorize pattern matching, but the result was still incorrect. Rather than fixing the intrin implementation in https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/tensor_intrin.py#L492 to adapt for 4x8 dot product, I settled on setting `n_elems = 4` in alter layout. It turned out this change is enough to get the correct output. Moreover, `n_elems = 8` is simply wrong for the dot product path in https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/conv2d_int8.py#L154-L155 which computes 4x4 dot product in one instruction. @tkonolige I suggest doing perf benchmark again, since the numbers in #10310 are invalid. cc @mbrookhart @Mousius @junrushao1994 @vinx13

* [ARM_CPU] Conv2d int8 intrinsic for cortex-A72 Add an intrinsic that performs a dot product of 8 4-element vectors at once. Also conditionally inline fused operators into the main convolution loop depending on convolutions size. Small convolution = no inlining. Performance improves by ~20% on mobilenet on raspberry pi 4 and ~30% improvement on performance for the individual convolutions. * ignore incorrect lints * fixup fstring * revert changes to conv2d_NCHWc (not int8) * remove error check, apparently tests rely on it * refactor alter op layout

@tkonolige

This PR fixes a bug in TE ARM int8 compute for NCHWc conv2d, introduced in apache#10310. The compute itself, not the schedule, is broken for the following reasons: * We are using `n_elems = 8` in https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_alter_op.py#L350. Thus, the innermost axis of the transformed kernel has extent 8: https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_alter_op.py#L375 * In the TE compute, we iterate over the innermost axis `ic_s_inner` of the kernel at https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L577. `ic_s_inner` has extent `n_elems` according to https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L566. `n_elems` is 4 by default according to https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L478 * The ARM code that calls this compute does not explicitly pass `n_elems`, according to https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_int8.py#L106-L108 * Thus, even though the innermost axis of the kernel has extent 8, the TE compute only loops over `n_elems = 4` of the input channel dimension. Initially, I tried to keep `n_elems = 8` in alter layout and fix the intrinsic definition. But `n_elems = 8` breaks tensorization pattern matching, since now the compute is doing 4x8 innermost loop but this intrinsic is supposed to do 4x4 dot product, see https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/tensor_intrin.py#L467-L479. Setting `num_int8_elements = 8` there does fix the tensorize pattern matching, but the result was still incorrect. Rather than fixing the intrin implementation in https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/tensor_intrin.py#L492 to adapt for 4x8 dot product, I settled on setting `n_elems = 4` in alter layout. It turned out this change is enough to get the correct output. Moreover, `n_elems = 8` is simply wrong for the dot product path in https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/conv2d_int8.py#L154-L155 which computes 4x4 dot product in one instruction. @tkonolige I suggest doing perf benchmark again, since the numbers in apache#10310 are invalid. cc @mbrookhart @Mousius @junrushao1994 @vinx13

@tkonolige

This PR fixes a bug in TE ARM int8 compute for NCHWc conv2d, introduced in apache#10310. The compute itself, not the schedule, is broken for the following reasons: * We are using `n_elems = 8` in https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_alter_op.py#L350. Thus, the innermost axis of the transformed kernel has extent 8: https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_alter_op.py#L375 * In the TE compute, we iterate over the innermost axis `ic_s_inner` of the kernel at https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L577. `ic_s_inner` has extent `n_elems` according to https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L566. `n_elems` is 4 by default according to https://github.com/apache/tvm/blob/f6f252f0abc8f621a96506739f9534083d1fe213/python/tvm/topi/nn/conv2d.py#L478 * The ARM code that calls this compute does not explicitly pass `n_elems`, according to https://github.com/apache/tvm/blob/e9091d6c68d5d70c28881e5c75bfe72e385c1f4d/python/tvm/topi/arm_cpu/conv2d_int8.py#L106-L108 * Thus, even though the innermost axis of the kernel has extent 8, the TE compute only loops over `n_elems = 4` of the input channel dimension. Initially, I tried to keep `n_elems = 8` in alter layout and fix the intrinsic definition. But `n_elems = 8` breaks tensorization pattern matching, since now the compute is doing 4x8 innermost loop but this intrinsic is supposed to do 4x4 dot product, see https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/tensor_intrin.py#L467-L479. Setting `num_int8_elements = 8` there does fix the tensorize pattern matching, but the result was still incorrect. Rather than fixing the intrin implementation in https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/tensor_intrin.py#L492 to adapt for 4x8 dot product, I settled on setting `n_elems = 4` in alter layout. It turned out this change is enough to get the correct output. Moreover, `n_elems = 8` is simply wrong for the dot product path in https://github.com/apache/tvm/blob/7896108fc41663a1fecbb52345194a93278e9e28/python/tvm/topi/arm_cpu/conv2d_int8.py#L154-L155 which computes 4x4 dot product in one instruction. @tkonolige I suggest doing perf benchmark again, since the numbers in apache#10310 are invalid. cc @mbrookhart @Mousius @junrushao1994 @vinx13

tkonolige requested review from Laurawly, Huyuwei, kevinthesun, jwfromm, vinx13, masahi, yzhliu, mbrookhart, ZihengJiang, jcf94, jroesch, slyubomirsky, icemelon, MarisaKirisame, zhiics, anijain2305, wweic, junrushao, merrymercy, comaniac, tqchen and areusch as code owners February 18, 2022 17:38

tmoreau89 reviewed Feb 18, 2022

View reviewed changes

tmoreau89 approved these changes Feb 18, 2022

View reviewed changes

masahi reviewed Feb 21, 2022

View reviewed changes

masahi reviewed Feb 22, 2022

View reviewed changes

python/tvm/topi/arm_cpu/conv2d_alter_op.py Show resolved Hide resolved

Tristan Konolige added 5 commits February 22, 2022 10:14

ignore incorrect lints

b459c2a

fixup fstring

ae2428d

revert changes to conv2d_NCHWc (not int8)

7203c09

remove error check, apparently tests rely on it

c45c45b

tkonolige force-pushed the arm_conv2d branch from 3eaa786 to c45c45b Compare February 22, 2022 18:39

refactor alter op layout

c528b0f

masahi approved these changes Feb 23, 2022

View reviewed changes

masahi merged commit 6c6e873 into apache:main Feb 23, 2022

masahi mentioned this pull request Mar 31, 2022

[ARM] Fix int8 NCHWc compute and alter layout #10839

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM_CPU] Conv2d int8 intrinsic for cortex-A72 #10310

[ARM_CPU] Conv2d int8 intrinsic for cortex-A72 #10310

tkonolige commented Feb 18, 2022

masahi commented Feb 18, 2022

tmoreau89 Feb 18, 2022 •

edited

Loading

tmoreau89 Feb 18, 2022

tkonolige Feb 18, 2022

tmoreau89 Feb 18, 2022

tkonolige Feb 18, 2022

tmoreau89 Feb 18, 2022

tkonolige Feb 18, 2022

tmoreau89 Feb 18, 2022

tmoreau89 commented Feb 18, 2022

tmoreau89 left a comment

masahi Feb 21, 2022

tkonolige Feb 22, 2022

	conv2d_generic.schedule_conv_NCHWc_cpu_1x1_int8(
	*args, int32_lanes=4, intrin=dot_int8_int8_int32(int32_lanes=4, dtype=dtype)
	)
	else:
	conv2d_generic.schedule_conv_NCHWc_cpu_common_int8(
	*args, int32_lanes=4, intrin=dot_int8_int8_int32(int32_lanes=4, dtype=dtype)
	)

[ARM_CPU] Conv2d int8 intrinsic for cortex-A72 #10310

[ARM_CPU] Conv2d int8 intrinsic for cortex-A72 #10310

Conversation

tkonolige commented Feb 18, 2022

masahi commented Feb 18, 2022

tmoreau89 Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmoreau89 commented Feb 18, 2022

tmoreau89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmoreau89 Feb 18, 2022 •

edited

Loading