Improve AArch64 depthwise convolution through smlal/smlal2 intrinsic #6711

giuseros · 2020-10-19T15:29:05Z

Added an intrinsic to load a single int16x8 vector and produce two
int32x4 output vectors through smlal/smlal2 instructions
Changed the NHWC depthwise schedule to accomodate the aforementioned
intrinsic

Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57

giuseros · 2020-10-19T15:29:32Z

cc @u99127 @anijain2305

FrozenGene · 2020-10-26T06:41:47Z

I like this change. May I ask two questions?

could we make smlal / smlal2 work on convolution too?
If we don't use tensorize, do we have method to handle this? For example, add optimization of custom LLVM pass or low detail analyze on tir or llvm ir?

giuseros · 2020-10-26T14:32:47Z

Hi @FrozenGene
Thanks for your reply!

Yes we can, but there is a issue. It's very hard to have the two quantized strategies (int8 + smull/sadalp vs int16 + smlal/smlal2) available to the auto-tuner at the same time (you already saw this post, I will reply there to your comment). In the case of depthwise, we only have the int16 strategy, so the issue does not arise.
Nice you asked, because I am trying to prototype the idea of a "tir pass" where this tensorization (and possibly every arm tensorization in tensor_intrin.py) happens. My hope is to make it work more or less like vectorize. One of the main reasons to have this is to enable Ansor support

FrozenGene · 2020-10-26T15:07:46Z

Hi @FrozenGene
Thanks for your reply!

Yes we can, but there is a issue. It's very hard to have the two quantized strategies (int8 + smull/sadalp vs int16 + smlal/smlal2) available to the auto-tuner at the same time (you already saw this post, I will reply there to your comment). In the case of depthwise, we only have the int16 strategy, so the issue does not arise.

Nice you asked, because I am trying to prototype the idea of a "tir pass" where this tensorization (and possibly every arm tensorization in tensor_intrin.py) happens. My hope is to make it work more or less like vectorize. One of the main reasons to have this is to enable Ansor support

For ansor, we could make ansor support tensorize (like TensorCore we need tensorize too). However, if we could done it in the llvm / tir, we will make ansor support it easily. If we could lift it as generic pass, I think it will bring benifit to other hardware platform too.

python/tvm/topi/arm_cpu/depthwise_conv2d.py

python/tvm/topi/arm_cpu/tensor_intrin.py

- Added an intrinsic to load a single int16x8 vector and produce two int32x4 output vectors through smlal/smlal2 instructions - Changed the NHWC depthwise schedule to accomodate the aforementioned intrinsic Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57

mbaret

LGTM now. I agree with @FrozenGene that this would be best implemented nearer the LLVM codegen for reusability, but I think this is a good start and demonstrates a worthwhile benefit to performance.

giuseros · 2020-11-02T14:32:59Z

Hi @mbaret , thanks for the review!

@FrozenGene , any update on this?

FrozenGene · 2020-11-03T05:56:46Z

@giuseros @mbaret Thanks, merged now

…pache#6711) * Improve depthwise convolution through smlal/smlal2 intrinsic - Added an intrinsic to load a single int16x8 vector and produce two int32x4 output vectors through smlal/smlal2 instructions - Changed the NHWC depthwise schedule to accomodate the aforementioned intrinsic Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57 * Address review comments * Rebasing - 2 * Rebasing - 3 * Rebasing - 3 * Fix linting

ZihengJiang added the status: need review label Oct 21, 2020

mbaret requested changes Oct 28, 2020

View reviewed changes

tqchen assigned FrozenGene Oct 28, 2020

Giuseppe Rossini added 3 commits October 29, 2020 21:56

Address review comments

155ab8a

Rebasing - 2

efb8ef9

giuseros force-pushed the improve_depthwise_conv2d branch from 4a7b7c1 to efb8ef9 Compare October 29, 2020 22:40

Giuseppe Rossini added 3 commits October 29, 2020 22:44

Rebasing - 3

3149923

Rebasing - 3

c825a4e

Fix linting

5034afb

mbaret approved these changes Oct 30, 2020

View reviewed changes

FrozenGene approved these changes Nov 3, 2020

View reviewed changes

FrozenGene merged commit 01b98c1 into apache:main Nov 3, 2020

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AArch64 depthwise convolution through smlal/smlal2 intrinsic #6711

Improve AArch64 depthwise convolution through smlal/smlal2 intrinsic #6711

giuseros commented Oct 19, 2020

giuseros commented Oct 19, 2020

FrozenGene commented Oct 26, 2020

giuseros commented Oct 26, 2020 •

edited

FrozenGene commented Oct 26, 2020

mbaret left a comment

giuseros commented Nov 2, 2020

FrozenGene commented Nov 3, 2020

Improve AArch64 depthwise convolution through smlal/smlal2 intrinsic #6711

Improve AArch64 depthwise convolution through smlal/smlal2 intrinsic #6711

Conversation

giuseros commented Oct 19, 2020

giuseros commented Oct 19, 2020

FrozenGene commented Oct 26, 2020

giuseros commented Oct 26, 2020 • edited

FrozenGene commented Oct 26, 2020

mbaret left a comment

Choose a reason for hiding this comment

giuseros commented Nov 2, 2020

FrozenGene commented Nov 3, 2020

giuseros commented Oct 26, 2020 •

edited