Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AArch64 depthwise convolution through smlal/smlal2 intrinsic #6711

Merged
merged 6 commits into from Nov 3, 2020

Conversation

giuseros
Copy link
Contributor

  • Added an intrinsic to load a single int16x8 vector and produce two
    int32x4 output vectors through smlal/smlal2 instructions

  • Changed the NHWC depthwise schedule to accomodate the aforementioned
    intrinsic

Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57

@giuseros
Copy link
Contributor Author

cc @u99127 @anijain2305

@FrozenGene
Copy link
Member

I like this change. May I ask two questions?

  1. could we make smlal / smlal2 work on convolution too?
  2. If we don't use tensorize, do we have method to handle this? For example, add optimization of custom LLVM pass or low detail analyze on tir or llvm ir?

@giuseros
Copy link
Contributor Author

giuseros commented Oct 26, 2020

Hi @FrozenGene
Thanks for your reply!

  1. Yes we can, but there is a issue. It's very hard to have the two quantized strategies (int8 + smull/sadalp vs int16 + smlal/smlal2) available to the auto-tuner at the same time (you already saw this post, I will reply there to your comment). In the case of depthwise, we only have the int16 strategy, so the issue does not arise.
  2. Nice you asked, because I am trying to prototype the idea of a "tir pass" where this tensorization (and possibly every arm tensorization in tensor_intrin.py) happens. My hope is to make it work more or less like vectorize. One of the main reasons to have this is to enable Ansor support

@FrozenGene
Copy link
Member

Hi @FrozenGene
Thanks for your reply!

  1. Yes we can, but there is a issue. It's very hard to have the two quantized strategies (int8 + smull/sadalp vs int16 + smlal/smlal2) available to the auto-tuner at the same time (you already saw this post, I will reply there to your comment). In the case of depthwise, we only have the int16 strategy, so the issue does not arise.
  2. Nice you asked, because I am trying to prototype the idea of a "tir pass" where this tensorization (and possibly every arm tensorization in tensor_intrin.py) happens. My hope is to make it work more or less like vectorize. One of the main reasons to have this is to enable Ansor support

For ansor, we could make ansor support tensorize (like TensorCore we need tensorize too). However, if we could done it in the llvm / tir, we will make ansor support it easily. If we could lift it as generic pass, I think it will bring benifit to other hardware platform too.

python/tvm/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/tensor_intrin.py Outdated Show resolved Hide resolved
Giuseppe Rossini added 3 commits October 29, 2020 21:56
- Added an intrinsic to load a single int16x8 vector and produce two
  int32x4 output vectors through smlal/smlal2 instructions

- Changed the NHWC depthwise schedule to accomodate the aforementioned
  intrinsic

Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57
Copy link
Contributor

@mbaret mbaret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. I agree with @FrozenGene that this would be best implemented nearer the LLVM codegen for reusability, but I think this is a good start and demonstrates a worthwhile benefit to performance.

@giuseros
Copy link
Contributor Author

giuseros commented Nov 2, 2020

Hi @mbaret , thanks for the review!

@FrozenGene , any update on this?

@FrozenGene FrozenGene merged commit 01b98c1 into apache:main Nov 3, 2020
@FrozenGene
Copy link
Member

@giuseros @mbaret Thanks, merged now

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
…pache#6711)

* Improve depthwise convolution through smlal/smlal2 intrinsic

- Added an intrinsic to load a single int16x8 vector and produce two
  int32x4 output vectors through smlal/smlal2 instructions

- Changed the NHWC depthwise schedule to accomodate the aforementioned
  intrinsic

Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57

* Address review comments

* Rebasing - 2

* Rebasing - 3

* Rebasing - 3

* Fix linting
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
…pache#6711)

* Improve depthwise convolution through smlal/smlal2 intrinsic

- Added an intrinsic to load a single int16x8 vector and produce two
  int32x4 output vectors through smlal/smlal2 instructions

- Changed the NHWC depthwise schedule to accomodate the aforementioned
  intrinsic

Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57

* Address review comments

* Rebasing - 2

* Rebasing - 3

* Rebasing - 3

* Fix linting
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
…pache#6711)

* Improve depthwise convolution through smlal/smlal2 intrinsic

- Added an intrinsic to load a single int16x8 vector and produce two
  int32x4 output vectors through smlal/smlal2 instructions

- Changed the NHWC depthwise schedule to accomodate the aforementioned
  intrinsic

Change-Id: I347c3bf98fa8dd87057304dcda0d78e558424c57

* Address review comments

* Rebasing - 2

* Rebasing - 3

* Rebasing - 3

* Fix linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants