Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SME] Introduce scalable fp32 dense schedule #16921

Merged
merged 7 commits into from
May 15, 2024

Conversation

lhutton1
Copy link
Contributor

This commit adds a new scalable fp32 dense schedule that calls SME intrinsics according to the SME RFC: apache/tvm-rfcs#107.

Currently the schedule does not make use of predication, meaning the output from the matmul compute must be copied in a subsequent compute stage. This will be removed once support for predication is added.

cc @ekalda @Anndrey24 @tqchen @cbalint13 @Lunderberg

This commit adds a new scalable fp32 dense schedule that calls SME
intrinsics according to the SME RFC:
apache/tvm-rfcs#107.

Currently the schedule does not make use of predication, meaning the
output from the matmul compute must be copied in a subsequent compute
stage. This will be removed once support for predication is added.

Change-Id: I9d5ec03d10b03b0637a48116d0cb4076f0ca8192
Addresses a nitpick comment mentioned here:
apache#16899 (comment)

Change-Id: I5b3dbe2b08dbf3b498b55fb89d9bfc112049baa4
Change-Id: I252e148c3a2e3567bc15b0dcea2efbfc228e2ecb
lhutton1 added a commit to lhutton1/tvm that referenced this pull request Apr 25, 2024
Changes the config script to build TVM with LLVM17. This enables tests
for apache#16921.

Change-Id: Ibfba153c7c9bb399d22d9ad6bd6569c59a1d5648
lhutton1 added a commit that referenced this pull request Apr 29, 2024
Changes the config script to build TVM with LLVM17. This enables tests
for #16921.

There was a failing codegen test when updating to LLVM 17, it seems it
stopped producing vectorized code with LLVM 16. I have checked the
same test with LLVM 18 and it now correctly produces vectorized code. I
made an attempt to track down the commit that fixed the issue in LLVM
but didn't have any success. Therefore, I think the best solution is
to skip the test until a more recent version of LLVM is used in CI.
- Skips the tests for all platforms that don't have the FVP installed
- Reworks the dense_alter_op changes to more closely match other
  implementations and remove warnings for "configs not found"
- Expands matmulinferlayout for transposed layouts

Change-Id: Ic8322e07e410da4ff0b2476ea95fa0f4ef7124c1
@lhutton1 lhutton1 marked this pull request as ready for review April 30, 2024 10:10
Change-Id: I04354dd75492edbc8a6dd3bd9d4f3f5751761ea2
Change-Id: If4912f7bfa8460ed6a6d27e65dd845906a0186fe
Change-Id: I980d9ff0ed842f1b8176057ee070779427b0a896
@lhutton1
Copy link
Contributor Author

@tvm-bot rerun

@lhutton1
Copy link
Contributor Author

Retriggering CI since it hasn't been run for a while

@leandron leandron self-assigned this May 14, 2024
Copy link
Contributor

@leandron leandron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @lhutton1

@lhutton1 lhutton1 merged commit b49468d into apache:main May 15, 2024
22 checks passed
@lhutton1 lhutton1 deleted the sme-fp32-padded-dense-schedule branch May 15, 2024 10:28
@lhutton1
Copy link
Contributor Author

thanks @Anndrey24 @leandron!

lhutton1 added a commit to lhutton1/tvm that referenced this pull request May 15, 2024
When analyzing scalable expressions, the analyzer will iterate over a
series of known vscale values in the range 1-16. However, we can
tighten this range to only values that are a power of two, as stated
in the [LLVM lang ref](https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic:~:text=This%20function%20attribute%20indicates%20vscale%20is%20a%20power%2Dof%2Dtwo%20within%20a%20specified%20range)
and more generally the [reference manual](https://developer.arm.com/documentation/ddi0487/latest/).

This comes from a discussion in apache#16921 (comment)

Change-Id: Iabbd1478b3853c3a6ad49c1442422bd50b8b08a6
Anndrey24 added a commit to Anndrey24/tvm that referenced this pull request May 16, 2024
This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in apache#16921.

Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.
ekalda pushed a commit that referenced this pull request May 21, 2024
When analyzing scalable expressions, the analyzer will iterate over a
series of known vscale values in the range 1-16. However, we can
tighten this range to only values that are a power of two, as stated
in the [LLVM lang ref](https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic:~:text=This%20function%20attribute%20indicates%20vscale%20is%20a%20power%2Dof%2Dtwo%20within%20a%20specified%20range)
and more generally the [reference manual](https://developer.arm.com/documentation/ddi0487/latest/).

This comes from a discussion in #16921 (comment)
Anndrey24 added a commit to Anndrey24/tvm that referenced this pull request May 28, 2024
This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in apache#16921.

Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.
ekalda pushed a commit that referenced this pull request May 28, 2024
This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in #16921.

Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants