[SME] Introduce scalable fp32 dense schedule #16921

lhutton1 · 2024-04-24T12:35:18Z

This commit adds a new scalable fp32 dense schedule that calls SME intrinsics according to the SME RFC: apache/tvm-rfcs#107.

Currently the schedule does not make use of predication, meaning the output from the matmul compute must be copied in a subsequent compute stage. This will be removed once support for predication is added.

cc @ekalda @Anndrey24 @tqchen @cbalint13 @Lunderberg

This commit adds a new scalable fp32 dense schedule that calls SME intrinsics according to the SME RFC: apache/tvm-rfcs#107. Currently the schedule does not make use of predication, meaning the output from the matmul compute must be copied in a subsequent compute stage. This will be removed once support for predication is added. Change-Id: I9d5ec03d10b03b0637a48116d0cb4076f0ca8192

Addresses a nitpick comment mentioned here: apache#16899 (comment) Change-Id: I5b3dbe2b08dbf3b498b55fb89d9bfc112049baa4

Change-Id: I252e148c3a2e3567bc15b0dcea2efbfc228e2ecb

Changes the config script to build TVM with LLVM17. This enables tests for apache#16921. Change-Id: Ibfba153c7c9bb399d22d9ad6bd6569c59a1d5648

Changes the config script to build TVM with LLVM17. This enables tests for #16921. There was a failing codegen test when updating to LLVM 17, it seems it stopped producing vectorized code with LLVM 16. I have checked the same test with LLVM 18 and it now correctly produces vectorized code. I made an attempt to track down the commit that fixed the issue in LLVM but didn't have any success. Therefore, I think the best solution is to skip the test until a more recent version of LLVM is used in CI.

- Skips the tests for all platforms that don't have the FVP installed - Reworks the dense_alter_op changes to more closely match other implementations and remove warnings for "configs not found" - Expands matmulinferlayout for transposed layouts Change-Id: Ic8322e07e410da4ff0b2476ea95fa0f4ef7124c1

Change-Id: I04354dd75492edbc8a6dd3bd9d4f3f5751761ea2

Change-Id: If4912f7bfa8460ed6a6d27e65dd845906a0186fe

Change-Id: I980d9ff0ed842f1b8176057ee070779427b0a896

lhutton1 · 2024-05-14T09:48:08Z

@tvm-bot rerun

lhutton1 · 2024-05-14T09:48:28Z

Retriggering CI since it hasn't been run for a while

src/arith/const_int_bound.cc

leandron

LGTM, thanks @lhutton1

lhutton1 · 2024-05-15T10:28:31Z

thanks @Anndrey24 @leandron!

When analyzing scalable expressions, the analyzer will iterate over a series of known vscale values in the range 1-16. However, we can tighten this range to only values that are a power of two, as stated in the [LLVM lang ref](https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic:~:text=This%20function%20attribute%20indicates%20vscale%20is%20a%20power%2Dof%2Dtwo%20within%20a%20specified%20range) and more generally the [reference manual](https://developer.arm.com/documentation/ddi0487/latest/). This comes from a discussion in apache#16921 (comment) Change-Id: Iabbd1478b3853c3a6ad49c1442422bd50b8b08a6

This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in apache#16921. Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.

When analyzing scalable expressions, the analyzer will iterate over a series of known vscale values in the range 1-16. However, we can tighten this range to only values that are a power of two, as stated in the [LLVM lang ref](https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic:~:text=This%20function%20attribute%20indicates%20vscale%20is%20a%20power%2Dof%2Dtwo%20within%20a%20specified%20range) and more generally the [reference manual](https://developer.arm.com/documentation/ddi0487/latest/). This comes from a discussion in #16921 (comment)

This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in apache#16921. Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.

This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in #16921. Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.

lhutton1 added 2 commits April 24, 2024 12:25

Address a comment from apache#16899

c557a32

Addresses a nitpick comment mentioned here: apache#16899 (comment) Change-Id: I5b3dbe2b08dbf3b498b55fb89d9bfc112049baa4

lhutton1 mentioned this pull request Apr 24, 2024

[Tracking Issue] Scalable Matrix Extension (SME) upstreaming #16734

Open

11 tasks

github-actions bot requested review from Lunderberg, ekalda and tqchen April 24, 2024 12:36

Correct the order of database initialization

d0e50c8

Change-Id: I252e148c3a2e3567bc15b0dcea2efbfc228e2ecb

lhutton1 added a commit to lhutton1/tvm that referenced this pull request Apr 25, 2024

[CI] Use LLVM17 for tests on ci_cpu

e13c394

Changes the config script to build TVM with LLVM17. This enables tests for apache#16921. Change-Id: Ibfba153c7c9bb399d22d9ad6bd6569c59a1d5648

lhutton1 mentioned this pull request Apr 25, 2024

[CI] Use LLVM17 for tests on ci_cpu #16931

Merged

lhutton1 marked this pull request as ready for review April 30, 2024 10:10

lhutton1 added 3 commits May 1, 2024 08:34

Don't run SME select impl tests on older llvm target

5029c81

Change-Id: I04354dd75492edbc8a6dd3bd9d4f3f5751761ea2

add target context for sme codegen test

59671d4

Change-Id: If4912f7bfa8460ed6a6d27e65dd845906a0186fe

use DenseInferLayout for matmul

4b8d83c

Change-Id: I980d9ff0ed842f1b8176057ee070779427b0a896

lhutton1 mentioned this pull request May 8, 2024

[SME] Add scalable fp16->fp32 dense schedule #16981

Merged

leandron self-assigned this May 14, 2024

Anndrey24 reviewed May 14, 2024

View reviewed changes

src/arith/const_int_bound.cc Show resolved Hide resolved

leandron approved these changes May 14, 2024

View reviewed changes

lhutton1 merged commit b49468d into apache:main May 15, 2024
22 checks passed

lhutton1 deleted the sme-fp32-padded-dense-schedule branch May 15, 2024 10:28

lhutton1 mentioned this pull request May 15, 2024

[SVE] Use only powers of two as possible vscale values #17001

Merged

Anndrey24 mentioned this pull request May 16, 2024

[SME][TOPI] Add conv2d NHWC SME fp32 schedule #17003

Merged

felix-ro mentioned this pull request Jun 5, 2024

[BugFix][MetaSchedule] Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered #17066

Merged

ysh329 mentioned this pull request Jul 20, 2024

[Release] v0.17.0 Release Candidate Notes #17178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SME] Introduce scalable fp32 dense schedule #16921

[SME] Introduce scalable fp32 dense schedule #16921

lhutton1 commented Apr 24, 2024

lhutton1 commented May 14, 2024

lhutton1 commented May 14, 2024

leandron left a comment

lhutton1 commented May 15, 2024

[SME] Introduce scalable fp32 dense schedule #16921

[SME] Introduce scalable fp32 dense schedule #16921

Conversation

lhutton1 commented Apr 24, 2024

lhutton1 commented May 14, 2024

lhutton1 commented May 14, 2024

leandron left a comment

Choose a reason for hiding this comment

lhutton1 commented May 15, 2024