-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SME] Introduce scalable fp32 dense schedule #16921
Merged
lhutton1
merged 7 commits into
apache:main
from
lhutton1:sme-fp32-padded-dense-schedule
May 15, 2024
Merged
[SME] Introduce scalable fp32 dense schedule #16921
lhutton1
merged 7 commits into
apache:main
from
lhutton1:sme-fp32-padded-dense-schedule
May 15, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds a new scalable fp32 dense schedule that calls SME intrinsics according to the SME RFC: apache/tvm-rfcs#107. Currently the schedule does not make use of predication, meaning the output from the matmul compute must be copied in a subsequent compute stage. This will be removed once support for predication is added. Change-Id: I9d5ec03d10b03b0637a48116d0cb4076f0ca8192
Addresses a nitpick comment mentioned here: apache#16899 (comment) Change-Id: I5b3dbe2b08dbf3b498b55fb89d9bfc112049baa4
11 tasks
Change-Id: I252e148c3a2e3567bc15b0dcea2efbfc228e2ecb
lhutton1
added a commit
to lhutton1/tvm
that referenced
this pull request
Apr 25, 2024
Changes the config script to build TVM with LLVM17. This enables tests for apache#16921. Change-Id: Ibfba153c7c9bb399d22d9ad6bd6569c59a1d5648
lhutton1
added a commit
that referenced
this pull request
Apr 29, 2024
Changes the config script to build TVM with LLVM17. This enables tests for #16921. There was a failing codegen test when updating to LLVM 17, it seems it stopped producing vectorized code with LLVM 16. I have checked the same test with LLVM 18 and it now correctly produces vectorized code. I made an attempt to track down the commit that fixed the issue in LLVM but didn't have any success. Therefore, I think the best solution is to skip the test until a more recent version of LLVM is used in CI.
- Skips the tests for all platforms that don't have the FVP installed - Reworks the dense_alter_op changes to more closely match other implementations and remove warnings for "configs not found" - Expands matmulinferlayout for transposed layouts Change-Id: Ic8322e07e410da4ff0b2476ea95fa0f4ef7124c1
Change-Id: I04354dd75492edbc8a6dd3bd9d4f3f5751761ea2
Change-Id: If4912f7bfa8460ed6a6d27e65dd845906a0186fe
Change-Id: I980d9ff0ed842f1b8176057ee070779427b0a896
@tvm-bot rerun |
Retriggering CI since it hasn't been run for a while |
Anndrey24
reviewed
May 14, 2024
leandron
approved these changes
May 14, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @lhutton1
thanks @Anndrey24 @leandron! |
lhutton1
added a commit
to lhutton1/tvm
that referenced
this pull request
May 15, 2024
When analyzing scalable expressions, the analyzer will iterate over a series of known vscale values in the range 1-16. However, we can tighten this range to only values that are a power of two, as stated in the [LLVM lang ref](https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic:~:text=This%20function%20attribute%20indicates%20vscale%20is%20a%20power%2Dof%2Dtwo%20within%20a%20specified%20range) and more generally the [reference manual](https://developer.arm.com/documentation/ddi0487/latest/). This comes from a discussion in apache#16921 (comment) Change-Id: Iabbd1478b3853c3a6ad49c1442422bd50b8b08a6
Anndrey24
added a commit
to Anndrey24/tvm
that referenced
this pull request
May 16, 2024
This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in apache#16921. Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.
ekalda
pushed a commit
that referenced
this pull request
May 21, 2024
When analyzing scalable expressions, the analyzer will iterate over a series of known vscale values in the range 1-16. However, we can tighten this range to only values that are a power of two, as stated in the [LLVM lang ref](https://llvm.org/docs/LangRef.html#llvm-vscale-intrinsic:~:text=This%20function%20attribute%20indicates%20vscale%20is%20a%20power%2Dof%2Dtwo%20within%20a%20specified%20range) and more generally the [reference manual](https://developer.arm.com/documentation/ddi0487/latest/). This comes from a discussion in #16921 (comment)
Anndrey24
added a commit
to Anndrey24/tvm
that referenced
this pull request
May 28, 2024
This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in apache#16921. Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.
ekalda
pushed a commit
that referenced
this pull request
May 28, 2024
This commit adds a scalable `arm_cpu` conv2d NHWC schedule for fp32 which generates SME instructions by using the tensor intrinsics introduced in #16921. Alongside the SME schedule, the logic of the TE schedule `schedule_conv2d_gemm_native()` for both non-scalable and scalable vector implementations has also been translated into the new TIR schedule. This means that the TE compute definition `compute_conv2d_NHWC_hybrid()` is now compatible with both the original TE schedules (e.g. `schedule_conv2d_NHWC_hybrid()`) and the newly introduced TIR schedule `schedule_conv2d_NHWC_hybrid_TIR()`. The corresponding TOPI test has been extended to reflect that.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit adds a new scalable fp32 dense schedule that calls SME intrinsics according to the SME RFC: apache/tvm-rfcs#107.
Currently the schedule does not make use of predication, meaning the output from the matmul compute must be copied in a subsequent compute stage. This will be removed once support for predication is added.
cc @ekalda @Anndrey24 @tqchen @cbalint13 @Lunderberg