[MetaSchedule] Fuse loops around shared to global store block in MultiLevelTilingTensorCore#13357
Merged
masahi merged 2 commits intoapache:mainfrom Nov 11, 2022
Merged
Conversation
Collaborator
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
vinx13
approved these changes
Nov 11, 2022
xinetzone
pushed a commit
to daobook/tvm
that referenced
this pull request
Nov 25, 2022
…tiLevelTilingTensorCore` (apache#13357) * Fuse shared to global store loops in MultiLevelTilingTensorCore * update test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, vectorization of shared to global store in tensor core auto tensorization is not done properly, since most blocks have the
T.wherepredicate which disables vectorization.The predicate is introduced after
Splitin cooperative fetch: https://github.com/apache/tvm/blob/main/src/meta_schedule/postproc/rewrite_cooperative_fetch.cc#L159-L162As the code says, this split is supposed to be applied to a fused loop. This is the case for cache read blocks, where
AddReadReuseexplicitly fuses loops around cache read blocks. ButAddWriteReuseTensorCoredoesn't fuse loops after cache write: https://github.com/apache/tvm/blob/main/src/meta_schedule/schedule_rule/multi_level_tiling_tensor_core.cc#L260-L262.So for cache rewrite blocks, we always try to split a single axis by large factors like
[None, 4, 32, 2]. Unless the sampled factor for the axis is large, we always getT.wherein the shared to global copy block.This PR adds the missing fusion. Now, all candidate samples have the shared to global copy block properly vectorized. But unfortunately, there was no perf improvement from this change after e2e tuning.
For quantized workloads, vectorization of shared to global copy is disabled, since we end up vectorizing also requantization-related math, involving 64 bit arithmetic. The generated code fails to compile currently.
@vinx13 @junrushao