-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TensorIR][M1b] Scaffolding ScheduleState data structure #7765
[TensorIR][M1b] Scaffolding ScheduleState data structure #7765
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Most comments are for comments and clarifications.
One major suggestion as listed in the comment is whether we should preserve the block name "root" and disallow duplicated block names in a scope of PrimFunc.
constexpr int kVerifyAffineBinding = static_cast<int>(ScheduleDebugMask::kVerifyAffineBinding); | ||
constexpr int kVerifyRegionCover = static_cast<int>(ScheduleDebugMask::kVerifyRegionCover); | ||
constexpr int kVerifyStagePipeline = static_cast<int>(ScheduleDebugMask::kVerifyStagePipeline); | ||
ICHECK_GE(debug_mode, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just assign 8 to the True
case, so that you don't need to deal with negative debug_mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to. The reason is that we don't want to introduce a really magic number, and even worse perhaps in the future we do want to extend the verification so the magic number may change over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant 15 (1111b, a full mask).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah a full mask (like INT_MAX
) definitely works, but I would prefer -1 here to make the logic clear and invariant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... In my experience, I'm thinking that it will be better to use unsigned int as bit masks. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to unsigned integers makes sense to me, and we could use like (11...11)_2 for the full mask. What I was worrying about is passing unsigned integers around the TVM FFI would cause potential issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Python doesn't have unsigned int, changing to unsigned int will result in inconsistency between Python and C++. Plus this debug_mode won't be increased to a large number in the future AFAIK, I'll take the current solution.
src/tir/schedule/utils.h
Outdated
* \param Type The type to be casted to, can be Block or For | ||
* \note The `E` in the macro means `error`, which means allowing to customize error message | ||
*/ | ||
#define TVM_SREF_TO_E(Result, SRef, Type) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The
E
in macro looks confusing, and I still cannot get the point even after reading the note... - This macro itself is also a bit confusion, as Result is being assigned in the LHS. I would suggest simply using inline functions for the cases in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlined functions are generally good, but I am not in favor of it in this particular case - I considered this before but didn't go with that idea, and here is the reason:
- When an error occurs, we want to print the exact line/function/file that throws that error: if we use an inline function, then instead of rendering the caller, it throws in the inline function in utils.h, which is much less informative.
- The caller should be responsible for writing the declaration of the variable. Comparing the following two, I would go for the first one, because it writes the type clearly, allows re-assignment of a variable, and makes it really clear what we are doing.
- The only disadvantage is that we need to repeat the name "block" twice, which is not quite inconvenient IMO.
const BlockNode* block = TVM_SREF_TO_BLOCK(block, block_sref);
// compared with
TVM_SREF_TO_BLOCK(block, block_sref);
A good alternative I considered before is to use inlined lambda function which expands like:
const auto* block = [&]() -> const BlockNode* {
const BlockNode* stmt = sref->StmtAs<BlockNode>();
ICHECK(stmt != nullptr) << "Error Message";
return stmt;
}();
The disadvantage of this approach is that the error message is not customizable, and it really depends on the compiler to optimize the lambda out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The macro with "E" suffix is short for "error", which means the error message is customizable. It is only exposed for flexibility, and rarely used in the codebase. Given it is an internal util macro, and has been well documented, I think it is fine to keep it here. Of course better names are definitely welcome :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasons you illustrated make sense to me, although I still don't like the coding style. It seems like extracting a common piese of expression to be a macro. Anyways, as you mentioned, this macro is an internal utility which is more like just a helper, so I'm not strongly against it.
For the naming, probably TVM_SREF_AS_OR_ERR
would be more straightforward? "E" is not a proper and common abbrevation of "Error" so I don't think it is informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think TVM_SREF_AS_OR_ERR
is a better name. I will go with the name :-)
The reason that we introduced this macro is that there are too many places using such conversion-and-check with almost the same error message, because most of the subsequent methods/analysis/primitives are mostly based on sref, not stmt. Introducing such macro could help alleviate the burden of writing several lines of almost identical checks and provide a consistent template of the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, a better name might be TVM_SREF_STMT_AS_OR_ERR
Would like to get a chance to read this throughly, put some time on calendar to do it tomorrow. |
Although out of the scope of this PR, I am really glad that we have the discussion about block names. @comaniac brought up the point #7765 (comment):
I kinda agree with Cody about his points, but would love to hear more discussion on the block name. Particularly, we have three points to discuss:
|
Thanks @junrushao1994 @comaniac These are great points, although i think they are somewhat parallel to the data structure itself and have things to do with primitive implementations. So we could try to make discussions in parallel with respect to this PR. In terms of the "root" name, given that we are uniquely identifying function already via the global names, an easy way is to just use function name in the module to obtain the root, which removes on concept here. The main Q for the block name uniqueness is about how to enforce them. For manual operations they certainly makes sense. For general automated transformations it might create an extra burden to introduce name tables or allocation mechanism. Since automated transformations rules works on a sub-region and may not be aware of the names from other parts. Due to that reason, allowing pointer uniqueness might still be a better approach. This also aligns with our existing approach to handle loop vars, which saves a lot of trouble during automatic transformations. This being said, we should be able to introduce canonicalization pass to uniquely rename block names. We can also add a flag in the Schedule to enforce such uniqueness if it is turned on |
Make sense. I'm fine with a follow-up PR to implement the result of this discussion.
This is also a good point. IMHO, as long as the interface makes sense to schedule primitive developers, it should be fine.
It makes sense to use unique pointers in the automation framework. One thing I would like to highlight is that even we leverage unique pointer to access blocks and don't have to worry about their names during optimization, it might still be worthwhile to maintain block name uniqueness. The reason is, IIUC, we will have a mechanism to print out the schedule in Python format for debugging and investigation. In the printed schedule, block name will be the only referenced.
Exactly. Calling a canonicalization pass before printing out the schedule could also solve the issue I mentioned above. |
Per discussion with @tqchen. A1. I agree that enforcing name uniqueness in scheduling is important and it is something we should do. We also recognize the name uniqueness is a bit misleading and not super useful in subsequent IR passes. Therefore, we want to divide the problems in two steps:
A2. Reserve names for the root block: Yes, we should do that. We have two proposals:
A3. Yes, we want to enable users who call scheduling primitives to specify the names of the blocks. Particularly, we want to hear some further discussions on the user experience should look like. Here is our proposal:
|
/*! \brief Lookup table for the `src` of dependencies */ | ||
std::unordered_map<StmtSRef, Array<Dependency>, ObjectPtrHash, ObjectPtrEqual> src2deps; | ||
/*! \brief Lookup table for the `dst` of dependencies */ | ||
std::unordered_map<StmtSRef, Array<Dependency>, ObjectPtrHash, ObjectPtrEqual> dst2deps; | ||
/*! \brief The mapping from the buffer to the blocks who write it */ | ||
std::unordered_map<Buffer, Array<StmtSRef>, ObjectPtrHash, ObjectPtrEqual> buffer_writers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to use tvm::Map
here instead of std::unordered_map
? Then these members can be visited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had been trying with tvm::Map<Buffer, Array<StmtSRef>>
in the very beginning, but it turned out that we need the values (the Array<StmtSRef>
) of the map to be mutable to make sure they are maintained properly during transformations, but with tvm::Map
we are unable to do so in an easy way :-( Therefore, we have to provide workarounds like providing APIs get_deps_by_src
on the python side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had been trying with
tvm::Map<Buffer, Array<StmtSRef>>
in the very beginning, but it turned out that we need the values (theArray<StmtSRef>
) of the map to be mutable to make sure they are maintained properly during transformations, but withtvm::Map
we are unable to do so in an easy way :-( Therefore, we have to provide workarounds like providing APIsget_deps_by_src
on the python side.
might be worth writing this down for future people in a NB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I will add this to "\note"
constexpr int kVerifyAffineBinding = static_cast<int>(ScheduleDebugMask::kVerifyAffineBinding); | ||
constexpr int kVerifyRegionCover = static_cast<int>(ScheduleDebugMask::kVerifyRegionCover); | ||
constexpr int kVerifyStagePipeline = static_cast<int>(ScheduleDebugMask::kVerifyStagePipeline); | ||
ICHECK_GE(debug_mode, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... In my experience, I'm thinking that it will be better to use unsigned int as bit masks. 😄
Hey would you guys take another look? Thanks a lot! @comaniac @jcf94 @MasterJH5574 @jroesch @tqchen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Two concerns were addressed per offline discussion with @junrushao1994
- The naming of InlineMark and RootMark should be improved in the future, but I don't have a better suggestion so I'll leave them there for now.
- It's better to use a full mask as the default debug_mode, but since Python doesn't have unsigned int, this will cause inconsistency between Python and C++. Meanwhile, the debug mode shouldn't be increased dramatically, we could avoid spending too much time on this point.
constexpr int kVerifyAffineBinding = static_cast<int>(ScheduleDebugMask::kVerifyAffineBinding); | ||
constexpr int kVerifyRegionCover = static_cast<int>(ScheduleDebugMask::kVerifyRegionCover); | ||
constexpr int kVerifyStagePipeline = static_cast<int>(ScheduleDebugMask::kVerifyStagePipeline); | ||
ICHECK_GE(debug_mode, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Python doesn't have unsigned int, changing to unsigned int will result in inconsistency between Python and C++. Plus this debug_mode won't be increased to a large number in the future AFAIK, I'll take the current solution.
Per offline discussion with @comaniac: More documentation on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for waiting for my slow review, just spent my energy on the interface and documentation. Most of them should be really easy to apply.
int64_t seq_index; | ||
|
||
void VisitAttrs(AttrVisitor* v) { | ||
// `stmt` is not visited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because of the weak pointer optimization? It isn't clear why I can't read these fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh we don't want to visit the weak references in the visitors, because those void pointers are less meaningful on the python side. Instead, we provide FFI functions that return strong references: see block_scope.cc:144-151
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: Jared Roesch <roeschinc@gmail.com>
Thanks @junrushao1994 for keep improving the PR. |
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: Jared Roesch <roeschinc@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: Jared Roesch <roeschinc@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: Jared Roesch <roeschinc@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: Jared Roesch <roeschinc@gmail.com>
This PR is part of the stage M1b, TensorIR upstreaming plan (#7527), on the core data structure, ScheduleState.
This PR introduces two key concepts: BlockScope and ScheduleState. The ScheduleState provides a key method
Replace
, which allows all the schedule primitives to be developed around.Detailed explanation of all the terminologies, concepts and algorithms is provided in the documentation.
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>