[CORE][MVP] POC of fuse ops pass based on the DFPatterns#9628
[CORE][MVP] POC of fuse ops pass based on the DFPatterns#9628mikepapadim wants to merge 1 commit intoapache:mainfrom
Conversation
052e6d3 to
4e4696d
Compare
mbs-octoml
left a comment
There was a problem hiding this comment.
I'll take another look once you've commented pattern_fuse.cc a bit, thanks!
| Expr PartitionPattern(DFPattern pattern, Expr expr, Map<String, ObjectRef> attrs, PackedFunc check); | ||
|
|
||
| /*! | ||
| * \brief Partition all matches of a DFPattern inside an Expr into separate Function calls |
There was a problem hiding this comment.
You'll need to explain the 'hierarchical order' part here, perhaps explain they are expected to be in most-specific to most-general form and the first pattern to succeed is taken.
There was a problem hiding this comment.
Oh now that i look at the impl i see it's not that at all. So yeah will need to explain :-)
| /*! | ||
| * \brief Annoate primitive functions | ||
| * | ||
| * The result is an update module with annotated the primitive functions originated from the fuse |
There was a problem hiding this comment.
nit: ...updated module with fused functions annotation...
src/relay/ir/dataflow_matcher.cc
Outdated
| return Call(func, args); | ||
| } | ||
|
|
||
| // Expr DispatchVisitExpr(const Expr& pre) override { |
| auto post = MixedModeMutator::DispatchVisitExpr(pre); | ||
| if (gid_assignments_.count(pre) && pre == groups_[gid_assignments_[pre]].root_node && | ||
| static_cast<bool>(check_(pre))) { | ||
| if (gid_assignments_.count(pre) && pre == groups_[gid_assignments_[pre]].root_node) { |
There was a problem hiding this comment.
sorry i don't understand this change
|
|
||
| Pass AnnotatePostFuseFuncs() { | ||
| auto pass_info = PassInfo(0, "AnnotatePostFuseFuncs", {}); | ||
| return tvm::transform::CreateModulePass( |
There was a problem hiding this comment.
this can be a FunctionPass right?
| auto func = GetRef<Function>(func_node); | ||
|
|
||
| // add check from where it originate | ||
| func = WithAttr(std::move(func), attr::kPrimitive, tvm::Integer(1)); |
There was a problem hiding this comment.
this will annotate all functions, including used defined ones.
meanwhile isn't fusion rewriting the sub-expression to a call to a function literal, and it's those that need the primitive annotation?
src/relay/transforms/pattern_fuse.cc
Outdated
| */ | ||
|
|
||
| /*! | ||
| * \file src/relay/transforms/fold_explicit_padding.cc |
There was a problem hiding this comment.
could you comment throughout here?
| return PatternPartitioner().Partition(pattern, expr, attrs, check); | ||
| } | ||
|
|
||
| Expr PartitionPattern(Array<DFPattern> patterns, Expr expr, Map<String, ObjectRef> attrs, |
There was a problem hiding this comment.
I'm not sure, but perhaps this is better expressed as a 'sequence' pattern combinator who's matching rule is what you've written here.
4d93c79 to
2633cf7
Compare
…se o f the DFPattern language implementation
2633cf7 to
a7381d9
Compare
|
ping @mikepapadim and @mbs-octoml |
|
Hi @mikepapadim, in an effort to cleanup the pending PRs it would be good if you capture the above comments (I think you already did?) then remove this PR. Though keep this alive in your branch obviously! Thanks. |
This adds a demonstration of extracting, scheduling, and e2e-compiling relay subgraphs with multiple anchor ops. Since task extraction is not associated with TE scheduling anymore, extracting a subgraph with multiple anchor TE compute just works. The test case manually creates a simple fused mod with two `relay.dense`. But in the future, an effort like #9628 should make it easier to construct multi-anchor subgraphs. The extracted TensorIR block corresponding to two TE `dense` compute looks like this: ``` @tvm.script.ir_module class Module: @T.prim_func def main(placeholder: T.Buffer[(128, 128), "float32"], placeholder_1: T.Buffer[(128, 128), "float32"], placeholder_2: T.Buffer[(128, 128), "float32"], T_matmul_NT: T.Buffer[(128, 128), "float32"]) -> None: # function attr dict T.func_attr({"global_symbol": "main", "tir.noalias": True}) # body # with T.block("root") T_matmul_NT_1 = T.alloc_buffer([128, 128], dtype="float32") for i0, i1, i2 in T.grid(128, 128, 128): with T.block("T_matmul_NT"): i, j, k = T.axis.remap("SSR", [i0, i1, i2]) T.reads(placeholder[i, k], placeholder_1[j, k]) T.writes(T_matmul_NT_1[i, j]) T.block_attr({"layout_free_placeholders":[placeholder_1]}) with T.init(): T_matmul_NT_1[i, j] = T.float32(0) T_matmul_NT_1[i, j] = T_matmul_NT_1[i, j] + placeholder[i, k] * placeholder_1[j, k] for i0, i1, i2 in T.grid(128, 128, 128): with T.block("T_matmul_NT_1"): i, j, k = T.axis.remap("SSR", [i0, i1, i2]) T.reads(T_matmul_NT_1[i, k], placeholder_2[j, k]) T.writes(T_matmul_NT[i, j]) T.block_attr({"layout_free_placeholders":[placeholder_2]}) with T.init(): T_matmul_NT[i, j] = T.float32(0) T_matmul_NT[i, j] = T_matmul_NT[i, j] + T_matmul_NT_1[i, k] * placeholder_2[j, k] ```
This adds a demonstration of extracting, scheduling, and e2e-compiling relay subgraphs with multiple anchor ops. Since task extraction is not associated with TE scheduling anymore, extracting a subgraph with multiple anchor TE compute just works. The test case manually creates a simple fused mod with two `relay.dense`. But in the future, an effort like apache#9628 should make it easier to construct multi-anchor subgraphs. The extracted TensorIR block corresponding to two TE `dense` compute looks like this: ``` @tvm.script.ir_module class Module: @T.prim_func def main(placeholder: T.Buffer[(128, 128), "float32"], placeholder_1: T.Buffer[(128, 128), "float32"], placeholder_2: T.Buffer[(128, 128), "float32"], T_matmul_NT: T.Buffer[(128, 128), "float32"]) -> None: # function attr dict T.func_attr({"global_symbol": "main", "tir.noalias": True}) # body # with T.block("root") T_matmul_NT_1 = T.alloc_buffer([128, 128], dtype="float32") for i0, i1, i2 in T.grid(128, 128, 128): with T.block("T_matmul_NT"): i, j, k = T.axis.remap("SSR", [i0, i1, i2]) T.reads(placeholder[i, k], placeholder_1[j, k]) T.writes(T_matmul_NT_1[i, j]) T.block_attr({"layout_free_placeholders":[placeholder_1]}) with T.init(): T_matmul_NT_1[i, j] = T.float32(0) T_matmul_NT_1[i, j] = T_matmul_NT_1[i, j] + placeholder[i, k] * placeholder_1[j, k] for i0, i1, i2 in T.grid(128, 128, 128): with T.block("T_matmul_NT_1"): i, j, k = T.axis.remap("SSR", [i0, i1, i2]) T.reads(T_matmul_NT_1[i, k], placeholder_2[j, k]) T.writes(T_matmul_NT[i, j]) T.block_attr({"layout_free_placeholders":[placeholder_2]}) with T.init(): T_matmul_NT[i, j] = T.float32(0) T_matmul_NT[i, j] = T_matmul_NT[i, j] + T_matmul_NT_1[i, k] * placeholder_2[j, k] ```
This adds a demonstration of extracting, scheduling, and e2e-compiling relay subgraphs with multiple anchor ops. Since task extraction is not associated with TE scheduling anymore, extracting a subgraph with multiple anchor TE compute just works. The test case manually creates a simple fused mod with two `relay.dense`. But in the future, an effort like apache#9628 should make it easier to construct multi-anchor subgraphs. The extracted TensorIR block corresponding to two TE `dense` compute looks like this: ``` @tvm.script.ir_module class Module: @T.prim_func def main(placeholder: T.Buffer[(128, 128), "float32"], placeholder_1: T.Buffer[(128, 128), "float32"], placeholder_2: T.Buffer[(128, 128), "float32"], T_matmul_NT: T.Buffer[(128, 128), "float32"]) -> None: # function attr dict T.func_attr({"global_symbol": "main", "tir.noalias": True}) # body # with T.block("root") T_matmul_NT_1 = T.alloc_buffer([128, 128], dtype="float32") for i0, i1, i2 in T.grid(128, 128, 128): with T.block("T_matmul_NT"): i, j, k = T.axis.remap("SSR", [i0, i1, i2]) T.reads(placeholder[i, k], placeholder_1[j, k]) T.writes(T_matmul_NT_1[i, j]) T.block_attr({"layout_free_placeholders":[placeholder_1]}) with T.init(): T_matmul_NT_1[i, j] = T.float32(0) T_matmul_NT_1[i, j] = T_matmul_NT_1[i, j] + placeholder[i, k] * placeholder_1[j, k] for i0, i1, i2 in T.grid(128, 128, 128): with T.block("T_matmul_NT_1"): i, j, k = T.axis.remap("SSR", [i0, i1, i2]) T.reads(T_matmul_NT_1[i, k], placeholder_2[j, k]) T.writes(T_matmul_NT[i, j]) T.block_attr({"layout_free_placeholders":[placeholder_2]}) with T.init(): T_matmul_NT[i, j] = T.float32(0) T_matmul_NT[i, j] = T_matmul_NT[i, j] + T_matmul_NT_1[i, k] * placeholder_2[j, k] ```
This is a WIP of reproducing the functionality of the fuse_ops pass by using the pattern language instead.
The main goal is to replace the legacy fuse_ops with a cleaner and easier to maintain pass.
Also, we want to be able to extend it with pattern selection based on specific targets.
This MVP currently showcases the following patterns:
This is a draft as I am still migrating patterns from other branches and assertions for IR structural equality are missing.
@mbs-octoml @electriclilies @jroesch