Add Single-Source-Single-Sink Patterns in canonicalize-live-in Pass by ShangkunLi · Pull Request #180 · coredac/dataflow

ShangkunLi · 2025-10-28T15:03:09Z

In this PR, we add a new pattern able to identify the following patterns in canonicalize-live-in pass. And avoid creating block arguments and branch operands for live-ins in the sink block.

The patterns are like:

tancheng · 2025-10-28T15:40:34Z

Can you please elaborate on the "avoid creating block arguments and branch operands for live-ins in the sink block" via your example? Some args are still needed, but what are skipped.

And mentioning how is this PR different from the suggestion I gave in #159 (comment).

ShangkunLi · 2025-10-29T03:20:04Z

Can you please elaborate on the "avoid creating block arguments and branch operands for live-ins in the sink block" via your example? Some args are still needed, but what are skipped.

And mentioning how is this PR different from the suggestion I gave in #159 (comment).

Q1. Elaborate on the pass via an example.

Sure~ Take relu as an example. The CFG structure of this kernel is

We identify the live-ins in bb4 and check if they are defined in bb2. If yes, we directly use these defined values in bb4 and avoid passing them through branches & block arguments.

The canonicalized IR using the new canonicalize-live-in pass looks like,

module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br to ^bb4
  ^bb4:  // 2 preds: ^bb2, ^bb3
    %10 = "neura.add"(%1) {rhs_value = 1 : i64} : (i64) -> i64
    %11 = "neura.icmp"(%10) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %11 : i1 then to ^bb1 else %10 : i64 to ^bb2
  }
}

%1 in bb4 meets the rule described above. So we directly use it in bb4 instead of passing it through branches & block args.

The canonicalized IR using the old canonicalize-live-in pass looks like,

module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else %1 : i64 to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br %5 : i64 to ^bb4
  ^bb4(%10: i64):  // 2 preds: ^bb2, ^bb3
    %11 = "neura.add"(%10) {rhs_value = 1 : i64} : (i64) -> i64
    %12 = "neura.icmp"(%11) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %12 : i1 then to ^bb1 else %11 : i64 to ^bb2
  }
}

Here we need to pass %10 through branches and block arguments.

Q2. Difference from Proposal in #159

In #159 (comment). We use simple dominance and post-dominance relationships to identify the source and sink blocks.

Step 1 -- Given a live-in (e.g., variable x) of BB_b, identify the dominator BB_a.

Step 2 -- Make sure BB_b is post-dominator of BB_a as well.

This 2-step identification is incomplete for this pass. Because Step 1 & Step 2 can also match with block A -> block B in this figure, which is not our target.

We want to identify relationships like B <-> E, B <-> F, etc. So we add another constraint beyond the dominance check --- the two blocks must cross a cond_br and merge in the same block.

tancheng · 2025-10-29T03:44:26Z

Can you please elaborate on the "avoid creating block arguments and branch operands for live-ins in the sink block" via your example? Some args are still needed, but what are skipped.
And mentioning how is this PR different from the suggestion I gave in #159 (comment).

Q1. Elaborate on the pass via an example.

Sure~ Take relu as an example. The CFG structure of this kernel is

We identify the live-ins in `bb4` and check if they are defined in `bb2`. If yes, we directly use these defined values in `bb4` and avoid passing them through branches & block arguments.

The canonicalized IR using the new canonicalize-live-in pass looks like,

module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br to ^bb4
  ^bb4:  // 2 preds: ^bb2, ^bb3
    %10 = "neura.add"(%1) {rhs_value = 1 : i64} : (i64) -> i64
    %11 = "neura.icmp"(%10) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %11 : i1 then to ^bb1 else %10 : i64 to ^bb2
  }
}

%1 in bb4 meets the rule described above. So we directly use it in bb4 instead of passing it through branches & block args.

The canonicalized IR using the old canonicalize-live-in pass looks like,

module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else %1 : i64 to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br %5 : i64 to ^bb4
  ^bb4(%10: i64):  // 2 preds: ^bb2, ^bb3
    %11 = "neura.add"(%10) {rhs_value = 1 : i64} : (i64) -> i64
    %12 = "neura.icmp"(%11) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %12 : i1 then to ^bb1 else %11 : i64 to ^bb2
  }
}

Here we need to pass %10 through branches and block arguments.

Q2. Difference from Proposal in #159

In #159 (comment). We use simple dominance and post-dominance relationships to identify the source and sink blocks.

Step 1 -- Given a live-in (e.g., variable x) of BB_b, identify the dominator BB_a.

Step 2 -- Make sure BB_b is post-dominator of BB_a as well.

This 2-step identification is incomplete for this pass. Because Step 1 & Step 2 can also match with block A -> block B in this figure, which is not our target.

We want to identify relationships like B <-> E, B <-> F, etc. So we add another constraint beyond the dominance check --- the two blocks must cross a cond_br and merge in the same block.

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp

test/neura/for_loop/relu_test.mlir

ShangkunLi · 2025-10-29T05:01:17Z

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

I think for A->B, they need to be handled in this step. It is orthogonal to this pr. For this pr, we only care about BBs crossing cond_br.

For each live-in (e.g., y), if its use/consumer is already dominated by other live-ins (e.g., x) and x and y are both from the same dominator, we can reserve y inside direct_data_flow_live_in set.

e.g., ^bb3(%5: i64, %6: i32)'s %6.

tancheng · 2025-10-29T05:22:34Z

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

I think for A->B, they need to be handled in this step. It is orthogonal to this pr. For this pr, we only care about BBs crossing cond_br.

For each live-in (e.g., y), if its use/consumer is already dominated by other live-ins (e.g., x) and x and y are both from the same dominator, we can reserve y inside direct_data_flow_live_in set.

e.g., ^bb3(%5: i64, %6: i32)'s %6.

okay. i feel it is quite trivial. it is like A dominates B via br (rather than cond_br), then B's live_in from A can be viewed as direct_dataflow_live_in.

ShangkunLi · 2025-10-29T06:13:43Z

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

I think for A->B, they need to be handled in this step. It is orthogonal to this pr. For this pr, we only care about BBs crossing cond_br.

For each live-in (e.g., y), if its use/consumer is already dominated by other live-ins (e.g., x) and x and y are both from the same dominator, we can reserve y inside direct_data_flow_live_in set.

e.g., ^bb3(%5: i64, %6: i32)'s %6.

okay. i feel it is quite trivial. it is like A dominates B via br (rather than cond_br), then B's live_in from A can be viewed as direct_dataflow_live_in.

Yes, it's quite simple to address this. But I just do not want to introduce too many patterns in this pr.

I add a TODO in the latest commit for a reminder.

ShangkunLi added 5 commits October 28, 2025 19:41

prototype direct dataflow live-in opt

e397f46

enale a robust direct dataflow live-in pattern

9c5679e

enbale relu test

e2cb783

remove debug code

83b330a

remove redundant files

122e1f8

ShangkunLi marked this pull request as ready for review October 28, 2025 15:03

ShangkunLi requested a review from tancheng October 28, 2025 15:03

ShangkunLi self-assigned this Oct 28, 2025

ShangkunLi added the enhancement New feature or request label Oct 28, 2025

tancheng reviewed Oct 29, 2025

View reviewed changes

add comments and rename funcs/variables

f5b0041

tancheng approved these changes Oct 29, 2025

View reviewed changes

This was linked to issues Oct 29, 2025

[P0] Overly conservative --canonicalize-live-in increases RecII by propagating read-only variables #159

Open

[P0] Codegen pass mishandles promoted func args (neura.constant {value="%arg*"}) #154

Closed

tancheng removed a link to an issue Oct 29, 2025

[P0] Codegen pass mishandles promoted func args (neura.constant {value="%arg*"}) #154

Closed

tancheng linked an issue Oct 29, 2025 that may be closed by this pull request

[P0] Reduce II in ReLU kernel Mapping #149

Closed

tancheng mentioned this pull request Oct 29, 2025

[P0] Overly conservative --canonicalize-live-in increases RecII by propagating read-only variables #159

Open

tancheng merged commit 1cdd6c1 into coredac:main Oct 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Single-Source-Single-Sink Patterns in canonicalize-live-in Pass#180

Add Single-Source-Single-Sink Patterns in canonicalize-live-in Pass#180
tancheng merged 6 commits intocoredac:mainfrom
ShangkunLi:live-in-patterns

ShangkunLi commented Oct 28, 2025

Uh oh!

tancheng commented Oct 28, 2025

Uh oh!

ShangkunLi commented Oct 29, 2025

Uh oh!

tancheng commented Oct 29, 2025

Q1. Elaborate on the pass via an example.

Q2. Difference from Proposal in #159

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangkunLi commented Oct 29, 2025 •

edited

Loading

Uh oh!

tancheng commented Oct 29, 2025

Uh oh!

ShangkunLi commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShangkunLi commented Oct 28, 2025

Uh oh!

tancheng commented Oct 28, 2025

Uh oh!

ShangkunLi commented Oct 29, 2025

Q1. Elaborate on the pass via an example.

Q2. Difference from Proposal in #159

Uh oh!

tancheng commented Oct 29, 2025

Q1. Elaborate on the pass via an example.

Q2. Difference from Proposal in #159

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangkunLi commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tancheng commented Oct 29, 2025

Uh oh!

ShangkunLi commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShangkunLi commented Oct 29, 2025 •

edited

Loading