Skip to content

Add Single-Source-Single-Sink Patterns in canonicalize-live-in Pass#180

Merged
tancheng merged 6 commits intocoredac:mainfrom
ShangkunLi:live-in-patterns
Oct 29, 2025
Merged

Add Single-Source-Single-Sink Patterns in canonicalize-live-in Pass#180
tancheng merged 6 commits intocoredac:mainfrom
ShangkunLi:live-in-patterns

Conversation

@ShangkunLi
Copy link
Copy Markdown
Collaborator

In this PR, we add a new pattern able to identify the following patterns in canonicalize-live-in pass. And avoid creating block arguments and branch operands for live-ins in the sink block.

The patterns are like:

Screenshot 2025-10-28 at 23 02 50

@ShangkunLi ShangkunLi marked this pull request as ready for review October 28, 2025 15:03
@ShangkunLi ShangkunLi requested a review from tancheng October 28, 2025 15:03
@ShangkunLi ShangkunLi self-assigned this Oct 28, 2025
@ShangkunLi ShangkunLi added the enhancement New feature or request label Oct 28, 2025
@tancheng
Copy link
Copy Markdown
Contributor

Can you please elaborate on the "avoid creating block arguments and branch operands for live-ins in the sink block" via your example? Some args are still needed, but what are skipped.

And mentioning how is this PR different from the suggestion I gave in #159 (comment).

@ShangkunLi
Copy link
Copy Markdown
Collaborator Author

Can you please elaborate on the "avoid creating block arguments and branch operands for live-ins in the sink block" via your example? Some args are still needed, but what are skipped.

And mentioning how is this PR different from the suggestion I gave in #159 (comment).

Q1. Elaborate on the pass via an example.

Sure~ Take relu as an example. The CFG structure of this kernel is

Screenshot 2025-10-29 at 11 02 47

We identify the live-ins in bb4 and check if they are defined in bb2. If yes, we directly use these defined values in bb4 and avoid passing them through branches & block arguments.

  1. The canonicalized IR using the new canonicalize-live-in pass looks like,
module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br to ^bb4
  ^bb4:  // 2 preds: ^bb2, ^bb3
    %10 = "neura.add"(%1) {rhs_value = 1 : i64} : (i64) -> i64
    %11 = "neura.icmp"(%10) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %11 : i1 then to ^bb1 else %10 : i64 to ^bb2
  }
}

%1 in bb4 meets the rule described above. So we directly use it in bb4 instead of passing it through branches & block args.

  1. The canonicalized IR using the old canonicalize-live-in pass looks like,
module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else %1 : i64 to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br %5 : i64 to ^bb4
  ^bb4(%10: i64):  // 2 preds: ^bb2, ^bb3
    %11 = "neura.add"(%10) {rhs_value = 1 : i64} : (i64) -> i64
    %12 = "neura.icmp"(%11) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %12 : i1 then to ^bb1 else %11 : i64 to ^bb2
  }
}

Here we need to pass %10 through branches and block arguments.

Q2. Difference from Proposal in #159

In #159 (comment). We use simple dominance and post-dominance relationships to identify the source and sink blocks.

  • Step 1 -- Given a live-in (e.g., variable x) of BB_b, identify the dominator BB_a.
  • Step 2 -- Make sure BB_b is post-dominator of BB_a as well.
Screenshot 2025-10-29 at 11 12 04

This 2-step identification is incomplete for this pass. Because Step 1 & Step 2 can also match with block A -> block B in this figure, which is not our target.

We want to identify relationships like B <-> E, B <-> F, etc. So we add another constraint beyond the dominance check --- the two blocks must cross a cond_br and merge in the same block.

@tancheng
Copy link
Copy Markdown
Contributor

Can you please elaborate on the "avoid creating block arguments and branch operands for live-ins in the sink block" via your example? Some args are still needed, but what are skipped.
And mentioning how is this PR different from the suggestion I gave in #159 (comment).

Q1. Elaborate on the pass via an example.

Sure~ Take relu as an example. The CFG structure of this kernel is

Screenshot 2025-10-29 at 11 02 47 We identify the live-ins in `bb4` and check if they are defined in `bb2`. If yes, we directly use these defined values in `bb4` and avoid passing them through branches & block arguments.
  1. The canonicalized IR using the new canonicalize-live-in pass looks like,
module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br to ^bb4
  ^bb4:  // 2 preds: ^bb2, ^bb3
    %10 = "neura.add"(%1) {rhs_value = 1 : i64} : (i64) -> i64
    %11 = "neura.icmp"(%10) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %11 : i1 then to ^bb1 else %10 : i64 to ^bb2
  }
}

%1 in bb4 meets the rule described above. So we directly use it in bb4 instead of passing it through branches & block args.

  1. The canonicalized IR using the old canonicalize-live-in pass looks like,
module {
  func.func @_Z6kernelPiS_(%arg0: !llvm.ptr {llvm.nocapture, llvm.readonly}, %arg1: !llvm.ptr {llvm.nocapture}) -> !llvm.void attributes {CConv = #llvm.cconv<ccc>, accelerator = "neura", frame_pointer = #llvm.framePointerKind<none>, linkage = #llvm.linkage<external>, no_infs_fp_math = false, no_nans_fp_math = false, no_signed_zeros_fp_math = false, no_unwind, passthrough = ["nofree", "norecurse", ["uwtable", "2"], ["correctly-rounded-divide-sqrt-fp-math", "false"], ["disable-tail-calls", "false"], ["less-precise-fpmad", "false"], ["min-legal-vector-width", "0"], ["no-jump-tables", "false"], ["no-trapping-math", "false"], ["stack-protector-buffer-size", "8"], ["target-cpu", "x86-64"], ["use-soft-float", "false"]], target_cpu = "x86-64", target_features = #llvm.target_features<["+cx8", "+fxsr", "+mmx", "+sse", "+sse2", "+x87"]>, unnamed_addr = 1 : i64, unsafe_fp_math = false, visibility_ = 0 : i64} {
    %0 = "neura.constant"() <{value = 0 : i64}> : () -> i64
    neura.br %0 : i64 to ^bb2
  ^bb1:  // pred: ^bb4
    "neura.return"() : () -> ()
  ^bb2(%1: i64):  // 2 preds: ^bb0, ^bb4
    %2 = "neura.gep"(%1) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg0"} : (i64) -> !llvm.ptr
    %3 = "neura.load"(%2) : (!llvm.ptr) -> i32
    %4 = "neura.icmp"(%3) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
    neura.cond_br %4 : i1 then %1, %3 : i64, i32 to ^bb3 else %1 : i64 to ^bb4
  ^bb3(%5: i64, %6: i32):  // pred: ^bb2
    %7 = "neura.gep"(%5) <{operandSegmentSizes = array<i32: 0, 1>}> {lhs_value = "%arg1"} : (i64) -> !llvm.ptr
    %8 = "neura.load"(%7) : (!llvm.ptr) -> i32
    %9 = "neura.add"(%8, %6) : (i32, i32) -> i32
    "neura.store"(%9, %7) : (i32, !llvm.ptr) -> ()
    neura.br %5 : i64 to ^bb4
  ^bb4(%10: i64):  // 2 preds: ^bb2, ^bb3
    %11 = "neura.add"(%10) {rhs_value = 1 : i64} : (i64) -> i64
    %12 = "neura.icmp"(%11) <{cmpType = "eq"}> {rhs_value = 32 : i64} : (i64) -> i1
    neura.cond_br %12 : i1 then to ^bb1 else %11 : i64 to ^bb2
  }
}

Here we need to pass %10 through branches and block arguments.

Q2. Difference from Proposal in #159

In #159 (comment). We use simple dominance and post-dominance relationships to identify the source and sink blocks.

  • Step 1 -- Given a live-in (e.g., variable x) of BB_b, identify the dominator BB_a.
  • Step 2 -- Make sure BB_b is post-dominator of BB_a as well.
Screenshot 2025-10-29 at 11 12 04 This 2-step identification is incomplete for this pass. Because Step 1 & Step 2 can also match with block A -> block B in this figure, which is not our target.

We want to identify relationships like B <-> E, B <-> F, etc. So we add another constraint beyond the dominance check --- the two blocks must cross a cond_br and merge in the same block.

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

@ShangkunLi
Copy link
Copy Markdown
Collaborator Author

ShangkunLi commented Oct 29, 2025

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

I think for A->B, they need to be handled in this step. It is orthogonal to this pr. For this pr, we only care about BBs crossing cond_br.

  • For each live-in (e.g., y), if its use/consumer is already dominated by other live-ins (e.g., x) and x and y are both from the same dominator, we can reserve y inside direct_data_flow_live_in set.

    • e.g., ^bb3(%5: i64, %6: i32)'s %6.

@tancheng
Copy link
Copy Markdown
Contributor

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

I think for A->B, they need to be handled in this step. It is orthogonal to this pr. For this pr, we only care about BBs crossing cond_br.

  • For each live-in (e.g., y), if its use/consumer is already dominated by other live-ins (e.g., x) and x and y are both from the same dominator, we can reserve y inside direct_data_flow_live_in set.

    • e.g., ^bb3(%5: i64, %6: i32)'s %6.

okay. i feel it is quite trivial. it is like A dominates B via br (rather than cond_br), then B's live_in from A can be viewed as direct_dataflow_live_in.

@ShangkunLi
Copy link
Copy Markdown
Collaborator Author

A -> B (i.e., unconditional_br) could also benefit from it, right? We ignore such case to decrease the search space? or there would be problem if we include such pattern?

I think for A->B, they need to be handled in this step. It is orthogonal to this pr. For this pr, we only care about BBs crossing cond_br.

  • For each live-in (e.g., y), if its use/consumer is already dominated by other live-ins (e.g., x) and x and y are both from the same dominator, we can reserve y inside direct_data_flow_live_in set.

    • e.g., ^bb3(%5: i64, %6: i32)'s %6.

okay. i feel it is quite trivial. it is like A dominates B via br (rather than cond_br), then B's live_in from A can be viewed as direct_dataflow_live_in.

Yes, it's quite simple to address this. But I just do not want to introduce too many patterns in this pr.

I add a TODO in the latest commit for a reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P0] Overly conservative --canonicalize-live-in increases RecII by propagating read-only variables [P0] Reduce II in ReLU kernel Mapping

2 participants