JIT: implement tail merging #77103

AndyAyersMS · 2022-10-16T21:46:35Z

Add a phase that looks for common tail statements in a block's predecessors and merges them.

Run it both before and after morph.

Closes #8795.
Closes #76872.

Add a phase that looks for common tail statements in a block's predecessors and merges them. Run it both before and after morph. Closes dotnet#8795. Closes dotnet#76872.

ghost · 2022-10-16T21:46:59Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Add a phase that looks for common tail statements in a block's predecessors and merges them.

Run it both before and after morph.

Closes #8795.
Closes #76872.

Author:	AndyAyersMS
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2022-10-16T21:50:41Z

cc @dotnet/jit-contrib

Depends critically on GenTree::Compare actually being correct. Let's see.

Decent code size improvements. Improvement:Regression byte ratio is 20:1 or better. Haven't looked into regressions that deeply but the few I looked at were LSRA related.

Local TP runs show no cost or even a net improvement. We might be able to up the merge limit a bit more. There are still size wins to be had in ASP.NET (a no-limit x64 windows gets -127K size improvement) -- evidently a lot of async state machines have big switches with lots of common tails in the switch cases.

Unlimited merging has a ~0.4% TP cost in libraries tests.

AndyAyersMS · 2022-10-16T22:17:59Z

Looks like I need to update the block flags.

  BBF_HAS_NULLCHECK is not set on BB28 but is required because of the following tree 
  N002 (  4,  3) [000565] ---X-------                         *  NULLCHECK byte  
  N001 (  3,  2) [000564] -----------                         \--*  LCL_VAR   ref    V09 loc8         u:1 (last use)

EgorBo · 2022-10-16T22:40:02Z

Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account?

AndyAyersMS · 2022-10-17T00:21:10Z

Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account?

Right, it is doing very literal matching right now (modulo allowing swaps). Would not be hard to make it a bit smarter and allow a bigger range of things to match.

I am more worried about the cases where Compare returns true but the trees are actually different.

Skip past GT_NOP, no point considering those for merging. Fix logic error when finding cross jump victim -- need to assess the first block in the loop.

AndyAyersMS · 2022-10-17T11:05:02Z

NAOT failure seems like it could be #76801.

EgorBo · 2022-10-17T11:07:58Z

Wow, nice diffs 🙂

EgorBo · 2022-10-17T11:12:00Z

Perhaps worth running some outerloop?

AndyAyersMS · 2022-10-17T15:56:28Z

Seems like this is in reasonably good shape.

Perhaps worth running some outerloop?

Sure.

AndyAyersMS · 2022-10-17T15:59:00Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

azure-pipelines · 2022-10-17T15:59:28Z

Azure Pipelines successfully started running 3 pipeline(s).

SingleAccretion · 2022-10-17T16:08:05Z

src/coreclr/jit/fgopt.cpp

+                if (GenTree::Compare(baseStmt->GetRootNode(), otherStmt->GetRootNode(), true))
+                {
+                    matchedPredInfo.Push(predInfo.TopRef(j));
+                }


GenTree::Compare does not pay attention to GTF_IND_VOLATILE, so presumably this can do something like below, which does not seem right.

[ p1 ] [ p1 ] [ind<volatile>(addr)] | | | | [ p2 ] --> | [ p2 ] | [ind(addr)] | | | | [ ind(addr) ] [ block ] [ block ]

Thanks.

Was actually expecting to quickly hit more bugs in GenTree::Compare, but so far haven't run across any.

jakobbotsch · 2022-10-17T17:58:53Z

src/tests/JIT/Directed/debugging/debuginfo/tests.il

@@ -99,7 +99,7 @@
      // as this is used for the managed-ret-val feature, but the debugger filters out these mappings and does not
      // report them in the ETW event. We should probably change this, those mappings should be useful in any case.
      property int32[] Debug = int32[10]( 0x0 0x6 0xe 0x12 0x1a 0x1c 0x24 0x28 0x2c 0x34 ) 
-      property int32[] Opts = int32[5]( 0x0 0x6 0x12 0x1c 0x2c )
+      property int32[] Opts = int32[4]( 0x0 0x6 0x12 0x1c )


Does this break stepping in the debuggers, or are they able to map the new IP into the shared tail? What about profilers using sampling?

We should think about if we need to add some new debug information for tooling to have a chance to handle this.

Yes we can lose debug info here -- note we can also merge calls which may create confusing looking stack traces.

But I don't think we can express this sort of many to one mapping. So not sure what to do about it.

jakobbotsch · 2022-10-17T18:03:30Z

Nice! We should probably audit GenTree::Compare and at least verify that it isn't missing something obvious.

Is this purely a size-decreasing optimization, or do we also expect it to benefit performance?

What is the impact on TP in the contexts where no merging is done?

AndyAyersMS · 2022-10-17T18:18:23Z

Is this purely a size-decreasing optimization, or do we also expect it to benefit performance?

Perf is tricky to assess. There are some knock-on optimizations this can enable, but absent those, tail merging can actually hurt perf because it might increase the relative density/frequency of taken branches.

What is the impact on TP in the contexts where no merging is done?

Not sure. Any suggestions on how I could measure that?

AndyAyersMS · 2022-10-17T18:23:32Z

Looks like libraries jitstress and fuzzlyn are hitting an assert:

// Assertion failed '(head->bbJumpDest != top) || (head->bbFlags & BBF_KEEP_BBJ_ALWAYS)' in 'Program:M2(byref):ushort' during 'Find loops' (IL size 72; hash 0x504dda22; FullOpts)
// 
//     File: /Users/runner/work/1/s/src/coreclr/jit/optimizer.cpp Line: 1872
//

jakobbotsch · 2022-10-17T18:24:59Z

Not sure. Any suggestions on how I could measure that?

I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a .mcl file with all those contexts. Then you can pass it via -c argument to superpmi in a pin run.

AndyAyersMS · 2022-10-17T19:12:50Z

@BruceForstall any idea what this assert is guarding against?

runtime/src/coreclr/jit/optimizer.cpp

Lines 1871 to 1872 in 4574ccb

    
           // Cannot enter at the top - should have being caught by redundant jumps 
        
           assert((head->bbJumpDest != top) || (head->bbFlags & BBF_KEEP_BBJ_ALWAYS));

Offhand I don't see the problem if head branches to top, and that's what we have here:

Head   -- BB04
Top    -- BB05
Bottom -- BB07

This won't become a loop anyways.

Previously BB04 and BB07 would branch to BB12. Now they cross-jump to BB05.

AndyAyersMS · 2022-10-17T19:52:27Z

Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account?

Right, it is doing very literal matching right now (modulo allowing swaps). Would not be hard to make it a bit smarter and allow a bigger range of things to match.

I ended up handing the GT_NOP case (at top level anyways). Didn't make a lot of difference.

Probably I should remove those statements in the pre-screening, but then tail merge phase might have diffs even if it did not do any merging.

Add indir flag checking to `GenTree::Compare`.

AndyAyersMS · 2022-10-18T14:46:35Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

azure-pipelines · 2022-10-18T14:47:05Z

Azure Pipelines successfully started running 3 pipeline(s).

AndyAyersMS · 2022-10-19T22:59:45Z

Not sure. Any suggestions on how I could measure that?

I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a .mcl file with all those contexts. Then you can pass it via -c argument to superpmi in a pin run.

A bit tricky since it runs twice;, but I suppose I can try this.

AndyAyersMS · 2022-10-19T23:05:42Z

Jitstress should now be clean (ish...) may retry in a bit.

AndyAyersMS · 2022-10-20T21:20:29Z

Adding the ability to disable even in release.

AndyAyersMS · 2022-10-20T23:40:11Z

Not sure. Any suggestions on how I could measure that?

I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a .mcl file with all those contexts. Then you can pass it via -c argument to superpmi in a pin run.

A bit tricky since it runs twice;, but I suppose I can try this.

Here is some contextual TP data for the ASP.NET collection:

ASP, all contexts

[16:06:06] Loaded 129381  Jitted 129381  FailedCompile 0 Excluded 0 Missing 0 Diffs 5481

[16:06:06] Total instructions executed by base: 139387630956
[16:06:06] Total instructions executed by diff: 139283948984
[16:06:06] Total instructions executed delta: -103681972 (-0.07% of base)

ASP, no tail merge OPT contexts

[16:08:17] Loaded 55409  Jitted 55409  FailedCompile 0 Excluded 0 Missing 0 Diffs 0

[16:08:17] Total instructions executed by base: 74214970319
[16:08:17] Total instructions executed by diff: 74406798338
[16:08:17] Total instructions executed delta: 191828019 (0.26% of base)

ASP, tail merge OPT contexts

[16:31:10] Loaded 6374  Jitted 6374  FailedCompile 0 Excluded 0 Missing 0 Diffs 5481

[16:31:10] Total instructions executed by base: 43475425380
[16:31:10] Total instructions executed by diff: 43180020551
[16:31:10] Total instructions executed delta: -295404829 (-0.68% of base)

My hunch is that the methods that are tail merge candidates tend to be more complex so despite this kicking in for only 10% or so of methods it is still a net TP improvement.

There is also a subset ~55K min opts methods in here which is not explicitly accounted for.

Also note that in about 10% of the cases where we tail merge we were able to get the same codegen w/o tail merge.

AndyAyersMS · 2022-10-21T03:59:10Z

That 0.26% we pay for not optimizing seems a bit high, let me see if there's some way to trim it down a bit...

AndyAyersMS · 2022-10-21T18:23:30Z

Did a couple of perf tweaks. I had thought GenTree::Compare was going to be the costly bit but profiling is pointing at the initial stage where we find the set of merge candidates. Slimmed that down a bit.

AndyAyersMS · 2022-10-24T15:05:52Z

TP impact is down a little bit more (did not remeasure the splits above). Diffs similar to before.

@dotnet/jit-contrib ping

EgorBo

LGTM! Looking forward to seeing it merged, want to experiment further :)

BruceForstall

Some nice, elegant code.

BruceForstall · 2022-10-26T03:37:49Z

src/coreclr/jit/gentree.cpp

+                    if ((op1->gtFlags & (GTF_IND_FLAGS)) != (op2->gtFlags & (GTF_IND_FLAGS)))
+                    {
+                        return false;
+                    }
+                    FALLTHROUGH;
+
+                case GT_IND:
+                case GT_NULLCHECK:
+                    if ((op1->gtFlags & (GTF_IND_FLAGS)) != (op2->gtFlags & (GTF_IND_FLAGS)))


This code is odd. GT_BLK/GT_OBJ/GT_IND/GT_NULLCHECK are all equivalent; why not share the code? Why does GT_BLK/GT_OBJ have "FALLTHROUGH" to the same code it just executed? also, (GTF_IND_FLAGS) doesn't need to be parenthesized.

AndyAyersMS · 2022-11-03T16:42:48Z

dotnet/perf-autofiling-issues#9472

AndyAyersMS · 2022-11-03T17:21:15Z

Possibly also: dotnet/perf-autofiling-issues#9468

JIT: implement tail merging

2da39fb

Add a phase that looks for common tail statements in a block's predecessors and merges them. Run it both before and after morph. Closes dotnet#8795. Closes dotnet#76872.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 16, 2022

ghost assigned AndyAyersMS Oct 16, 2022

update block flags if we move a statement

8dddcb5

This was referenced Oct 17, 2022

System.Text.RegularExpressions.Tests.AttRegexTests test fails in CI #75808

Closed

Assertion failed: (*card_word)==0 in DynamicGenerics tests #76801

Closed

AndyAyersMS added 2 commits October 16, 2022 22:12

Add range enable config setting.

119c260

Skip past GT_NOP, no point considering those for merging. Fix logic error when finding cross jump victim -- need to assess the first block in the loop.

remove assert; fix debug validation test

6e3c219

build-analysis bot mentioned this pull request Oct 17, 2022

Nuget restore error caused by 503s dotnet/arcade#11239

Closed

2 tasks

runfoapp bot mentioned this pull request Oct 17, 2022

Infra improvements for Helix #68176

Closed

AndyAyersMS marked this pull request as ready for review October 17, 2022 15:55

AndyAyersMS mentioned this pull request Oct 17, 2022

JIT: suboptimal block layout #76872

Closed

SingleAccretion reviewed Oct 17, 2022

View reviewed changes

jakobbotsch reviewed Oct 17, 2022

View reviewed changes

build-analysis bot mentioned this pull request Oct 17, 2022

Test failure in System.Transactions.Tests.OleTxTests.* tests #76836

Closed

Remove an apparently unnecessary assert from loop recognition.

3c9c1d6

Add indir flag checking to `GenTree::Compare`.

build-analysis bot mentioned this pull request Oct 18, 2022

Tracking issue for CI build timeouts #76454

Closed

add JitEnableTailMerge config setting

cdc4968

AndyAyersMS added 2 commits October 20, 2022 23:42

tweak

d1488f5

tweak

e86d075

EgorBo approved these changes Oct 25, 2022

View reviewed changes

AndyAyersMS merged commit 9b16818 into dotnet:main Oct 25, 2022

BruceForstall reviewed Oct 26, 2022

View reviewed changes

dakersnar mentioned this pull request Oct 27, 2022

[Perf] Linux/arm64: 8 Regressions on 10/25/2022 10:16:52 PM #77546

Closed

This was referenced Nov 1, 2022

[Perf] Alpine/x64: 24 Regressions on 10/25/2022 7:09:53 PM dotnet/perf-autofiling-issues#9390

Open

[Perf] Linux/x64: 32 Regressions on 10/25/2022 7:09:53 PM dotnet/perf-autofiling-issues#9302

Closed

tarekgh mentioned this pull request Nov 4, 2022

Regressions in ICU globalization microbenchmarks #77730

Closed

jakobbotsch mentioned this pull request Nov 14, 2022

JIT: Bad codegen - using wrong var for field access #78310

Closed

ghost locked as resolved and limited conversation to collaborators Dec 3, 2022

JIT: implement tail merging #77103

JIT: implement tail merging #77103

Conversation

AndyAyersMS commented Oct 16, 2022

ghost commented Oct 16, 2022

AndyAyersMS commented Oct 16, 2022

AndyAyersMS commented Oct 16, 2022

EgorBo commented Oct 16, 2022

AndyAyersMS commented Oct 17, 2022

AndyAyersMS commented Oct 17, 2022

EgorBo commented Oct 17, 2022

EgorBo commented Oct 17, 2022

AndyAyersMS commented Oct 17, 2022

AndyAyersMS commented Oct 17, 2022

azure-pipelines bot commented Oct 17, 2022

SingleAccretion Oct 17, 2022

Choose a reason for hiding this comment

AndyAyersMS Oct 17, 2022

Choose a reason for hiding this comment

jakobbotsch Oct 17, 2022

Choose a reason for hiding this comment

AndyAyersMS Oct 18, 2022

Choose a reason for hiding this comment

jakobbotsch commented Oct 17, 2022

AndyAyersMS commented Oct 17, 2022

AndyAyersMS commented Oct 17, 2022

jakobbotsch commented Oct 17, 2022

AndyAyersMS commented Oct 17, 2022 • edited Loading

AndyAyersMS commented Oct 17, 2022

AndyAyersMS commented Oct 18, 2022

azure-pipelines bot commented Oct 18, 2022

AndyAyersMS commented Oct 19, 2022

AndyAyersMS commented Oct 19, 2022

AndyAyersMS commented Oct 20, 2022

AndyAyersMS commented Oct 20, 2022

AndyAyersMS commented Oct 21, 2022

AndyAyersMS commented Oct 21, 2022

AndyAyersMS commented Oct 24, 2022

EgorBo left a comment

Choose a reason for hiding this comment

BruceForstall left a comment

Choose a reason for hiding this comment

BruceForstall Oct 26, 2022

Choose a reason for hiding this comment

AndyAyersMS commented Nov 3, 2022

AndyAyersMS commented Nov 3, 2022

AndyAyersMS commented Oct 17, 2022 •

edited

Loading