Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: implement tail merging #77103

Merged
merged 8 commits into from
Oct 25, 2022
Merged

Conversation

AndyAyersMS
Copy link
Member

Add a phase that looks for common tail statements in a block's predecessors and merges them.

Run it both before and after morph.

Closes #8795.
Closes #76872.

Add a phase that looks for common tail statements in a block's
predecessors and merges them.

Run it both before and after morph.

Closes dotnet#8795.
Closes dotnet#76872.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 16, 2022
@ghost ghost assigned AndyAyersMS Oct 16, 2022
@ghost
Copy link

ghost commented Oct 16, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Add a phase that looks for common tail statements in a block's predecessors and merges them.

Run it both before and after morph.

Closes #8795.
Closes #76872.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib

Depends critically on GenTree::Compare actually being correct. Let's see.

Decent code size improvements. Improvement:Regression byte ratio is 20:1 or better. Haven't looked into regressions that deeply but the few I looked at were LSRA related.

Local TP runs show no cost or even a net improvement. We might be able to up the merge limit a bit more. There are still size wins to be had in ASP.NET (a no-limit x64 windows gets -127K size improvement) -- evidently a lot of async state machines have big switches with lots of common tails in the switch cases.

Unlimited merging has a ~0.4% TP cost in libraries tests.

@AndyAyersMS
Copy link
Member Author

Looks like I need to update the block flags.

  BBF_HAS_NULLCHECK is not set on BB28 but is required because of the following tree 
  N002 (  4,  3) [000565] ---X-------                         *  NULLCHECK byte  
  N001 (  3,  2) [000564] -----------                         \--*  LCL_VAR   ref    V09 loc8         u:1 (last use)

@EgorBo
Copy link
Member

EgorBo commented Oct 16, 2022

Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account?

@AndyAyersMS
Copy link
Member Author

Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account?

Right, it is doing very literal matching right now (modulo allowing swaps). Would not be hard to make it a bit smarter and allow a bigger range of things to match.

I am more worried about the cases where Compare returns true but the trees are actually different.

Skip past GT_NOP, no point considering those for merging.

Fix logic error when finding cross jump victim -- need to
assess the first block in the loop.
@AndyAyersMS
Copy link
Member Author

NAOT failure seems like it could be #76801.

@EgorBo
Copy link
Member

EgorBo commented Oct 17, 2022

Wow, nice diffs 🙂

@EgorBo
Copy link
Member

EgorBo commented Oct 17, 2022

Perhaps worth running some outerloop?

@AndyAyersMS AndyAyersMS marked this pull request as ready for review October 17, 2022 15:55
@AndyAyersMS
Copy link
Member Author

Seems like this is in reasonably good shape.

Perhaps worth running some outerloop?

Sure.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Comment on lines 6727 to 6730
if (GenTree::Compare(baseStmt->GetRootNode(), otherStmt->GetRootNode(), true))
{
matchedPredInfo.Push(predInfo.TopRef(j));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenTree::Compare does not pay attention to GTF_IND_VOLATILE, so presumably this can do something like below, which does not seem right.

[ p1 ]                  [ p1 ]
[ind<volatile>(addr)]     |
  |                       |
  |  [   p2    ]    -->   |   [ p2 ]
  |  [ind(addr)]          |      |   
  |     |               [ ind(addr) ]
[  block  ]             [   block   ]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Was actually expecting to quickly hit more bugs in GenTree::Compare, but so far haven't run across any.

@@ -99,7 +99,7 @@
// as this is used for the managed-ret-val feature, but the debugger filters out these mappings and does not
// report them in the ETW event. We should probably change this, those mappings should be useful in any case.
property int32[] Debug = int32[10]( 0x0 0x6 0xe 0x12 0x1a 0x1c 0x24 0x28 0x2c 0x34 )
property int32[] Opts = int32[5]( 0x0 0x6 0x12 0x1c 0x2c )
property int32[] Opts = int32[4]( 0x0 0x6 0x12 0x1c )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this break stepping in the debuggers, or are they able to map the new IP into the shared tail? What about profilers using sampling?

We should think about if we need to add some new debug information for tooling to have a chance to handle this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can lose debug info here -- note we can also merge calls which may create confusing looking stack traces.

But I don't think we can express this sort of many to one mapping. So not sure what to do about it.

@jakobbotsch
Copy link
Member

Nice! We should probably audit GenTree::Compare and at least verify that it isn't missing something obvious.

Is this purely a size-decreasing optimization, or do we also expect it to benefit performance?

What is the impact on TP in the contexts where no merging is done?

@AndyAyersMS
Copy link
Member Author

Is this purely a size-decreasing optimization, or do we also expect it to benefit performance?

Perf is tricky to assess. There are some knock-on optimizations this can enable, but absent those, tail merging can actually hurt perf because it might increase the relative density/frequency of taken branches.

What is the impact on TP in the contexts where no merging is done?

Not sure. Any suggestions on how I could measure that?

@AndyAyersMS
Copy link
Member Author

Looks like libraries jitstress and fuzzlyn are hitting an assert:

// Assertion failed '(head->bbJumpDest != top) || (head->bbFlags & BBF_KEEP_BBJ_ALWAYS)' in 'Program:M2(byref):ushort' during 'Find loops' (IL size 72; hash 0x504dda22; FullOpts)
// 
//     File: /Users/runner/work/1/s/src/coreclr/jit/optimizer.cpp Line: 1872
// 

@jakobbotsch
Copy link
Member

Not sure. Any suggestions on how I could measure that?

I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a .mcl file with all those contexts. Then you can pass it via -c argument to superpmi in a pin run.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 17, 2022

@BruceForstall any idea what this assert is guarding against?

// Cannot enter at the top - should have being caught by redundant jumps
assert((head->bbJumpDest != top) || (head->bbFlags & BBF_KEEP_BBJ_ALWAYS));

Offhand I don't see the problem if head branches to top, and that's what we have here:

Head   -- BB04
Top    -- BB05
Bottom -- BB07

This won't become a loop anyways.

Previously BB04 and BB07 would branch to BB12. Now they cross-jump to BB05.

image

@AndyAyersMS
Copy link
Member Author

Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account?

Right, it is doing very literal matching right now (modulo allowing swaps). Would not be hard to make it a bit smarter and allow a bigger range of things to match.

I ended up handing the GT_NOP case (at top level anyways). Didn't make a lot of difference.

Probably I should remove those statements in the pre-screening, but then tail merge phase might have diffs even if it did not do any merging.

Add indir flag checking to `GenTree::Compare`.
@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Not sure. Any suggestions on how I could measure that?

I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a .mcl file with all those contexts. Then you can pass it via -c argument to superpmi in a pin run.

A bit tricky since it runs twice;, but I suppose I can try this.

@AndyAyersMS
Copy link
Member Author

Jitstress should now be clean (ish...) may retry in a bit.

@AndyAyersMS
Copy link
Member Author

Adding the ability to disable even in release.

@AndyAyersMS
Copy link
Member Author

Not sure. Any suggestions on how I could measure that?

I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a .mcl file with all those contexts. Then you can pass it via -c argument to superpmi in a pin run.

A bit tricky since it runs twice;, but I suppose I can try this.

Here is some contextual TP data for the ASP.NET collection:

ASP, all contexts

[16:06:06] Loaded 129381  Jitted 129381  FailedCompile 0 Excluded 0 Missing 0 Diffs 5481

[16:06:06] Total instructions executed by base: 139387630956
[16:06:06] Total instructions executed by diff: 139283948984
[16:06:06] Total instructions executed delta: -103681972 (-0.07% of base)

ASP, no tail merge OPT contexts

[16:08:17] Loaded 55409  Jitted 55409  FailedCompile 0 Excluded 0 Missing 0 Diffs 0

[16:08:17] Total instructions executed by base: 74214970319
[16:08:17] Total instructions executed by diff: 74406798338
[16:08:17] Total instructions executed delta: 191828019 (0.26% of base)

ASP, tail merge OPT contexts

[16:31:10] Loaded 6374  Jitted 6374  FailedCompile 0 Excluded 0 Missing 0 Diffs 5481

[16:31:10] Total instructions executed by base: 43475425380
[16:31:10] Total instructions executed by diff: 43180020551
[16:31:10] Total instructions executed delta: -295404829 (-0.68% of base)

My hunch is that the methods that are tail merge candidates tend to be more complex so despite this kicking in for only 10% or so of methods it is still a net TP improvement.

There is also a subset ~55K min opts methods in here which is not explicitly accounted for.

Also note that in about 10% of the cases where we tail merge we were able to get the same codegen w/o tail merge.

@AndyAyersMS
Copy link
Member Author

That 0.26% we pay for not optimizing seems a bit high, let me see if there's some way to trim it down a bit...

@AndyAyersMS
Copy link
Member Author

Did a couple of perf tweaks. I had thought GenTree::Compare was going to be the costly bit but profiling is pointing at the initial stage where we find the set of merge candidates. Slimmed that down a bit.

@AndyAyersMS
Copy link
Member Author

TP impact is down a little bit more (did not remeasure the splits above). Diffs similar to before.

@dotnet/jit-contrib ping

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looking forward to seeing it merged, want to experiment further :)

@AndyAyersMS AndyAyersMS merged commit 9b16818 into dotnet:main Oct 25, 2022
Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nice, elegant code.

Comment on lines +2488 to +2496
if ((op1->gtFlags & (GTF_IND_FLAGS)) != (op2->gtFlags & (GTF_IND_FLAGS)))
{
return false;
}
FALLTHROUGH;

case GT_IND:
case GT_NULLCHECK:
if ((op1->gtFlags & (GTF_IND_FLAGS)) != (op2->gtFlags & (GTF_IND_FLAGS)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is odd. GT_BLK/GT_OBJ/GT_IND/GT_NULLCHECK are all equivalent; why not share the code? Why does GT_BLK/GT_OBJ have "FALLTHROUGH" to the same code it just executed? also, (GTF_IND_FLAGS) doesn't need to be parenthesized.

@AndyAyersMS
Copy link
Member Author

@AndyAyersMS
Copy link
Member Author

Possibly also: dotnet/perf-autofiling-issues#9468

@ghost ghost locked as resolved and limited conversation to collaborators Dec 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: suboptimal block layout JIT optimization: tail merge
5 participants