Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: make global morph its own phase, morph blocks in RPO #86822

Closed
wants to merge 3 commits into from

Conversation

AndyAyersMS
Copy link
Member

Some refactoring in antcipation of enabling cross-block local assertion prop during global morph.

Process blocks in RPO. Try and verify that no newly added blocks can alter the set of assertions that flow into the pre-existing ones.

Some refactoring in antcipation of enabling cross-block local assertion prop
during global morph.

Process blocks in RPO. Try and verify that no newly added blocks can alter
the set of assertions that flow into the pre-existing ones.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 26, 2023
@ghost ghost assigned AndyAyersMS May 26, 2023
@ghost
Copy link

ghost commented May 26, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Some refactoring in antcipation of enabling cross-block local assertion prop during global morph.

Process blocks in RPO. Try and verify that no newly added blocks can alter the set of assertions that flow into the pre-existing ones.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

A bunch of mostly neutral diffs, looks mostly like frame offset churn because temps are created in different orders (which reminds me, we should try and sort locals based on weight or something).

No new functionality actually enabled yet.

@AndyAyersMS
Copy link
Member Author

@jakobbotsch PTAL
cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented May 27, 2023

Not surprisingly x86 has some special behavior here I'll have to chase down. The assert I added to try and verify that morph only makes "harmless" edits to flow is firing. I suspect it suffers from both false positives and false negatives.

The main purpose behind the assert is to try and validate that any cross-block AP we might do on the blocks that exist at the start of morph (not yet happening here, but in some future PR) remain valid even if new blocks get introduced and flow is altered during morph.

So far the edits I have seen are the following:

  • Simplify a conditional branch. Main impact is on the (former) target of the branch, it will now have fewer preds. For forward branches (in RPO) this may let us propagate a cleaner set of assertions; for backwards branches we've already processed the block with a too-conservative set, so we're ok but sub-optimal (such is the nature of a one pass flow algorithm).
  • Add a single throw block, and possibly rewire a fallthrough, and remove outbound flow from a now throwing block. In addition to effects similar to the case above, this will introduce a block with no successors (so also doesn't alter forward flow and is fine to visit any time after the initial block set is processed) or an empty block that just jumps to the former fall through (has no impact on forward prop, and no impact from forward prop, so can be safely visited "out of order").
  • Change a tail call to branch back to method entry, possibly also introducing an empty scratch bb. There should be / will be no assertions live into the entry block, so altering the flow into it or nearby should be harmless.
  • For x86 ( and maybe elsewhere) we are invoking fgCreateGCPoll in morph which can introduce control flow. Again it should be harmless to process these blocks out of RPO order but I currently have no way of tracking why new blocks appear and don't want to have to parse their contents to figure out if they are harmless.

Another wrinkle we might consider here is to aggressively prune (or at least mark) unreachable blocks rather than morph them, doing so will also start to trim down the pred sets for blocks and have similar impact as the above. In fact we could do a dynamic evolution of the RPO by turning this into a worklist driven traversal (similar to how the importer runs).

@AndyAyersMS
Copy link
Member Author

TP Impact is currently a bit high, but the hope is we can recoup that by reducing the volume of IR leaving morph., at least when optimizations are enabled, or else justify tje TP hit with better CQ.

But likely we need to iterate in normal block order if we're not optimizing since RPO will offer no benefit.

@AndyAyersMS
Copy link
Member Author

For the time being we can relax or remove the assert, avoid using RPO unless optimized, and look for better ways to track the set of introduced blocks (so we don't have to search the bblist for them later).

@EgorBo
Copy link
Member

EgorBo commented May 27, 2023

The diffs are likely just not representative due to too many of missing contexts (MISSED contexts: base: 313,570 (25.86%), diff: 313,581 (25.86%) I've kicked off a new collection

@AndyAyersMS
Copy link
Member Author

Locally I see fewer missed contexts (or perhaps just am not paying attention).

At any rate not doing RPO unless we're optimizing knocked down the diffs considerably and should remove the minopts TP impact. I also optimized the new block visits for the optimized case and gave up (for now) trying to detect if the flow alterations and possible code motion could pose problems later on.

I am now seeing diffs where there is an order dependence for block morphing. Block morph is sensitive to a local var's DNER state, and this gets set during global morph as needed, so depending on the relative ordering of the block morph and DNER setting we can get different expansions. I don't think it is a correctness issue since we already have this order dependence now.

Comment on lines +13689 to 13692
for (BasicBlock* const block : *fgNewBBs)
{
fgMergeBlockReturn(block);
fgMorphBlock(block);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a significant chunk of throughput cost? What is the improvement of this tracking mechanism alone?

Comment on lines +13662 to +13665
fgRenumberBlocks();
EnsureBasicBlockEpoch();
fgComputeEnterBlocksSet();
fgDfsReversePostorder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we need to do so much work to set up for the RPO traversal. SSA's TopologicalSort does not do any of this work; it simply starts at fgFirstBB (well, the BB root, but then the original fgFirstBB very soon after that). Would it be incorrect to do the same here?

Another idea: can we avoid the two step iteration process, and instead have a version of fgDfsReversePostorder that just invokes a callback instead of recording the order into ambient state? Would it make any difference? I'm not totally sure where the throughput cost comes from in this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we need to do so much work to set up for the RPO traversa

We can try and do all this on the fly, that's more or less what I was suggesting above:

we could do a dynamic evolution of the RPO by turning this into a worklist driven traversal (similar to how the importer runs).

I'm not totally sure where the throughput cost comes from in this change.

I don't understand it either, seems like this latest version should (aside from building the RPO) be fairly efficient. Will have to take a look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea: can we avoid the two step iteration process, and instead have a version of fgDfsReversePostorder that just invokes a callback instead of recording the order into ambient state? Would it make any difference? I'm not totally sure where the throughput cost comes from in this change.

I suspect this is tricky for phases like morph that also can alter the flow graph.

@AndyAyersMS
Copy link
Member Author

Note value numbering (via ValueNumberState) does a sort of on the fly RPO to try and maximize forward propagation of VNs. I don't think this approach would be usable in Morph because of the modifications morph can make to the flow graph, and I actually wonder if VN might be better off with a precomputed schedule or callback style processing during a DFS.

@AndyAyersMS
Copy link
Member Author

Via pin analysis it looks like the computation of the RPO is the main culprit for the ~0.25% TP increase.

image

It doesn't seem like it would be substantially cheaper to try and do this on the fly, we still have to enumerate each block's successors which seems like the bulk of the cost here, and which morph currently doesn't need to do.

Presumably assertion state maintenance will add another layer of TP impact, but then the additional assertions could simplify morphing and so recover some of this. Hard to know how much without actually doing the work, but at that point we'd also potentially be seeing CQ wins so could possibly justify a modest TP increase.

@AndyAyersMS
Copy link
Member Author

Not going to happen in .NET 8, so will close for now.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants