JIT: Flow Graph Modernization and Improved Block Layout #93020

AndyAyersMS · 2023-10-04T17:29:06Z

Overview

The current block layout algorithm in the JIT is based on local permutations of block order. It is complicated and likely far from optimal. We would like to improve the overall block layout algorithm used by the JIT, in particular adopting a global cost-minimizing approach to layout—for instance, one in the style of Young et. al.'s Near-optimal intraprocedural branch alignment. Additional complexities arise in our case because of various EH reporting requirements, so the JIT cannot freely reorder all blocks, but we should be able to apply the global ordering techniques within EH regions.

Before we can tackle this problem there are several important (and sizeable) prerequisites, which we can lump together as "flow graph modernization." There are a lot of details here, but at a high level:

Update the flow graph to make fall-through behavior explicit for most phases (until layout). This means disallowing BBJ_NONE and adding explicit fall through successors for BBJ_COND and BBJ_SWITCH. More on this below.
Defer most block reordering work until late in the JIT phase pipeline (ideally, perhaps, waiting until after LSRA has run, so we can properly position any new blocks it introduces).
Leverage the work done in .NET 8 to make flow edges persistent and ensure that those edges accurately describe successor likelihoods. Possibly make successor edge enumeration a first-class concept.

It is not yet clear how much progress we can make during .NET 9. The list of items below is preliminary and subject to change.

Motivation

Past studies have shown that the two phases that benefit most from block-level PGO data are inlining and block layout. In a previous compiler project, the net benefit from PGO was on the order of 15%, with about 12% attributable to inlining, and 2% to improved layout.

The current JIT is likely seeing a much smaller benefit from layout. The goal here is to ensure that we are using the accurate PGO data to make informed decisions about the ordering of blocks, with the hope of realizing perhaps a 1 or 2% net benefit across a wide range of applications (with some benefiting much more, and others, not at all).

Flow Graph Modernization

Block Layout

Look into running existing layout after LSRA. Note there may be some tricky interactions, if say reversing a conditional perturbs the allocation for some reason
Building on the work above, leverage block weights and successor edge likelihoods to build a good initial layout via something like a greedy RPO
- JIT: Implement greedy RPO-based block layout #101473
Implement some sort of layout score describing costs of a layout
(Possibly) find a way of estimating the optimal layout score
(stretch) Implement a scheme to improve layout based on lowering score; one such scheme is described in Near-optimal intraprocedural branch alignment
Consider removing Compiler::fgFindInsertPoint, and similar logic that attempts to maintain reasonable orderings before block layout is run (see comment).
Consider skipping loop compaction and relying on the new layout algorithm to place loop bodies contiguously.
Think about the interaction of layout and hot/cold splitting

cc @amanasifkhalid @dotnet/jit-contrib

The text was updated successfully, but these errors were encountered:

ghost · 2023-10-04T17:29:11Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

The current block layout algorithm in the JIT is based local permutations of block order. It is complicated and likely far from optimal.
We would like to improve the overall block layout algorithm used by the JIT, in particular adopting a glboal cost-minimizing approach to layout—for instance, one in the style of Young et. al.'s Near-optimal intraprocedural branch alignment. Additional complexities arise in our case because of various EH reporting requirements, so the JIT cannot freely reorder all blocks, but we should be able to apply the global ordering techniques within EH regions.

Before we can tackle this problem there are several important (and sizeable) prerequisites, which we can lump together as "flow graph modernization." There are a lot of details here, but at a high level:

Update the flow graph to make fall-through behavior explicit for most phases (until layout). This means disallowing BBJ_NONE and adding explicit fall through successors for BBJ_COND and BBJ_SWITCH. More on this below.
Defer most block reordering work until late in the JIT phase pipeline (ideally, perhaps, waiting until after LSRA has run, so we can properly position any new blocks it introduces).
Leverage the work done in .NET 8 to make flow edges persistent and ensure that those edges accurately describe successor likelihoods. Possibly make successor edge enumeration a first-class concept.

It is not yet clear how much progress we can make during .NET 9. The list of items below is preliminary and subject to change.

Flow Graph Modernization

Block Layout

Look into running existing layout after LSRA. Note there may be some tricky interactions, if say reversing a conditional perturbs the allocation for some reason
Building on the work above, leverage block weights and successor edge likelihoods to build a good initial layout via something like a greedy RPO
Implement some sort of layout score describing costs of a layout
(Possibly) find a way of estimating the optimal layout score
Implement a scheme to improve layout based on lowering score; one such scheme is described in Near-optimal intraprocedural branch alignment
Think about the interaction of layout and hot/cold splitting

cc @amanasifkhalid @dotnet/jit-contrib

Author:	AndyAyersMS
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	9.0.0

Follow-up to #92908, and next step for #93020.

Next step for #93020, per conversation on #93772. Replacing BBJ_NONE with BBJ_ALWAYS to the next block helps limit our use of implicit fall-through (though we still expect BBJ_COND to fall through when its false branch is taken; #93772 should eventually address this). I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

jakobbotsch · 2023-12-09T20:45:51Z

As a prerequisite for some of the block reordering work we'll likely need to change loop alignment to be more centralized. We currently identify the initial candidate blocks to place loop alignment instructions in during loop finding and apply some heuristics when computing loop side effects during VN. We will probably need to defer all of these decisions to happen after block reordering. I am also inclined to say that we should just recompute the loops at that point, instead of trying to maintain loop information -- we have a lot of code that works hard to maintain bbNatLoopNum all the way into the backend that we could remove. Since the block reordering is likely to need a DFS as well, the extra TP we'll end up paying is just for the loop identification, which is not that much (tpdiff).

…96265) Part of #93020. Previously, bbFalseTarget was hard-coded to match bbNext in BasicBlock::SetNext. We still require bbFalseTarget to point to the next block for BBJ_COND blocks, but I've removed the logic for updating bbFalseTarget from SetNext, and placed calls to SetFalseTarget wherever bbFalseTarget needs to be updated because the BBJ_COND block has been created or moved relative to its false successor. This helps set us up to start removing logic that enforces the block's false successor is the next block.

…e target (#96431) Next step for #93020. When doing hot/cold splitting, if the first cold block succeeds a BBJ_COND block (meaning the false target is the first cold block), we previously needed to insert a BBJ_ALWAYS block at the end of the hot section to unconditionally jump to the cold section. Since we will need to conditionally generate a jump to the false target depending on its location once bbFalseTarget can diverge from bbNext, this seemed like a nice opportunity to add that logic in, and instead generate a jump to the cold section by checking if a jump is needed to the false target, rather than by appending a BBJ_ALWAYS block to the hot section.

Before finalizing the block layout with optOptimizeLayout, we call fgReorderBlocks in a few optimization passes that modify the flowgraph (though without the intent to actually reorder any blocks, by passing useProfile=false). Removing all of these early calls -- except for the one in optOptimizeFlow, which can probably be replaced by moving fgReorderBlocks's branch optimization logic to fgUpdateFlowGraph -- incurs relatively few diffs, and gets us closer to #93020's goal of deferring block reordering until late in the JIT's optimization phases.

…imization phase (#96609) Next step for #93020. Working backwards through the JIT flowgraph phases, this change allows bbFalseTarget to diverge from bbNext in Compiler::optOptimizeLayout and onwards.

Before finalizing the block layout with optOptimizeLayout, we call fgReorderBlocks in a few optimization passes that modify the flowgraph (though without the intent to actually reorder any blocks, by passing useProfile=false). Removing all of these early calls -- except for the one in optOptimizeFlow, which can probably be replaced by moving fgReorderBlocks's branch optimization logic to fgUpdateFlowGraph -- incurs relatively few diffs, and gets us closer to dotnet#93020's goal of deferring block reordering until late in the JIT's optimization phases.

…imization phase (dotnet#96609) Next step for dotnet#93020. Working backwards through the JIT flowgraph phases, this change allows bbFalseTarget to diverge from bbNext in Compiler::optOptimizeLayout and onwards.

Part of #93020. This change adds back in most of #97191 and #96609, except for any significant changes to the flowgraph optimization passes to reduce churn. With this change, the false target of a BBJ_COND can diverge from the next block until Compiler::optOptimizeLayout, in which we reestablish implicit fall-through with fgConnectFallThrough to preserve the existing block reordering behavior. Note that the deferral of these fall-through fixups causes diffs in the edge weights, which can alter the behavior of fgReorderBlocks, hence some of the size regressions

Move the full profile check down past the importer. Attempt local repair for cases where the importer alters BBJ_COND. If that is unable to guarantee consistency, mark the PGO data as inconsistent. If the importer alters BBJ_SWITCH don't attempt repair, just mark the profile as inconsistent. If in an OSR method the original method entry is a loop header, and that is not the loop that triggered OSR, mark the profile as inconsistent. If the importer re-imports a LEAVE, there are still orphaned blocks left from the first importation, these can mess up profiles. In that case, mark the profile as inconsistent. Exempt blocks with EH preds (catches, etc) from inbound checking, as profile data propagation along EH edges is not modelled. Modify the post-phase checks to allow either small relative errors or small absolute errors, so that flow out of EH regions though intermediaries (say step blocks) does not trip the checker. Ensure the initial pass of likelihood adjustments pays attention to throws. And only mark throws as rare in the importer if we have not synthesized profile data (which may in fact tell us the throw is not cold). Contributes to dotnet#93020

Part of dotnet#93020. Removes FlowEdge::m_edgeWeightMin and FlowEdge::m_edgeWeightMax, and relies on block weights and edge likelihoods to determine edge weights via FlowEdge::getLikelyWeight.

…1011) Fixes the following areas with proper profile updates: * GDV chaining * instrumentation-introduces flow * OSR step blocks * fgSplitEdge (used by instrumentation) Adds checking bypasses for: * callfinally pair tails * original method entries in OSR methods Contributes to dotnet#93020

When dynamic PGO is active we would like for all methods to have some profile data, so we don't have to handle a mixture of profiled and unprofiled methods during or after inlining. But to reduce profiling overhead, the JIT will not instrument methods that have straight-line control flow, or flow where all branches lead to throws (aka "minimal profiling"). When the JIT tries to recover profile data for these methods it won't get any data back. SO there is a fairly high volume of these profiled/unprofiled mixtures today and they lead to various poor decisions in the JIT. This change enables the JIT to see if dynamic PGO is active. The JIT does not yet do anything with the information. A subsequent change will have the JIT synthesize data for methods with no profile data in this case. We could also solve this by creating a placeholder PGO schema for theswith no data, but it seems simpler and less resource intensive to have the runtime tell the JIT that dynamic PGO is active. This also changes the JIT GUID for the new API surface. Contributes to dotnet#93020.

If we know dynamic PGO is active, and we do not find a PGO schema for a method, synthesize PGO data. The schema may be missing if the method was prejitted but not covered by static PGO, or was considered too simple to need profiling (aka minimal profiling). This synthesis removes the possibility of a mixed PGO/no PGO situation. These are problematic, especially in methods that do a lot of inlining. Now when dynamic PGO is active all methods that get optimized will have some form of PGO data. Contributes to dotnet#93020.

Part of #93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks. The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so. This new layout is disabled by default for now.

If we know dynamic PGO is active, and we do not find a PGO schema for a method, synthesize PGO data. The schema may be missing if the method was prejitted but not covered by static PGO, or was considered too simple to need profiling (aka minimal profiling). This synthesis removes the possibility of a mixed PGO/no PGO situation. These are problematic, especially in methods that do a lot of inlining. Now when dynamic PGO is active all methods that get optimized will have some form of PGO data. Only run profile incorporation when optimizing. Reset BBOPT/pgo vars if we switch away from optimization or have a min opts failover. Contributes to #93020.

Advance profile consistency check through inlining. Turns out there are five reasons why inlining may make profile data inconstent. Account for these and add metrics. Also add separate metrics for consistency before and after inlining, since pre-inline phases are run on inlinees and so don't give us good insight into overall consistency rates. And add some metrics for inlining itself. Contributes to dotnet#93020.

Advance profile consistency check through inlining. Turns out there are five reasons why inlining may make profile data inconsistent. Account for these and add metrics. Also add separate metrics for consistency before and after inlining, since pre-inline phases are run on inlinees and so don't give us good insight into overall consistency rates. And add some metrics for inlining itself. Contributes to #93020. Co-authored-by: Aman Khalid <amankhalid@microsoft.com>

When dynamic PGO is active we would like for all methods to have some profile data, so we don't have to handle a mixture of profiled and unprofiled methods during or after inlining. But to reduce profiling overhead, the JIT will not instrument methods that have straight-line control flow, or flow where all branches lead to throws (aka "minimal profiling"). When the JIT tries to recover profile data for these methods it won't get any data back. SO there is a fairly high volume of these profiled/unprofiled mixtures today and they lead to various poor decisions in the JIT. This change enables the JIT to see if dynamic PGO is active. The JIT does not yet do anything with the information. A subsequent change will have the JIT synthesize data for methods with no profile data in this case. We could also solve this by creating a placeholder PGO schema for theswith no data, but it seems simpler and less resource intensive to have the runtime tell the JIT that dynamic PGO is active. This also changes the JIT GUID for the new API surface. Contributes to dotnet#93020.

Part of dotnet#93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks. The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so. This new layout is disabled by default for now.

…101739) If we know dynamic PGO is active, and we do not find a PGO schema for a method, synthesize PGO data. The schema may be missing if the method was prejitted but not covered by static PGO, or was considered too simple to need profiling (aka minimal profiling). This synthesis removes the possibility of a mixed PGO/no PGO situation. These are problematic, especially in methods that do a lot of inlining. Now when dynamic PGO is active all methods that get optimized will have some form of PGO data. Only run profile incorporation when optimizing. Reset BBOPT/pgo vars if we switch away from optimization or have a min opts failover. Contributes to dotnet#93020.

Advance profile consistency check through inlining. Turns out there are five reasons why inlining may make profile data inconsistent. Account for these and add metrics. Also add separate metrics for consistency before and after inlining, since pre-inline phases are run on inlinees and so don't give us good insight into overall consistency rates. And add some metrics for inlining itself. Contributes to dotnet#93020. Co-authored-by: Aman Khalid <amankhalid@microsoft.com>

…ayout (#102461) Part of #93020. In #102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.

Instead of giving hander regions a fraction of the entry weight, give them a small fixed weight. This is intended to combat the lack of profile propagation out of handler regions, where there are currently sometimes weight discontinuities large enough to cause profile check asserts. Contributes to dotnet#93020.

Move the full profile check down past the importer. Attempt local repair for cases where the importer alters BBJ_COND. If that is unable to guarantee consistency, mark the PGO data as inconsistent. If the importer alters BBJ_SWITCH don't attempt repair, just mark the profile as inconsistent. If in an OSR method the original method entry is a loop header, and that is not the loop that triggered OSR, mark the profile as inconsistent. If the importer re-imports a LEAVE, there are still orphaned blocks left from the first importation, these can mess up profiles. In that case, mark the profile as inconsistent. Exempt blocks with EH preds (catches, etc) from inbound checking, as profile data propagation along EH edges is not modelled. Modify the post-phase checks to allow either small relative errors or small absolute errors, so that flow out of EH regions though intermediaries (say step blocks) does not trip the checker. Ensure the initial pass of likelihood adjustments pays attention to throws. And only mark throws as rare in the importer if we have not synthesized profile data (which may in fact tell us the throw is not cold). Contributes to dotnet#93020

Part of dotnet#93020. Removes FlowEdge::m_edgeWeightMin and FlowEdge::m_edgeWeightMax, and relies on block weights and edge likelihoods to determine edge weights via FlowEdge::getLikelyWeight.

…1011) Fixes the following areas with proper profile updates: * GDV chaining * instrumentation-introduces flow * OSR step blocks * fgSplitEdge (used by instrumentation) Adds checking bypasses for: * callfinally pair tails * original method entries in OSR methods Contributes to dotnet#93020

When dynamic PGO is active we would like for all methods to have some profile data, so we don't have to handle a mixture of profiled and unprofiled methods during or after inlining. But to reduce profiling overhead, the JIT will not instrument methods that have straight-line control flow, or flow where all branches lead to throws (aka "minimal profiling"). When the JIT tries to recover profile data for these methods it won't get any data back. SO there is a fairly high volume of these profiled/unprofiled mixtures today and they lead to various poor decisions in the JIT. This change enables the JIT to see if dynamic PGO is active. The JIT does not yet do anything with the information. A subsequent change will have the JIT synthesize data for methods with no profile data in this case. We could also solve this by creating a placeholder PGO schema for theswith no data, but it seems simpler and less resource intensive to have the runtime tell the JIT that dynamic PGO is active. This also changes the JIT GUID for the new API surface. Contributes to dotnet#93020.

Part of dotnet#93020. Compiler::fgDoReversePostOrderLayout reorders blocks based on a RPO of the flowgraph's successor edges. When reordering based on the RPO, we only reorder blocks within the same EH region to avoid breaking up their contiguousness. After establishing an RPO-based layout, we do another pass to move cold blocks to the ends of their regions in fgMoveColdBlocks. The "greedy" part of this layout isn't all that greedy just yet. For now, we use edge likelihoods to make placement decisions only for BBJ_COND blocks' successors. I plan to extend this greediness to other multi-successor block kinds (BBJ_SWITCH, etc) in a follow-up so we can independently evaluate the value in doing so. This new layout is disabled by default for now.

…101739) If we know dynamic PGO is active, and we do not find a PGO schema for a method, synthesize PGO data. The schema may be missing if the method was prejitted but not covered by static PGO, or was considered too simple to need profiling (aka minimal profiling). This synthesis removes the possibility of a mixed PGO/no PGO situation. These are problematic, especially in methods that do a lot of inlining. Now when dynamic PGO is active all methods that get optimized will have some form of PGO data. Only run profile incorporation when optimizing. Reset BBOPT/pgo vars if we switch away from optimization or have a min opts failover. Contributes to dotnet#93020.

Advance profile consistency check through inlining. Turns out there are five reasons why inlining may make profile data inconsistent. Account for these and add metrics. Also add separate metrics for consistency before and after inlining, since pre-inline phases are run on inlinees and so don't give us good insight into overall consistency rates. And add some metrics for inlining itself. Contributes to dotnet#93020. Co-authored-by: Aman Khalid <amankhalid@microsoft.com>

…ayout (dotnet#102461) Part of dotnet#93020. In dotnet#102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.

AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 4, 2023

AndyAyersMS added this to the 9.0.0 milestone Oct 4, 2023

amanasifkhalid mentioned this issue Oct 4, 2023

JIT: Make BasicBlock::bbPrev and bbNext private #93032

Merged

amanasifkhalid added a commit that referenced this issue Oct 6, 2023

JIT: Make BasicBlock::bbPrev and bbNext private (#93032)

22d034f

Follow-up to #92908, and next step for #93020.

amanasifkhalid mentioned this issue Oct 6, 2023

JIT: Make BasicBlock jump target private #93152

Merged

JulieLeeMSFT added this to Team User Stories in .NET Core CodeGen Oct 6, 2023

JulieLeeMSFT added the User Story A single user-facing feature. Can be grouped under an epic. label Oct 6, 2023

JulieLeeMSFT assigned AndyAyersMS Oct 6, 2023

amanasifkhalid mentioned this issue Oct 20, 2023

JIT: Add explicit block fallthrough successor #93772

Closed

amanasifkhalid mentioned this issue Oct 31, 2023

JIT: Remove BBJ_NONE #94239

Merged

amanasifkhalid mentioned this issue Dec 7, 2023

JIT: Add explicit successor for BBJ_COND false branch #95773

Merged

amanasifkhalid mentioned this issue Dec 14, 2023

[JIT] Remove BBF_NONE_QUIRK #95998

Closed

4 tasks

amanasifkhalid mentioned this issue Dec 22, 2023

JIT: Set bbFalseTarget upon BBJ_COND initialization/modification #96265

Merged

amanasifkhalid mentioned this issue Jan 3, 2024

JIT: Allow hot/cold splitting between a BBJ_COND block and its false target #96431

Merged

amanasifkhalid mentioned this issue Jan 8, 2024

JIT: Allow BBJ_COND false target to diverge from bbNext in layout optimization phase #96609

Merged

amanasifkhalid mentioned this issue Jan 16, 2024

JIT: Remove some early fgReorderBlocks calls #97012

Merged

This was referenced Jan 19, 2024

JIT: Remove Compiler::fgConnectFallThrough and BBJ_COND fix-ups in loop phases #97191

Closed

JIT: Remove Compiler::fgIsBetterFallThrough #97222

Merged

amanasifkhalid mentioned this issue Jan 25, 2024

JIT: Remove most fgConnectFallThrough calls #97488

Merged

amanasifkhalid mentioned this issue Jan 29, 2024

Remove fgReplaceSwitchJumpTarget; increase usage of fgReplaceJumpTarget #97664

Merged

AndyAyersMS mentioned this issue Apr 30, 2024

JIT: synthesize PGO if no schema, when dynamic PGO is active #101739

Merged

This was referenced May 3, 2024

JIT: profile checking through inlining #101834

Merged

Profile Synthesis Work Items #82964

Closed

This was referenced May 16, 2024

JIT: Enable RPO-based block layout by default #102343

Merged

JIT: Move backward jumps to before their successors after RPO-based layout #102461

Merged

amanasifkhalid mentioned this issue Jun 3, 2024

JIT: Reconsider block weight propagation in loop cloning #103001

Open

amanasifkhalid mentioned this issue Jun 21, 2024

JIT: Enable compaction of all BBJ_ALWAYS blocks #103785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Flow Graph Modernization and Improved Block Layout #93020

JIT: Flow Graph Modernization and Improved Block Layout #93020

AndyAyersMS commented Oct 4, 2023 •

edited by amanasifkhalid

ghost commented Oct 4, 2023

Overview

Flow Graph Modernization

Block Layout

jakobbotsch commented Dec 9, 2023

JIT: Flow Graph Modernization and Improved Block Layout #93020

JIT: Flow Graph Modernization and Improved Block Layout #93020

Comments

AndyAyersMS commented Oct 4, 2023 • edited by amanasifkhalid

Overview

Motivation

Flow Graph Modernization

Block Layout

ghost commented Oct 4, 2023

Overview

Flow Graph Modernization

Block Layout

jakobbotsch commented Dec 9, 2023

AndyAyersMS commented Oct 4, 2023 •

edited by amanasifkhalid