Skip to content

Commit

Permalink
JIT: Add a (disabled) prototype for a generalized promotion pass (#83388
Browse files Browse the repository at this point in the history
)

Introduce a "physical" promotion pass that generalizes the existing promotion.
More specifically, it does not have restrictions on field count and it can
handle arbitrary recursive promotion.

The pass is physical in the sense that it does not rely on any field metadata
for structs. Instead, it works in two separate passes over the IR:

1. In the first pass we find and analyze how unpromoted struct locals are
accessed. For example, for a simple program like:

```
public static void Main()
{
    S s = default;
    Call(s, s.C);
    Console.WriteLine(s.B + s.C);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Call(S s, byte b)
{
}

private struct S
{
    public byte A, B, C, D, E;
}
```

we see IR like:

```
***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------                         ▌  ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] --C-G------                         ▌  CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000007] ----------- arg1                    └──▌  LCL_FLD   ubyte  V00 loc0         [+2]

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] --C-G------                         ▌  CALL      void   System.Console:WriteLine(int)
               [000015] ----------- arg0                    └──▌  ADD       int   
               [000011] -----------                            ├──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000014] -----------                            └──▌  LCL_FLD   ubyte  V00 loc0         [+2]
```

and the analysis produces

```
Accesses for V00
  [000..005)
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (1, 100)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 001
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 002
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)
```

Here the pairs are (#ref counts, wtd ref counts).

Based on this accounting, the analysis estimates the profitability of replacing
some of the accessed parts of the struct with a local. This may be costly
because overlapping struct accesses (e.g. passing the whole struct as an
argument) may require more expensive codegen after promotion. And of course,
creating new locals introduces more register pressure. Currently the
profitability analysis is very crude.

In this case the logic decides that promotion is not worth it:

```
Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Disqualifying replacement
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Disqualifying replacement
```

2. In the second pass the field accesses are replaced with new locals for the
profitable cases. For overlapping accesses that currently involves writing back
replacements to the struct local first. For arguments/OSR locals, it involves
reading them back from the struct first.

In the above case we can override the profitability analysis with stress mode
STRESS_PHYSICAL_PROMOTION_COST and we get:

```
Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Promoting replacement due to stress

lvaGrabTemp returning 2 (V02 tmp1) (a long lifetime temp) called for V00.[001..002).
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Promoting replacement due to stress

lvaGrabTemp returning 3 (V03 tmp2) (a long lifetime temp) called for V00.[002..003).
V00 promoted with 2 replacements
  [001..002) promoted as ubyte V02
  [002..003) promoted as ubyte V03

...

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------                         ▌  ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] -ACXG------                         ▌  CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000022] -A--------- arg1                    └──▌  COMMA     ubyte 
               [000021] -A---------                            ├──▌  ASG       ubyte 
               [000019] D------N---                            │  ├──▌  LCL_VAR   ubyte  V03 tmp2         
               [000020] -----------                            │  └──▌  LCL_FLD   ubyte  V00 loc0         [+2]
               [000018] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] -ACXG------                         ▌  CALL      void   System.Console:WriteLine(int)
               [000015] -A--------- arg0                    └──▌  ADD       int   
               [000027] -A---------                            ├──▌  COMMA     ubyte 
               [000026] -A---------                            │  ├──▌  ASG       ubyte 
               [000024] D------N---                            │  │  ├──▌  LCL_VAR   ubyte  V02 tmp1         
               [000025] -----------                            │  │  └──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000023] -----------                            │  └──▌  LCL_VAR   ubyte  V02 tmp1         
               [000028] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         
```

The pass still only has rudimentary support and is missing many basic CQ
optimization optimizations. For example, it does not make use of any liveness
yet and it does not have any decomposition support for assignments. Yet, it
already shows good potential in user benchmarks. I have listed some follow-up
improvements in #76928.

This PR is adding the pass but it is disabled by default. It can be enabled by
setting DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION. There are two new
scenarios added to jit-experimental that enables it, to be used for testing
purposes.
  • Loading branch information
jakobbotsch committed Apr 11, 2023
1 parent 62e026e commit 388edb6
Show file tree
Hide file tree
Showing 16 changed files with 1,442 additions and 41 deletions.
2 changes: 2 additions & 0 deletions eng/pipelines/common/templates/runtimes/run-test-job.yml
Expand Up @@ -583,6 +583,8 @@ jobs:
- jitpartialcompilation
- jitpartialcompilation_pgo
- jitobjectstackallocation
- jitgeneralizedpromotion
- jitgeneralizedpromotion_full

${{ if in(parameters.testGroup, 'jit-cfg') }}:
scenarios:
Expand Down
2 changes: 2 additions & 0 deletions src/coreclr/jit/CMakeLists.txt
Expand Up @@ -158,6 +158,7 @@ set( JIT_SOURCES
optimizer.cpp
patchpoint.cpp
phase.cpp
promotion.cpp
rangecheck.cpp
rationalize.cpp
redundantbranchopts.cpp
Expand Down Expand Up @@ -348,6 +349,7 @@ set( JIT_HEADERS
objectalloc.h
opcode.h
phase.h
promotion.h
rangecheck.h
rationalize.h
regalloc.h
Expand Down
16 changes: 16 additions & 0 deletions src/coreclr/jit/assertionprop.cpp
Expand Up @@ -1256,6 +1256,22 @@ AssertionIndex Compiler::optCreateAssertion(GenTree* op1,
goto DONE_ASSERTION; // Don't make an assertion
}

// We process locals when we see the LCL_VAR node instead
// of at its actual use point (its parent). That opens us
// up to problems in a case like the following, assuming we
// allowed creating an assertion like V10 = V35:
//
// └──▌ ADD int
// ├──▌ LCL_VAR int V10 tmp6 -> copy propagated to [V35 tmp31]
// └──▌ COMMA int
// ├──▌ ASG int
// │ ├──▌ LCL_VAR int V35 tmp31
// │ └──▌ LCL_FLD int V03 loc1 [+4]
if (lclVar2->lvRedefinedInEmbeddedStatement)
{
goto DONE_ASSERTION; // Don't make an assertion
}

assertion.op2.kind = O2K_LCLVAR_COPY;
assertion.op2.vn = optConservativeNormalVN(op2);
assertion.op2.lcl.lclNum = lclNum2;
Expand Down
4 changes: 4 additions & 0 deletions src/coreclr/jit/compiler.cpp
Expand Up @@ -4745,6 +4745,10 @@ void Compiler::compCompile(void** methodCodePtr, uint32_t* methodCodeSize, JitFl
//
DoPhase(this, PHASE_EARLY_LIVENESS, &Compiler::fgEarlyLiveness);

// Promote struct locals based on primitive access patterns
//
DoPhase(this, PHASE_PHYSICAL_PROMOTION, &Compiler::PhysicalPromotion);

// Run a simple forward substitution pass.
//
DoPhase(this, PHASE_FWD_SUB, &Compiler::fgForwardSub);
Expand Down
14 changes: 12 additions & 2 deletions src/coreclr/jit/compiler.h
Expand Up @@ -661,6 +661,8 @@ class LclVarDsc

unsigned char lvIsOSRExposedLocal : 1; // OSR local that was address exposed in Tier0

unsigned char lvRedefinedInEmbeddedStatement : 1; // Local has redefinitions inside embedded statements that
// disqualify it from local copy prop.
private:
unsigned char lvIsNeverNegative : 1; // The local is known to be never negative

Expand Down Expand Up @@ -2030,6 +2032,9 @@ class Compiler
friend class CallArgs;
friend class IndirectCallTransformer;
friend class ProfileSynthesis;
friend class LocalsUseVisitor;
friend class Promotion;
friend class ReplaceVisitor;

#ifdef FEATURE_HW_INTRINSICS
friend struct HWIntrinsicInfo;
Expand Down Expand Up @@ -2449,7 +2454,7 @@ class Compiler
GenTree* gtNewOperNode(genTreeOps oper, var_types type, GenTree* op1);

// For binary opers.
GenTree* gtNewOperNode(genTreeOps oper, var_types type, GenTree* op1, GenTree* op2);
GenTreeOp* gtNewOperNode(genTreeOps oper, var_types type, GenTree* op1, GenTree* op2);

GenTreeCC* gtNewCC(genTreeOps oper, var_types type, GenCondition cond);
GenTreeOpCC* gtNewOperCC(genTreeOps oper, var_types type, GenCondition cond, GenTree* op1, GenTree* op2);
Expand Down Expand Up @@ -5740,9 +5745,9 @@ class Compiler
private:
void fgInsertStmtNearEnd(BasicBlock* block, Statement* stmt);
void fgInsertStmtAtBeg(BasicBlock* block, Statement* stmt);
void fgInsertStmtAfter(BasicBlock* block, Statement* insertionPoint, Statement* stmt);

public:
void fgInsertStmtAfter(BasicBlock* block, Statement* insertionPoint, Statement* stmt);
void fgInsertStmtBefore(BasicBlock* block, Statement* insertionPoint, Statement* stmt);

private:
Expand Down Expand Up @@ -6078,6 +6083,8 @@ class Compiler
PhaseStatus fgMarkAddressExposedLocals();
void fgSequenceLocals(Statement* stmt);

PhaseStatus PhysicalPromotion();

PhaseStatus fgForwardSub();
bool fgForwardSubBlock(BasicBlock* block);
bool fgForwardSubStatement(Statement* statement);
Expand Down Expand Up @@ -9720,6 +9727,9 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
STRESS_MODE(SSA_INFO) /* Select lower thresholds for "complex" SSA num encoding */ \
STRESS_MODE(SPLIT_TREES_RANDOMLY) /* Split all statements at a random tree */ \
STRESS_MODE(SPLIT_TREES_REMOVE_COMMAS) /* Remove all GT_COMMA nodes */ \
STRESS_MODE(NO_OLD_PROMOTION) /* Do not use old promotion */ \
STRESS_MODE(PHYSICAL_PROMOTION) /* Use physical promotion */ \
STRESS_MODE(PHYSICAL_PROMOTION_COST) \
\
/* After COUNT_VARN, stress level 2 does all of these all the time */ \
\
Expand Down
1 change: 1 addition & 0 deletions src/coreclr/jit/compmemkind.h
Expand Up @@ -50,6 +50,7 @@ CompMemKindMacro(LoopHoist)
CompMemKindMacro(Unknown)
CompMemKindMacro(RangeCheck)
CompMemKindMacro(CopyProp)
CompMemKindMacro(Promotion)
CompMemKindMacro(SideEffects)
CompMemKindMacro(ObjectAllocator)
CompMemKindMacro(VariableLiveRanges)
Expand Down
1 change: 1 addition & 0 deletions src/coreclr/jit/compphases.h
Expand Up @@ -43,6 +43,7 @@ CompPhaseNameMacro(PHASE_UPDATE_FINALLY_FLAGS, "Update finally target flag
CompPhaseNameMacro(PHASE_EARLY_UPDATE_FLOW_GRAPH, "Update flow graph early pass", false, -1, false)
CompPhaseNameMacro(PHASE_STR_ADRLCL, "Morph - Structs/AddrExp", false, -1, false)
CompPhaseNameMacro(PHASE_EARLY_LIVENESS, "Early liveness", false, -1, false)
CompPhaseNameMacro(PHASE_PHYSICAL_PROMOTION, "Physical promotion", false, -1, false)
CompPhaseNameMacro(PHASE_FWD_SUB, "Forward Substitution", false, -1, false)
CompPhaseNameMacro(PHASE_MORPH_IMPBYREF, "Morph - ByRefs", false, -1, false)
CompPhaseNameMacro(PHASE_PROMOTE_STRUCTS, "Morph - Promote Structs", false, -1, false)
Expand Down
67 changes: 37 additions & 30 deletions src/coreclr/jit/fgopt.cpp
Expand Up @@ -6774,45 +6774,52 @@ PhaseStatus Compiler::fgTailMerge()
//
for (BasicBlock* const predBlock : block->PredBlocks())
{
if ((predBlock->GetUniqueSucc() == block) && BasicBlock::sameEHRegion(block, predBlock))
if (predBlock->GetUniqueSucc() != block)
{
Statement* lastStmt = predBlock->lastStmt();
continue;
}

// Block might be empty.
//
if (lastStmt == nullptr)
{
continue;
}
if (!BasicBlock::sameEHRegion(block, predBlock))
{
continue;
}

// Walk back past any GT_NOPs.
//
Statement* const firstStmt = predBlock->firstStmt();
while (lastStmt->GetRootNode()->OperIs(GT_NOP))
{
if (lastStmt == firstStmt)
{
// predBlock is evidently all GT_NOP.
//
lastStmt = nullptr;
break;
}
Statement* lastStmt = predBlock->lastStmt();

lastStmt = lastStmt->GetPrevStmt();
}
// Block might be empty.
//
if (lastStmt == nullptr)
{
continue;
}

// Block might be effectively empty.
//
if (lastStmt == nullptr)
// Walk back past any GT_NOPs.
//
Statement* const firstStmt = predBlock->firstStmt();
while (lastStmt->GetRootNode()->OperIs(GT_NOP))
{
if (lastStmt == firstStmt)
{
continue;
// predBlock is evidently all GT_NOP.
//
lastStmt = nullptr;
break;
}

// We don't expect to see PHIs but watch for them anyways.
//
assert(!lastStmt->IsPhiDefnStmt());
predInfo.Emplace(predBlock, lastStmt);
lastStmt = lastStmt->GetPrevStmt();
}

// Block might be effectively empty.
//
if (lastStmt == nullptr)
{
continue;
}

// We don't expect to see PHIs but watch for them anyways.
//
assert(!lastStmt->IsPhiDefnStmt());
predInfo.Emplace(predBlock, lastStmt);
}

// Are there enough preds to make it interesting?
Expand Down
6 changes: 5 additions & 1 deletion src/coreclr/jit/fgstmt.cpp
Expand Up @@ -386,11 +386,15 @@ Statement* Compiler::fgNewStmtFromTree(GenTree* tree, BasicBlock* block, const D
{
Statement* stmt = gtNewStmt(tree, di);

if (fgNodeThreading != NodeThreading::None)
if (fgNodeThreading == NodeThreading::AllTrees)
{
gtSetStmtInfo(stmt);
fgSetStmtSeq(stmt);
}
else if (fgNodeThreading == NodeThreading::AllLocals)
{
fgSequenceLocals(stmt);
}

#if DEBUG
if (block != nullptr)
Expand Down
37 changes: 32 additions & 5 deletions src/coreclr/jit/gentree.cpp
Expand Up @@ -7012,7 +7012,7 @@ void GenTree::SetVtableForOper(genTreeOps oper)
}
#endif // DEBUGGABLE_GENTREE

GenTree* Compiler::gtNewOperNode(genTreeOps oper, var_types type, GenTree* op1, GenTree* op2)
GenTreeOp* Compiler::gtNewOperNode(genTreeOps oper, var_types type, GenTree* op1, GenTree* op2)
{
assert(op1 != nullptr);
assert(op2 != nullptr);
Expand All @@ -7021,7 +7021,7 @@ GenTree* Compiler::gtNewOperNode(genTreeOps oper, var_types type, GenTree* op1,
// should call the appropriate constructor for the extended type.
assert(!GenTree::IsExOp(GenTree::OperKind(oper)));

GenTree* node = new (this, oper) GenTreeOp(oper, type, op1, op2);
GenTreeOp* node = new (this, oper) GenTreeOp(oper, type, op1, op2);

return node;
}
Expand Down Expand Up @@ -8397,7 +8397,7 @@ GenTree* Compiler::gtClone(GenTree* tree, bool complexOK)
return nullptr;
}

if (tree->gtOper == GT_FIELD)
if (tree->OperIs(GT_FIELD))
{
GenTree* objp = nullptr;

Expand Down Expand Up @@ -16263,6 +16263,34 @@ bool Compiler::gtSplitTree(
return false;
}

bool IsValue(const UseInfo& useInf)
{
GenTree* node = (*useInf.Use)->gtEffectiveVal();
if (!node->IsValue())
{
return false;
}

if (node->OperIs(GT_ASG))
{
return false;
}

GenTree* user = useInf.User;

if (user == nullptr)
{
return false;
}

if (user->OperIs(GT_COMMA) && (&user->AsOp()->gtOp1 == useInf.Use))
{
return false;
}

return true;
}

void SplitOutUse(const UseInfo& useInf, bool userIsReturned)
{
GenTree** use = useInf.Use;
Expand Down Expand Up @@ -16328,8 +16356,7 @@ bool Compiler::gtSplitTree(
}

Statement* stmt = nullptr;
if (!(*use)->IsValue() || (*use)->gtEffectiveVal()->OperIs(GT_ASG) || (user == nullptr) ||
(user->OperIs(GT_COMMA) && (user->gtGetOp1() == *use)))
if (!IsValue(useInf))
{
GenTree* sideEffects = nullptr;
m_compiler->gtExtractSideEffList(*use, &sideEffects);
Expand Down
1 change: 1 addition & 0 deletions src/coreclr/jit/jitconfigvalues.h
Expand Up @@ -431,6 +431,7 @@ CONFIG_STRING(JitEnableVNBasedDeadStoreRemovalRange, W("JitEnableVNBasedDeadStor
CONFIG_STRING(JitEnableEarlyLivenessRange, W("JitEnableEarlyLivenessRange"))
CONFIG_STRING(JitOnlyOptimizeRange,
W("JitOnlyOptimizeRange")) // If set, all methods that do _not_ match are forced into MinOpts
CONFIG_STRING(JitEnablePhysicalPromotionRange, W("JitEnablePhysicalPromotionRange"))

CONFIG_INTEGER(JitDoSsa, W("JitDoSsa"), 1) // Perform Static Single Assignment (SSA) numbering on the variables
CONFIG_INTEGER(JitDoValueNumber, W("JitDoValueNumber"), 1) // Perform value numbering on method expressions
Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/jit/jitstd/vector.h
Expand Up @@ -733,7 +733,7 @@ void vector<T, Allocator>::insert_elements_helper(iterator iter, size_type size,

ensure_capacity(m_nSize + size);

for (int src = m_nSize - 1, dst = m_nSize + size - 1; src >= (int) pos; --src, --dst)
for (int src = (int)(m_nSize - 1), dst = (int)(m_nSize + size - 1); src >= (int) pos; --src, --dst)
{
m_pArray[dst] = m_pArray[src];
}
Expand Down
12 changes: 10 additions & 2 deletions src/coreclr/jit/morph.cpp
Expand Up @@ -3328,7 +3328,7 @@ GenTreeCall* Compiler::fgMorphArgs(GenTreeCall* call)

if (argLclNum != BAD_VAR_NUM)
{
argObj->ChangeType(argVarDsc->TypeGet());
argx->ChangeType(argVarDsc->TypeGet());
argObj->SetOper(GT_LCL_VAR);
argObj->AsLclVar()->SetLclNum(argLclNum);
}
Expand All @@ -3346,7 +3346,7 @@ GenTreeCall* Compiler::fgMorphArgs(GenTreeCall* call)
{
// TODO-CQ: perform this transformation in lowering instead of here and
// avoid marking enregisterable structs DNER.
argObj->ChangeType(structBaseType);
argx->ChangeType(structBaseType);
if (argObj->OperIs(GT_LCL_VAR))
{
argObj->SetOper(GT_LCL_FLD);
Expand Down Expand Up @@ -14751,6 +14751,14 @@ PhaseStatus Compiler::fgPromoteStructs()
return PhaseStatus::MODIFIED_NOTHING;
}

#ifdef DEBUG
if (compStressCompile(STRESS_NO_OLD_PROMOTION, 10))
{
JITDUMP(" skipping due to stress\n");
return PhaseStatus::MODIFIED_NOTHING;
}
#endif

#if 0
// The code in this #if has been useful in debugging struct promotion issues, by
// enabling selective enablement of the struct promotion optimization according to
Expand Down

0 comments on commit 388edb6

Please sign in to comment.