[JIT] Add BasicBlock::bbFlags helper methods #95139

amanasifkhalid · 2023-11-22T18:35:13Z

Andy mentioned in #94239 that it would be nice to have a method handle verbose flag checks like ((block->bbFlags & BBF_SOME_FLAG) != 0), so this adds some helper methods for checking/modifying block flags.

ghost · 2023-11-22T18:35:26Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Andy mentioned in #94239 that it would be nice to have a method handle verbose flag checks like ((block->bbFlags & BBF_SOME_FLAG) != 0), so this adds some helper methods for checking/modifying block flags.

Author:	amanasifkhalid
Assignees:	amanasifkhalid
Labels:	`area-CodeGen-coreclr`
Milestone:	-

amanasifkhalid · 2023-11-22T19:08:20Z

cc @dotnet/jit-contrib. With these changes, we only have a few spots where we directly access bbFlags (usually either for assignment, or for calculating its bitwise AND with some flag without converting the result to a boolean). If we've made it this far, would it be worth just making bbFlags private, and adding GetFlags() and SetFlags(...) to cover those last few cases?

BruceForstall

I like this change. Left a few notes.

If we've made it this far, would it be worth just making bbFlags private, and adding GetFlags() and SetFlags(...) to cover those last few cases?

With my suggested CopyFlags function, how many more cases are there? I think it would make sense to make it private. I would name the functions GetFlagsRaw/SetFlagsRaw to make it even more clear this is an "escape hatch" and not preferred usage.

BruceForstall · 2023-11-22T20:30:03Z

src/coreclr/jit/block.h

@@ -869,8 +894,7 @@ struct BasicBlock : private LIR::Range
    {
        if (this->bbWeight == BB_ZERO_WEIGHT)
        {
-            this->bbFlags &= ~BBF_RUN_RARELY;  // Clear any RarelyRun flag
-            this->bbFlags &= ~BBF_PROF_WEIGHT; // Clear any profile-derived flag
+            this->RemoveFlag((BBF_RUN_RARELY | BBF_PROF_WEIGHT));


nit

Suggested change

this->RemoveFlag((BBF_RUN_RARELY | BBF_PROF_WEIGHT));

this->RemoveFlag(BBF_RUN_RARELY | BBF_PROF_WEIGHT);

BruceForstall · 2023-11-22T20:30:39Z

src/coreclr/jit/codegenarmarch.cpp

@@ -5435,7 +5435,7 @@ void CodeGen::genFnEpilog(BasicBlock* block)
    }
 #endif // DEBUG

-    bool jmpEpilog = ((block->bbFlags & BBF_HAS_JMP) != 0);
+    bool jmpEpilog = (block->HasFlag(BBF_HAS_JMP));


nit

Suggested change

bool jmpEpilog = (block->HasFlag(BBF_HAS_JMP));

bool jmpEpilog = block->HasFlag(BBF_HAS_JMP);

BruceForstall · 2023-11-22T20:32:08Z

src/coreclr/jit/codegencommon.cpp


    // Use coldness of current block, as this label will
    // be contained in it.
-    block->bbFlags |= (compiler->compCurBB->bbFlags & BBF_COLD);
+    block->SetFlag(compiler->compCurBB->bbFlags & BBF_COLD);


You could add a CopyFlags function and have:

Suggested change

block->SetFlag(compiler->compCurBB->bbFlags & BBF_COLD);

block->CopyFlags(compiler->compCurBB, BBF_COLD);

Sure thing. We do have a few cases where the second argument to CopyFlags would be multiple flags OR'd together, e.g. in Compiler::fgSplitBlockBeforeTree:
block->SetFlag(originalFlags & (BBF_SPLIT_GAINED | BBF_IMPORTED | BBF_GC_SAFE_POINT | BBF_LOOP_PREHEADER | BBF_RETLESS_CALL));
Would you be ok with this pattern?

Something like:

block->CopyFlags(fromBlock, BBF_IMPORTED | BBF_GC_SAFE_POINT | BBF_LOOP_PREHEADER | BBF_RETLESS_CALL);

would be fine.

The more complex cases where we (1) grab the flags, (2) do some computations, then (3) set the flags, might be where the GetFlagsRaw would come in. So the example you reference might need to use that.

BruceForstall · 2023-11-22T20:35:50Z

src/coreclr/jit/block.h

+        return !CheckFlag(flag, BBF_EMPTY);
+    }
+
+    void SetFlag(const BasicBlockFlags flag)


Maybe it should be SetFlags/RemoveFlags?

BruceForstall · 2023-11-22T20:37:44Z

src/coreclr/jit/block.h

+        return CheckFlag(flag, flag);
+    }
+
+    bool HasFlag(const BasicBlockFlags flag) const


Should this assert that flag has only one bit set? And add an explicit HasAnyFlag that allows for the rare cases where you're checking if any flag is set? (actually, we should probably just make callers use HasFlag(one) || HasFlag(two) in that case)

I think that's a good idea. It doesn't come up often, but we do have a few HasFlag(BBF_FLAG1 | BBF_FLAG2) calls, so maybe we can use the templated approach we use for BasicBlock::KindIs? If not, we can just use more than one call to HasFlag in the caller.

I think it's better to make the caller use HasFlag(one) || HasFlag(two), or HasFlag(one) && HasFlag(two).

Yes, we could use the template magic to have HasAnyFlags(one,two,...) and HasAllFlags(one,two,...) but does that happen very often? IMO, we should not have HasFlag(one,two,...) because it's not clear if it's "any" or "all".

Yes, we could use the template magic to have HasAnyFlags(one,two,...) and HasAllFlags(one,two,...) but does that happen very often?

I don't recall seeing it enough to justify introducing two more methods.

I think it's better to make the caller use HasFlag(one) || HasFlag(two), or HasFlag(one) && HasFlag(two).

Now that you mention the || vs && cases, I agree it's more readable to just have the caller make two calls.

On second thought, there are some cases where multiple calls to HasFlag quickly becomes ugly, so I'll introduce HasAnyFlags to cover these scenarios (I haven't found a need for HasAllFlags anywhere).

nit: in English, the preferred wording would be HasAnyFlag and HasAllFlags, IMO

BruceForstall · 2023-11-22T20:43:34Z

src/coreclr/jit/fgopt.cpp

    }

    /* Update the flags for block with those found in bNext */

-    block->bbFlags |= (bNext->bbFlags & BBF_COMPACT_UPD);
+    block->SetFlag(bNext->bbFlags & BBF_COMPACT_UPD);


Another case for CopyFlag

BruceForstall · 2023-11-22T20:46:08Z

src/coreclr/jit/flowgraph.cpp

@@ -3026,7 +3026,7 @@ void Compiler::fgInsertFuncletPrologBlock(BasicBlock* block)
    assert(nullptr == fgGetPredForBlock(block, newHead));
    fgAddRefPred(block, newHead);

-    assert((newHead->bbFlags & BBF_INTERNAL) == BBF_INTERNAL);
+    assert(newHead->CheckFlag(BBF_INTERNAL));


This can probably be HasFlag. It's equivalent if the flag being checked is a single bit. That's probably the case for most/all of the uses of CheckFlag with a single flag.

Thanks for pointing this out. I was able to get rid of CheckFlag altogether.

amanasifkhalid · 2023-11-22T23:31:36Z

With my suggested CopyFlags function, how many more cases are there?

The cases that require GetFlagsRaw and/or SetFlagsRaw fall into the following categories:

We want to cheaply assign bbFlags to some computed value, or to the flags of another block (we could introduce another version of CopyFlags to do the latter, but I think that might be confusing, since one version of CopyFlags would bitwise OR the flags with its arguments, and the other would just assign them). This occurs only a few (~3) times, for example when creating a duplicate block.
We want to copy a block's flags to a local variable, do some computations on the local variable, and then pass the computed flags to SetFlags/CopyFlags/etc, or use them as a conditional. A one-line version of the former looks like newBlock->CopyFlags(curr, (succ->GetFlagsRaw() & BBF_BACKWARD_JUMP));
We want to print a block's flags.

Overall, usage of the raw getters/setters is uncommon.

jakobbotsch · 2023-11-23T10:42:49Z

I don't have a strong opinion on a change like this, but I will say that I don't think the original pattern is verbose, and bitwise manipulation is such a primitive operation that I don't think this makes things clearer, only a bit inconsistent with other flags throughout the JIT.

If other people on the team prefer these wrapper methods, then I am ok with it.

amanasifkhalid · 2023-11-23T16:41:03Z

I don't think this makes things clearer, only a bit inconsistent with other flags throughout the JIT.

I agree this isn't a significant readability improvement, but this does fix the style inconsistencies with checking the flags (like use of implicit boolean, number of parentheses, etc).

I noticed from the last run that this change has unexpectedly large TP diffs. I suspect the templated definition of HasAnyFlag is to blame, and it would be cheaper to OR all the flags passed in first, and then check if bbFlags contains them with one AND instruction.

Side note: I tried profiling the TP diff locally, but the recent changes to jiteeversionguid.h are causing the JIT to fail to initialize when SuperPMI tries to run a collection due to a version mismatch. Rebuilding the JIT/SuperPMI and redownloading the collections doesn't fix this. Any ideas?

amanasifkhalid · 2023-11-29T19:46:22Z

This all assumes HasFlag is inlined, which I would assume. But maybe that's a bad assumption?

I don't see why HasFlag wouldn't be inlined, considering its size and simplicity (though maybe I'm missing something). Adding the inline directive doesn't change any TP diffs locally (though that doesn't guarantee the compiler will inline it). I'll take a look at the disassembly...

amanasifkhalid · 2023-11-30T02:33:49Z

I'll take a look at the disassembly...

So the codegen for the BasicBlockFlags and bool implementations match godbolt; here they are, respectively:

mov         rax,qword ptr [rcx+30h]
and         rax,rdx
ret

test        qword ptr [rcx+30h],rdx
setne       al
ret

In 6 places in the JIT, neither version is inlined (though there are no instances where one is inlined, and the other isn't). These calls are in larger methods like Compiler::impImportBlockCode (which is almost 5000 lines long), so maybe MSVC is exhausting its inlining budget? The inline directive doesn't change anything here. Similarly, SetFlags is called in 12 places. All the other helper methods are inlined everywhere.

We could use __forceinline to get around this. We have a few use cases of it, but I imagine this isn't ideal (though replacing these methods with macros probably wouldn't be any better). But regardless of how we address inlining, I'm still surprised I managed to make up most of the TP diff on Windows x64 by returning a BasicBlockFlags instead. I guess a mov and an and are that much cheaper?

BruceForstall · 2023-11-30T03:17:55Z

We could use __forceinline to get around this.

I think this is a good case for the use of __forceinline

I guess a mov and an and are that much cheaper?

Note that TP diffs purely count instructions; they are not a true measure of run-time performance difference. An instruction count is the best, extremely low noise and reproducible, proxy for run-time perf differences that we have.

jakobbotsch · 2023-11-30T15:29:29Z

src/coreclr/jit/block.h

+        // by checking if it is a power of 2
+        // (HasFlag expects to check only one flag at a time)
+        assert(isPow2(flag));
+        return (bool)(bbFlags & flag);


Does this get us into odd unnormalized bools territory? Why not (bbFlags & flag) != 0?

Good point. Though it looks like the codegen is still worse when the boolean version of this method is inlined (diffs). I'll take a look at the inlined instances and share the codegen here.

amanasifkhalid · 2023-11-30T16:57:31Z

I think this is a good case for the use of __forceinline

This seemed to fix the remaining diffs on x86.

Note that TP diffs purely count instructions; they are not a true measure of run-time performance difference.

I see, thanks for pointing that out. We're still seeing TP regressions with the inlined boolean implementation of the method. I can confirm that the boolean implementation results in a larger clrjit.dll disassembly. To get an idea of why, I looked at disassemblies of Compiler::fgHasCycleWithoutGCSafePoint (which has two calls to HasFlag, where the method is used as a predicate) with the two different implementations. Both calls to HasFlag are inlined, but the codegen is slightly different. For the BasicBlockFlags implementation, HasFlag is inlined into a bt, while for the boolean implementation, HasFlag is inlined into a shr and a test. So we are executing more instructions with the boolean approach.

If everyone's ok with the slightly less intuitive code, I'm going to revert to the BasicBlockFlags implementation for the better TP.

BruceForstall · 2023-11-30T17:06:48Z

So the implementation with __forceinline and bool return (bbFlags & flag) != 0 has a very slight (+0.02%) regression, but notably no regression on linux/x64 (built with clang)? I'm inclined to say we should just accept that and go for the intuitive implementation.

amanasifkhalid · 2023-11-30T17:08:31Z

So the implementation with __forceinline and bool return (bbFlags & flag) != 0 has a very slight (+0.02%) regression

Ah sorry, I linked the wrong diffs. It's bigger than that.

amanasifkhalid · 2023-11-30T17:10:53Z

Also while looking at the disassembly of large methods like impImportBlockCode, I found other small methods like BasicBlock::JumpsToNext not being inlined. I'm thinking of opening a followup PR where we __forceinline these methods to see if we can claw back some more TP (we could also chop up these large methods to fit them in MSVC's inlining budget, though that seems more daunting). What do we think?

BruceForstall · 2023-11-30T17:14:12Z

With the current variation that is running diffs, would you expect TP to universally improve, since there are lots of places using ((block->bbFlags & flag) != 0) that will now be doing (inlined) (block->bbFlags & flag) which you've shown has fewer instructions?

BruceForstall · 2023-11-30T17:18:52Z

I'm thinking of opening a followup PR where we __forceinline these methods to see if we can claw back some more TP (we could also chop up these large methods to fit them in MSVC's inlining budget, though that seems more daunting). What do we think?

Seems like a good experiment to try (the __forceinline). Splitting up large methods is generally "hard", but can be worthwhile. Probably best done while working in an area, not as a "go around breaking up large methods" task.

What really matters is the profile-feedback-based Release build. Note that TP diffs, while they use Release builds, disable profile feedback being used by the compiler, to ensure "apples to apples" comparison that doesn't bias the baseline (which presumably has more accurate profile data).

amanasifkhalid · 2023-11-30T17:20:22Z

With the current variation that is running diffs, would you expect TP to universally improve

Good question. This previous run is with __forceinline and returning (bbFlags & flag). There are only some slight TP improvements on win-x86 when targeting Linux ARM. I'll take a look at the codegen with the original ((block->bbFlags & flag) != 0) approach to see if there's any difference.

amanasifkhalid · 2023-11-30T17:46:20Z

I checked the disassembly of Compiler::fgHasCycleWithoutGCSafePoint with the HasFlag calls replaced with ((block->bbFlags & flag) != 0), and the codegen is identical. However, there are some odd codegen diffs elsewhere that result in a larger assembly overall with the HasFlag changes. I've attached a snippet below; left is without this PR's changes, and right is with them (sorry for the excessive diff highlighting):

That snippet is from CodeGen::genCodeForBBlist; looking at the changes I made there, nothing stands out to me as responsible for this diff. So maybe just some odd MSVC behavior?

BruceForstall · 2023-12-07T00:03:05Z

I looked into the largest Windows TP regression in the "bool" returning "(bbFlags & flags) != 0" implementation, and it looks like a CQ limitation of the MSVC compiler: the clang/gcc compilers optimize this better when inlined. I opened https://developercommunity.visualstudio.com/t/Sub-optimal-code-for-inlined-bool-functi/10535537 to request the MSVC team look at improving this.

I'm ok merging this with the BasicBlockFlags return and maybe doing a separate PR to convert it to bool (I'm ok with the TP regression there, too: it's about 0.2%).

Fix 2 bugs in importer with flags conversion to functions. Implement full `CopyFlags` to eliminate some GetFlagsRaw cases. For BasicBlockFlags operators, mark them as FORCEINLINE. They have always been expected to be fully inlined, but I found some cases in Release builds where they were not.

BruceForstall · 2023-12-07T00:15:41Z

@amanasifkhalid I submitted a PR to update this PR, with fixes to a couple bugs I noticed while reviewing, as well as adding some additional FORCEINLINE cases (that unfortunately don't materially improve measured TP):

amanasifkhalid#3

Implement HasAllFlags and fix a couple bugs

amanasifkhalid · 2023-12-07T01:22:54Z

@BruceForstall thanks for investigating this further, and for the fixes! Hopefully we'll see an MSVC fix for this soon.

BruceForstall

LGTM if tests pass

BruceForstall · 2023-12-07T04:20:38Z

Diffs

No asm diffs (as expected). No TP diffs on linux. Small TP improvements to small regressions on Windows (-0.01% to 0.02%)

amanasifkhalid · 2023-12-07T04:33:11Z

Failures are known NativeAOT issues.

amanasifkhalid added 10 commits November 17, 2023 16:45

Start HasFlag() work

055dd8f

Add HasFlag to codegen files

76f8e31

Add HasFlag up to fgbasic

2507e75

Add HasFlag to fgdiagnostic-fgopt

e9ba518

Add HasFlag to fgprofile-optcse

81ade2b

Finish adding HasFlag

f23e490

Add SetFlag

e63defc

Add RemoveFlag

cd5d96d

Add CheckFlag

5fe13db

Fix merge conflict

60ad95d

ghost assigned amanasifkhalid Nov 22, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2023

amanasifkhalid added 2 commits November 22, 2023 13:46

Fix style

1a80fe7

Fix arm build

2e5ae19

BruceForstall reviewed Nov 22, 2023

View reviewed changes

amanasifkhalid added 7 commits November 22, 2023 15:58

Fix nits

f930345

Rename Set/RemoveFlags

5be4ae2

Add CopyFlags

2474f72

Fix CopyFlags

8504392

Make bbFlags private

0d764ea

Add HasAnyFlags

26fa206

Put jiteeversionguid.h back

34ed3a9

amanasifkhalid added 2 commits November 23, 2023 10:16

Rename HasAnyFlag

c1808c9

Simplify HasAnyFlag

38628c3

Fix merge conflict

5b65c95

Try __forceinline

1ab9def

amanasifkhalid marked this pull request as draft November 30, 2023 02:38

amanasifkhalid marked this pull request as ready for review November 30, 2023 12:53

jakobbotsch reviewed Nov 30, 2023

View reviewed changes

amanasifkhalid force-pushed the bb-has-flag branch from 0113d05 to 1ab9def Compare November 30, 2023 16:59

amanasifkhalid added 2 commits December 1, 2023 11:33

Use FORCEINLINE instead

c348a42

Fix merge conflict

ad435bc

Merge pull request #3 from BruceForstall/bruce-fix-bb-has-flag

92f51e0

Implement HasAllFlags and fix a couple bugs

BruceForstall mentioned this pull request Dec 7, 2023

JIT: Update terminology in loop cloning, and simplify a bit #95648

Merged

BruceForstall approved these changes Dec 7, 2023

View reviewed changes

build-analysis bot mentioned this pull request Dec 7, 2023

Widespread "Optimizing assemblies for size is not supported for the selected publish configuration" failures #95712

Closed

amanasifkhalid merged commit 425cfb9 into dotnet:main Dec 7, 2023
124 of 139 checks passed

github-actions bot locked and limited conversation to collaborators Jan 6, 2024

	this->RemoveFlag((BBF_RUN_RARELY \| BBF_PROF_WEIGHT));
	this->RemoveFlag(BBF_RUN_RARELY \| BBF_PROF_WEIGHT);

	bool jmpEpilog = (block->HasFlag(BBF_HAS_JMP));
	bool jmpEpilog = block->HasFlag(BBF_HAS_JMP);

	block->SetFlag(compiler->compCurBB->bbFlags & BBF_COLD);
	block->CopyFlags(compiler->compCurBB, BBF_COLD);

[JIT] Add BasicBlock::bbFlags helper methods #95139

[JIT] Add BasicBlock::bbFlags helper methods #95139

Conversation

amanasifkhalid commented Nov 22, 2023

ghost commented Nov 22, 2023

amanasifkhalid commented Nov 22, 2023

BruceForstall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanasifkhalid commented Nov 22, 2023

jakobbotsch commented Nov 23, 2023

amanasifkhalid commented Nov 23, 2023

amanasifkhalid commented Nov 29, 2023

amanasifkhalid commented Nov 30, 2023 • edited Loading

BruceForstall commented Nov 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanasifkhalid commented Nov 30, 2023

BruceForstall commented Nov 30, 2023

amanasifkhalid commented Nov 30, 2023 • edited Loading

amanasifkhalid commented Nov 30, 2023 • edited Loading

BruceForstall commented Nov 30, 2023

BruceForstall commented Nov 30, 2023

amanasifkhalid commented Nov 30, 2023

amanasifkhalid commented Nov 30, 2023

BruceForstall commented Dec 7, 2023

BruceForstall commented Dec 7, 2023

amanasifkhalid commented Dec 7, 2023

BruceForstall left a comment

Choose a reason for hiding this comment

BruceForstall commented Dec 7, 2023

amanasifkhalid commented Dec 7, 2023

amanasifkhalid commented Nov 30, 2023 •

edited

Loading

amanasifkhalid commented Nov 30, 2023 •

edited

Loading

amanasifkhalid commented Nov 30, 2023 •

edited

Loading