Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] Add BasicBlock::bbFlags helper methods #95139

Merged
merged 35 commits into from
Dec 7, 2023

Conversation

amanasifkhalid
Copy link
Member

Andy mentioned in #94239 that it would be nice to have a method handle verbose flag checks like ((block->bbFlags & BBF_SOME_FLAG) != 0), so this adds some helper methods for checking/modifying block flags.

@ghost ghost assigned amanasifkhalid Nov 22, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2023
@ghost
Copy link

ghost commented Nov 22, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Andy mentioned in #94239 that it would be nice to have a method handle verbose flag checks like ((block->bbFlags & BBF_SOME_FLAG) != 0), so this adds some helper methods for checking/modifying block flags.

Author: amanasifkhalid
Assignees: amanasifkhalid
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib. With these changes, we only have a few spots where we directly access bbFlags (usually either for assignment, or for calculating its bitwise AND with some flag without converting the result to a boolean). If we've made it this far, would it be worth just making bbFlags private, and adding GetFlags() and SetFlags(...) to cover those last few cases?

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. Left a few notes.

If we've made it this far, would it be worth just making bbFlags private, and adding GetFlags() and SetFlags(...) to cover those last few cases?

With my suggested CopyFlags function, how many more cases are there? I think it would make sense to make it private. I would name the functions GetFlagsRaw/SetFlagsRaw to make it even more clear this is an "escape hatch" and not preferred usage.

@@ -869,8 +894,7 @@ struct BasicBlock : private LIR::Range
{
if (this->bbWeight == BB_ZERO_WEIGHT)
{
this->bbFlags &= ~BBF_RUN_RARELY; // Clear any RarelyRun flag
this->bbFlags &= ~BBF_PROF_WEIGHT; // Clear any profile-derived flag
this->RemoveFlag((BBF_RUN_RARELY | BBF_PROF_WEIGHT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
this->RemoveFlag((BBF_RUN_RARELY | BBF_PROF_WEIGHT));
this->RemoveFlag(BBF_RUN_RARELY | BBF_PROF_WEIGHT);

@@ -5435,7 +5435,7 @@ void CodeGen::genFnEpilog(BasicBlock* block)
}
#endif // DEBUG

bool jmpEpilog = ((block->bbFlags & BBF_HAS_JMP) != 0);
bool jmpEpilog = (block->HasFlag(BBF_HAS_JMP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
bool jmpEpilog = (block->HasFlag(BBF_HAS_JMP));
bool jmpEpilog = block->HasFlag(BBF_HAS_JMP);


// Use coldness of current block, as this label will
// be contained in it.
block->bbFlags |= (compiler->compCurBB->bbFlags & BBF_COLD);
block->SetFlag(compiler->compCurBB->bbFlags & BBF_COLD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a CopyFlags function and have:

Suggested change
block->SetFlag(compiler->compCurBB->bbFlags & BBF_COLD);
block->CopyFlags(compiler->compCurBB, BBF_COLD);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. We do have a few cases where the second argument to CopyFlags would be multiple flags OR'd together, e.g. in Compiler::fgSplitBlockBeforeTree:
block->SetFlag(originalFlags & (BBF_SPLIT_GAINED | BBF_IMPORTED | BBF_GC_SAFE_POINT | BBF_LOOP_PREHEADER | BBF_RETLESS_CALL));
Would you be ok with this pattern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

block->CopyFlags(fromBlock, BBF_IMPORTED | BBF_GC_SAFE_POINT | BBF_LOOP_PREHEADER | BBF_RETLESS_CALL);

would be fine.

The more complex cases where we (1) grab the flags, (2) do some computations, then (3) set the flags, might be where the GetFlagsRaw would come in. So the example you reference might need to use that.

return !CheckFlag(flag, BBF_EMPTY);
}

void SetFlag(const BasicBlockFlags flag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be SetFlags/RemoveFlags?

return CheckFlag(flag, flag);
}

bool HasFlag(const BasicBlockFlags flag) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this assert that flag has only one bit set? And add an explicit HasAnyFlag that allows for the rare cases where you're checking if any flag is set? (actually, we should probably just make callers use HasFlag(one) || HasFlag(two) in that case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea. It doesn't come up often, but we do have a few HasFlag(BBF_FLAG1 | BBF_FLAG2) calls, so maybe we can use the templated approach we use for BasicBlock::KindIs? If not, we can just use more than one call to HasFlag in the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to make the caller use HasFlag(one) || HasFlag(two), or HasFlag(one) && HasFlag(two).

Yes, we could use the template magic to have HasAnyFlags(one,two,...) and HasAllFlags(one,two,...) but does that happen very often? IMO, we should not have HasFlag(one,two,...) because it's not clear if it's "any" or "all".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could use the template magic to have HasAnyFlags(one,two,...) and HasAllFlags(one,two,...) but does that happen very often?

I don't recall seeing it enough to justify introducing two more methods.

I think it's better to make the caller use HasFlag(one) || HasFlag(two), or HasFlag(one) && HasFlag(two).

Now that you mention the || vs && cases, I agree it's more readable to just have the caller make two calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, there are some cases where multiple calls to HasFlag quickly becomes ugly, so I'll introduce HasAnyFlags to cover these scenarios (I haven't found a need for HasAllFlags anywhere).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in English, the preferred wording would be HasAnyFlag and HasAllFlags, IMO

}

/* Update the flags for block with those found in bNext */

block->bbFlags |= (bNext->bbFlags & BBF_COMPACT_UPD);
block->SetFlag(bNext->bbFlags & BBF_COMPACT_UPD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case for CopyFlag

@@ -3026,7 +3026,7 @@ void Compiler::fgInsertFuncletPrologBlock(BasicBlock* block)
assert(nullptr == fgGetPredForBlock(block, newHead));
fgAddRefPred(block, newHead);

assert((newHead->bbFlags & BBF_INTERNAL) == BBF_INTERNAL);
assert(newHead->CheckFlag(BBF_INTERNAL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be HasFlag. It's equivalent if the flag being checked is a single bit. That's probably the case for most/all of the uses of CheckFlag with a single flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I was able to get rid of CheckFlag altogether.

@amanasifkhalid
Copy link
Member Author

With my suggested CopyFlags function, how many more cases are there?

The cases that require GetFlagsRaw and/or SetFlagsRaw fall into the following categories:

  • We want to cheaply assign bbFlags to some computed value, or to the flags of another block (we could introduce another version of CopyFlags to do the latter, but I think that might be confusing, since one version of CopyFlags would bitwise OR the flags with its arguments, and the other would just assign them). This occurs only a few (~3) times, for example when creating a duplicate block.
  • We want to copy a block's flags to a local variable, do some computations on the local variable, and then pass the computed flags to SetFlags/CopyFlags/etc, or use them as a conditional. A one-line version of the former looks like newBlock->CopyFlags(curr, (succ->GetFlagsRaw() & BBF_BACKWARD_JUMP));
  • We want to print a block's flags.

Overall, usage of the raw getters/setters is uncommon.

@jakobbotsch
Copy link
Member

I don't have a strong opinion on a change like this, but I will say that I don't think the original pattern is verbose, and bitwise manipulation is such a primitive operation that I don't think this makes things clearer, only a bit inconsistent with other flags throughout the JIT.

If other people on the team prefer these wrapper methods, then I am ok with it.

@amanasifkhalid
Copy link
Member Author

I don't think this makes things clearer, only a bit inconsistent with other flags throughout the JIT.

I agree this isn't a significant readability improvement, but this does fix the style inconsistencies with checking the flags (like use of implicit boolean, number of parentheses, etc).

I noticed from the last run that this change has unexpectedly large TP diffs. I suspect the templated definition of HasAnyFlag is to blame, and it would be cheaper to OR all the flags passed in first, and then check if bbFlags contains them with one AND instruction.

Side note: I tried profiling the TP diff locally, but the recent changes to jiteeversionguid.h are causing the JIT to fail to initialize when SuperPMI tries to run a collection due to a version mismatch. Rebuilding the JIT/SuperPMI and redownloading the collections doesn't fix this. Any ideas?

@amanasifkhalid
Copy link
Member Author

This all assumes HasFlag is inlined, which I would assume. But maybe that's a bad assumption?

I don't see why HasFlag wouldn't be inlined, considering its size and simplicity (though maybe I'm missing something). Adding the inline directive doesn't change any TP diffs locally (though that doesn't guarantee the compiler will inline it). I'll take a look at the disassembly...

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Nov 30, 2023

I'll take a look at the disassembly...

So the codegen for the BasicBlockFlags and bool implementations match godbolt; here they are, respectively:

mov         rax,qword ptr [rcx+30h]
and         rax,rdx
ret
test        qword ptr [rcx+30h],rdx
setne       al
ret

In 6 places in the JIT, neither version is inlined (though there are no instances where one is inlined, and the other isn't). These calls are in larger methods like Compiler::impImportBlockCode (which is almost 5000 lines long), so maybe MSVC is exhausting its inlining budget? The inline directive doesn't change anything here. Similarly, SetFlags is called in 12 places. All the other helper methods are inlined everywhere.

We could use __forceinline to get around this. We have a few use cases of it, but I imagine this isn't ideal (though replacing these methods with macros probably wouldn't be any better). But regardless of how we address inlining, I'm still surprised I managed to make up most of the TP diff on Windows x64 by returning a BasicBlockFlags instead. I guess a mov and an and are that much cheaper?

@amanasifkhalid amanasifkhalid marked this pull request as draft November 30, 2023 02:38
@BruceForstall
Copy link
Member

We could use __forceinline to get around this.

I think this is a good case for the use of __forceinline

I guess a mov and an and are that much cheaper?

Note that TP diffs purely count instructions; they are not a true measure of run-time performance difference. An instruction count is the best, extremely low noise and reproducible, proxy for run-time perf differences that we have.

@amanasifkhalid amanasifkhalid marked this pull request as ready for review November 30, 2023 12:53
// by checking if it is a power of 2
// (HasFlag expects to check only one flag at a time)
assert(isPow2(flag));
return (bool)(bbFlags & flag);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get us into odd unnormalized bools territory? Why not (bbFlags & flag) != 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Though it looks like the codegen is still worse when the boolean version of this method is inlined (diffs). I'll take a look at the inlined instances and share the codegen here.

@amanasifkhalid
Copy link
Member Author

I think this is a good case for the use of __forceinline

This seemed to fix the remaining diffs on x86.

Note that TP diffs purely count instructions; they are not a true measure of run-time performance difference.

I see, thanks for pointing that out. We're still seeing TP regressions with the inlined boolean implementation of the method. I can confirm that the boolean implementation results in a larger clrjit.dll disassembly. To get an idea of why, I looked at disassemblies of Compiler::fgHasCycleWithoutGCSafePoint (which has two calls to HasFlag, where the method is used as a predicate) with the two different implementations. Both calls to HasFlag are inlined, but the codegen is slightly different. For the BasicBlockFlags implementation, HasFlag is inlined into a bt, while for the boolean implementation, HasFlag is inlined into a shr and a test. So we are executing more instructions with the boolean approach.

If everyone's ok with the slightly less intuitive code, I'm going to revert to the BasicBlockFlags implementation for the better TP.

@BruceForstall
Copy link
Member

So the implementation with __forceinline and bool return (bbFlags & flag) != 0 has a very slight (+0.02%) regression, but notably no regression on linux/x64 (built with clang)? I'm inclined to say we should just accept that and go for the intuitive implementation.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Nov 30, 2023

So the implementation with __forceinline and bool return (bbFlags & flag) != 0 has a very slight (+0.02%) regression

Ah sorry, I linked the wrong diffs. It's bigger than that.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Nov 30, 2023

Also while looking at the disassembly of large methods like impImportBlockCode, I found other small methods like BasicBlock::JumpsToNext not being inlined. I'm thinking of opening a followup PR where we __forceinline these methods to see if we can claw back some more TP (we could also chop up these large methods to fit them in MSVC's inlining budget, though that seems more daunting). What do we think?

@BruceForstall
Copy link
Member

With the current variation that is running diffs, would you expect TP to universally improve, since there are lots of places using ((block->bbFlags & flag) != 0) that will now be doing (inlined) (block->bbFlags & flag) which you've shown has fewer instructions?

@BruceForstall
Copy link
Member

I'm thinking of opening a followup PR where we __forceinline these methods to see if we can claw back some more TP (we could also chop up these large methods to fit them in MSVC's inlining budget, though that seems more daunting). What do we think?

Seems like a good experiment to try (the __forceinline). Splitting up large methods is generally "hard", but can be worthwhile. Probably best done while working in an area, not as a "go around breaking up large methods" task.

What really matters is the profile-feedback-based Release build. Note that TP diffs, while they use Release builds, disable profile feedback being used by the compiler, to ensure "apples to apples" comparison that doesn't bias the baseline (which presumably has more accurate profile data).

@amanasifkhalid
Copy link
Member Author

With the current variation that is running diffs, would you expect TP to universally improve

Good question. This previous run is with __forceinline and returning (bbFlags & flag). There are only some slight TP improvements on win-x86 when targeting Linux ARM. I'll take a look at the codegen with the original ((block->bbFlags & flag) != 0) approach to see if there's any difference.

@amanasifkhalid
Copy link
Member Author

I checked the disassembly of Compiler::fgHasCycleWithoutGCSafePoint with the HasFlag calls replaced with ((block->bbFlags & flag) != 0), and the codegen is identical. However, there are some odd codegen diffs elsewhere that result in a larger assembly overall with the HasFlag changes. I've attached a snippet below; left is without this PR's changes, and right is with them (sorry for the excessive diff highlighting):

image

That snippet is from CodeGen::genCodeForBBlist; looking at the changes I made there, nothing stands out to me as responsible for this diff. So maybe just some odd MSVC behavior?

@BruceForstall
Copy link
Member

I looked into the largest Windows TP regression in the "bool" returning "(bbFlags & flags) != 0" implementation, and it looks like a CQ limitation of the MSVC compiler: the clang/gcc compilers optimize this better when inlined. I opened https://developercommunity.visualstudio.com/t/Sub-optimal-code-for-inlined-bool-functi/10535537 to request the MSVC team look at improving this.

I'm ok merging this with the BasicBlockFlags return and maybe doing a separate PR to convert it to bool (I'm ok with the TP regression there, too: it's about 0.2%).

Fix 2 bugs in importer with flags conversion to functions.

Implement full `CopyFlags` to eliminate some GetFlagsRaw cases.

For BasicBlockFlags operators, mark them as FORCEINLINE. They have
always been expected to be fully inlined, but I found some cases
in Release builds where they were not.
@BruceForstall
Copy link
Member

@amanasifkhalid I submitted a PR to update this PR, with fixes to a couple bugs I noticed while reviewing, as well as adding some additional FORCEINLINE cases (that unfortunately don't materially improve measured TP):

amanasifkhalid#3

Implement HasAllFlags and fix a couple bugs
@amanasifkhalid
Copy link
Member Author

@BruceForstall thanks for investigating this further, and for the fixes! Hopefully we'll see an MSVC fix for this soon.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

@BruceForstall
Copy link
Member

Diffs

No asm diffs (as expected). No TP diffs on linux. Small TP improvements to small regressions on Windows (-0.01% to 0.02%)

@amanasifkhalid
Copy link
Member Author

Failures are known NativeAOT issues.

@amanasifkhalid amanasifkhalid merged commit 425cfb9 into dotnet:main Dec 7, 2023
124 of 139 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants