Lower TEST(x, LSH(1, y)) to BT(x, y) #13626
Conversation
jit-diff fx summary:
|
It's obvious from the diff that this pattern doesn't occur often in the framework. Still, this pattern may occur in code that makes use of bit vectors for optimization purposes and where good performance is expected. For example, the managed implementation of And in some cases the speedup can grow to 3x: int y = 0;
int x = 0;
while (x++ < 10000000) {
if ((x & (1 << y)) == 0) {
y++;
}
}
return y; generates: G_M1752_IG01:
50 push rax
G_M1752_IG02:
33C0 xor eax, eax
33D2 xor edx, edx
EB15 jmp SHORT G_M1752_IG04
G_M1752_IG03:
BA01000000 mov edx, 1
8BC8 mov ecx, eax
D3E2 shl edx, cl
8B4C2404 mov ecx, dword ptr [rsp+04H]
85CA test ecx, edx
8BD1 mov edx, ecx
7502 jne SHORT G_M1752_IG04
FFC0 inc eax
G_M1752_IG04:
8D4A01 lea ecx, [rdx+1]
894C2404 mov dword ptr [rsp+04H], ecx
81FA80969800 cmp edx, 0x989680
7CDC jl SHORT G_M1752_IG03
G_M1752_IG05:
4883C408 add rsp, 8
C3 ret BT version: G_M1752_IG01:
G_M1752_IG02:
33C0 xor eax, eax
33D2 xor edx, edx
EB09 jmp SHORT G_M1752_IG04
G_M1752_IG03:
0FA3C1 bt ecx, eax
8BD1 mov edx, ecx
7202 jb SHORT G_M1752_IG04
FFC0 inc eax
G_M1752_IG04:
8D4A01 lea ecx, [rdx+1]
81FA80969800 cmp edx, 0x989680
7CEC jl SHORT G_M1752_IG03
G_M1752_IG05:
C3 ret The fact that BT doesn't need a particular bit index register (ECX) nor a temporary register to compute the bit mask into seems to help a lot. |
@dotnet-bot test Tizen armel Cross Debug Build |
@dotnet-bot test Windows_NT x64 corefx_baseline |
@russellhadley We talked about generating BT in the past, here's one use of it that appears to be useful for both performance and code size reasons. |
@dotnet-bot test Tizen armel Cross Debug Build |
@DrewScoggins @jorive who owns the Perf Tests correctness? Can you help with this:
|
@dotnet/jit-contrib it would be great to get this in to unblock sharing more of String with CoreRT. |
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness |
@danmosemsft I'm rerunning the tests. There was a bug on the PR test generator that was fixed this morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate to see all the special-casing required in emitxarch.cpp, but it's not obvious to me how to avoid it. Any idea whether there is any throughput impact?
LGTM, but would like to see additional reviews.
src/jit/gtlist.h
Outdated
@@ -226,6 +226,12 @@ GTNODE(JCC , GenTreeCC ,0,GTK_LEAF|GTK_NOVALUE) // Check | |||
GTNODE(SETCC , GenTreeCC ,0,GTK_LEAF) // Checks the condition flags and produces 1 if the condition specified | |||
// by GenTreeCC::gtCondition is true and 0 otherwise. | |||
|
|||
#ifdef _TARGET_XARCH_ | |||
GTNODE(BT , GenTreeOp ,0,GTK_BINOP|GTK_NOVALUE) | |||
GTNODE(BTC , GenTreeOp ,0,GTK_BINOP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that we not add the additional forms as GenTree operators until they are going to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be good to add comments describing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that we not add the additional forms as GenTree operators until they are going to be used.
Will do, they're easy to add back if needed. And it may turn out that they're never needed because they have worse perf characteristics on AMD CPUs. I'm not sure it's worth the trouble to make the JIT use them only on Intel CPUs.
src/jit/lowerxarch.cpp
Outdated
// The BT instruction family supports "reg/mem,reg" and "reg/mem,imm" forms. | ||
// However, the "mem,reg" form has a slightly different semantic than the other forms, | ||
// it treats the memory as an array indexed by bit_index / 32/64. This is rarely useful | ||
// as these instructions are tipically produced by transforming shift instructions and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: tipically => typically
Indeed it is. When I started doing this (it's been a while and I don't remember the exact details) I thought it would be easy because like shifts it too has a one byte immediate operand. But then shift requires special casing too and BT needs even more (AFAIR partly because it has a 0F prefix). If I had time I'd refactor the whole emitter, it's long past its design limit (assuming there ever was a design to begin with). It appears that the original idea was to be "data driven" but the instruction table is far too limited to avoid a bunch of special casing in the code. As for performance, corelib crossgen timing doesn't show any change and I've yet to figure out a way to measure things like retired instruction count. |
Here's what I've used (courtesy @pgavlin) to measure it for corelib crossgen on Linux: ./build.sh x64 release skiptests ninja nopgooptimize
valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./bin/Product/Linux.x64.Release/crossgen /Platform_Assemblies_Paths ./bin/Product/Linux.x64.Release/IL ./bin/Product/Linux.x64.Release/IL/System.Private.CoreLib.dll
kcachegrind callgrind.out And I haven't tried it myself, but @AndyAyersMS has https://github.com/AndyAyersMS/InstructionsRetiredExplorer for measuring the same on Windows. |
I know but I don't have enough disk space to install Linux. And AFAIK callgrind doesn't work in a VM...
Ah, xperf + pmc, I was just looking into that, thanks for the pointer! |
You can also use Intel 'pin', configured to count instructions: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool |
The instructions retired explorer may require very new xperfs (not 100% sure they've even shipped). Also you can't be running on a box with Hyper-V enabled. And you need to run as admin. |
You might be able to enable the PMC ETL event mode via perfview instead. Haven't tried.... |
Yep, I know that one too. Yet to get it to build, AFAIR last time I tried it couldn't find the C compiler.
Indeed, it looks like the xperf I have doesn't accept
So around 0.2% increase in instructions retired, assuming the numbers are correct. See this Excel file for some more details: https://1drv.ms/x/s!Av4baJYSo5pjgrhsTDG39GBXVvRdOA. For example, the reported time does look reasonable. 0.2% is a bit much and it's quite unfortunate that adding a new instruction has such an effect. I'll see if I can do something about it though I'm not very optimistic when it comes to emitter code... |
Probably the best option to speed up the emitter is to add special casing only where it is really needed - in And if only
|
I replaced all the special casing from the emitter with comments and asserts. It's perhaps a bit of a hack but the added complexity and the throughput hit is just not worth it considering the low usage of BT. |
OSX build failed due to some Jenkins issue
@dotnet-bot test OSX10.12 x64 Checked Build and Test |
OSX build got stuck... @dotnet-bot test OSX10.12 x64 Checked Build and Test |
@CarolEidt good to merge it looks like? |
@dotnet/jit-contrib This LGTM, but I think a second set of eyes on this would be good. |
@@ -163,6 +163,7 @@ void genCodeForShiftLong(GenTreePtr tree); | |||
|
|||
#ifdef _TARGET_XARCH_ | |||
void genCodeForShiftRMW(GenTreeStoreInd* storeInd); | |||
void genCodeForBT(GenTreeOp* bt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to call this BitTest rather than BT, since we will likely want this for other architectures eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM doesn't seem to have anything like BT. ARM64 has TBZ/TBNZ but that's quite different from BT and I'm not even sure how we can represent that in JIT's IR, it's a rather special case of conditional branch, maybe a JTRUE with a contained TEST_EQ/NE operand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm64 has tst:
tst w24, #24
beq G_M55607_IG12
Implemented with PR #13799
Note:
Test bits (immediate), setting the condition flags and discarding the result: Rn AND imm
This instruction is an alias of the ANDS (immediate) instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM64's tst
is similar to x64's test
and represented in IR by GT_TEST_EQ
and GT_TEST_NE
. BT
is a different thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these two?
Any bit (0-63) can be tested using Arm64 TST instruction.
The main difference seems to be that the Intel instruction sets the CF and the Arm instruction sets the ZF. (And that a memory operand is allowed on Intel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any bit (0-63) can be tested using Arm64 TST instruction.
Yes, but the bit to test is specified by an immediate value. Presumably you can do something like tst w24, w23
but then w23
needs to contain a bit mask, not a bit index like in the case of BT
.
@dotnet-bot test Windows_NT arm64 Checked |
@mikedn |
Do not feel obligated to wait for arm64 tests to complete. |
I had the same question, and was thinking that perhaps it is time to extract the XARCH-specific code into lowerxarch.cpp |
Many of my recent changes are enabling sections of this LowerCompare XARCH block for arm64. |
Ah, very good then! |
Rebased and fixed conflicts. It happens so that now it is obvious that the code is inside an ifdef. |
@mikedn are there any plans to add similar changes for |
I did experiment with those and I have some code that handles things like
mov eax, [mem]
bts eax, edx
mov [mem], eax This makes these instructions slightly less useful compared to
And interestingly, most native compilers seem to avoid these instructions. AFAIR the only native compiler that I've seen using |
I'm happy to run benchmarks on a Ryzen, I've got a 1800X and a 1950X (Threadripper) |
Fair enough. If alternative code tends to perform better across all platforms, its probably better to keep that. |
Thanks, then I'll try to put something together when I get some time, next weekend perhaps. I should add that adding support for SHLX/SHRX/SARX might be more useful. On Intels these have better perf than normal shifts while on AMDs they have identical perf so we can use them on all CPUs that support BMI2. They don't require the shift count to be in |
@mikedn I used your example in my blog about how to generate the disassembly of .NET functions and how to diff many of them with BenchmarkDotNet. |
No description provided.