Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Lower TEST(x, LSH(1, y)) to BT(x, y) #13626

Merged
merged 3 commits into from Sep 14, 2017
Merged

Lower TEST(x, LSH(1, y)) to BT(x, y) #13626

merged 3 commits into from Sep 14, 2017

Conversation

mikedn
Copy link

@mikedn mikedn commented Aug 28, 2017

No description provided.

@mikedn
Copy link
Author

mikedn commented Aug 28, 2017

jit-diff fx summary:

Total bytes of diff: -67 (0.00% of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
         -34 : Microsoft.CodeAnalysis.dasm (0.00% of base)
         -16 : System.Collections.dasm (-0.01% of base)
         -11 : System.Private.CoreLib.dasm (0.00% of base)
          -6 : System.Text.Encoding.CodePages.dasm (-0.01% of base)
4 total files with size differences (4 improved, 0 regressed), 75 unchanged.
Top method improvements by size (bytes):
         -20 : Microsoft.CodeAnalysis.dasm - MetadataSizes:GetTableSize(ubyte,int):int:this
         -19 : System.Private.CoreLib.dasm - String:IsCharBitSet(long,ubyte):bool
         -16 : System.Collections.dasm - BitArray:Get(int):bool:this
         -14 : Microsoft.CodeAnalysis.dasm - MetadataSizes:IsPresent(ubyte):bool:this
7 total methods with size differences (7 improved, 0 regressed), 66286 unchanged.```

@mikedn
Copy link
Author

mikedn commented Aug 28, 2017

It's obvious from the diff that this pattern doesn't occur often in the framework. Still, this pattern may occur in code that makes use of bit vectors for optimization purposes and where good performance is expected.

For example, the managed implementation of String.IndexOfAny in dotnet/corert#4395 is ~1.2x faster when using BT.

And in some cases the speedup can grow to 3x:

int y = 0;
int x = 0;
while (x++ < 10000000) {
    if ((x & (1 << y)) == 0) {
        y++;
    }
}
return y;

generates:

G_M1752_IG01:
       50                   push     rax
G_M1752_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
       EB15                 jmp      SHORT G_M1752_IG04
G_M1752_IG03:
       BA01000000           mov      edx, 1
       8BC8                 mov      ecx, eax
       D3E2                 shl      edx, cl
       8B4C2404             mov      ecx, dword ptr [rsp+04H]
       85CA                 test     ecx, edx
       8BD1                 mov      edx, ecx
       7502                 jne      SHORT G_M1752_IG04
       FFC0                 inc      eax
G_M1752_IG04:
       8D4A01               lea      ecx, [rdx+1]
       894C2404             mov      dword ptr [rsp+04H], ecx
       81FA80969800         cmp      edx, 0x989680
       7CDC                 jl       SHORT G_M1752_IG03
G_M1752_IG05:
       4883C408             add      rsp, 8
       C3                   ret

BT version:

G_M1752_IG01:
G_M1752_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
       EB09                 jmp      SHORT G_M1752_IG04
G_M1752_IG03:
       0FA3C1               bt       ecx, eax
       8BD1                 mov      edx, ecx
       7202                 jb       SHORT G_M1752_IG04
       FFC0                 inc      eax
G_M1752_IG04:
       8D4A01               lea      ecx, [rdx+1]
       81FA80969800         cmp      edx, 0x989680
       7CEC                 jl       SHORT G_M1752_IG03
G_M1752_IG05:
       C3                   ret

The fact that BT doesn't need a particular bit index register (ECX) nor a temporary register to compute the bit mask into seems to help a lot.

@mikedn
Copy link
Author

mikedn commented Aug 28, 2017

@dotnet-bot test Tizen armel Cross Debug Build
test Windows_NT x64 corefx_baseline

@mikedn
Copy link
Author

mikedn commented Aug 29, 2017

@dotnet-bot test Windows_NT x64 corefx_baseline

@mikedn
Copy link
Author

mikedn commented Aug 29, 2017

@russellhadley We talked about generating BT in the past, here's one use of it that appears to be useful for both performance and code size reasons.

@mikedn
Copy link
Author

mikedn commented Aug 30, 2017

@dotnet-bot test Tizen armel Cross Debug Build

@danmoseley
Copy link
Member

@DrewScoggins @jorive who owns the Perf Tests correctness? Can you help with this:

dotnet-bot@WIN2012-2005B0 "D:\j\workspace\perf_perflab_---1c2b0529\bin\sandbox"
[Tue 08/29/2017][11:02:35] $ py.exe "D:\j\workspace\perf_perflab_---1c2b0529\Microsoft.Benchview.JSONFormat\tools\submission.py" measurement.json  --build "D:\j\workspace\perf_perflab_---1c2b0529\build.json" --machine-data "D:\j\workspace\perf_perflab_---1c2b0529\machinedata.json" --metadata "D:\j\workspace\perf_perflab_---1c2b0529\submission-metadata.json" --group "CoreCLR" --type "private" --config-name "Release" --config Configuration "Release" --config OS "Windows_NT" --config Profile "Off" --config OptLevel "full_opt" --config JitName  "ryujit" --architecture "x86" --machinepool "PerfSnake"
[2017-08-29 11:02][ERROR] <class 'ValueError'>: Specified submission-metadata json file "D:\j\workspace\perf_perflab_---1c2b0529\submission-metadata.json" does not exist.
[2017-08-29 11:02][ERROR] Traceback (most recent call last):
  File "D:\j\workspace\perf_perflab_---1c2b0529\Microsoft.Benchview.JSONFormat\tools\submission.py", line 121, in main
    submission = read_submission_metadata(args['metadata'])
  File "D:\j\workspace\perf_perflab_---1c2b0529\Microsoft.Benchview.JSONFormat\tools\benchview\format\JSONReaders.py", line 70, in read_submission_metadata
    raise ValueError('Specified submission-metadata json file "{0}" does not exist.'.format(filename))
ValueError: Specified submission-metadata json file "D:\j\workspace\perf_perflab_---1c2b0529\submission-metadata.json" does not exist.

@danmoseley
Copy link
Member

@dotnet/jit-contrib it would be great to get this in to unblock sharing more of String with CoreRT.

@jorive
Copy link
Member

jorive commented Aug 30, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness

@jorive
Copy link
Member

jorive commented Aug 30, 2017

@danmosemsft I'm rerunning the tests. There was a bug on the PR test generator that was fixed this morning.
Sorry for the inconvenience,
-José

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate to see all the special-casing required in emitxarch.cpp, but it's not obvious to me how to avoid it. Any idea whether there is any throughput impact?
LGTM, but would like to see additional reviews.

src/jit/gtlist.h Outdated
@@ -226,6 +226,12 @@ GTNODE(JCC , GenTreeCC ,0,GTK_LEAF|GTK_NOVALUE) // Check
GTNODE(SETCC , GenTreeCC ,0,GTK_LEAF) // Checks the condition flags and produces 1 if the condition specified
// by GenTreeCC::gtCondition is true and 0 otherwise.

#ifdef _TARGET_XARCH_
GTNODE(BT , GenTreeOp ,0,GTK_BINOP|GTK_NOVALUE)
GTNODE(BTC , GenTreeOp ,0,GTK_BINOP)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that we not add the additional forms as GenTree operators until they are going to be used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be good to add comments describing them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that we not add the additional forms as GenTree operators until they are going to be used.

Will do, they're easy to add back if needed. And it may turn out that they're never needed because they have worse perf characteristics on AMD CPUs. I'm not sure it's worth the trouble to make the JIT use them only on Intel CPUs.

// The BT instruction family supports "reg/mem,reg" and "reg/mem,imm" forms.
// However, the "mem,reg" form has a slightly different semantic than the other forms,
// it treats the memory as an array indexed by bit_index / 32/64. This is rarely useful
// as these instructions are tipically produced by transforming shift instructions and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: tipically => typically

@mikedn
Copy link
Author

mikedn commented Aug 31, 2017

It's unfortunate to see all the special-casing required in emitxarch.cpp, but it's not obvious to me how to avoid it

Indeed it is. When I started doing this (it's been a while and I don't remember the exact details) I thought it would be easy because like shifts it too has a one byte immediate operand. But then shift requires special casing too and BT needs even more (AFAIR partly because it has a 0F prefix). If I had time I'd refactor the whole emitter, it's long past its design limit (assuming there ever was a design to begin with). It appears that the original idea was to be "data driven" but the instruction table is far too limited to avoid a bunch of special casing in the code.

As for performance, corelib crossgen timing doesn't show any change and I've yet to figure out a way to measure things like retired instruction count.

@JosephTremoulet
Copy link

I've yet to figure out a way to measure things like retired instruction count

Here's what I've used (courtesy @pgavlin) to measure it for corelib crossgen on Linux:

./build.sh x64 release skiptests ninja nopgooptimize
valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./bin/Product/Linux.x64.Release/crossgen /Platform_Assemblies_Paths ./bin/Product/Linux.x64.Release/IL ./bin/Product/Linux.x64.Release/IL/System.Private.CoreLib.dll
kcachegrind callgrind.out

And I haven't tried it myself, but @AndyAyersMS has https://github.com/AndyAyersMS/InstructionsRetiredExplorer for measuring the same on Windows.

@mikedn
Copy link
Author

mikedn commented Aug 31, 2017

measure it for corelib crossgen on Linux:

I know but I don't have enough disk space to install Linux. And AFAIK callgrind doesn't work in a VM...

https://github.com/AndyAyersMS/InstructionsRetiredExplorer for measuring the same on Windows.

Ah, xperf + pmc, I was just looking into that, thanks for the pointer!

@BruceForstall
Copy link
Member

You can also use Intel 'pin', configured to count instructions: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

@AndyAyersMS
Copy link
Member

The instructions retired explorer may require very new xperfs (not 100% sure they've even shipped). Also you can't be running on a box with Hyper-V enabled. And you need to run as admin.

@AndyAyersMS
Copy link
Member

You might be able to enable the PMC ETL event mode via perfview instead. Haven't tried....

@mikedn
Copy link
Author

mikedn commented Sep 1, 2017

You can also use Intel 'pin', configured to count instructions

Yep, I know that one too. Yet to get it to build, AFAIR last time I tried it couldn't find the C compiler.

The instructions retired explorer may require very new xperfs (not 100% sure they've even shipped)

Indeed, it looks like the xperf I have doesn't accept -pmcprofile. But there's tracelog that does accept -pmc. But then InstructionsRetiredExplorer can't seem to find any PerfInfo/PMCSample events in the merged etl file. Yet to figure out why but I tried Bruce Dawson's PMC event parsing script from https://github.com/google/UIforETW/tree/master/LabScripts/ETWPMCDemo and got this:

Process          Ins Retired     Branch Ins     Branch Miss
crossgen.exe     18,067,358,073  3,890,472,755  233,555,507
crossgen.exe     18,062,612,600  3,889,872,528  233,447,303
crossgen.exe     18,060,606,313  3,889,366,433  233,538,253
crossgen.exe     18,078,099,445  3,892,123,129  233,823,258
crossgen.exe     18,064,848,866  3,890,080,253  233,779,378
crossgen_bt.exe  18,102,982,290  3,898,831,420  236,963,113
crossgen_bt.exe  18,110,508,956  3,901,657,373  237,138,253
crossgen_bt.exe  18,099,535,973  3,898,380,857  236,903,141
crossgen_bt.exe  18,095,406,195  3,897,302,086  236,967,909
crossgen_bt.exe  18,099,172,364  3,898,696,667  236,855,791

So around 0.2% increase in instructions retired, assuming the numbers are correct. See this Excel file for some more details: https://1drv.ms/x/s!Av4baJYSo5pjgrhsTDG39GBXVvRdOA. For example, the reported time does look reasonable.

0.2% is a bit much and it's quite unfortunate that adding a new instruction has such an effect. I'll see if I can do something about it though I'm not very optimistic when it comes to emitter code...

@mikedn
Copy link
Author

mikedn commented Sep 4, 2017

Probably the best option to speed up the emitter is to add special casing only where it is really needed - in emitOutputRR - because only BT reg, reg is currently needed. BT reg, imm and BT mem, imm can be useful but they're probably not worth the trouble. BT mem, reg is completely pointless.

And if only BT reg, reg is needed then it turns out that it's possible to avoid all emitter special casing if cheating is an option. Special casing is needed because:

  • The existing code path attempts to set bits 0 and 1 of the opcode. For BT these bits have no special meaning but it happens so that its opcode is A3 so we can cheat - just set them unconditionally.
  • BT's registers operand are swapped - BT ecx, edx has a different encoding than ADD ecx, edx. Here we can cheat and have CodeGen call emitIns_R_R with swapped operands. It's ugly but it works.

@mikedn
Copy link
Author

mikedn commented Sep 7, 2017

I replaced all the special casing from the emitter with comments and asserts. It's perhaps a bit of a hack but the added complexity and the throughput hit is just not worth it considering the low usage of BT.

@mikedn
Copy link
Author

mikedn commented Sep 8, 2017

OSX build failed due to some Jenkins issue

00:14:17 hudson.remoting.RemotingSystemException: java.util.concurrent.ExecutionException: Invalid object ID 389 iota=390
00:14:17 	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:268)
00:14:17 	at com.sun.proxy.$Proxy169.join(Unknown Source)
00:14:17 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:991)
00:14:17 	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:135)
00:14:17 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:95)

@dotnet-bot test OSX10.12 x64 Checked Build and Test

@mikedn
Copy link
Author

mikedn commented Sep 9, 2017

OSX build got stuck...

@dotnet-bot test OSX10.12 x64 Checked Build and Test

@danmoseley
Copy link
Member

@CarolEidt good to merge it looks like?

@CarolEidt
Copy link

@dotnet/jit-contrib This LGTM, but I think a second set of eyes on this would be good.

@@ -163,6 +163,7 @@ void genCodeForShiftLong(GenTreePtr tree);

#ifdef _TARGET_XARCH_
void genCodeForShiftRMW(GenTreeStoreInd* storeInd);
void genCodeForBT(GenTreeOp* bt);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to call this BitTest rather than BT, since we will likely want this for other architectures eventually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM doesn't seem to have anything like BT. ARM64 has TBZ/TBNZ but that's quite different from BT and I'm not even sure how we can represent that in JIT's IR, it's a rather special case of conditional branch, maybe a JTRUE with a contained TEST_EQ/NE operand.

Copy link

@briansull briansull Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm64 has tst:

  tst     w24, #24
  beq     G_M55607_IG12

Implemented with PR #13799

Note:

Test bits (immediate), setting the condition flags and discarding the result: Rn AND imm
This instruction is an alias of the ANDS (immediate) instruction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM64's tst is similar to x64's test and represented in IR by GT_TEST_EQ and GT_TEST_NE. BT is a different thing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between these two?
Any bit (0-63) can be tested using Arm64 TST instruction.

The main difference seems to be that the Intel instruction sets the CF and the Arm instruction sets the ZF. (And that a memory operand is allowed on Intel)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any bit (0-63) can be tested using Arm64 TST instruction.

Yes, but the bit to test is specified by an immediate value. Presumably you can do something like tst w24, w23 but then w23 needs to contain a bit mask, not a bit index like in the case of BT.

@sdmaclea
Copy link

@dotnet-bot test Windows_NT arm64 Checked

@sdmaclea
Copy link

sdmaclea commented Sep 12, 2017

@mikedn It is not obvious to me how this is disabled for arm64. I do notsee the #ifdef in LowerCompare() Is it just inside another large block? Never mind it is just in the large TARGET_XARCH block.

@sdmaclea
Copy link

Do not feel obligated to wait for arm64 tests to complete.

@CarolEidt
Copy link

Never mind it is just in the large TARGET_XARCH block.

I had the same question, and was thinking that perhaps it is time to extract the XARCH-specific code into lowerxarch.cpp

@sdmaclea
Copy link

sdmaclea commented Sep 12, 2017

@CarolEidt

extract the XARCH-specific code into lowerxarch.cpp

Many of my recent changes are enabling sections of this LowerCompare XARCH block for arm64.

@CarolEidt
Copy link

Many of my recent changes are enabling sections of this LowerCompare XARCH block for arm64.

Ah, very good then!

@mikedn
Copy link
Author

mikedn commented Sep 13, 2017

Rebased and fixed conflicts. It happens so that now it is obvious that the code is inside an ifdef.

@tannergooding
Copy link
Member

@mikedn are there any plans to add similar changes for bts, btr, and btc?

@mikedn
Copy link
Author

mikedn commented Sep 13, 2017

are there any plans to add similar changes for bts, btr, and btc?

I did experiment with those and I have some code that handles things like x |= (1 << y). The are some problems:

  • These instructions have a mem, reg form but it has poor performance and problematic behavior. That means that when x is a memory location we need to generate something like
mov eax, [mem]
bts eax, edx
mov [mem], eax

This makes these instructions slightly less useful compared to BT (which only needs an extra mov in such cases).

  • On AMD CPUs (Ryzen really as probably nobody cares about the rest) these instructions have slightly worse perf characteristics than shift instructions. On Intel CPUs it's the other way around. This makes things complicated - we either need to carefully handle only those patterns that work well on all CPUs or use these instructions only on some CPUs. The former might require running benchmarks on a Ryzen (and I don't have one), the later might complicate testing and I'm not sure if the team will agree to that.

And interestingly, most native compilers seem to avoid these instructions. AFAIR the only native compiler that I've seen using bts & co. is VC++.

@tannergooding
Copy link
Member

The former might require running benchmarks on a Ryzen (and I don't have one)

I'm happy to run benchmarks on a Ryzen, I've got a 1800X and a 1950X (Threadripper)

@tannergooding
Copy link
Member

These instructions have a mem, reg form but it has poor performance and problematic behavior. That means that when x is a memory location we need to generate something like

Fair enough. If alternative code tends to perform better across all platforms, its probably better to keep that.

@mikedn
Copy link
Author

mikedn commented Sep 13, 2017

I'm happy to run benchmarks on a Ryzen, I've got a 1800X and a 1950X (Threadripper)

Thanks, then I'll try to put something together when I get some time, next weekend perhaps.

I should add that adding support for SHLX/SHRX/SARX might be more useful. On Intels these have better perf than normal shifts while on AMDs they have identical perf so we can use them on all CPUs that support BMI2. They don't require the shift count to be in ECX (and as seen in an example above this may help with register allocation) and they have a 3 operand form (though I'm not sure if the JIT can fully take advantage of that).

@CarolEidt CarolEidt merged commit f386cd6 into dotnet:master Sep 14, 2017
@mikedn mikedn deleted the test-bt branch December 16, 2017 09:17
@WojciechNagorski
Copy link

@mikedn I used your example in my blog about how to generate the disassembly of .NET functions and how to diff many of them with BenchmarkDotNet.

@WojciechNagorski
Copy link

Link: https://wojciechnagorski.com/2019/01/generates-disassembly-of-.net-functions 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet