-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT Performance regression between 1.1 and 2.0 and between 2.0 and 2.1 #10466
Comments
cc: @dotnet/jit-contrib |
I'll take a quick look. Given that it is an x86 regression my guess it might be similar to what we see over in #9833. |
Thanks, @AndyAyersMS. I did some further optimization of that inner loop code in my main project that altered results slightly. Don't know if it will help track down the regression. Basically, I added more local variables to hold the data block being hashed. Of course those end up getting spilled, but having them and the accumulator variables in the same stack memory area speeds things up slightly. I also re-ordered some of the operations to reduce register dependencies. Anyway, the updated version of the function now performs the same on 1.1 and 2.0 on the x64 runtimes but has a consistent 4% slowdown in 2.1. x86 shows the same major slowdown between 1.1 and 2.0 and again between 2.0 and 2.1.
|
One would hope x86 is ~2x slower than x64 as primary data type is long/ulong. But no... Looking at x86, 2.0 vs 2.1 there is a large size regression. Likely RA.
Suffice to say this is something of a torture test for an x86 allocator, leading to long stretches of stack shuffling: 89B574FCFFFF mov dword ptr [ebp-38CH], esi
8BB578FCFFFF mov esi, dword ptr [ebp-388H]
81F66BBD41FB xor esi, 0xFB41BD6B
89B578FCFFFF mov dword ptr [ebp-388H], esi
8BB574FCFFFF mov esi, dword ptr [ebp-38CH]
81F6ABD9831F xor esi, 0x1F83D9AB
89B574FCFFFF mov dword ptr [ebp-38CH], esi
8BB578FCFFFF mov esi, dword ptr [ebp-388H]
89B594FEFFFF mov dword ptr [ebp-16CH], esi
8BB574FCFFFF mov esi, dword ptr [ebp-38CH]
89B590FEFFFF mov dword ptr [ebp-170H], esi
8BF1 mov esi, ecx
03B5F4FEFFFF add esi, dword ptr [ebp-10CH]
89B574FCFFFF mov dword ptr [ebp-38CH], esi
8BF2 mov esi, edx
13B5F0FEFFFF adc esi, dword ptr [ebp-110H]
89B578FCFFFF mov dword ptr [ebp-388H], esi
8B75F0 mov esi, dword ptr [ebp-10H]
8B8574FCFFFF mov eax, dword ptr [ebp-38CH]
0306 add eax, dword ptr [esi]
898574FCFFFF mov dword ptr [ebp-38CH], eax
8B8578FCFFFF mov eax, dword ptr [ebp-388H]
134604 adc eax, dword ptr [esi+4]
8BB574FCFFFF mov esi, dword ptr [ebp-38CH] The methods are too big to make sense of as is (especially given the semi-arbitrary IG breaks) so we should cut them down to a representative sample. Probably just one round of mixing would be plenty to surface the challenges the jit must overcome. Also need to look into what kind of code jit32 produced in 1.1. |
@saucecontrol Unrolled hash computations are often difficult cases for the allocator because everything ends up highly interdependent. But clearly we have a lot of room for improvement on x86 both on a release/release basis and by comparison to where we are on x64. Speaking of comparisons, do you have a native (C/C++) implementation that you use as a reference point? It also would be interesting to see how far off our x86/x64 performance is from pure native code. |
Sure do. There's a project called Blake2.Bench in that same repo that has comparisons with the reference C version called via PInvoke. My C# version started out with about the same performance on x64, but after optimization, it's quite a bit faster. Performance on x86 has never been as good, but 1.1 did admirably. Here's another run of that last bench with the native results inlcuded
Since register allocation is such an issue, I reckon it's expected x86 would be more than 2x slower since it has less than half the registers to work with. Combine that with the ulong calculations, and 4x slower is probably about right for a target. |
I don't know if this is useful or not, but I also have an implementation of the Blake2s algorithm in the main project in that repo. It's the exact same thing but operates on uint words instead of ulong and does 10 mixing rounds instead of 12.
Interesting that RyuJIT-32 does so much better between 1.1 and 2.0 here, but there's a definite regression between 2.0 and 2.1. |
Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit. Generally speaking this switch to RyuJit resulted in improved performance for x86. But not always. I think the perf issues in the initial Long decomposition doubles the number of things that deserve registers (vs same computation on x64) while at the same time the allocator is trying to cope with the fact that on x86 there are only ~6 allocatable registers, while on x64 there are ~13. Twice as much demand and half as much supply. I don't have anything specific to call out just yet though, or a good explanation of why it started off poorly and has gotten worse. |
Ohhhhhhh… I feel like I knew that at one point and it fell out of my brain. Doesn't help that BDN will only run |
I found another case that might be related and is much more simple to follow. BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=2.1.301
[Host] : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
net46 : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3110.0
netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 32bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 32bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
This one shows a consistent 6-8% regression between jit32 and RyuJIT on this machine and more like 10-20% on another I tested. class GreyConverter
{
unsafe static void ToGreyFixedPoint(byte* prgb, byte* pgrey, int cb)
{
const ushort scale = 1 << 15;
const ushort round = 1 << 14;
const ushort rw = (ushort)(0.299f * scale + 0.5f);
const ushort gw = (ushort)(0.587f * scale + 0.5f);
const ushort bw = (ushort)(0.114f * scale + 0.5f);
byte* end = prgb + cb;
while (prgb < end)
{
int val = prgb[0] * rw + prgb[1] * gw + prgb[2] * bw;
*pgrey = (byte)((val + round) >> 15);
prgb += 3;
pgrey++;
}
}
byte[] Grey;
byte[] Rgb;
public GreyConverter()
{
Grey = new byte[16384];
Rgb = new byte[Grey.Length * 3];
new Random(42).NextBytes(Rgb);
}
[Benchmark]
unsafe public void FixedPoint()
{
fixed (byte* prgb = &Rgb[0], pgrey = &Grey[0])
ToGreyFixedPoint(prgb, pgrey, Rgb.Length);
}
} Here's the BDN assembly dump from the legacy JIT: 081f9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
081f9697 8b7d08 mov edi,dword ptr [ebp+8]
081f969a 03f9 add edi,ecx
081f969c 3bcf cmp ecx,edi
081f969e 7334 jae 081f96d4
081f96a0 0fb611 movzx edx,byte ptr [ecx]
081f96a3 69d246260000 imul edx,edx,2646h
081f96a9 0fb64101 movzx eax,byte ptr [ecx+1]
081f96ad 69c0234b0000 imul eax,eax,4B23h
081f96b3 03d0 add edx,eax
081f96b5 0fb64102 movzx eax,byte ptr [ecx+2]
081f96b9 69c0980e0000 imul eax,eax,0E98h
081f96bf 03d0 add edx,eax
081f96c1 81c200400000 add edx,4000h
081f96c7 c1fa0f sar edx,0Fh
081f96ca 8816 mov byte ptr [esi],dl
081f96cc 83c103 add ecx,3
081f96cf 46 inc esi
081f96d0 3bcf cmp ecx,edi
081f96d2 72cc jb 081f96a0
081f96d4 5e pop esi And RyuJIT: 05649198 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
0564919e 8bc1 mov eax,ecx
056491a0 034508 add eax,dword ptr [ebp+8]
056491a3 3bc8 cmp ecx,eax
056491a5 7336 jae 056491dd
056491a7 0fb619 movzx ebx,byte ptr [ecx]
056491aa 69f346260000 imul esi,ebx,2646h
056491b0 0fb65901 movzx ebx,byte ptr [ecx+1]
056491b4 69fb234b0000 imul edi,ebx,4B23h
056491ba 03f7 add esi,edi
056491bc 0fb65902 movzx ebx,byte ptr [ecx+2]
056491c0 69fb980e0000 imul edi,ebx,0E98h
056491c6 03f7 add esi,edi
056491c8 81c600400000 add esi,4000h
056491ce 8bde mov ebx,esi
056491d0 c1fb0f sar ebx,0Fh
056491d3 881a mov byte ptr [edx],bl
056491d5 83c103 add ecx,3
056491d8 42 inc edx
056491d9 3bc8 cmp ecx,eax
056491db 72ca jb 056491a7
056491dd 5b pop ebx The two differences that jump out at me:
|
The fact that the same register is used for both shouldn't make any differences, CPUs handle such cases by register renaming.
That may be. In general, register to register moves should have very low cost, in many cases being handled by register renaming and thus generating no actual μops. Still, they can degrade performance, if only because of the increased code size. |
Hmmm... I didn't think about the register renaming, but if the CPU is doing it in both cases, I'm surprised a 2 byte code size increase would have such a large perf impact. I'll add more cases if I see anything different come up. |
Perf difference might be caused by loop top alignment, note the backedge target addresses: ;;; legacy jit
081f96d2 72cc jb 081f96a0
;;; ryjit
056491db 72ca jb 056491a7 There is a perf benefit to having frequent branch targets be addresses == 0 mod 16 (or these days mod 32). Not sure if legacy jit gets lucky here or would pad above the loops with a slightly different example. |
Looks like the alignment was purely coincidental. I added an extra increment before the loop to offset each of them by 1. The RyuJIT version still came out 8% slower with its loop top 16-byte aligned
jit32 asm 074e9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
074e9697 41 inc ecx
074e9698 8b7d08 mov edi,dword ptr [ebp+8]
074e969b 03f9 add edi,ecx
074e969d 3bcf cmp ecx,edi
074e969f 7334 jae 074e96d5
074e96a1 0fb611 movzx edx,byte ptr [ecx]
074e96a4 69d246260000 imul edx,edx,2646h
074e96aa 0fb64101 movzx eax,byte ptr [ecx+1]
074e96ae 69c0234b0000 imul eax,eax,4B23h
074e96b4 03d0 add edx,eax
074e96b6 0fb64102 movzx eax,byte ptr [ecx+2]
074e96ba 69c0980e0000 imul eax,eax,0E98h
074e96c0 03d0 add edx,eax
074e96c2 81c200400000 add edx,4000h
074e96c8 c1fa0f sar edx,0Fh
074e96cb 8816 mov byte ptr [esi],dl
074e96cd 83c103 add ecx,3
074e96d0 46 inc esi
074e96d1 3bcf cmp ecx,edi
074e96d3 72cc jb 074e96a1
074e96d5 5e pop esi and RyuJIT 084e91a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
084e91a6 41 inc ecx
084e91a7 8bc1 mov eax,ecx
084e91a9 034508 add eax,dword ptr [ebp+8]
084e91ac 3bc8 cmp ecx,eax
084e91ae 7336 jae 084e91e6
084e91b0 0fb619 movzx ebx,byte ptr [ecx]
084e91b3 69f346260000 imul esi,ebx,2646h
084e91b9 0fb65901 movzx ebx,byte ptr [ecx+1]
084e91bd 69fb234b0000 imul edi,ebx,4B23h
084e91c3 03f7 add esi,edi
084e91c5 0fb65902 movzx ebx,byte ptr [ecx+2]
084e91c9 69fb980e0000 imul edi,ebx,0E98h
084e91cf 03f7 add esi,edi
084e91d1 81c600400000 add esi,4000h
084e91d7 8bde mov ebx,esi
084e91d9 c1fb0f sar ebx,0Fh
084e91dc 881a mov byte ptr [edx],bl
084e91de 83c103 add ecx,3
084e91e1 42 inc edx
084e91e2 3bc8 cmp ecx,eax
084e91e4 72ca jb 084e91b0
084e91e6 5b pop ebx |
In newer processors crossing a 32-byte boundary in a loop makes a difference because of µop cache. It's definitely the case for loops that are 32 bytes long or smaller. Here the loops are larger than 32 bytes but they cross 32-byte boundary once in jit32 case and twice in RyuJit case. That may affect perf. Can you try aligning the loops at 32 bytes in both cases? See 9.3 in http://www.agner.org/optimize/microarchitecture.pdf for details about µop cache and 32-bytes alignment. |
I managed to get both loop tops 32-byte aligned, but I'm not seeing any changes to the perf numbers. And the perf delta holds up even if I force the jit32 output to be misaligned.
jit32 07a59690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a59697 8b7d08 mov edi,dword ptr [ebp+8]
07a5969a 03f9 add edi,ecx
07a5969c 3bcf cmp ecx,edi
07a5969e 7334 jae 07a596d4
07a596a0 0fb611 movzx edx,byte ptr [ecx]
07a596a3 69d246260000 imul edx,edx,2646h
07a596a9 0fb64101 movzx eax,byte ptr [ecx+1]
07a596ad 69c0234b0000 imul eax,eax,4B23h
07a596b3 03d0 add edx,eax
07a596b5 0fb64102 movzx eax,byte ptr [ecx+2]
07a596b9 69c0980e0000 imul eax,eax,0E98h
07a596bf 03d0 add edx,eax
07a596c1 81c200400000 add edx,4000h
07a596c7 c1fa0f sar edx,0Fh
07a596ca 8816 mov byte ptr [esi],dl
07a596cc 83c103 add ecx,3
07a596cf 46 inc esi
07a596d0 3bcf cmp ecx,edi
07a596d2 72cc jb 07a596a0
07a596d4 5e pop esi RyuJIT 07a191a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a191a9 81c100010000 add ecx,100h
07a191af 81c200010000 add edx,100h
07a191b5 83c010 add eax,10h
07a191b8 8d440110 lea eax,[ecx+eax+10h]
07a191bc 3bc8 cmp ecx,eax
07a191be 7336 jae 07a191f6
07a191c0 0fb619 movzx ebx,byte ptr [ecx]
07a191c3 69f346260000 imul esi,ebx,2646h
07a191c9 0fb65901 movzx ebx,byte ptr [ecx+1]
07a191cd 69fb234b0000 imul edi,ebx,4B23h
07a191d3 03f7 add esi,edi
07a191d5 0fb65902 movzx ebx,byte ptr [ecx+2]
07a191d9 69fb980e0000 imul edi,ebx,0E98h
07a191df 03f7 add esi,edi
07a191e1 81c600400000 add esi,4000h
07a191e7 8bde mov ebx,esi
07a191e9 c1fb0f sar ebx,0Fh
07a191ec 881a mov byte ptr [edx],bl
07a191ee 83c103 add ecx,3
07a191f1 42 inc edx
07a191f2 3bc8 cmp ecx,eax
07a191f4 72ca jb 07a191c0
07a191f6 5b pop ebx |
json-benchmark (java&.net) The benchmark almost include the fastest .net json library,like Jil,NetJson.... when could be CLR can be faster more than JVM? |
@sgf that benchmark is not using .net core though. It might still be slower. But just pointing out. |
I saw there were some recent LSRA changes and decided to revisit this to see if there was any improvement. I managed to distill my original repro down to a much more reasonable size, with only half a round of hash mixing and with just enough math to touch all the variables. Current code is here and the current numbers look like this: BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
[Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp1.1 : .NET Core 1.1.10 (CoreCLR 4.6.26906.01, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp3.0 : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
The x64 regression between 1.1 and 2.0 is barely visible, but there is a code size increase to back it up. The x86 regression is quite clear both in the numbers and in the asm. I went ahead and captured JitDisasm and JitDump output for the simplified mixing method on every runtime version (3.0 is from current master) to make this easier to review. Dumps are here: jitdumps.zip |
I get similar numbers:
For the latest RyuJit on x86, we've gotten ourselves into a situation where the only available register for long stretches of code is lea esi, [ecx+176]
mov dword ptr [ebp-D8H], esi
mov esi, dword ptr [esi]
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-D8H]
G_M44156_IG04:
mov esi, dword ptr [esi+4]
mov dword ptr [ebp-F0H], esi
mov esi, dword ptr [ebp-ECH]
mov dword ptr [ebp-78H], esi
mov esi, dword ptr [ebp-F0H]
mov dword ptr [ebp-7CH], esi
lea esi, [ecx+184]
mov dword ptr [ebp-DCH], esi
mov esi, dword ptr [esi]
mov dword ptr [ebp-F0H], esi
mov esi, dword ptr [ebp-DCH]
mov esi, dword ptr [esi+4]
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-F0H]
mov dword ptr [ebp-80H], esi
mov esi, dword ptr [ebp-ECH]
mov dword ptr [ebp-84H], esi
lea esi, [ecx+192]
mov dword ptr [ebp-E0H], esi
mov esi, dword ptr [esi]
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-E0H]
mov esi, dword ptr [esi+4]
mov dword ptr [ebp-F0H], esi
mov esi, dword ptr [ebp-ECH]
xor esi, 0xD1FFAB1E
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-F0H]
xor esi, 0xD1FFAB1E I think part of the issue here is that when we decompose a long GT_IND, we save the address to a temp, even if that address is something we could trivially recompute. So we see lea esi, [ecx+176] // address of low part of some long
mov dword ptr [ebp-D8H], esi // spill the address
mov esi, dword ptr [esi] // load low part
mov dword ptr [ebp-ECH], esi // spill the low part
mov esi, dword ptr [ebp-D8H] // reload the address
G_M44156_IG04:
mov esi, dword ptr [esi+4] // bump address to point at high part
mov dword ptr [ebp-F0H], esi // load the high part when we could just as easily have the following, and save a bit of register pressure: mov esi, dword ptr [ecx + 176] // load low part
mov dword ptr [ebp-ECH], esi // spill the low part
G_M44156_IG04:
mov esi, dword ptr [ecx + 180] // bump to point at high part
mov dword ptr [ebp-F0H], esi // load the high part The other thing I saw was that when a long load is dead but might cause an exception, we don't simplify to something like a null check like jit32 does. Again this saves needing a register. Let me look into long decomposition and see how hard it would be to teach it not to create a temp for simple address forms. |
I have a prototype here OptimizeDecomposeInd... it helps somewhat but we're still nowhere close to jit32.
Can do similar for stores but since those are all at the end I don't expect that to have a large impact in this case. Also reasonably broad impact across the frameworks....
|
Will need to spend more time looking into this to figure out what else must happen to get x86 up to par (and perhaps arm32 too). Moving this to future as I don't see any quick resolution that would fit into 3.0. |
Was just giving this another look with 3.1 RTM, and whatever was going wrong with LSRA on x86 is now impacting x64 as well. The code size really exploded in 3.1.
And the benchmarks are showing the expected perf drop. BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18362
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
[Host] : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
netcoreapp1.1 : .NET Core 1.1.13 (CoreCLR 4.6.27618.02, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.14 (CoreCLR 4.6.28207.04, CoreFX 4.6.28208.01), 64bit RyuJIT
netcoreapp3.1 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
JitDisasm and JitDump output for all versions are here: |
cc @CarolEidt -- some interesting stress cases for LSRA |
The x64 disassembly for 3.1 shows that tier 0 was used to generate code...
|
D'oh. Been a while since that got me. Updated code sizes still show another increase but not nearly so extreme. Benchmark numbers were correct.
And here are the updated dumps: simplifiedhash_jitdumps.zip |
Thanks, a quick look over the disassembly diff would indicate that this is more likely to be an issue with null check elimination. I see lots of extra null checks in various places, like here: - mov r14, qword ptr [rcx+144]
- mov r15, qword ptr [rcx+152]
- mov r12, qword ptr [rcx+160]
- mov r13, qword ptr [rcx+168]
- mov rbx, qword ptr [rcx+176]
- mov rdi, qword ptr [rcx+184]
- mov rsi, 0xD1FFAB1E
- xor rsi, qword ptr [rcx+192]
+ mov rbp, rdx
+ cmp dword ptr [rcx], ecx
+ mov r14, qword ptr [rcx+136]
+ cmp dword ptr [rcx], ecx
+ mov r15, qword ptr [rcx+144]
+ cmp dword ptr [rcx], ecx
+ mov r12, qword ptr [rcx+152]
+ cmp dword ptr [rcx], ecx
+ mov r13, qword ptr [rcx+160]
+ cmp dword ptr [rcx], ecx
+ mov rbx, qword ptr [rcx+168]
+ cmp dword ptr [rcx], ecx
+ mov rdi, qword ptr [rcx+176]
+ cmp dword ptr [rcx], ecx
+ mov rsi, qword ptr [rcx+184] |
Thanks for looking into it. You're right, of course. I actually ran across that issue a few months back and updated the main project (from which the sample is distilled) to replace the struct pointer argument with an interior pointer to the hash state. Running the benchmarks against the main project, there doesn't appear to be a new regression in 3.1... but the drop-off from legacy JIT is still severe. |
The x86 issue is the same as #8846, and is something I plan to be working on soon (see #9399). |
I've looked a bit more into those null checks and the reason they're not removed is pretty hilarious:
So the C# compiler somehow manages to emit a constant tree and the JIT gets rid of it too late:
The reason for the null check is I haven't checked how come this worked before 3.0 nor if there's any quick fix for this. It looks to me that the main reason for the MAC's mess is the existence of The generated code also contains a bunch of extra |
@CarolEidt you might want to look at this one again as at the heart of it all there are likely LSRA challenges (made worse by clunky long decomposition). Hash computations tend to be pretty stressful for the allocator. |
I've been trying to come up with a simple repro for a perf regression I've been fighting, and I think I have it trimmed down as much as possible while still showing the real-world impact. What I have here is a stripped-down version of my Blake2b hashing algorithm implementation.
The
mixSimplified
method here shows the problem. Code size has grown with each new JIT since 1.1, and performance has dropped.JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip
category:cq
theme:register-allocator
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: