Micro-optimizing a particular piece of code I found suboptimal codegen introduced by the signature of the Unsafe class design that could be fixed by the JIT.
Say that I need to read data from 2 different memory locations where offset is an int
matches = Sse2.MoveMask(Sse2.CompareEqual(LoadVector128(ref first, (IntPtr)offset), LoadVector128(ref second, (IntPtr)offset)));
Now you can see that it is performing 2 times the same operation.
**movsxd r8,eax
vmovupd xmm0,xmmword ptr [rcx+r8]
**movsxd r8,eax
vmovupd xmm1,xmmword ptr [rdx+r8]
vpcmpeqb xmm0,xmm0,xmm1
vpmovmskb r8d,xmm0
This has been solved (somehow) for AVX2 but it also introduce another strange behavior:
matches = Avx2.MoveMask(Avx2.CompareEqual(LoadVector256(ref first, (IntPtr)offset), LoadVector256(ref second, (IntPtr)offset)));
As you can see not only we copy with sign extension but we are also coping it into r9. While at the architectural level that is a simple rename (better than the other one) we are still issuing an extra operation.
**movsxd r8,eax
**mov r9,r8
vmovupd ymm0,ymmword ptr [rcx+r9]
vmovupd ymm1,ymmword ptr [rdx+r8]
vpcmpeqb ymm0,ymm0,ymm1
vpmovmskb r8d,ymm0
What I dont understand is why if eax has been set in the same code (not coming from anywhere else) the JIT decides to use an extra mov operation instead of emitting:
vmovupd ymm0,ymmword ptr [rcx+eax]
vmovupd ymm1,ymmword ptr [rdx+eax]
vpcmpeqb ymm0,ymm0,ymm1
vpmovmskb r8d,ymm0
And, futhermore, this can also be optimized to:
vmovupd ymm0,ymmword ptr [rcx+eax]
vpcmpeqb ymm0,ymm0,ymmword ptr [rdx+eax]
vpmovmskb r8d,ymm0
I am running nightly from today. 3.0.0-preview4-27506-5
Any idea how I can achieve the latter code?
category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium
Micro-optimizing a particular piece of code I found suboptimal codegen introduced by the signature of the Unsafe class design that could be fixed by the JIT.
Say that I need to read data from 2 different memory locations where offset is an
intNow you can see that it is performing 2 times the same operation.
This has been solved (somehow) for AVX2 but it also introduce another strange behavior:
As you can see not only we copy with sign extension but we are also coping it into r9. While at the architectural level that is a simple rename (better than the other one) we are still issuing an extra operation.
What I dont understand is why if eax has been set in the same code (not coming from anywhere else) the JIT decides to use an extra mov operation instead of emitting:
And, futhermore, this can also be optimized to:
I am running nightly from today.
3.0.0-preview4-27506-5Any idea how I can achieve the latter code?
category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium