-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split 16-byte SIMD store around GC struct fields into two 8-byte SIMD stores (x86)/ two 8-byte mov-s (x64) #53116
Conversation
@dotnet/jit-contrib PTAL |
Should we be using
|
+1, but I am more worried about performance rather than code size, intel instruction manual says that
I guess it does not apply to the modern cpu that we care about but maybe we still should not mix 8/16 bytes writes. it looks like we will need perf lab results for this change, not asm diffs. |
Right now when we are not always allocating a GP register for |
Good suggestion. I can try to run a perf job against this PR and see what has changed. |
The size regression would break even with 2+ stores and start winning with 3 or more. |
Yes, that is right. But I was talking more about allocating an additional register for I will just make such a change and measure its consequences. |
I did couple experiments and I found that strategy "use only mov [m64], r64 and don't use SIMD movs for INIT_BLK when a source of INIT_BLK is_not a local address and the corresponding struct layout has a GCPtr" gives the best results in terms of code size. That way we don't have a mix of SIMD and scalar operations writing to the neighboring locations in memory and we don't have to zero both a GP register and a SIMD register for INIT_BLK. We will keep using SIMD movs for other cases (i.e. no GC fields or struct on the stack). @sandreenko @tannergooding Any opinion on that? |
I am just curious to see the numbers if you have them comparing these strategies. |
Sergey covered my view as well. If its performant and has good code size, then +1. This is only for |
It is for |
And yes, for Guid we will issue the same code as before |
…LK(addr), 0) for other types. This allows the JIT to maintain information about the class layout.
Always use a sequence of mov [m64],r64 when a struct has GC fields and is not guaranteed to be on the stack X86: 1) Use movq [m64],xmm to zero 8-bytes on the stack instead of mov [m32],r32; mov [m32+4],r32 2) Do not use movdqu when a struct has GC fields and is not guaranteed to be on the stack
/azp run runtime-coreclr outerloop |
/azp run runtime-coreclr jitstress |
None of the failures seems to be related to the change. Forgot to reply to @sandreenko request:
I updated the I also ran microbenchmarks from dotnet/performance locally. |
src/coreclr/jit/codegenxarch.cpp
Outdated
|
||
assert(size <= INT32_MAX); | ||
assert(dstOffset < (INT32_MAX - static_cast<int>(size))); | ||
|
||
// Fill as much as possible using SSE2 stores. | ||
if (size >= XMM_REGSIZE_BYTES) | ||
if (AMD64_ONLY(canUse16BytesSimdMov &&)(size >= XMM_REGSIZE_BYTES)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check for x64 only, shouldn't the same restriction exist for x86?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For x86 we will still emit movq [m64], xmm
for structs larger than 16 bytes - since one such instruction will save us from emitting mov [m32], r32
mov [m32+4], r32
. So it really not important whether we can use movdqu
or not on x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks here, or maybe the names, seem a bit confusing then.
I'd expect this to be something more like bool canUseSimdMov = AMD64_ONLY(!node->MustUsePointerSizeAtomicStores() &&)(size >= minSimdCopySize)
. minSimdCopySize
would then be 16
on x64 and 8
on x86.
It would then also require that (size % minSimdCopySize) == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the logic I stated isn't quite exact either, but the general point I'm trying to make is that this logic isn't exactly straightforward to follow with the logic for movdqu
and movq
and x64
and x86
all being together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the size of a struct doesn't matter.
I think I could do this to make the logic clearer
#ifdef TARGET_AMD64
// On Amd64 if the JIT needs to guarantee atomicity for stores in INITBLK
// the codegen will avoid using SIMD stores and instead will use mov [m64], r64 for zeroing.
const bool willUseSimdMov = canUse16BytesSimdMov && (size >= XMM_REGSIZE_BYTES);
#else
// On x86 even we can't use movdqu for zeroing it's still more beneficial to use
// movq [m64],xmm for structs larger than 16 bytes.
const bool willUseSimdMov = (size >= 16);
#endif
if (willUseSimdMov)
{
}
@tannergooding Would this help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it might be worth renaming canUse16BytesSimdMov
to just be isReferenceOrContainsReferences
or something similar. That's the actual term on the managed side (https://docs.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.runtimehelpers.isreferenceorcontainsreferences?view=net-5.0) and is the limiting factor behind movdqu
vs mov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the boundary is to be somewhat consistent with current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest the following name - isOnHeapAndContainsReferences
or maybeOnHeapAndContainsReferences
since it is definitely not about the type being a reference type and it is not only about a value type that contains references.
It's about a value type that contains references and not guaranteed to be allocated on the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I pushed the change renaming the method name and updating the comment. Does it read better now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, its much easier to follow now. Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with 2 nits and +1 for tanner's suggestion to rename the condition variable.
const bool willUseSimdMov = canUse16BytesSimdMov && (size >= XMM_REGSIZE_BYTES); | ||
#else | ||
// On X86 the JIT will use movq for structs that are larger than 16 bytes | ||
// since it is more beneficial than using two mov-s from a GP register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is 4 moves, since 32-bit can do 4-byte moves, not 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant slightly different thing by saying movq is more beneficial than using two mov-s from a GP register is that using
movq [m64], xmm
is more beneficial than mov [m32], r32
mov [m32+4], r32
Fixes #51638 by using
mov [m64],r64
instead ofmovdqu [m128],xmm
when zeroing structs with GC fields that are not guaranteed to be on the stack on win-x64 or linux-x64movq [m64],xmm
when zeroing such structs on win-x86win-x64
win-x86
asm.benchmarks.run.Linux.x64.checked
Detail diffs
asm.benchmarks.run.windows.x64.checked
Detail diffs
asm.benchmarks.run.windows.x86.checked
Detail diffs
asm.coreclr_tests.pmi.Linux.arm.checked
Detail diffs
asm.coreclr_tests.pmi.Linux.arm64.checked
Detail diffs
asm.coreclr_tests.pmi.Linux.x64.checked
Detail diffs
asm.coreclr_tests.pmi.windows.arm64.checked
Detail diffs
asm.coreclr_tests.pmi.windows.x64.checked
Detail diffs
asm.coreclr_tests.pmi.windows.x86.checked
Detail diffs
asm.libraries.crossgen2.Linux.x64.checked
Detail diffs
asm.libraries.crossgen2.windows.x64.checked
Detail diffs
asm.libraries.crossgen2.windows.x86.checked
Detail diffs
asm.libraries.pmi.Linux.x64.checked
Detail diffs
asm.libraries.pmi.windows.x64.checked
Detail diffs
asm.libraries.pmi.windows.x86.checked
Detail diffs
asm.libraries_tests.pmi.Linux.x64.checked
Detail diffs
asm.libraries_tests.pmi.windows.x64.checked
Detail diffs
asm.libraries_tests.pmi.windows.x86.checked
Detail diffs