-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Updating the HWIntrinsic codegen to support marking LoadVector128 and LoadAlignedVector128 as contained. #16095
Conversation
This is a basic example of marking a HWIntrinsic node as contained. I don't think it should be merged until after #15771, so that we can readily get more test coverage on this (15771 adds some explicit containment tests, and will make it simpler to add others). |
@@ -230,6 +230,11 @@ void CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins) | |||
|
|||
compiler->tmpRlsTemp(tmpDsc); | |||
} | |||
else if (op2->OperIsHWIntrinsic()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar code is needed in genHWIntrinsic_R_R_RM_I
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Base: C4E1791038 vmovupd xmm7, xmmword ptr [rax]
C4614858C7 vaddps xmm8, xmm6, xmm7 Diff: C4E1485838 vaddps xmm7, xmm6, xmmword ptr [rax] |
src/jit/lowerxarch.cpp
Outdated
// | ||
bool Lowering::IsContainableHWIntrinsicOp(GenTree* node) | ||
{ | ||
if (!node->OperIsHWIntrinsic()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this earlier return can be an assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not positive. Currently this is called for any op2 which is not a containable memory op. I don't know if that will always be a HWIntrinsic node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, for example, a user defined function which returns a Vector128<T>
, would fail such an assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not tested that, however, and @CarolEidt or @mikedn might know for certain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me as it is. That's how isContainableMemoryOp
does it as well.
src/jit/lowerxarch.cpp
Outdated
switch (intrinsicID) | ||
{ | ||
case NI_SSE_LoadAlignedVector128: | ||
case NI_SSE_LoadVector128: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadScalarVector128
should also be handled here, but I need to make sure it is emitted properly (since it is a m32
instead of a m128
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I need to make sure it is emitted properly (since it is a m32 instead of a m128)
Do you mean only folding AddScalar(v1, LoadScalarVector128(address))
but not folding AddScalar(v1, LoadVector128(address))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well specifically for Add(v1, LoadScalar(address))
, which would cause a read of too many bits, if folded.
I thin the other (AddScalar(v1, LoadVector128(address))
) is safe to fold (we should still determine if we want to do such an optimization) since the read isn't stored and the upper bits don't impact the final result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not want to make the second optimization since the full read has potentially observable side effects (it could cause an AV exception, if you read past the end of an array for example or if it was an aligned read of an unlined address, etc).
Folding Folding LoadAligned
When using the non-VEX encoding (generally legacy hardware), However, when using the VEX-encoding (newer hardware that supports AVX or AVX2), For example, on the below, if you are on newer hardware, and the second load is not marked as contained, you will get an AV. However, if it is contained, the code will "just work". var value = Sse.LoadAlignedVector128(address);
var result = Sse.Add(value, Sse.LoadAlignedVector128(address + 2)); Folding larger readsFor the scalar instructions, the address never needs to be aligned (on newer or older hardware) and is only a partial read (m8, m16, m32 or m64) rather than a full read (m128). If a user were to use a full read, and pass it directly into an operation, there would be a chance to fold it. For example, on the below, the second load will normally read the full 128-bits. However, the value is never stored and the upper bits are not used in the operation. As such, it is "safe" to fold this operation. However, this has potentially observable side-effects with things like caching, page loads, AV (if the full read would extend past the end of an array or page boundary, etc). var value = Sse.LoadScalarVector128(address);
var result = Sse.AddScalar(value, Sse.LoadVector128(address + 1)); |
I doubt that the fact that the code will just work is an issue. That is, it's unlikely that people will use
I'm surprised that this is even up for discussion. Reading more memory than requested it's a very dubious thing to do and should only be done in very specific circumstances. |
Yes, but it is an observable side-effect, which is something we normally don't allow you to hide (as has been stated in at least a couple other issues where I, or others, have asked for better codegen/inlining 😄).
I imagine we should just not fold these, but I want an explicit decision so that I can document it in |
That's not really a side effect in the common sense. It's not as if you're transforming a null pointer dereference in a 0.
It's the other way around. You would need an explicit decision to generate wider loads. |
I think any generated exception is an observable side effect according to the spec. In either way, getting an official ruling would be good. NOTE: I am in favor of folding these, but possibly only in release/optimized code (and not in debug/min-opts code).
Why? The user coded For the other end, |
Obviously I agree that we can't fold something like However, I am also not in favor of any folding that would suppress an exception that would otherwise have happened. Getting optimal code may require the developer to check for (and use) higher-level ISA instructions to get the folding, but I think that's a reasonable expectation given that we already expect the developer to have a detailed knowledge of the hardware - this is, after all, effectively inline assembly. |
@dotnet-bot test this please |
My concern on this is that for If a user has already validated that the loads will be aligned, they shouldn't need to write two versions of the algorithm (one that uses That being said, this is a perfect use-case for |
src/jit/lowerxarch.cpp
Outdated
// Only fold a scalar load into a SIMD scalar intrinsic to ensure the number of bits | ||
// read remains the same. Likewise, we can't fold a larger load into a SIMD scalar | ||
// intrinsic as that would read fewer bits that requested. | ||
case NI_SSE_LoadScalar: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to assert here that the baseType of the containing node is the same as the baseType of the load.
It should already line up due to the method signatures, but we don't want to end up in a scenario where the LoadScalar
is for m32, but the containing node would do a m64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
That one is fine, there's nothing interesting about containment in this case. I was talking about the case where a wider load would be generated. Narrowing loads can also be problematic if the data is not aligned, you could end up hiding an invalid memory access. |
AFAIR native compilers have no problem doing this. It's not clear why it would be problem in .NET. Maybe because the intrinsic names are somewhat different in regards to alignment, in C/C++ you basically have |
Indeed, they shouldn't need to write 2 versions. Because the Load version would be good enough for current hardware. |
Because It's implementation is effectively: if ((address % 16) != 0)
{
throw new AccessViolationException();
}
return LoadUnaligned(address); This is in the same vein as why we can't transform: int Select(bool condition, int trueValue, int falseValue)
{
return condition ? trueValue : falseValue;
}
int MyMethod(bool condition, int[] a, int[] b)
{
return Select(condition, a[5], b[6]);
} into: int MyMethod(bool condition, int[] a, int[] b)
{
return condition ? a[5] : b[6];
} even though the native compiler can. Since it would mask a NRE if |
There are modern processors (such as the Kaby Lake Celeron/Pentiums -- https://ark.intel.com/products/97460/Intel-Pentium-Processor-G4620-3M-Cache-3_70-GHz), which don't support VEX encoding. |
Are you really going to use that as an argument? Really really?
And let me guess - people who use such processors expect to get best performance from a gimped |
Yes :) It was made very clear in my proposal to allow the folding of such things why we can't. Folding LoadAligned is the exact same scenario, it would mask an exception that would otherwise be thrown. The solution, I believe, is to move forward with a proposal such https://github.com/dotnet/corefx/issues/26188, which did garner support, but which required a reasonable use case (such as this) and a prototype to back it up.
Regardless of whether or not you view the processor as "gimped", some people will end up getting processors like that or won't upgrade every new processor generation and some of those people will run .NET. That alone isn't enough of a case to argue for allowing it, without some more concrete numbers, but it is a start.
Code size is also important for throughput in performance oriented code (such as the general use-case for HWIntrinsics). If you end up doubling the code size, because Loads aren't folded, it can make a measurable difference in the total execution time, especially if that code is on a hot-path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one optional suggestion
@@ -1297,6 +1297,10 @@ void CodeGen::genConsumeRegs(GenTree* tree) | |||
{ | |||
genConsumeReg(tree->gtGetOp1()); | |||
} | |||
else if (tree->OperIsHWIntrinsic()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would be a good place to assert that HW_Flag_NoContainment
is not set, or that IsContainableHWIntrinsicOp(tree, tree->gtGetOp1())
but perhaps that's overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think asserting is a good idea. I'll update.
This is also pending #16114 and #16115. The former to fixup the various flags and the latter to make it easier to add some tests with this change (we currently don't have any tests that use Load
, LoadScalar
, or LoadAligned
in a way that they could be contained).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, I can't easily check IsContainableHWIntrinsicOp
.
It is a member of Lowering
and depends on IsContainableMemoryOp
, which itself depends on m_lsra
.
I would need to make m_pLowering
visible from compiler
somewhere to use that check.
The other check, flagsOfHWIntrinsic
, currently requires friend access to the compiler, but I don't see any reason why those can't be public.
@CarolEidt, what would you recommend here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall why IsContainableMemoryOp
has its actual implementation on LinearScan
when all the containment analysis is done in Lowering
(even before we moved the TreeNodeInfoInit
to LSRA). In any case, it seems like something we might want to check during code generation, so we may want to expose it in some way. That said, I don't feel that it's critical.
On the other hand, the flagsOfHwIntrinsic
I think should definitely be public so that they are accessible from codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall why IsContainableMemoryOp has its actual implementation on LinearScan when all the containment analysis is done in Lowering (even before we moved the TreeNodeInfoInit to LSRA).
Looks to be one use in LSRA, for SIMDIntrinsicGetItem
: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L3026
In any case, it seems like something we might want to check during code generation, so we may want to expose it in some way. That said, I don't feel that it's critical.
Exposing a getLowering()
method in the compiler is easy enough, but it also requires us to #include "lower.h"
somewhere where codegen
can pick it up (probably just in codegen.h
). I've commented out this particular assert with a TODO and will finish looking tomorrow if I get the chance.
On the other hand, the flagsOfHwIntrinsic I think should definitely be public so that they are accessible from codegen.
It is already available to codgen, I just forgot to put the assert in an #ifdef _TARGET_XARCH_
since its in codegenlinear
. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both asserts I tried to add here would have been invalid. tree
is the node which was contained, we would want to be doing the assert on the node which contains tree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the asserts to hwintrinsiccodegenxarch.cpp
.
The product code here should be "complete". I'd still like to wait on #16115 and add some explicit tests before merging, however. |
Updated the templated tests to also validate using |
src/jit/hwintrinsicxarch.cpp
Outdated
|
||
// Set `compFloatingPointUsed` to cover the scenario where an intrinsic is being on SIMD fields, but | ||
// where no SIMD local vars are in use. This is the same logic as is used for FEATURE_SIMD. | ||
compFloatingPointUsed = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI. @CarolEidt, @fiigii
This is the same logic as https://github.com/dotnet/coreclr/blob/master/src/jit/simd.cpp#L3094. An assert was being hit in LSRA for reflection calls which used LoadVector128
.
This needs to be cleaned up to use a flag so we aren't setting it on intrinsics which don't actually use any floating-point nodes/registers (such as Crc32).
Also FYI. @sdmaclea. This appears to impact ARM64 as well (based on the FEATURE_SIMD code), and you may need something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also imagine various code could be updated elsewhere in the JIT to properly support TYP_SIMD as floating-point registers, rather than requiring this special handling here.
@@ -242,7 +250,6 @@ void CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins) | |||
offset = 0; | |||
|
|||
// Ensure that all the GenTreeIndir values are set to their defaults. | |||
assert(memBase->gtRegNum == REG_NA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was found to be incorrect in the larger emitInsBinary
refactoring, but wasn't also removed from here.
@CarolEidt, @fiigii. Could you give another review pass when you get the opportunity. I think I'm satisfied with the changes now.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
@@ -215,6 +215,9 @@ void CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins) | |||
|
|||
if (op2->isContained() || op2->isUsedFromSpillTemp()) | |||
{ | |||
assert((Compiler::flagsOfHWIntrinsic(node->gtHWIntrinsicId) & HW_Flag_NoContainment) == 0); | |||
assert(compiler->m_pLowering->IsContainableHWIntrinsicOp(node, op2) || op2->IsRegOptional()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work.
|
Failures are from SSE2 and AVX tests attempting to use the not yet implemented LoadVector intrinsics for those ISAs, will fix. |
Are you going to disable this behavior? I can implement these Load* intrinsics in this weekend. |
Yes (effectively). I am going to revert the test changes for SSE2/AVX (since those just added the Load/LoadAligned tests) and comment those out from generation for the time being. |
Same as before (just squashed into a product changes and a test changes commit), but with the invalid test changes reverted for SSE2/AVX/AVX2 (since they don't have the required Load/LoadAligned intrinsics implemented yet). |
… LoadAlignedVector128 as contained.
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic test OSX10.12 x64 Checked jitincompletehwintrinsic |
FYI. @fiigii, @CarolEidt, @mikedn