Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Updating the HWIntrinsic codegen to support marking LoadVector128 and LoadAlignedVector128 as contained. #16095

Merged
merged 2 commits into from
Feb 3, 2018

Conversation

tannergooding
Copy link
Member

@tannergooding
Copy link
Member Author

This is a basic example of marking a HWIntrinsic node as contained.

I don't think it should be merged until after #15771, so that we can readily get more test coverage on this (15771 adds some explicit containment tests, and will make it simpler to add others).

@@ -230,6 +230,11 @@ void CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins)

compiler->tmpRlsTemp(tmpDsc);
}
else if (op2->OperIsHWIntrinsic())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar code is needed in genHWIntrinsic_R_R_RM_I.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@tannergooding
Copy link
Member Author

Base:

C4E1791038           vmovupd  xmm7, xmmword ptr [rax]
C4614858C7           vaddps   xmm8, xmm6, xmm7

Diff:

C4E1485838           vaddps   xmm7, xmm6, xmmword ptr [rax]

//
bool Lowering::IsContainableHWIntrinsicOp(GenTree* node)
{
if (!node->OperIsHWIntrinsic())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this earlier return can be an assertion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not positive. Currently this is called for any op2 which is not a containable memory op. I don't know if that will always be a HWIntrinsic node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, for example, a user defined function which returns a Vector128<T>, would fail such an assertion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not tested that, however, and @CarolEidt or @mikedn might know for certain

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me as it is. That's how isContainableMemoryOp does it as well.

switch (intrinsicID)
{
case NI_SSE_LoadAlignedVector128:
case NI_SSE_LoadVector128:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadScalarVector128 should also be handled here, but I need to make sure it is emitted properly (since it is a m32 instead of a m128)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I need to make sure it is emitted properly (since it is a m32 instead of a m128)

Do you mean only folding AddScalar(v1, LoadScalarVector128(address)) but not folding AddScalar(v1, LoadVector128(address))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well specifically for Add(v1, LoadScalar(address)) , which would cause a read of too many bits, if folded.

I thin the other (AddScalar(v1, LoadVector128(address)) ) is safe to fold (we should still determine if we want to do such an optimization) since the read isn't stored and the upper bits don't impact the final result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not want to make the second optimization since the full read has potentially observable side effects (it could cause an AV exception, if you read past the end of an array for example or if it was an aligned read of an unlined address, etc).

@tannergooding
Copy link
Member Author

@CarolEidt, @fiigii.

Folding LoadAligned, on newer hardware, might need a slightly more in-depth discussion and definitely needs an explicit design decision. Folding a 128-bit load into a scalar instruction needs a similar discussion/decision.

Folding LoadAligned

LoadAligned has an observable side-effect in that it will cause an AccessViolation if you pass it an unaligned address.

When using the non-VEX encoding (generally legacy hardware), INS xmm, [mem], the memory location must be aligned or the instruction will also raise an AccessViolation exception. So it is fine to fold LoadAligned here (and it is the only thing we can fold).

However, when using the VEX-encoding (newer hardware that supports AVX or AVX2), INS xmm, xmm, [mem], the memory location can be aligned or unaligned and no AccessViolation exception is raised. As such, we can fold regular Loads or LoadAligneds, however folding the latter can cause an observable change in side-effects.

For example, on the below, if you are on newer hardware, and the second load is not marked as contained, you will get an AV. However, if it is contained, the code will "just work".

var value = Sse.LoadAlignedVector128(address);
var result = Sse.Add(value, Sse.LoadAlignedVector128(address + 2));

Folding larger reads

For the scalar instructions, the address never needs to be aligned (on newer or older hardware) and is only a partial read (m8, m16, m32 or m64) rather than a full read (m128).

If a user were to use a full read, and pass it directly into an operation, there would be a chance to fold it.

For example, on the below, the second load will normally read the full 128-bits. However, the value is never stored and the upper bits are not used in the operation. As such, it is "safe" to fold this operation. However, this has potentially observable side-effects with things like caching, page loads, AV (if the full read would extend past the end of an array or page boundary, etc).

var value = Sse.LoadScalarVector128(address);
var result = Sse.AddScalar(value, Sse.LoadVector128(address + 1));

@mikedn
Copy link

mikedn commented Jan 30, 2018

For example, on the below, if you are on newer hardware, and the second load is not marked as contained, you will get an AV. However, if it is contained, the code will "just work".

I doubt that the fact that the code will just work is an issue. That is, it's unlikely that people will use LoadAligned to check if the data is aligned.

For the scalar instructions, the address never needs to be aligned (on newer or older hardware) and is only a partial read (m8, m16, m32 or m64) rather than a full read (m128).

I'm surprised that this is even up for discussion. Reading more memory than requested it's a very dubious thing to do and should only be done in very specific circumstances.

@tannergooding
Copy link
Member Author

I doubt that the fact that the code will just work is an issue. That is, it's unlikely that people will use LoadAligned to check if the data is aligned.

Yes, but it is an observable side-effect, which is something we normally don't allow you to hide (as has been stated in at least a couple other issues where I, or others, have asked for better codegen/inlining 😄).

I'm surprised that this is even up for discussion. Reading more memory than requested it's a very dubious thing to do and should only be done in very specific circumstances.

I imagine we should just not fold these, but I want an explicit decision so that I can document it in IsContainableHWIntrinsicOp.

@mikedn
Copy link

mikedn commented Jan 30, 2018

Yes, but it is an observable side-effect, which is something we normally don't allow you to hide (as has been stated in at least a couple other issues where I, or others, have asked for better codegen/inlining 😄).

That's not really a side effect in the common sense. It's not as if you're transforming a null pointer dereference in a 0.

I imagine we should just not fold these, but I want an explicit decision so that I can document it in IsContainableHWIntrinsicOp.

It's the other way around. You would need an explicit decision to generate wider loads.

@tannergooding
Copy link
Member Author

tannergooding commented Jan 30, 2018

That's not really a side effect in the common sense. It's not as if you're transforming a null pointer dereference in a 0.

I think any generated exception is an observable side effect according to the spec. In either way, getting an official ruling would be good.

NOTE: I am in favor of folding these, but possibly only in release/optimized code (and not in debug/min-opts code).

It's the other way around. You would need an explicit decision to generate wider loads.

Why? The user coded AddScalar(value, LoadVector128(address)), this means the user stated: Do a wide load. It should be an explicit decision to allow shrinking such a load (since the folded operation normally only reads a m32, but the unfolded operation would read m128).

For the other end, Add(value, LoadScalarVector128(address)). I don't think we can safely fold (since the folded operation reads a m128, but the unfolded operation only reads m32), as it could cause reading across a page/cache-line boundary, cause an AV exception, read past the end of an array, etc.

@CarolEidt
Copy link

Obviously I agree that we can't fold something like Add(value, LoadScalarVector128(address)) as it could cause an exception that would not have otherwise occurred (not to mention that I believe it would get a different answer)

However, I am also not in favor of any folding that would suppress an exception that would otherwise have happened. Getting optimal code may require the developer to check for (and use) higher-level ISA instructions to get the folding, but I think that's a reasonable expectation given that we already expect the developer to have a detailed knowledge of the hardware - this is, after all, effectively inline assembly.

@mmitche
Copy link
Member

mmitche commented Jan 30, 2018

@dotnet-bot test this please

@mmitche mmitche closed this Jan 30, 2018
@mmitche mmitche reopened this Jan 30, 2018
@tannergooding
Copy link
Member Author

However, I am also not in favor of any folding that would suppress an exception that would otherwise have happened. Getting optimal code may require the developer to check for (and use) higher-level ISA instructions to get the folding, but I think that's a reasonable expectation given that we already expect the developer to have a detailed knowledge of the hardware - this is, after all, effectively inline assembly.

My concern on this is that for addps xmm, xmm, we only expose Sse.Add and we don't redeclare Add(Vector128<float>, Vector128<float>) for AVX. Instead Sse.Add emits the VEX encoding on a hardware that supports it and the non-VEX encoding on older hardware (or certain newer hardware that doesn't support AVX). For Avx, we only expose the Add(Vector256<float>, Vector256<float>) (addps ymm, ymm) which, of course, is a larger read/write and would require a different algorithm.

If a user has already validated that the loads will be aligned, they shouldn't need to write two versions of the algorithm (one that uses LoadAligned on older hardware and one that uses Load on newer hardware) so that folding happens in both places.

That being said, this is a perfect use-case for Assume.Alignment() (see https://github.com/dotnet/corefx/issues/26188), which would allow the JIT to fold if optimizations are enabled.

// Only fold a scalar load into a SIMD scalar intrinsic to ensure the number of bits
// read remains the same. Likewise, we can't fold a larger load into a SIMD scalar
// intrinsic as that would read fewer bits that requested.
case NI_SSE_LoadScalar:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to assert here that the baseType of the containing node is the same as the baseType of the load.

It should already line up due to the method signatures, but we don't want to end up in a scenario where the LoadScalar is for m32, but the containing node would do a m64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@mikedn
Copy link

mikedn commented Jan 30, 2018

Why? The user coded AddScalar(value, LoadVector128(address))

That one is fine, there's nothing interesting about containment in this case. I was talking about the case where a wider load would be generated. Narrowing loads can also be problematic if the data is not aligned, you could end up hiding an invalid memory access.

@mikedn
Copy link

mikedn commented Jan 30, 2018

However, I am also not in favor of any folding that would suppress an exception that would otherwise have happened.

AFAIR native compilers have no problem doing this. It's not clear why it would be problem in .NET. Maybe because the intrinsic names are somewhat different in regards to alignment, in C/C++ you basically have Load and LoadUnaligned, not Load and LoadAligned.

@mikedn
Copy link

mikedn commented Jan 30, 2018

If a user has already validated that the loads will be aligned, they shouldn't need to write two versions of the algorithm (one that uses LoadAligned on older hardware and one that uses Load on newer hardware) so that folding happens in both places.

Indeed, they shouldn't need to write 2 versions. Because the Load version would be good enough for current hardware.

@tannergooding
Copy link
Member Author

tannergooding commented Jan 30, 2018

It's not clear why it would be problem in .NET.

Because LoadAligned is one of the few intrinsic instructions that has an exception behavior (this is excluding the floating-point exceptions we mask by default, as per the spec).

It's implementation is effectively:

if ((address % 16) != 0)
{
    throw new AccessViolationException();
}

return LoadUnaligned(address);

This is in the same vein as why we can't transform:

int Select(bool condition, int trueValue, int falseValue)
{
    return condition ? trueValue : falseValue;
}

int MyMethod(bool condition, int[] a, int[] b)
{
    return Select(condition, a[5], b[6]);
}

into:

int MyMethod(bool condition, int[] a, int[] b)
{
    return condition ? a[5] : b[6];
}

even though the native compiler can. Since it would mask a NRE if a was null, but condition was false

@tannergooding
Copy link
Member Author

Indeed, they shouldn't need to write 2 versions. Because the Load version would be good enough for current hardware.

There are modern processors (such as the Kaby Lake Celeron/Pentiums -- https://ark.intel.com/products/97460/Intel-Pentium-Processor-G4620-3M-Cache-3_70-GHz), which don't support VEX encoding.

@mikedn
Copy link

mikedn commented Jan 30, 2018

This is in the same vein as why we can't transform:

Are you really going to use that as an argument? Really really?

There are modern processors (such as the Kaby Lake Celeron/Pentiums -- ) which don't support VEX encoding

And let me guess - people who use such processors expect to get best performance from a gimped
processor and are the primary targets for .NET Core apps. Or .NET apps in general. Not to mention that containment in itself won't necessarily make a significant difference. Contained or not, a memory load is a memory load. This is mainly a code size optimization and it may very well not make any difference in many cases.

@tannergooding
Copy link
Member Author

Are you really going to use that as an argument? Really really?

Yes :)

It was made very clear in my proposal to allow the folding of such things why we can't. Folding LoadAligned is the exact same scenario, it would mask an exception that would otherwise be thrown.

The solution, I believe, is to move forward with a proposal such https://github.com/dotnet/corefx/issues/26188, which did garner support, but which required a reasonable use case (such as this) and a prototype to back it up.

And let me guess - people who use such processors expect to get best performance from a gimped
processor and are the primary targets for .NET Core apps. Or .NET apps in general.

Regardless of whether or not you view the processor as "gimped", some people will end up getting processors like that or won't upgrade every new processor generation and some of those people will run .NET.

That alone isn't enough of a case to argue for allowing it, without some more concrete numbers, but it is a start.

Not to mention that containment in itself won't necessarily make a significant difference. Contained or not, a memory load is a memory load. This is mainly a code size optimization and it may very well not make any difference in many cases.

Code size is also important for throughput in performance oriented code (such as the general use-case for HWIntrinsics). If you end up doubling the code size, because Loads aren't folded, it can make a measurable difference in the total execution time, especially if that code is on a hot-path.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one optional suggestion

@@ -1297,6 +1297,10 @@ void CodeGen::genConsumeRegs(GenTree* tree)
{
genConsumeReg(tree->gtGetOp1());
}
else if (tree->OperIsHWIntrinsic())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would be a good place to assert that HW_Flag_NoContainment is not set, or that IsContainableHWIntrinsicOp(tree, tree->gtGetOp1()) but perhaps that's overkill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think asserting is a good idea. I'll update.

This is also pending #16114 and #16115. The former to fixup the various flags and the latter to make it easier to add some tests with this change (we currently don't have any tests that use Load, LoadScalar, or LoadAligned in a way that they could be contained).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, I can't easily check IsContainableHWIntrinsicOp.

It is a member of Lowering and depends on IsContainableMemoryOp, which itself depends on m_lsra.

I would need to make m_pLowering visible from compiler somewhere to use that check.

The other check, flagsOfHWIntrinsic, currently requires friend access to the compiler, but I don't see any reason why those can't be public.

@CarolEidt, what would you recommend here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall why IsContainableMemoryOp has its actual implementation on LinearScan when all the containment analysis is done in Lowering (even before we moved the TreeNodeInfoInit to LSRA). In any case, it seems like something we might want to check during code generation, so we may want to expose it in some way. That said, I don't feel that it's critical.

On the other hand, the flagsOfHwIntrinsic I think should definitely be public so that they are accessible from codegen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall why IsContainableMemoryOp has its actual implementation on LinearScan when all the containment analysis is done in Lowering (even before we moved the TreeNodeInfoInit to LSRA).

Looks to be one use in LSRA, for SIMDIntrinsicGetItem: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L3026

In any case, it seems like something we might want to check during code generation, so we may want to expose it in some way. That said, I don't feel that it's critical.

Exposing a getLowering() method in the compiler is easy enough, but it also requires us to #include "lower.h" somewhere where codegen can pick it up (probably just in codegen.h). I've commented out this particular assert with a TODO and will finish looking tomorrow if I get the chance.

On the other hand, the flagsOfHwIntrinsic I think should definitely be public so that they are accessible from codegen.

It is already available to codgen, I just forgot to put the assert in an #ifdef _TARGET_XARCH_ since its in codegenlinear. 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both asserts I tried to add here would have been invalid. tree is the node which was contained, we would want to be doing the assert on the node which contains tree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the asserts to hwintrinsiccodegenxarch.cpp.

@tannergooding tannergooding changed the title [WIP] Updating the HWIntrinsic codegen to support marking LoadVector128 and LoadAlignedVector128 as contained. Updating the HWIntrinsic codegen to support marking LoadVector128 and LoadAlignedVector128 as contained. Feb 1, 2018
@tannergooding
Copy link
Member Author

The product code here should be "complete". I'd still like to wait on #16115 and add some explicit tests before merging, however.

@tannergooding
Copy link
Member Author

tannergooding commented Feb 1, 2018

Updated the templated tests to also validate using LoadVector and LoadAlignedVector works. We should eventually add tests that use LoadScalarVector as well, but that will require slightly more complex changes to the template.


// Set `compFloatingPointUsed` to cover the scenario where an intrinsic is being on SIMD fields, but
// where no SIMD local vars are in use. This is the same logic as is used for FEATURE_SIMD.
compFloatingPointUsed = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. @CarolEidt, @fiigii

This is the same logic as https://github.com/dotnet/coreclr/blob/master/src/jit/simd.cpp#L3094. An assert was being hit in LSRA for reflection calls which used LoadVector128.

This needs to be cleaned up to use a flag so we aren't setting it on intrinsics which don't actually use any floating-point nodes/registers (such as Crc32).

Also FYI. @sdmaclea. This appears to impact ARM64 as well (based on the FEATURE_SIMD code), and you may need something similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also imagine various code could be updated elsewhere in the JIT to properly support TYP_SIMD as floating-point registers, rather than requiring this special handling here.

@@ -242,7 +250,6 @@ void CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins)
offset = 0;

// Ensure that all the GenTreeIndir values are set to their defaults.
assert(memBase->gtRegNum == REG_NA);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was found to be incorrect in the larger emitInsBinary refactoring, but wasn't also removed from here.

@tannergooding
Copy link
Member Author

@CarolEidt, @fiigii. Could you give another review pass when you get the opportunity.

I think I'm satisfied with the changes now.

  • We should have test coverage over most of the containment scenarios
    • Explicit tests that use LoadScalar still need coverage, but require a different template (not applicable to Vector256)
    • Explicit tests that cover containment for the commutative case are still needed, but again require a different template
  • We should have asserts in codegen to validate the containment checks are as expected.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@@ -215,6 +215,9 @@ void CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins)

if (op2->isContained() || op2->isUsedFromSpillTemp())
{
assert((Compiler::flagsOfHWIntrinsic(node->gtHWIntrinsicId) & HW_Flag_NoContainment) == 0);
assert(compiler->m_pLowering->IsContainableHWIntrinsicOp(node, op2) || op2->IsRegOptional());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work.

@tannergooding
Copy link
Member Author

tannergooding commented Feb 2, 2018

Ubuntu test failures are due to the API name fixes and have already been fixed #16169 Edit: One of them is due to a change brought in by this, fixing.

@tannergooding
Copy link
Member Author

Failures are from SSE2 and AVX tests attempting to use the not yet implemented LoadVector intrinsics for those ISAs, will fix.

@fiigii
Copy link

fiigii commented Feb 3, 2018

SSE2 and AVX tests attempting to use the not yet implemented LoadVector intrinsics

Are you going to disable this behavior? I can implement these Load* intrinsics in this weekend.

@tannergooding
Copy link
Member Author

Are you going to disable this behavior?

Yes (effectively).

I am going to revert the test changes for SSE2/AVX (since those just added the Load/LoadAligned tests) and comment those out from generation for the time being.

@tannergooding
Copy link
Member Author

Same as before (just squashed into a product changes and a test changes commit), but with the invalid test changes reverted for SSE2/AVX/AVX2 (since they don't have the required Load/LoadAligned intrinsics implemented yet).

@tannergooding
Copy link
Member Author

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

test OSX10.12 x64 Checked jitincompletehwintrinsic
test OSX10.12 x64 Checked jitx86hwintrinsicnoavx
test OSX10.12 x64 Checked jitx86hwintrinsicnoavx2
test OSX10.12 x64 Checked jitx86hwintrinsicnosimd
test OSX10.12 x64 Checked jitnox86hwintrinsic

@tannergooding tannergooding merged commit eb54e48 into dotnet:master Feb 3, 2018
@tannergooding tannergooding deleted the sse-intrinsics branch May 30, 2018 04:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants