Consider exposing a HWIntrinsic that allows efficient loading, regardless of encoding. #954

tannergooding · 2018-11-17T17:45:08Z

Rationale

On x86 hardware, for SIMD instructions, there are effectively two encoding formats: "legacy" and "vex".

There are some minor differences between these two encodings, including the number of parameters they take and whether the memory operand has aligned or unaligned semantics.

As a brief summary:

The legacy encoding generally takes two operands (ins reg, reg/[mem]), the first serves as both a source and destination, and the last as a source which can generally be a register or memory address. The memory address form has "aligned" semantics and will cause an exception to be thrown if the address is not "naturally aligned" (generally 16-bytes).
The VEX encoding generally takes three operands (ins reg, reg, reg/[mem]), the first being the destination, the second being a source, and the last as a source which can generally be a register or memory address. The memory address form has "unaligned" semantics and will not cause an exception to be thrown, regardless of the input.

Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics). Given that a user will often want to generate the "most efficient" code possible and that the JIT, in order to preserve semantics (not silently get rid of an exception that would otherwise be thrown), changes which method it will "fold" depending on the encoding it is currently emitting, it may be beneficial to expose an intrinsic which allows the user to specify that "This address is aligned, do whatever load is most efficient". This would ensure that it can be folded regardless of the current encoding.

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public partial abstract class Sse
    {
        // New APIs:
        public Vector128<float> LoadVector128Unsafe(float* address);

        // Existing APIs:
        public Vector128<float> LoadVector128(float* address);
        public Vector128<float> LoadAlignedVector128(float* address);
    }

    public partial abstract class Sse2
    {
        // New APIs:
        public Vector128<double> LoadVector128Unsafe(double* address);
        public Vector128<byte> LoadVector128Unsafe(byte* address);
        public Vector128<sbyte> LoadVector128Unsafe(sbyte* address);
        public Vector128<short> LoadVector128Unsafe(short* address);
        public Vector128<ushort> LoadVector128Unsafe(ushort* address);
        public Vector128<int> LoadVector128Unsafe(int* address);
        public Vector128<uint> LoadVector128Unsafe(uint* address);
        public Vector128<long> LoadVector128Unsafe(long* address);
        public Vector128<ulong> LoadVector128Unsafe(ulong* address);

        // Existing APIs:
        public Vector128<double> LoadVector128(double* address);
        public Vector128<byte> LoadVector128(byte* address);
        public Vector128<sbyte> LoadVector128(sbyte* address);
        public Vector128<short> LoadVector128(short* address);
        public Vector128<ushort> LoadVector128(ushort* address);
        public Vector128<int> LoadVector128(int* address);
        public Vector128<uint> LoadVector128(uint* address);
        public Vector128<long> LoadVector128(long* address);
        public Vector128<ulong> LoadVector128(ulong* address);

        public Vector128<double> LoadAlignedVector128(double* address);
        public Vector128<byte> LoadAlignedVector128(byte* address);
        public Vector128<sbyte> LoadAlignedVector128(sbyte* address);
        public Vector128<short> LoadAlignedVector128(short* address);
        public Vector128<ushort> LoadAlignedVector128(ushort* address);
        public Vector128<int> LoadAlignedVector128(int* address);
        public Vector128<uint> LoadAlignedVector128(uint* address);
        public Vector128<long> LoadAlignedVector128(long* address);
        public Vector128<ulong> LoadAlignedVector128(ulong* address);
    }
}

API Semantics

This shows the semantics of each API under the legacy and VEX encoding.

Most insrtuctions support having the last operand be either a register or memory operand: ins reg, reg/[mem]

When folding does not happen the load is an explicit separate instruction:

For example, Sse.Add(input, Sse.Load(address)) would generate
* movups xmm1, [address]
* addps xmm0, xmm1

While, folding allows the load to be folded into the calling instruction:

For example, Sse.Add(input, Sse.Load(address)) would generate
* addps xmm0, [address]

On Legacy hardware, the folded-form of the instructions assert that address is aligned (generally this means (address % 16) == 0).
On VEX hardware, the folded-form does no such validation and allows any input.

Load (mov unaligned)

	Unaligned Input	Aligned Input
Legacy	No Folding	No Folding
VEX	Folds	Folds

LoadAligned (mov aligned)

	Unaligned Input	Aligned Input
Legacy	Folds, Throws	Folds
VEX	No Folding, Throws	No Folding

LoadUnsafe

	Unaligned Input	Aligned Input
Legacy	Folds, Throws	Folds
VEX	Folds, Throws in Debug	Folds

Additional Info

Some open ended questions:

Is this possibly better served by analysis or hints provided to the JIT?
- Not sure if providing analysis is worthwhile, given that modern hardware (since ~2011) almost always has VEX encoding support, and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)
Should we also expose Store, StoreUnaligned, and StoreAligned counterparts or just keep Store and StoreAligned?
Should we also expose equivalents on Avx/Avx2?
- Noting that for Avx/Avx2 instructions, they require the VEX encoding and therefore always have "unaligned" semantics, but that Avx/Avx2 define and expose LoadAlignedVector256/StoreAligned instructions anyways

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-11-17T17:45:25Z

CC. @eerhardt, @fiigii, @CarolEidt for thoughts

gfoidl · 2018-11-19T10:27:32Z

Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics).

I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load, as the proposed name is quite long.

"This address is aligned, do whatever load is most efficient".

For the name Load would match.

Should we also expose StoreAlignedOrUnaligned methods?

If load is exposed, the same should be done for store. Hence rename Store to StoreUnaligned and use the name Store for the method analogous to this proposal.

Should we also expose equivalents on Avx/Avx2?

In my proposal for method-names this wouldn't be in symmetry with Avx/Avx2 anymore, so yes there should be the same methods -- even if they just forward to the unaligned methods.

eerhardt · 2018-11-19T15:33:02Z

I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load,

I was thinking the same thing.

tannergooding · 2018-11-20T03:41:32Z

Updated to be LoadVector128, LoadAlignedVector128, and LoadUnalignedVector128.

fiigii · 2018-11-29T10:53:53Z

Sorry for the delay to reply.
I don't quite understand this proposal. Currently, LoadAligned can be folded into the consumer instruction with SSE and VEX encoding, but Load (LoadUnaligned in this proposal) only can be folded with VEX encoding. This behavior also matches C++ compilers.
So, what is the difference between the proposed Load intrinsic and the existing LoadAligned?

Meanwhile, guarantee memory alignment should be user's responsibility, and the dynamic assertion ("Asserts (address % 16) == 0 when optimizations are disabled") does not always work.

fiigii · 2018-11-29T10:58:22Z

Additionly, on modern Intel CPUs, the unaligned load/store won't be slower than aligned load/store (if the memory access hits in one cache line), so we suggest always using unaligned load/store. That is why Load intrinsic maps to the unaligned load instruction.

tannergooding · 2018-11-29T14:05:45Z

I don't quite understand this proposal.

The JIT has to preserve the semantics of the underlying instruction. Currently LoadAligned will cause a hardware exception to be raised if the input address is not aligned; LoadUnaligned (presently named just Load) currently has unaligned semantics and will not raise an exception, regardless of the alignment of the input address.

When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned.
When using the VEX encoding, the reverse is true, and the vast majority of the instructions which allow a memory operand have "unaligned" semantics and will not cause a hardware exception to be raised if the input address is not aligned.

This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for LoadAligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we can emit: addps xmm0, [pRight]; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we have to instead emit movups xmm1, [pRight]; addps xmm0, xmm1

And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]

We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).

So, what is the difference between the proposed Load intrinsic and the existing LoadAligned

We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data).

This leads users to writing algorithms that do the following (where possible):

// Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>`

They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).

This is what we are doing in ML.NET today, for example

However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following:

public static Vector128<float> EfficientLoad(float* address)
{
    if (Avx.IsSupported)
    {
        // Must assert, since this would fail for non-AVX otherwise
        Debug.Assert((address % 16) == 0);
        return Sse.LoadUnaligned(address);
    }
    else
    {
        return Sse.LoadAligned(address);
    }
}

Which also depends on the JIT being able to successfully recognize and fold this code.

So, this proposal suggests providing this helper method built-in by renaming the existing Load method to be LoadUnaligned and to expose a new method Load (I had originally suggested something like LoadAlignedOrUnaligned, but it was suggested I change it to the current) which does effectively the above. That is, on hardware with VEX encoding support, it uses LoadUnaligned and when optimizations are disabled it will validate that the input address is aligned; while on older hardware that uses the legacy encoding, it will use LoadAligned. This ensures that the code can always be folded efficiently and it validates the semantics remains the same (without hurting perf when users compile their executables with optimizations enabled). We also still have the LoadUnaligned and LoadAligned methods explicitly for the cases where the user can't always do the "optimal" thing (sometimes you just have to work with data that can be unaligned).

fiigii · 2018-11-29T18:44:59Z

And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]

We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).

Ah, I see the problem now. But we should fold LoadAligned with VEX encoding, as it does not throw exception.

They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).

This assert is useless. address could be either aligned or unaligned in different runs.

tannergooding · 2018-11-29T18:50:57Z

as it does not throw exception.

That is the problem and why we can't fold it. LoadAligned emits movaps which always throws if the address is unaligned. If you fold it into a VEX-encoded instruction, it will no longer throw. This means that if you passed in an unaligned address, the exception would silently disappear whenever the JIT folded it.

This assert is useless. address could be either aligned or unaligned in different runs.

The point of the assert is to help catch places where the user is calling Load without pinning the data and ensuring that it is properly aligned. It will still be possible for a user to do this incorrectly, but it would be an explicit opt-in.
We could potentially use another name, such as LoadUnsafe, or something that conveys that the data must be aligned but we won't always validate that it is.

CarolEidt · 2018-11-29T19:33:43Z

I like this proposal, and agree that the assert, while not guaranteeing to catch all potential dynamic data configurations, is very likely to catch nearly all cases.
Given that there is a great deal of expertise required to use the hardware intrinsics effectively, I don't think it's necessary to encumber this with the Unsafe suffix. Rather, documenting its behavior and asserting in the unoptimized case should be sufficient.

fiigii · 2018-11-29T19:41:25Z

Ah, looked the manual again, yes, vmovaps can still throw exceptions with VEX-encoding, but other instructions won't. Thanks.

fiigii · 2018-11-29T19:42:38Z

I don't think it's necessary to encumber this with the Unsafe suffix

Agree, actually other Load* are not safe neither 😄

mikedn · 2018-11-30T08:59:34Z

and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)

Are there enough such people to justify adding more intrinsics?

Should we also expose equivalents on Avx/Avx2?

It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic. But that would imply adding more intrinsics. Or perhaps AVX Load should simply be renamed to LoadUnaligned, though the longer name sucks. Same for stores.

But to quote @CarolEidt "Given that there is a great deal of expertise required to use the hardware intrinsics effectively", I don't think such difference in semantics justifies adding more intrinsics.

Should these instructions have a Debug.Assert((address % 16) == 0)?

Where does this assert go exactly? Is Load a true intrinsic or a helper? Who's using debug builds of corelib, besides the CoreCLR team?

fiigii · 2018-11-30T09:49:48Z

I don't think such difference in semantics justifies adding more intrinsic.

Agree, I suggest that we can fold more existing load intrinsic rather than adding more intrinsics.

SSE encoding: only fold LoadAligned because folding LoadUnaligned would throw exceptions that the users did not expect.
VEX encoding: fold LoadAligned and LoadUnaligned both. The individual LoadAligned may throw exceptions but folded ones won't, but I think this is fine. We assume that users "expect" exceptions from LoadAligned and handle in their own code, and folded LoadAligned just throws exceptions with 0% possibility.

The above codegen strategy is already adopted by all the mainstream C/C++ compilers.

tannergooding · 2018-11-30T14:42:52Z

Are there enough such people to justify adding more intrinsics?

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic

Yes, if this API is approved, we would (at a minimum rename) the AVX intrinsic to be LoadUnaligned. I already have a note for the API review about whether we should have Store mirror these names.

Where does this assert go exactly? Is Load a true intrinsic or a helper?

A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.

and folded LoadAligned just throws exceptions with 0% possibility

I don't think this is acceptable without an explicit user opt-in. The runtime has had a very strict policy on not silently removing side-effects

The above codegen strategy is already adopted by all the mainstream C/C++ compilers.

C/C++ makes a number of optimizations that we have, at this point, decided not to do and to instead address via Analyzers or look at more in depth later.
We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).

fiigii · 2018-11-30T19:10:45Z

The runtime has had a very strict policy on not silently removing side-effects

It is not removing side-effects, it has no any side-effects (hardware exceptions).

tannergooding · 2018-11-30T20:19:45Z

It is not removing side-effects, it has no any side-effects (hardware exceptions).

All exceptions, including HardwareExceptions are considered side-effects.

The movaps raises the hardware #GP exception if its input is not properly aligned and the JIT interecepts that and converts it into a System.AccessViolationException.

mikedn · 2018-11-30T22:09:37Z

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

I didn't ask about cases where you tried to do that. I asked if there are enough people who are using .NET Core on old or gimped CPU and expect things to work perfectly rather than just reasonably.

A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.

Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned. Whether it's a good idea to do that in debug builds it's debatable. Some people may end up disabling optimizations to workaround a JIT bug or to be able to get a correct stacktrace for a hard to track down bug. Now they risk getting exceptions that with optimizations enabled would not occur.

The runtime has had a very strict policy on not silently removing side-effects

Please do not mislead people. The runtime does not impose any semantics on intrinsics, it's the other way around. LoadAligned may be defined as "throws an exception if the address is unaligned" or as "may throw an exception if the address is unaligned". That may or may not be a good idea. But it doesn't have anything to do with any runtime policy.

We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).

Indeed. Squabbles over names resulted in string intrinsic being removed from .NET 3. Yet adding non-deterministic intrinsics to support old CPUs is doable.

fiigii · 2018-11-30T22:45:36Z

Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned.

I believe the result of this proposal is just forcing most people to change Load to LoadUnaligned in their programs.

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches

for (;IsUnaligned(addr); addr++)
{
    // processing data until the memory get aligned
}
for (; addr < end; addr += Vector128<int>.Count)
{
    var v = LoadAlignedVector128(addr); 
    // or LoadUnalignedVector128(addr), whatever, we recommand always using LoadUnalignedVector128
    ......
}

Or

for (; addr < end; addr += Vector128<int>.Count)
{
    var v = LoadUnalignedVector128(addr); 
    ......
}

Using which approach is based on profiling and it is much more important than folding loads for SSE encoding.

tannergooding · 2018-11-30T23:14:41Z

I didn't ask about cases where you tried to do that

I wasn't saying that with regards to personal attempts.

Now they risk getting exceptions that with optimizations enabled would not occur

Correct, but the API is explicitly documented to do as such and they have the option to enforce the desired behavior by explicitly using one of the other two APIs (LoadAligned or LoadUnaligned).

By not having all three, users either have to write their own wrappers for something that we could (and in my opinion should) reasonably handle.

Additionally, if we only expose two APIs and make one of them "sometimes validate alignment", then we can't provide a guarantee to the end user that they can enforce the semantics where required.

Please do not mislead people. The runtime does not impose any semantics on intrinsics

I don't think this is misleading at all. We have had a very clear contract and design for the platform-specific hardware intrinsics that each is tied to a very particular instruction.

LoadAligned is documented to emit movaps and LoadUnaligned (currently just called Load) will emit movups. This gives a very clear story and guarantees that, if someone uses this API, they will get exactly that instruction.

Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.

It is worth noting, however, that we have also tried to determine common patterns, use-cases, and limitations with the decision to tie a given hardware intrinsic to a particular instruction. When we have found such limitations, we have discussed what we could do about it (both the "ideal" and realistic scenarios) and determined an appropriate solution. So far, out of all the scenarios that have been brought up, we have determined to deviate only for CreateScalarUnsafe and ToVector256Unsafe (where we say that the upper bits will be non-deterministic, rather than explicitly zeroed; but for which we also expose "safe" versions that do explicitly zero the upper bits). This may be another case, that is more trivial to resolve via an additional API and that is worthwhile to do.

Squabbles over names resulted in string intrinsic being removed from .NET 3

They have not necessarily been removed from netcoreapp3.0, instead we decided that we would pull them until we can determine the appropriate way to expose these APIs as there were concerns over the useability and understandability of said APIs. The underlying instructions behave quite differently from most other intrinsics we've exposed and so they require some additional thought. This is no different from any other API that we decide to expose in that we don't want to ship something that has a bad surface area and we are still working towards making sure these APIs ship in a good and useable state.

Yet adding non-deterministic intrinsics to support old CPUs is doable.

As mentioned above, both cases where we exposed an additional non-deterministic API were discussed in length.

tannergooding · 2018-11-30T23:15:56Z

I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches

Once you have pinned the memory, it is guaranteed not to move and you can rely on the assert. Anyone dealing with perf-sensitive scenarios and larger blocks of memory (for example, ML.NET) should be pinning their data anyways, as the GC could move the data otherwise and that can mess with the cache, alignment, etc.

Using which approach is based on profiling and it is much more important than folding loads.

The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" has multiple pages/sections all discussing this. The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:

Always align data when possible
Generally prefer aligned stores over aligned loads, if you have to choose
Consider just using two 128-bit unaligned loads if you can't align your 256-bit data
etc

All this additional API does is give the user the option to choose between:

I can't align my data (they use LoadUnaligned)
My data is aligned and should never be unaligned (they use LoadAligned)
My data is aligned, but you don't have to check (they use Load)

A good number of vectorizable algorithms (especially the ones used in cases like ML.NET) can be handled with a worst-case scenario of two-unaligned loads maximum, regardless of the alignment of the input data. For such algorithms, the only question of whether to align or not really comes down to how much data is being processed in total and the cases where you don't want to make the data aligned can generally be special-cased.

fiigii · 2018-11-30T23:20:32Z

The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:

Always align data when possible
Generally prefer aligned stores over aligned loads, if you have to choose
Consider just using two 128-bit unaligned loads if you can't align your 256-bit data
etc

Yes, that is what I meant. And aligned loads != movap*.

I can't align my data (they use LoadUnaligned)
My data is aligned and should never be unaligned (they use LoadAligned)
My data is aligned, but you don't have to check (they use Load)

All these three situations are suggested to use LoadUnaligned.

mikedn · 2018-11-30T23:34:55Z

Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.

For the fortieth and two time: the runtime has nothing to do with this. I give up. We're probably speaking different languages or something.

CarolEidt · 2018-11-30T23:42:07Z

@mikedn, I think the issue is a runtime issue when we start talking about folding the loads into the operations, in ways that might cause a runtime failure to be masked (i.e. if the load would throw on an unaligned address, but the operation into which it is folded would not). And I think what @tannergooding is after here is the ability to have a single load semantic that supports folding.

All that said, I'm still not sure where I fall on this issue.

fiigii · 2018-11-30T23:47:32Z

Import @CarolEidt 's comment from https://github.com/dotnet/coreclr/issues/21308

I agree that we don't want to silently remove an exception. It is not just that it violates the "strict" behavior that developers expect from the CLR, but could also cause sudden unexpected failures if something changed in the user code (or the optimization settings used) that caused the JIT to no longer fold the instruction.

If users write LoadAligned, they should be responsible to handle the possible exceptions, whatever it is from JIT optimization or actual memory misalignment.

tannergooding · 2018-11-30T23:47:45Z

And aligned loads != movap*.

I would disagree here. We have two instructions that perform explicit loads (including movupd and movapd here):

movups, which works for any data
movaps, which only works for aligned data

We additionally have the ability to support load semantics in most instructions:

Under legacy encoding, this only works for aligned data
Under VEX encoding, this works for any data

I would say that the unaligned load could just be called a load, as it loads data with no special semantics (and it is still just a load even if the data happens to be aligned). I would say that an aligned load must also validate that the data is aligned (which only movaps does).

All these three situations are suggested to use LoadUnaligned.

LoadUnaligned does not successfully fulfill all three scenarios by itself. It definitively fills the first scenario, but the second scenario requires an additional validation step (and must fail if the requirements are not met) and the third scenario has an optional validation step (that would fail if the requirements are not met and optimizations are disabled; otherwise, when optimizations are enabled, it would skip the validation check).

mikedn · 2018-11-30T23:51:32Z

I think the issue is a runtime issue when we start talking about folding the loads into the operations

@CarolEidt It most definitely isn't. As far as the runtime/ECMA spec is concerned, intrinsics are just method calls. There's nothing anywhere in the ECMA spec that says that LoadAligned must always throw an exception if the data is not aligned. That's solely in the intrinsic's spec courtyard.

Also, I don't claim that we should change the current LoadAligned semantic. I just want to see reasonable argumentation for adding a bunch of new intrinsics. And I'm not seeing it.

fiigii · 2018-12-01T00:23:49Z

And aligned loads != movap*.

I would disagree here.

LoadUnaligned does not successfully fulfill all three scenarios by itself.

movups and movaps has the same performance and movups is always safe. Why not use it?

The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.

tannergooding · 2018-12-01T01:47:53Z

movups and movaps has the same performance and movups is always safe. Why not use it

Because we can't always safely fold a movups. Some examples are being on older/legacy hardware, working around a JIT bug (so you have disabled VEX-encoding support), using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...

The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.

Yes, and the purpose of the proposed intrinsic is to:

Provide some debug-mode validation that you are not causing cache-line splits or page faults
Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding

It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults; and they will not make any attempt to align the input data. As such, you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?

fiigii · 2018-12-01T02:13:39Z

using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...

It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults;

No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned).
Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.

Provide some debug-mode validation that you are not causing cache-line splits or page faults

Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding

If users do not guarantee the alignment, the compiler-generated validation (1) is not reliable, so that makes (2) unacceptable on older CPUs.
If users already guarantee the alignment, the compiler-generated validation (1) is useless, then i) using LoadUnaligned normally, or ii) using LoadAligned if the code-size on older CPUs is critical (needs https://github.com/dotnet/coreclr/issues/21308)

you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?

Right.

jkotas · 2019-11-26T22:48:32Z

This API fills a gap were we weren't providing functionality available in native code.

We are not trying to match native code 100%. We are trying to get close enough while staying true to the .NET values like simplicity and productivity. Having all LoadAlignedVector128, LoadAlignedVector128Unsafe and LoadVector128 is anything but simple.

tannergooding · 2019-11-26T23:09:38Z

At the same time, the hardware intrinsic APIs weren't exactly designed with simplicity in mind. We are expecting users to pin their data, understand the hardware involved that they are targeting, and to write multiple code paths that do functionally the same thing but with different underlying instructions depending on the hardware they are running against.

This makes one aspect of the coding pattern better by allowing them to cleanly write:
Sse.LoadAlignedVector128Unsafe(pAddress)) rather than the following (which involves another helper method, the JIT cleanly inlining the call, dropping the dead code paths, and folding the load; which we've already determined can be problematic in some cases):

if (Avx.IsSupported)
{
    Debug.Assert((((nint)pAddress) % 16) == 0);
    return Sse.LoadVector128(pAddress);
}
else
{
    return Sse.LoadAlignedVector128(pAddress);
}

It is also a common thing that these coding paths should be considering; especially in places like S.P.Corelib where we do have existing AOT/R2R support and where it defaults (and will continue to default) to supporting non-VEX enabled hardware by default. Users not wanting to deal with the complexity of hardware intrinsics have the option of using the type-safe and portable Vector<T> type instead.

saucecontrol · 2019-11-26T23:17:56Z

I was just writing the same thing. I actually have this code in places to ensure I get folded loads

// ensure pointer is aligned
if (Avx.IsSupported)
    Sse.Add(vec, Sse.LoadVector128(p));
else
    Sse.Add(vec, Sse.LoadAlignedVector128(p));

… which is far less simple and far less discoverable than LoadAlignedVector128Unsafe. If we could only have the one pair of methods, I'd want the LoadAlignedVector128Unsafe behavior. It's the current LoadAlignedVector that's more difficult to reason about given an understanding of how the native intrinsics work.

jkotas · 2019-11-27T00:41:29Z

Are there any cases where one would want to use LoadAlignedVector128 instead of LoadAlignedVector128Unsafe?

saucecontrol · 2019-11-27T06:37:58Z

I can't imagine one, particularly if the Unsafe variant can be implemented so it throws in minopts. I assume it would just emit movaps and never fold the load, exactly as LoadAlignedVector128 does on VEX processors today? I can see the reasoning behind wanting deterministic behavior for the .NET implementation, but I can't imagine an actual user of these APIs making the performance tradeoff to get the determinism if the choice is there.

I just had a chance to watch the review meeting video, and it does seem like the API shape and behavior with the two aligned load variants is confusing. I was equally confused until @tannergooding walked through it earlier in this thread.

jkotas · 2019-11-27T08:20:21Z

So we should just fix the LoadAlignedVector128 to do the right thing. It does not make sense to tell everybody to change LoadAlignedVector128 to LoadAlignedVector128Unsafe in their code for best results.

I do not buy the argument about deterministic behavior for alignment faults. The platform does not guarantee deterministic alignment faults. For example, accessing misaligned double* may throw on some architectures in some situations. We do not specify when exactly nor try to ensure that it stays the same from version to version.

tannergooding · 2019-11-27T15:59:36Z

The platform does not guarantee deterministic alignment faults.

x86, however, does guarantee deterministic alignment faults and the hardware intrinsics expose hardware specific functionality. The aligned load/store operations will always raise a #GP(0) (General Protection) fault if the memory address is not correctly aligned (this is to 16-bytes for Vector128 and to 32-bytes for Vector256).

Thus far, we have a 1-to-1 mapping of the exposed API surface to an underlying instruction for the safe APIs. That is, Sse.Shuffle will always emit shufps, Sse.LoadVector128 will always emit movups and Sse.LoadAlignedVector128 will always emit movaps. We've only exposed non-deterministic behavior under the "unsafe" APIs (such as Vector128.CreateScalarUnsafe, which is the equivalent to CreateScalar but leaves the upper bits non-deterministic rather than forcing them to zero). This had been an explicit design decision so users could retain control over the emitted instructions and to ensure they got deterministic behavior. This in turn helps ensure more deterministic perf results and allows them to micro-tune their code to get the best results possible.

It does not make sense to tell everybody to change LoadAlignedVector128 to LoadAlignedVector128Unsafe in their code for best results.

I don't believe this isn't what most users will have to do. There are only a few categories of code that you can have right now.

You have people who don't care what the alignment is and are explicitly using LoadVector. These users are getting efficient codegen on newer hardware and will need to make no changes.

Then you have users who are explicitly taking alignment into account (this has to be explicit since CoreCLR provides no way to guarantee 16-byte or higher alignment today).

For users who are taking alignment into account, you will either have people who are doing it because they want alignment enforcement or they are doing it for performance.

If they are doing it for alignment enforcement they will only be using LoadAlignedVector and changing the behavior of LoadAlignedVector will be breaking.

If they are instead doing it for performance, then they will either have written two code paths or have written a small helper method like I gave above that uses LoadVector on VEX enabled hardware and LoadAlignedVecctor on non-VEX enabled hardware.

These new APIs target the latter category of user and gives them an API which bridges that gap without also regressing any other scenario. It will only reduce the amount of code they have written and makes it an explicit opt-in that you are willing to not have strict alignment checking in favor of performance.

jkotas · 2019-11-27T16:05:00Z

changing the behavior of LoadAlignedVector will be breaking.

Throwing less exceptions is not a breaking change.

explicit opt-in that you are willing to not have strict alignment checking in favor of performance

There is very little value in this explicit opt-in. Alignment bugs are like threading bugs to large degree. If you have alignment bug in your code, you often get lucky because of some other part of the code guarantees alignment.

tannergooding · 2019-11-27T16:41:40Z

Throwing less exceptions is not a breaking change.

I think the difference is that this is a hardware exception raised as the side-effect of a given instruction. It is used to ensure correctness and sometimes because not all hardware has fast unaligned loads (particularly older hardware and sometimes non-VEX enabled hardware).

Alignment bugs are like threading bugs to large degree.

I don't think I'd agree, particularly when the expectation is that users explicitly pin and check alignment first (such that they can specially handle leading/trailing elements via an unaligned and masked memory operation).

That being said... @GrabYourPitchforks, @terrajobst, @CarolEidt. Do you have a problem with changing the behavior of LoadAlignedVector* to match the desired behavior rather than exposing the new API?

GrabYourPitchforks · 2019-11-27T19:45:07Z

The LoadAlignedVector* APIs are documented to map exactly to [v]movdqa, so as a consumer I'd certainly be surprised if the JIT ended up emitting any other instruction such as [v]movdqu. As we discussed yesterday I have relied on the fact that [v]movdqa AVs as a way to check my implementation while running under a test environment and while fuzzing. (Generally, fuzzing is meant to be run over a fully-optimized application, but I suppose it's possible to kick off two runs: one with optimizations enabled and one without. It's just more work to set up.)

As long as there exists some setting which will force this API to emit [v]movdqa, and as long as we can ensure we have unit tests running under such an environment, my scenario is satisfied.

jkotas · 2019-11-27T19:48:36Z

As long as there exists some setting which will force this API to emit [v]movdqa,

The configuration switches to explicitly set the different hardware support levels (e.g. COMPlus_EnableAVX, etc.) should do the trick, no?

saucecontrol · 2019-11-27T20:00:34Z

APIs are documented to map exactly to [v]movdqa, so as a consumer I'd certainly be surprised if the JIT ended up emitting any other instruction such as [v]movdqu

It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all. So where you would have gotten a fault if vmovdqa were emitted, you won't because the load isn't present at all.

The real question is, if given a choice between a method that always emits vmovdqa and one that potentially eliminates that instruction by folding it into another, would you ever use the one that guarantees never to fold? If the purpose is to avoid a performance hit due to a load that crosses cache line boundaries, how is a load instruction that can't be eliminated/folded an improvement? As long as minopts emits the aligned load always so you can test, I don't see anyone choosing the other method.

Edit:
It's also worth noting that on non-VEX hardware, that method isn't guaranteed to emit movdqa today because it may fold that. The only difference is that non-VEX will fault even if the load is folded.

Also, in the Intel intrinsics guide, movdqa is documented as

Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

Note that it says may there

tannergooding · 2019-11-27T20:19:29Z

It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all.

Which is why the original proposal exposes a new API.

As it is today, if you call LoadAlignedVector128 you will always get codegen that will load 16-bytes from a given address and will generate a #GP(0) if that address is not also 16-byte aligned. Normally this is movaps/movapd/movdqa but it can also safely be folded into the consuming instruction for the non-VEX encoding. This is "safe" to do as most SIMD instructions (all except 2, iirc) that can also encode a memory operand require the address to be aligned and therefore will preserve the fault behavior.

When using the VEX-encoding, the instructions were changed to no longer fault if the address wasn't aligned. This means it is no longer "safe" to fold because doing so could remove a #GP(0) that would have otherwise been raised.

It's also worth noting that on non-VEX hardware, that method isn't guaranteed to emit movdqa today because it may fold that

The key point being that we were still providing deterministic behavior and preserving the semantics of the original code.

Note that it says may there

The native intrinsics don't guarantee that _mm_load_* will emit a movaps/movupd/movdqa. Their intrinsic behaves as the proposed LoadAlignedVector128Unsafe. They also don't provide deterministic APIs for things like CreateScalar and expect that you will explicitly zero the bits yourself if required, which is something that we previously said was undesirable for .NET.

tannergooding · 2019-11-27T20:21:41Z

The configuration switches to explicitly set the different hardware support levels (e.g. COMPlus_EnableAVX, etc.) should do the trick, no?

By itself it won't do the trick. The behavior would be:

COMPlus_EnableAVX	Optimizations Enabled	Result
0	0	#GP(0) if unaligned
0	1	#GP(0) if unaligned
1	0	#GP(0) if unaligned
1	1	No exception raised

tannergooding · 2019-11-27T20:27:44Z

The real question is, if given a choice between a method that always emits vmovdqa and one that potentially eliminates that instruction by folding it into another, would you ever use the one that guarantees never to fold? If the purpose is to avoid a performance hit due to a load that crosses cache line boundaries, how is a load instruction that can't be eliminated/folded an improvement? As long as minopts emits the aligned load always so you can test, I don't see anyone choosing the other method.

This question hinges on whether you are wanting "fast code" or "strict code". If you want "fast code", then yes; there is never a reason to want the movaps/movapd/movdqa emitted. You need data to be aligned to get "fast" non-VEX but you still want the load to be folded to get the smaller codegen on the newer VEX encoding.

If you want "strict" code, then you would strictly want the alignment checking always, even under the newer VEX encoding and even if it provides a minimal perf hit. Likewise, you would probably not want to do the alignment checking yourself when there is a single instruction that will efficiently do it.

I tend to prefer "fast code" myself and I can't currently think of any scenarios where you would both want the VEX encoding and for "strict" mode to be used. However, at the same time, I don't think exposing a non-deterministic API (one that differs in result between optimizations on/off) is good API design and I feel it tends to go against what we've previously chosen to do for these APIs and for .NET in general.

tannergooding · 2019-11-27T20:28:43Z

(To clarify, I mean exposing a non-deterministic API without a deterministic equivalent).

jkotas · 2019-11-27T20:30:47Z

There are two different reasons why you may want do fuzzing:

Verifying that the implementation is functionally correct.
Verifying that the implementation maintains its internal invariants. This includes alignment as well as many other invariants well-written code asserts for.

You want to do both of these for different hardware support levels.

The hardware architectures out there and .NET specifically do not provide strong guarantees around alignment faults across the board, so designing fuzzing strategy around the fact that a few x86 instructions emit alignment faults is not a viable cross-architecture strategy.

CarolEidt · 2019-11-27T21:52:24Z

Do you have a problem with changing the behavior of LoadAlignedVector* to match the desired behavior rather than exposing the new API?

I don't have a problem with that change, and the arguments for seem much more compelling than the arguments against.

In general, we expect developers to use HW Intrinsics in scenarios where perf matters, and we expect them to have a great deal of sophistication with regard to their usage. That said, perhaps the subtleties here merit some guidance on this topic (blog post, anyone?)

tannergooding · 2020-02-10T18:45:39Z

Closing this. We've updated the containment handling to efficiently do this instead.

maryamariyan transferred this issue from dotnet/corefx Dec 16, 2019

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 16, 2019

maryamariyan added api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics labels Dec 16, 2019

maryamariyan added this to the 5.0 milestone Dec 16, 2019

scalablecory mentioned this issue Feb 1, 2020

Aligned load/store for Vector<T> #29229

Closed

tannergooding closed this as completed Feb 10, 2020

tannergooding removed the untriaged New issue has not been triaged by the area owner label Sep 14, 2020

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Consider exposing a HWIntrinsic that allows efficient loading, regardless of encoding. #954

Consider exposing a HWIntrinsic that allows efficient loading, regardless of encoding. #954

Comments

tannergooding commented Nov 17, 2018

Rationale

Proposed API

API Semantics

Load (mov unaligned)

LoadAligned (mov aligned)

LoadUnsafe

Additional Info

tannergooding commented Nov 17, 2018

gfoidl commented Nov 19, 2018

eerhardt commented Nov 19, 2018

tannergooding commented Nov 20, 2018

fiigii commented Nov 29, 2018

fiigii commented Nov 29, 2018

tannergooding commented Nov 29, 2018

fiigii commented Nov 29, 2018

tannergooding commented Nov 29, 2018

CarolEidt commented Nov 29, 2018

fiigii commented Nov 29, 2018

fiigii commented Nov 29, 2018

mikedn commented Nov 30, 2018

fiigii commented Nov 30, 2018

tannergooding commented Nov 30, 2018

fiigii commented Nov 30, 2018

tannergooding commented Nov 30, 2018

mikedn commented Nov 30, 2018

fiigii commented Nov 30, 2018 • edited Loading

tannergooding commented Nov 30, 2018

tannergooding commented Nov 30, 2018

fiigii commented Nov 30, 2018

mikedn commented Nov 30, 2018

CarolEidt commented Nov 30, 2018

fiigii commented Nov 30, 2018

tannergooding commented Nov 30, 2018 • edited Loading

mikedn commented Nov 30, 2018

fiigii commented Dec 1, 2018

tannergooding commented Dec 1, 2018

fiigii commented Dec 1, 2018 • edited Loading

jkotas commented Nov 26, 2019

tannergooding commented Nov 26, 2019

saucecontrol commented Nov 26, 2019

jkotas commented Nov 27, 2019

saucecontrol commented Nov 27, 2019 • edited Loading

jkotas commented Nov 27, 2019 • edited Loading

tannergooding commented Nov 27, 2019

jkotas commented Nov 27, 2019

tannergooding commented Nov 27, 2019

GrabYourPitchforks commented Nov 27, 2019

jkotas commented Nov 27, 2019

saucecontrol commented Nov 27, 2019 • edited Loading

tannergooding commented Nov 27, 2019

tannergooding commented Nov 27, 2019

tannergooding commented Nov 27, 2019

tannergooding commented Nov 27, 2019

jkotas commented Nov 27, 2019 • edited Loading

CarolEidt commented Nov 27, 2019

tannergooding commented Feb 10, 2020

fiigii commented Nov 30, 2018 •

edited

Loading

tannergooding commented Nov 30, 2018 •

edited

Loading

fiigii commented Dec 1, 2018 •

edited

Loading

saucecontrol commented Nov 27, 2019 •

edited

Loading

jkotas commented Nov 27, 2019 •

edited

Loading

saucecontrol commented Nov 27, 2019 •

edited

Loading

jkotas commented Nov 27, 2019 •

edited

Loading