-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider exposing a HWIntrinsic that allows efficient loading, regardless of encoding. #954
Comments
CC. @eerhardt, @fiigii, @CarolEidt for thoughts |
I would rename the unaligned load to that what it is
For the name
If load is exposed, the same should be done for store. Hence rename
In my proposal for method-names this wouldn't be in symmetry with |
I was thinking the same thing. |
Updated to be |
Sorry for the delay to reply. Meanwhile, guarantee memory alignment should be user's responsibility, and the dynamic assertion ("Asserts |
Additionly, on modern Intel CPUs, the unaligned load/store won't be slower than aligned load/store (if the memory access hits in one cache line), so we suggest always using unaligned load/store. That is why |
The JIT has to preserve the semantics of the underlying instruction. Currently When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned. This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for And when using the VEX encoding, we can only fold and emit efficient code for We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold
We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data). This leads users to writing algorithms that do the following (where possible): // Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>` They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the
However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following: public static Vector128<float> EfficientLoad(float* address)
{
if (Avx.IsSupported)
{
// Must assert, since this would fail for non-AVX otherwise
Debug.Assert((address % 16) == 0);
return Sse.LoadUnaligned(address);
}
else
{
return Sse.LoadAligned(address);
}
} Which also depends on the JIT being able to successfully recognize and fold this code. So, this proposal suggests providing this helper method built-in by renaming the existing |
Ah, I see the problem now. But we should fold
This assert is useless. |
That is the problem and why we can't fold it.
The point of the assert is to help catch places where the user is calling |
I like this proposal, and agree that the assert, while not guaranteeing to catch all potential dynamic data configurations, is very likely to catch nearly all cases. |
Ah, looked the manual again, yes, |
Agree, actually other |
Are there enough such people to justify adding more intrinsics?
It's perhaps weird for SSE But to quote @CarolEidt "Given that there is a great deal of expertise required to use the hardware intrinsics effectively", I don't think such difference in semantics justifies adding more intrinsics.
Where does this assert go exactly? Is |
Agree, I suggest that we can fold more existing load intrinsic rather than adding more intrinsics.
The above codegen strategy is already adopted by all the mainstream C/C++ compilers. |
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
Yes, if this API is approved, we would (at a minimum rename) the AVX intrinsic to be
A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.
I don't think this is acceptable without an explicit user opt-in. The runtime has had a very strict policy on not silently removing side-effects
C/C++ makes a number of optimizations that we have, at this point, decided not to do and to instead address via Analyzers or look at more in depth later. |
It is not removing side-effects, it has no any side-effects (hardware exceptions). |
All exceptions, including HardwareExceptions are considered side-effects. The |
I didn't ask about cases where you tried to do that. I asked if there are enough people who are using .NET Core on old or gimped CPU and expect things to work perfectly rather than just reasonably.
Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned. Whether it's a good idea to do that in debug builds it's debatable. Some people may end up disabling optimizations to workaround a JIT bug or to be able to get a correct stacktrace for a hard to track down bug. Now they risk getting exceptions that with optimizations enabled would not occur.
Please do not mislead people. The runtime does not impose any semantics on intrinsics, it's the other way around.
Indeed. Squabbles over names resulted in string intrinsic being removed from .NET 3. Yet adding non-deterministic intrinsics to support old CPUs is doable. |
I believe the result of this proposal is just forcing most people to change
I want to say again for (;IsUnaligned(addr); addr++)
{
// processing data until the memory get aligned
}
for (; addr < end; addr += Vector128<int>.Count)
{
var v = LoadAlignedVector128(addr);
// or LoadUnalignedVector128(addr), whatever, we recommand always using LoadUnalignedVector128
......
} Or for (; addr < end; addr += Vector128<int>.Count)
{
var v = LoadUnalignedVector128(addr);
......
} Using which approach is based on profiling and it is much more important than folding loads for SSE encoding. |
I wasn't saying that with regards to personal attempts.
Correct, but the API is explicitly documented to do as such and they have the option to enforce the desired behavior by explicitly using one of the other two APIs ( By not having all three, users either have to write their own wrappers for something that we could (and in my opinion should) reasonably handle. Additionally, if we only expose two APIs and make one of them "sometimes validate alignment", then we can't provide a guarantee to the end user that they can enforce the semantics where required.
I don't think this is misleading at all. We have had a very clear contract and design for the platform-specific hardware intrinsics that each is tied to a very particular instruction.
Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold It is worth noting, however, that we have also tried to determine common patterns, use-cases, and limitations with the decision to tie a given hardware intrinsic to a particular instruction. When we have found such limitations, we have discussed what we could do about it (both the "ideal" and realistic scenarios) and determined an appropriate solution. So far, out of all the scenarios that have been brought up, we have determined to deviate only for
They have not necessarily been removed from netcoreapp3.0, instead we decided that we would pull them until we can determine the appropriate way to expose these APIs as there were concerns over the useability and understandability of said APIs. The underlying instructions behave quite differently from most other intrinsics we've exposed and so they require some additional thought. This is no different from any other API that we decide to expose in that we don't want to ship something that has a bad surface area and we are still working towards making sure these APIs ship in a good and useable state.
As mentioned above, both cases where we exposed an additional non-deterministic API were discussed in length. |
Once you have pinned the memory, it is guaranteed not to move and you can rely on the assert. Anyone dealing with perf-sensitive scenarios and larger blocks of memory (for example, ML.NET) should be pinning their data anyways, as the GC could move the data otherwise and that can mess with the cache, alignment, etc.
The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" has multiple pages/sections all discussing this. The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:
All this additional API does is give the user the option to choose between:
A good number of vectorizable algorithms (especially the ones used in cases like ML.NET) can be handled with a worst-case scenario of two-unaligned loads maximum, regardless of the alignment of the input data. For such algorithms, the only question of whether to align or not really comes down to how much data is being processed in total and the cases where you don't want to make the data aligned can generally be special-cased. |
Yes, that is what I meant. And aligned loads != movap*.
All these three situations are suggested to use |
For the fortieth and two time: the runtime has nothing to do with this. I give up. We're probably speaking different languages or something. |
@mikedn, I think the issue is a runtime issue when we start talking about folding the loads into the operations, in ways that might cause a runtime failure to be masked (i.e. if the load would throw on an unaligned address, but the operation into which it is folded would not). And I think what @tannergooding is after here is the ability to have a single load semantic that supports folding. All that said, I'm still not sure where I fall on this issue. |
Import @CarolEidt 's comment from https://github.com/dotnet/coreclr/issues/21308
If users write |
I would disagree here. We have two instructions that perform explicit loads (including
We additionally have the ability to support load semantics in most instructions:
I would say that the
|
@CarolEidt It most definitely isn't. As far as the runtime/ECMA spec is concerned, intrinsics are just method calls. There's nothing anywhere in the ECMA spec that says that Also, I don't claim that we should change the current |
The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than |
Because we can't always safely fold a
Yes, and the purpose of the proposed intrinsic is to:
It sounds like your concern is that "most" workloads will just use |
No, my concern is that "most" workloads will just use
If users do not guarantee the alignment, the compiler-generated validation (1) is not reliable, so that makes (2) unacceptable on older CPUs.
Right. |
We are not trying to match native code 100%. We are trying to get close enough while staying true to the .NET values like simplicity and productivity. Having all |
At the same time, the hardware intrinsic APIs weren't exactly designed with simplicity in mind. We are expecting users to pin their data, understand the hardware involved that they are targeting, and to write multiple code paths that do functionally the same thing but with different underlying instructions depending on the hardware they are running against. This makes one aspect of the coding pattern better by allowing them to cleanly write: if (Avx.IsSupported)
{
Debug.Assert((((nint)pAddress) % 16) == 0);
return Sse.LoadVector128(pAddress);
}
else
{
return Sse.LoadAlignedVector128(pAddress);
} It is also a common thing that these coding paths should be considering; especially in places like S.P.Corelib where we do have existing AOT/R2R support and where it defaults (and will continue to default) to supporting non-VEX enabled hardware by default. Users not wanting to deal with the complexity of hardware intrinsics have the option of using the type-safe and portable |
I was just writing the same thing. I actually have this code in places to ensure I get folded loads // ensure pointer is aligned
if (Avx.IsSupported)
Sse.Add(vec, Sse.LoadVector128(p));
else
Sse.Add(vec, Sse.LoadAlignedVector128(p)); … which is far less simple and far less discoverable than |
Are there any cases where one would want to use |
I can't imagine one, particularly if the Unsafe variant can be implemented so it throws in minopts. I assume it would just emit I just had a chance to watch the review meeting video, and it does seem like the API shape and behavior with the two aligned load variants is confusing. I was equally confused until @tannergooding walked through it earlier in this thread. |
So we should just fix the I do not buy the argument about deterministic behavior for alignment faults. The platform does not guarantee deterministic alignment faults. For example, accessing misaligned |
x86, however, does guarantee deterministic alignment faults and the hardware intrinsics expose hardware specific functionality. The aligned load/store operations will always raise a Thus far, we have a 1-to-1 mapping of the exposed API surface to an underlying instruction for the safe APIs. That is,
I don't believe this isn't what most users will have to do. There are only a few categories of code that you can have right now. You have people who don't care what the alignment is and are explicitly using Then you have users who are explicitly taking alignment into account (this has to be explicit since CoreCLR provides no way to guarantee 16-byte or higher alignment today). For users who are taking alignment into account, you will either have people who are doing it because they want alignment enforcement or they are doing it for performance. If they are doing it for alignment enforcement they will only be using If they are instead doing it for performance, then they will either have written two code paths or have written a small helper method like I gave above that uses These new APIs target the latter category of user and gives them an API which bridges that gap without also regressing any other scenario. It will only reduce the amount of code they have written and makes it an explicit opt-in that you are willing to not have strict alignment checking in favor of performance. |
Throwing less exceptions is not a breaking change.
There is very little value in this explicit opt-in. Alignment bugs are like threading bugs to large degree. If you have alignment bug in your code, you often get lucky because of some other part of the code guarantees alignment. |
I think the difference is that this is a hardware exception raised as the side-effect of a given instruction. It is used to ensure correctness and sometimes because not all hardware has fast unaligned loads (particularly older hardware and sometimes non-VEX enabled hardware).
I don't think I'd agree, particularly when the expectation is that users explicitly pin and check alignment first (such that they can specially handle leading/trailing elements via an unaligned and masked memory operation). That being said... @GrabYourPitchforks, @terrajobst, @CarolEidt. Do you have a problem with changing the behavior of |
The As long as there exists some setting which will force this API to emit |
The configuration switches to explicitly set the different hardware support levels (e.g. |
It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all. So where you would have gotten a fault if The real question is, if given a choice between a method that always emits Edit: Also, in the Intel intrinsics guide,
Note that it says may there |
Which is why the original proposal exposes a new API. As it is today, if you call When using the VEX-encoding, the instructions were changed to no longer fault if the address wasn't aligned. This means it is no longer "safe" to fold because doing so could remove a #GP(0) that would have otherwise been raised.
The key point being that we were still providing deterministic behavior and preserving the semantics of the original code.
The native intrinsics don't guarantee that |
By itself it won't do the trick. The behavior would be:
|
This question hinges on whether you are wanting "fast code" or "strict code". If you want "fast code", then yes; there is never a reason to want the If you want "strict" code, then you would strictly want the alignment checking always, even under the newer VEX encoding and even if it provides a minimal perf hit. Likewise, you would probably not want to do the alignment checking yourself when there is a single instruction that will efficiently do it. I tend to prefer "fast code" myself and I can't currently think of any scenarios where you would both want the VEX encoding and for "strict" mode to be used. However, at the same time, I don't think exposing a non-deterministic API (one that differs in result between optimizations on/off) is good API design and I feel it tends to go against what we've previously chosen to do for these APIs and for .NET in general. |
(To clarify, I mean exposing a non-deterministic API without a deterministic equivalent). |
There are two different reasons why you may want do fuzzing:
You want to do both of these for different hardware support levels. The hardware architectures out there and .NET specifically do not provide strong guarantees around alignment faults across the board, so designing fuzzing strategy around the fact that a few x86 instructions emit alignment faults is not a viable cross-architecture strategy. |
I don't have a problem with that change, and the arguments for seem much more compelling than the arguments against. In general, we expect developers to use HW Intrinsics in scenarios where perf matters, and we expect them to have a great deal of sophistication with regard to their usage. That said, perhaps the subtleties here merit some guidance on this topic (blog post, anyone?) |
Closing this. We've updated the containment handling to efficiently do this instead. |
Rationale
On x86 hardware, for SIMD instructions, there are effectively two encoding formats: "legacy" and "vex".
There are some minor differences between these two encodings, including the number of parameters they take and whether the memory operand has aligned or unaligned semantics.
As a brief summary:
ins reg, reg/[mem]
), the first serves as both a source and destination, and the last as a source which can generally be a register or memory address. The memory address form has "aligned" semantics and will cause an exception to be thrown if the address is not "naturally aligned" (generally 16-bytes).ins reg, reg, reg/[mem]
), the first being the destination, the second being a source, and the last as a source which can generally be a register or memory address. The memory address form has "unaligned" semantics and will not cause an exception to be thrown, regardless of the input.Today, we expose both
Load
(which has unaligned semantics) andLoadAligned
(which has aligned semantics). Given that a user will often want to generate the "most efficient" code possible and that the JIT, in order to preserve semantics (not silently get rid of an exception that would otherwise be thrown), changes which method it will "fold" depending on the encoding it is currently emitting, it may be beneficial to expose an intrinsic which allows the user to specify that "This address is aligned, do whatever load is most efficient". This would ensure that it can be folded regardless of the current encoding.Proposed API
API Semantics
This shows the semantics of each API under the legacy and VEX encoding.
Most insrtuctions support having the last operand be either a register or memory operand:
ins reg, reg/[mem]
When folding does not happen the load is an explicit separate instruction:
While, folding allows the load to be folded into the calling instruction:
On
Legacy
hardware, the folded-form of the instructions assert thataddress
is aligned (generally this means(address % 16) == 0
).On
VEX
hardware, the folded-form does no such validation and allows any input.Load (mov unaligned)
LoadAligned (mov aligned)
LoadUnsafe
Additional Info
Some open ended questions:
COMPlus_EnableAVX=0
)Store
,StoreUnaligned
, andStoreAligned
counterparts or just keepStore
andStoreAligned
?Avx
/Avx2
?Avx
/Avx2
instructions, they require the VEX encoding and therefore always have "unaligned" semantics, but that Avx/Avx2 define and exposeLoadAlignedVector256
/StoreAligned
instructions anywaysThe text was updated successfully, but these errors were encountered: