New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Store Pair of SIMD&FP registers #33532
Comments
I couldn't add an area label to this Issue. Checkout this page to find out which area owner to ping, or please add exactly one area label to help train me in the future. |
@TamarChristinaArm Is it correct to say that functionality of |
I think we should could have better names than I wonder if |
How about
It is an option, but I can't tell which one has more emphasis on that we store a pair |
It seems to me that it's useful to have all of these start the same, i.e. |
Yeah that's correct |
Question about these store intrinsics in general, how are you guys planning on dealing with the different addressing modes? or are you only interested in a the register addressing modes? |
I had a discussion with @BruceForstall where we briefly discussed how we can benefit from post-index addressing modes if we had something like these: // ST1 { <Vt>.<T> }, [<Xn|SP>], #16
void Store(ref T* address, Vector128<T> value); For example, if have a loop Vector128<double> val ;
for (int i = 0; i < count; i++)
{
/*
compute new value of val
*/
Store(baseAddr + i * 16, val);
} a user might want to do some sort of strength reduction manually and have this Vector128<double> val ;
T* ptr = baseAddr;
for (int i = 0; i < count; i++)
{
/*
compute new value of val
*/
Store(ptr, val);
ptr += 16;
} and that as a result Vector128<double> val ;
T* ptr = baseAddr;
for (int i = 0; i < count; i++)
{
/*
compute new value of val
*/
Store(ref ptr, val);
} |
Why can't we just detect and emit the right encoding for |
We already support optimizing things like: Sse.LoadVector128(addr + index * 4); into: vmovups xmm0, [r8+rax*4] |
I did not know that we are doing this on x86/x64. Thes, yes, we can. |
I will open an issue to track this work |
That would make it easier to do e.g. |
Actually, I stand corrected when I said we can do this (well, we can but it's not that easy) - the problem with detecting a post-indexing address mode is harder than what you described on x86/x64, since during writeback stage the instruction modifies the value of a base register. I don't think we use post-indexing modes anywhere on arm64 other than in hand-written prolog/epilog or cpObj codegen. |
It seems like we would want intrinsics to allow directly specifying pre-indexed or post-indexed addressing, due to writeback. I don't think the JIT will be able to optimize that in all cases. I was wondering the APIs allow specifying a "memcpy" using LD1/ST1 with post-indexing, for example:
instead of:
In simple cases the JIT might be able to optimize this, but we shouldn't necessarily depend on that. |
It looks like writeback in this context is the modification of (depending on the post-index overload) |
That is, those inputs are effectively RMW? |
It's modification of |
Ah, yes, I see. The operation manual has the following and I misread the first if statement
|
Given the vector instructions look to force |
Oh, nevermind. It's only forced to |
Hmmm, but the native intrinsics don't seem to have variants that take anything other than |
I would suspect that's the case.
I don't know why we'd need to have specific overloads for array[index] for this to be useful. In any event the memory hw intrinsics require a pointer, and it would seem that supporting post-indexing of that pointer would be desirable. |
I meant this more as a, If we were to expose the |
I think it remains to be seen whether it would be better to expose this directly or to rely on the JIT to optimize it. But I suspect that the difficulties of determining that |
For one or two loads/stores I suspect it shouldn't matter all that much.. when you have a lot of them reading from the same sources or using the same offsets it becomes a bit more of an issue as if you pick the wrong addressing mode you end up with more instructions and higher register pressure. For instance if you fail to recognize that e.g. you can use register offset or an immediate offset you can up using lots of adds to generate the address to use a simpler addressing mode.
That said, C compilers routinely don't use the most efficient addressing modes and it hasn't terribly hurt us yet at this point. On the grand scheme of things there are higher priority optimization tasks but recognizing the simple cases would be a good start I think. |
namespace System.Runtime.Intrinsics.Arm
{
partial class AdvSimd.Arm64
{
public static unsafe void StorePair(byte* address, Vector64<byte> value1, Vector64<byte> value2);
public static unsafe void StorePair(double* address, Vector64<double> value1, Vector64<double> value2);
public static unsafe void StorePair(short* address, Vector64<short> value1, Vector64<short> value2);
public static unsafe void StorePair(int* address, Vector64<int> value1, Vector64<int> value2);
public static unsafe void StorePair(long* address, Vector64<long> value1, Vector64<long> value2);
public static unsafe void StorePair(sbyte* address, Vector64<sbyte> value1, Vector64<sbyte> value2);
public static unsafe void StorePair(float* address, Vector64<float> value1, Vector64<float> value2);
public static unsafe void StorePair(ushort* address, Vector64<ushort> value1, Vector64<ushort> value2);
public static unsafe void StorePair(uint* address, Vector64<uint> value1, Vector64<uint> value2);
public static unsafe void StorePair(ulong* address, Vector64<ulong> value1, Vector64<ulong> value2);
public static unsafe void StorePair(byte* address, Vector128<byte> value1, Vector128<byte> value2);
public static unsafe void StorePair(double* address, Vector128<double> value1, Vector128<double> value2);
public static unsafe void StorePair(short* address, Vector128<short> value1, Vector128<short> value2);
public static unsafe void StorePair(int* address, Vector128<int> value1, Vector128<int> value2);
public static unsafe void StorePair(long* address, Vector128<long> value1, Vector128<long> value2);
public static unsafe void StorePair(sbyte* address, Vector128<sbyte> value1, Vector128<sbyte> value2);
public static unsafe void StorePair(float* address, Vector128<float> value1, Vector128<float> value2);
public static unsafe void StorePair(ushort* address, Vector128<ushort> value1, Vector128<ushort> value2);
public static unsafe void StorePair(uint* address, Vector128<uint> value1, Vector128<uint> value2);
public static unsafe void StorePair(ulong* address, Vector128<ulong> value1, Vector128<ulong> value2);
public static unsafe void StorePairScalar(int* address, Vector64<int> value1, Vector64<int> value2);
public static unsafe void StorePairScalar(float* address, Vector64<float> value1, Vector64<float> value2);
public static unsafe void StorePairScalar(uint* address, Vector64<uint> value1, Vector64<uint> value2);
public static unsafe void StorePairNonTemporal(byte* address, Vector64<byte> value1, Vector64<byte> value2);
public static unsafe void StorePairNonTemporal(double* address, Vector64<double> value1, Vector64<double> value2);
public static unsafe void StorePairNonTemporal(short* address, Vector64<short> value1, Vector64<short> value2);
public static unsafe void StorePairNonTemporal(int* address, Vector64<int> value1, Vector64<int> value2);
public static unsafe void StorePairNonTemporal(long* address, Vector64<long> value1, Vector64<long> value2);
public static unsafe void StorePairNonTemporal(sbyte* address, Vector64<sbyte> value1, Vector64<sbyte> value2);
public static unsafe void StorePairNonTemporal(float* address, Vector64<float> value1, Vector64<float> value2);
public static unsafe void StorePairNonTemporal(ushort* address, Vector64<ushort> value1, Vector64<ushort> value2);
public static unsafe void StorePairNonTemporal(uint* address, Vector64<uint> value1, Vector64<uint> value2);
public static unsafe void StorePairNonTemporal(ulong* address, Vector64<ulong> value1, Vector64<ulong> value2);
public static unsafe void StorePairNonTemporal(byte* address, Vector128<byte> value1, Vector128<byte> value2);
public static unsafe void StorePairNonTemporal(double* address, Vector128<double> value1, Vector128<double> value2);
public static unsafe void StorePairNonTemporal(short* address, Vector128<short> value1, Vector128<short> value2);
public static unsafe void StorePairNonTemporal(int* address, Vector128<int> value1, Vector128<int> value2);
public static unsafe void StorePairNonTemporal(long* address, Vector128<long> value1, Vector128<long> value2);
public static unsafe void StorePairNonTemporal(sbyte* address, Vector128<sbyte> value1, Vector128<sbyte> value2);
public static unsafe void StorePairNonTemporal(float* address, Vector128<float> value1, Vector128<float> value2);
public static unsafe void StorePairNonTemporal(ushort* address, Vector128<ushort> value1, Vector128<ushort> value2);
public static unsafe void StorePairNonTemporal(uint* address, Vector128<uint> value1, Vector128<uint> value2);
public static unsafe void StorePairNonTemporal(ulong* address, Vector128<ulong> value1, Vector128<ulong> value2);
public static unsafe void StorePairScalarNonTemporal(int* address, Vector64<int> value1, Vector64<int> value2);
public static unsafe void StorePairScalarNonTemporal(float* address, Vector64<float> value1, Vector64<float> value2);
public static unsafe void StorePairScalarNonTemporal(uint* address, Vector64<uint> value1, Vector64<uint> value2);
}
} |
@TamarChristinaArm I started implementing StorePair and I realized that my original statement above was wrong since VSTM only allows to store from a list of consecutively numbered D-registers while STP can store an arbitrary pair of registers, so they are not equivalent. I believe the intrinsics in this PR should be Arm64 only. Do you agree? Out of curiosity - why there is no C++ intrinsics that store a pair of SIMD/FP registers? |
@echesakovMSFT
Well sort of, So the But yes, in the context of not having convenience intrinsics in CoreCLR I agree that it needs to be different intrinsics. (In case you're wondering, in C we would have done this in AArch32 by putting the values in a struct in the definition of the intrinsics before expanding STM. This usually wouldn't produce any extra moves as the RA will arrange if possible the values immediately in the right registers and the struct is optimized away).
The belief is that you don't need them and that the compiler should always be able to form pairs when it's possible. To do this both LLVM and GCC have special passes that aid in this. In GCC for instance we have a scheduler fusion pass that allows the instruction scheduler to move consecutive loads and stores next to each other if the pipeline description says it makes sense based on the data dependencies etc. i.e. we won't move them if you can't form pairs so that you don't overload your pipelines with a long chain of load/stores. After this we peephole them. After this we have a late scheduling pass that is able to schedule the formed pairs better so that again you don't have a long chain of them in your pipeline. Another way it deals with this is that we have modes that are larger than a machine int register size. e.g. |
@TamarChristinaArm Thank you for your thorough reply! |
RyuJIT doesn't have a scheduling pass, nor do we have any peephole-like phases that, for example, use a sliding window of instructions to analyze for optimizations such as this. Not to mention that we only have very limited capability of doing dependence analysis to identify interfering memory operations. So the only near-term feasible optimization would be for immediately adjacent instructions. |
The text was updated successfully, but these errors were encountered: