Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining ARM Intrinsics #37014

Closed
tannergooding opened this issue May 26, 2020 · 9 comments
Closed

Remaining ARM Intrinsics #37014

tannergooding opened this issue May 26, 2020 · 9 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented arch-arm64 area-System.Runtime.Intrinsics
Milestone

Comments

@tannergooding
Copy link
Member

namespace System.Runtime.Intrinsics.Arm
{
    public static class AdvSimd
    {
        // LDP
        public static unsafe (Vector64<byte>,    Vector64<byte>)    LoadPairVector64(byte*   address);
        public static unsafe (Vector64<sbyte>,   Vector64<sbyte>)   LoadPairVector64(sbyte*  address);
        public static unsafe (Vector64<short>,   Vector64<short>)   LoadPairVector64(short*  address);
        public static unsafe (Vector64<ushort>,  Vector64<ushort>)  LoadPairVector64(ushort* address);
        public static unsafe (Vector64<int>,     Vector64<int>)     LoadPairVector64(int*    address);
        public static unsafe (Vector64<uint>,    Vector64<uint>)    LoadPairVector64(uint*   address);
        public static unsafe (Vector64<float>,   Vector64<float>)   LoadPairVector64(float*  address);

        public static unsafe (Vector128<byte>,   Vector128<byte>)   LoadPairVector128(byte*   address);
        public static unsafe (Vector128<sbyte>,  Vector128<sbyte>)  LoadPairVector128(sbyte*  address);
        public static unsafe (Vector128<short>,  Vector128<short>)  LoadPairVector128(short*  address);
        public static unsafe (Vector128<ushort>, Vector128<ushort>) LoadPairVector128(ushort* address);
        public static unsafe (Vector128<int>,    Vector128<int>)    LoadPairVector128(int*    address);
        public static unsafe (Vector128<uint>,   Vector128<uint>)   LoadPairVector128(uint*   address);
        public static unsafe (Vector128<long>,   Vector128<long>)   LoadPairVector128(long*   address);
        public static unsafe (Vector128<ulong>,  Vector128<ulong>)  LoadPairVector128(ulong*  address);
        public static unsafe (Vector128<float>,  Vector128<float>)  LoadPairVector128(float*  address);

        public static unsafe (Vector64<int>,     Vector64<int>)     LoadPairScalarVector64(int*  address);
        public static unsafe (Vector64<uint>,    Vector64<uint>)    LoadPairScalarVector64(uint* address);
        public static unsafe (Vector64<long>,    Vector64<long>)    LoadPairScalarVector64(long*  address);
        public static unsafe (Vector64<ulong>,   Vector64<ulong>)   LoadPairScalarVector64(ulong* address);
        public static unsafe (Vector64<float>,   Vector64<float>)   LoadPairScalarVector64(float* address);
        public static unsafe (Vector64<double>,  Vector64<double>)  LoadPairScalarVector64(double* address);

        // LDNP
        public static unsafe (Vector64<byte>,    Vector64<byte>)    LoadPairVector64NonTemporal(byte*   address);
        public static unsafe (Vector64<sbyte>,   Vector64<sbyte>)   LoadPairVector64NonTemporal(sbyte*  address);
        public static unsafe (Vector64<short>,   Vector64<short>)   LoadPairVector64NonTemporal(short*  address);
        public static unsafe (Vector64<ushort>,  Vector64<ushort>)  LoadPairVector64NonTemporal(ushort* address);
        public static unsafe (Vector64<int>,     Vector64<int>)     LoadPairVector64NonTemporal(int*    address);
        public static unsafe (Vector64<uint>,    Vector64<uint>)    LoadPairVector64NonTemporal(uint*   address);
        public static unsafe (Vector64<float>,   Vector64<float>)   LoadPairVector64NonTemporal(float*  address);

        public static unsafe (Vector128<byte>,   Vector128<byte>)   LoadPairVector128NonTemporal(byte*   address);
        public static unsafe (Vector128<sbyte>,  Vector128<sbyte>)  LoadPairVector128NonTemporal(sbyte*  address);
        public static unsafe (Vector128<short>,  Vector128<short>)  LoadPairVector128NonTemporal(short*  address);
        public static unsafe (Vector128<ushort>, Vector128<ushort>) LoadPairVector128NonTemporal(ushort* address);
        public static unsafe (Vector128<int>,    Vector128<int>)    LoadPairVector128NonTemporal(int*    address);
        public static unsafe (Vector128<uint>,   Vector128<uint>)   LoadPairVector128NonTemporal(uint*   address);
        public static unsafe (Vector128<long>,   Vector128<long>)   LoadPairVector128NonTemporal(long*   address);
        public static unsafe (Vector128<ulong>,  Vector128<ulong>)  LoadPairVector128NonTemporal(ulong*  address);
        public static unsafe (Vector128<float>,  Vector128<float>)  LoadPairVector128NonTemporal(float*  address);

        public static unsafe (Vector64<int>,     Vector64<int>)     LoadPairScalarVector64NonTemporal(int*  address);
        public static unsafe (Vector64<uint>,    Vector64<uint>)    LoadPairScalarVector64NonTemporal(uint* address);
        public static unsafe (Vector64<long>,    Vector64<long>)    LoadPairScalarVector64NonTemporal(long*  address);
        public static unsafe (Vector64<ulong>,   Vector64<ulong>)   LoadPairScalarVector64NonTemporal(ulong* address);
        public static unsafe (Vector64<float>,   Vector64<float>)   LoadPairScalarVector64NonTemporal(float* address);
        public static unsafe (Vector64<double>,  Vector64<double>)  LoadPairScalarVector64NonTemporal(double* address);

        // SQXTN
        public static Vector64<sbyte>   ExtractNarrowingSaturateLower(Vector128<short>  value);
        public static Vector64<short>   ExtractNarrowingSaturateLower(Vector128<int>    value);
        public static Vector64<int>     ExtractNarrowingSaturateLower(Vector128<long>   value);
        public static Vector128<sbyte>  ExtractNarrowingSaturateUpper(Vector64<short>   lower, Vector128<short>  value);
        public static Vector128<short>  ExtractNarrowingSaturateUpper(Vector64<int>     lower, Vector128<int>    value);
        public static Vector128<int>    ExtractNarrowingSaturateUpper(Vector64<long>    lower, Vector128<long>   value);

        // SQXTUN
        public static Vector64<byte>    ExtractNarrowingSaturateLower(Vector128<ushort> value);
        public static Vector64<ushort>  ExtractNarrowingSaturateLower(Vector128<uint>   value);
        public static Vector64<uint>    ExtractNarrowingSaturateLower(Vector128<ulong>  value);
        public static Vector128<byte>   ExtractNarrowingSaturateUpper(Vector64<ushort>  lower, Vector128<ushort> value);
        public static Vector128<ushort> ExtractNarrowingSaturateUpper(Vector64<uint>    lower, Vector128<uint>   value);
        public static Vector128<uint>   ExtractNarrowingSaturateUpper(Vector64<ulong>   lower, Vector128<ulong>  value);

        // UQXTN
        public static Vector64<byte>    ExtractNarrowingSaturateUnsignedLower(Vector128<short> value);
        public static Vector64<ushort>  ExtractNarrowingSaturateUnsignedLower(Vector128<int>   value);
        public static Vector64<uint>    ExtractNarrowingSaturateUnsignedLower(Vector128<long>  value);
        public static Vector128<byte>   ExtractNarrowingSaturateUnsignedUpper(Vector64<short>  lower, Vector128<short> value);
        public static Vector128<ushort> ExtractNarrowingSaturateUnsignedUpper(Vector64<int>    lower, Vector128<int>   value);
        public static Vector128<uint>   ExtractNarrowingSaturateUnsignedUpper(Vector64<long>   lower, Vector128<long>  value);

        // REV16
        public static Vector64<ushort>  ReverseElementBytes(Vector64<ushort>  value);
        public static Vector64<short>   ReverseElementBytes(Vector64<short>   value);
        public static Vector128<ushort> ReverseElementBytes(Vector128<ushort> value);
        public static Vector128<short>  ReverseElementBytes(Vector128<short>  value);

        // REV32
        public static Vector64<uint>    ReverseElementBytes(Vector64<uint>    value);
        public static Vector64<int>     ReverseElementBytes(Vector64<int>     value);
        public static Vector64<float>   ReverseElementBytes(Vector64<float>   value);
        public static Vector128<uint>   ReverseElementBytes(Vector128<uint>   value);
        public static Vector128<int>    ReverseElementBytes(Vector128<int>    value);
        // Also versions that swap the "halfwords" (each 16-bit portion)

        // REV64
        public static Vector128<ulong>  ReverseElementBytes(Vector64<ulong>  value);
        public static Vector128<long>   ReverseElementBytes(Vector64<long>   value);
        public static Vector128<ulong>  ReverseElementBytes(Vector128<ulong>  value);
        public static Vector128<long>   ReverseElementBytes(Vector128<long>   value);
        // Also versions that swap the "doublewords" (each 32-bit portion)
        // Also versions that swap the "halfwords" (each 16-bit portion)
    }
}
@ghost
Copy link

ghost commented May 26, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 26, 2020
@tannergooding
Copy link
Member Author

CC. @CarolEidt, @echesakovMSFT

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label May 26, 2020
@echesakov echesakov added this to To do general in Hardware Intrinsics via automation Jun 11, 2020
@echesakov echesakov moved this from To do general to API design in Hardware Intrinsics Jun 13, 2020
@echesakov echesakov added this to the 5.0.0 milestone Jun 19, 2020
@echesakov echesakov added the blocking Marks issues that we want to fast track in order to unblock other important work label Jun 24, 2020
@echesakov
Copy link
Contributor

ExtractNarrowingSaturateUnsignedLower and ExtractNarrowingSaturateUnsignedUpper are needed for the "intrinsification" work that @carlossanlop is doing, so I will implement this next.

@echesakov
Copy link
Contributor

ExtractNarrowingSaturateUnsignedLower and ExtractNarrowingSaturateUnsignedUpper should correspond to sqxtun and sqxtun2

@terrajobst terrajobst added the api-approved API was approved in API review, it can be implemented label Jun 25, 2020
@terrajobst
Copy link
Member

terrajobst commented Jun 25, 2020

Video

label:blocking

namespace System.Runtime.Intrinsics.Arm
{
    public static class AdvSimd
    {
        public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)    LoadPairVector64(byte*   address);
        public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)   LoadPairVector64(sbyte*  address);
        public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)   LoadPairVector64(short*  address);
        public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)  LoadPairVector64(ushort* address);
        public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadPairVector64(int*    address);
        public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadPairVector64(uint*   address);
        public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadPairVector64(float*  address);

        public static unsafe (Vector128<byte> Value1,   Vector128<byte> Value2)   LoadPairVector128(byte*   address);
        public static unsafe (Vector128<sbyte> Value1,  Vector128<sbyte> Value2)  LoadPairVector128(sbyte*  address);
        public static unsafe (Vector128<short> Value1,  Vector128<short> Value2)  LoadPairVector128(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadPairVector128(ushort* address);
        public static unsafe (Vector128<int> Value1,    Vector128<int> Value2)    LoadPairVector128(int*    address);
        public static unsafe (Vector128<uint> Value1,   Vector128<uint> Value2)   LoadPairVector128(uint*   address);
        public static unsafe (Vector128<long> Value1,   Vector128<long> Value2)   LoadPairVector128(long*   address);
        public static unsafe (Vector128<ulong> Value1,  Vector128<ulong> Value2)  LoadPairVector128(ulong*  address);
        public static unsafe (Vector128<float> Value1,  Vector128<float> Value2)  LoadPairVector128(float*  address);

        public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadPairScalarVector64(int*  address);
        public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadPairScalarVector64(uint* address);
        public static unsafe (Vector64<long> Value1,    Vector64<long> Value2)    LoadPairScalarVector64(long*  address);
        public static unsafe (Vector64<ulong> Value1,   Vector64<ulong> Value2)   LoadPairScalarVector64(ulong* address);
        public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadPairScalarVector64(float* address);
        public static unsafe (Vector64<double> Value1,  Vector64<double> Value2)  LoadPairScalarVector64(double* address);

        public static unsafe (Vector64<byte> Value1,    Vector64<byte> Value2)    LoadPairVector64NonTemporal(byte*   address);
        public static unsafe (Vector64<sbyte> Value1,   Vector64<sbyte> Value2)   LoadPairVector64NonTemporal(sbyte*  address);
        public static unsafe (Vector64<short> Value1,   Vector64<short> Value2)   LoadPairVector64NonTemporal(short*  address);
        public static unsafe (Vector64<ushort> Value1,  Vector64<ushort> Value2)  LoadPairVector64NonTemporal(ushort* address);
        public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadPairVector64NonTemporal(int*    address);
        public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadPairVector64NonTemporal(uint*   address);
        public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadPairVector64NonTemporal(float*  address);

        public static unsafe (Vector128<byte> Value1,   Vector128<byte> Value2)   LoadPairVector128NonTemporal(byte*   address);
        public static unsafe (Vector128<sbyte> Value1,  Vector128<sbyte> Value2)  LoadPairVector128NonTemporal(sbyte*  address);
        public static unsafe (Vector128<short> Value1,  Vector128<short> Value2)  LoadPairVector128NonTemporal(short*  address);
        public static unsafe (Vector128<ushort> Value1, Vector128<ushort> Value2) LoadPairVector128NonTemporal(ushort* address);
        public static unsafe (Vector128<int> Value1,    Vector128<int> Value2)    LoadPairVector128NonTemporal(int*    address);
        public static unsafe (Vector128<uint> Value1,   Vector128<uint> Value2)   LoadPairVector128NonTemporal(uint*   address);
        public static unsafe (Vector128<long> Value1,   Vector128<long> Value2)   LoadPairVector128NonTemporal(long*   address);
        public static unsafe (Vector128<ulong> Value1,  Vector128<ulong> Value2)  LoadPairVector128NonTemporal(ulong*  address);
        public static unsafe (Vector128<float> Value1,  Vector128<float> Value2)  LoadPairVector128NonTemporal(float*  address);

        public static unsafe (Vector64<int> Value1,     Vector64<int> Value2)     LoadPairScalarVector64NonTemporal(int*  address);
        public static unsafe (Vector64<uint> Value1,    Vector64<uint> Value2)    LoadPairScalarVector64NonTemporal(uint* address);
        public static unsafe (Vector64<long> Value1,    Vector64<long> Value2)    LoadPairScalarVector64NonTemporal(long*  address);
        public static unsafe (Vector64<ulong> Value1,   Vector64<ulong> Value2)   LoadPairScalarVector64NonTemporal(ulong* address);
        public static unsafe (Vector64<float> Value1,   Vector64<float> Value2)   LoadPairScalarVector64NonTemporal(float* address);
        public static unsafe (Vector64<double> Value1,  Vector64<double> Value2)  LoadPairScalarVector64NonTemporal(double* address);

        public static Vector64<sbyte>   ExtractNarrowingSaturateLower(Vector128<short>  value);
        public static Vector64<short>   ExtractNarrowingSaturateLower(Vector128<int>    value);
        public static Vector64<int>     ExtractNarrowingSaturateLower(Vector128<long>   value);
        public static Vector128<sbyte>  ExtractNarrowingSaturateUpper(Vector64<short>   lower, Vector128<short>  value);
        public static Vector128<short>  ExtractNarrowingSaturateUpper(Vector64<int>     lower, Vector128<int>    value);
        public static Vector128<int>    ExtractNarrowingSaturateUpper(Vector64<long>    lower, Vector128<long>   value);

        public static Vector64<byte>    ExtractNarrowingSaturateLower(Vector128<ushort> value);
        public static Vector64<ushort>  ExtractNarrowingSaturateLower(Vector128<uint>   value);
        public static Vector64<uint>    ExtractNarrowingSaturateLower(Vector128<ulong>  value);
        public static Vector128<byte>   ExtractNarrowingSaturateUpper(Vector64<ushort>  lower, Vector128<ushort> value);
        public static Vector128<ushort> ExtractNarrowingSaturateUpper(Vector64<uint>    lower, Vector128<uint>   value);
        public static Vector128<uint>   ExtractNarrowingSaturateUpper(Vector64<ulong>   lower, Vector128<ulong>  value);

        public static Vector64<byte>    ExtractNarrowingSaturateUnsignedLower(Vector128<short> value);
        public static Vector64<ushort>  ExtractNarrowingSaturateUnsignedLower(Vector128<int>   value);
        public static Vector64<uint>    ExtractNarrowingSaturateUnsignedLower(Vector128<long>  value);
        public static Vector128<byte>   ExtractNarrowingSaturateUnsignedUpper(Vector64<short>  lower, Vector128<short> value);
        public static Vector128<ushort> ExtractNarrowingSaturateUnsignedUpper(Vector64<int>    lower, Vector128<int>   value);
        public static Vector128<uint>   ExtractNarrowingSaturateUnsignedUpper(Vector64<long>   lower, Vector128<long>  value);

        public static Vector64<ushort>  ReverseElement8(Vector64<ushort>  value);
        public static Vector64<short>   ReverseElement8(Vector64<short>   value);
        public static Vector128<ushort> ReverseElement8(Vector128<ushort> value);
        public static Vector128<short>  ReverseElement8(Vector128<short>  value);

        public static Vector64<uint>    ReverseElement8(Vector64<uint>    value);
        public static Vector64<int>     ReverseElement8(Vector64<int>     value);
        public static Vector64<float>   ReverseElement8(Vector64<float>   value);
        public static Vector128<uint>   ReverseElement8(Vector128<uint>   value);
        public static Vector128<int>    ReverseElement8(Vector128<int>    value);

        public static Vector128<ulong>  ReverseElement8(Vector64<ulong>   value);
        public static Vector128<long>   ReverseElement8(Vector64<long>    value);
        public static Vector128<ulong>  ReverseElement8(Vector128<ulong>  value);
        public static Vector128<long>   ReverseElement8(Vector128<long>   value);

        public static Vector64<uint>    ReverseElement16(Vector64<uint>    value);
        public static Vector64<int>     ReverseElement16(Vector64<int>     value);
        public static Vector64<float>   ReverseElement16(Vector64<float>   value);
        public static Vector128<uint>   ReverseElement16(Vector128<uint>   value);
        public static Vector128<int>    ReverseElement16(Vector128<int>    value);

        public static Vector128<ulong>  ReverseElement16(Vector64<ulong>   value);
        public static Vector128<long>   ReverseElement16(Vector64<long>    value);
        public static Vector128<ulong>  ReverseElement16(Vector128<ulong>  value);
        public static Vector128<long>   ReverseElement16(Vector128<long>   value);

        public static Vector128<ulong>  ReverseElement32(Vector64<ulong>   value);
        public static Vector128<long>   ReverseElement32(Vector64<long>    value);
        public static Vector128<ulong>  ReverseElement32(Vector128<ulong>  value);
        public static Vector128<long>   ReverseElement32(Vector128<long>   value);
    }
}

@echesakov echesakov moved this from API design to In progress in Hardware Intrinsics Jun 25, 2020
@juliusfriedman
Copy link
Contributor

Please correct me if I am wrong but for the first 50 or so methods there is little change for them to fail unless the address is invalid (and they take a pointer), thus I would highly suggest perhaps a Span and or IntPtr overload as well as proposed unsafe variants here.

@tannergooding
Copy link
Member Author

Such APIs are out of scope for .NET 5 and would require a separate API proposal.

However, none of the other intrinsics have such overloads and they would likely not be as performant or have as clear semantics and so I would not be in favor of taking them through API review.

@echesakov echesakov removed the blocking Marks issues that we want to fast track in order to unblock other important work label Jul 1, 2020
@carlossanlop carlossanlop added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-ready-for-review labels Jul 6, 2020
@echesakov
Copy link
Contributor

During the last JIT team meeting there were concerns raised that LoadPairVector64 and LoadPairVector128 are the only intrinsics returning a tuple and we don't know yet if they could expose previously not seen issues.

I am going to open a separate issue to track the work of implementing LoadPairVector64 and LoadPairVector128. Depending on the extent of changes their implementation requires we might consider moving these intrinsics to 6.0. Then, work could be consolidated with a work of implementing intrinsics for LD1-LD4,ST1-ST4 operating on multiple registers and thoroughly tested thereafter.

cc @dotnet/jit-contrib

@echesakov
Copy link
Contributor

Opened #39243 for LoadPairVector64 and LoadPairVector128

Hardware Intrinsics automation moved this from In progress to Done Jul 14, 2020
@stephentoub stephentoub removed the api-ready-for-review API is ready for review, it is NOT ready for implementation label Oct 23, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented arch-arm64 area-System.Runtime.Intrinsics
Projects
Development

No branches or pull requests

7 participants