Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal : Arm Shift and Permute intrinsics #31324

Closed
TamarChristinaArm opened this issue Oct 28, 2019 · 20 comments · Fixed by #36818
Closed

API Proposal : Arm Shift and Permute intrinsics #31324

TamarChristinaArm opened this issue Oct 28, 2019 · 20 comments · Fixed by #36818
Assignees
Labels
api-approved API was approved in API review, it can be implemented arch-arm64 area-System.Runtime.Intrinsics
Milestone

Comments

@TamarChristinaArm
Copy link
Contributor

The A32 variants of these are blocked pending resolution of the <lanes>x<copies> implementation in #24790 (e.g. int32x2x2). The permute instructions such as ZIP1 and ZIP2 present an interesting challenge. Since intrinsics in CoreCLR/CoreFX are supposed to match down to a single hardware instruction this makes it a bit awkward, since on A32 ZIP, TRN, UZP are destructive operations which perform both the Odd and Even shuffles at the same time. So while you could do the intrinsics for A32 by copying the vector and ignoring one of the outputs I believe that goes counter to the philosophy here (unless I'm mistaken.). It also means that if they were to be implemented on A32 for efficiency a ZIP1, ZIP2 combo should be combined to ZIP and the moves not generated.

This also means that the Arm ZIP, TRN, UZP intrinsics can't be implemented in A64 as a single intrinsics but rather the user needs to make two calls. This is the reason that in this proposal the intrinsics are A64 only, but it makes intrinsics code between A32 and A64 a bit less portable in this case.

Also to make things easier to read I combined the documentation headers for the proposal. They will of course be separated out in the actual implementation

namespace System.Runtime.Intrinsics.Arm
{

    public static class ArmBase
    {
        public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// vslid_n_[su]64
        ///
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static long  LeftShiftAndInsert (long  left, long  right, uint shift) => { throw null };
        public static ulong LeftShiftAndInsert (ulong left, ulong right, uint shift) => { throw null };

        /// <summary>
        /// vsrid_n_[su]64
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static long  RightShiftAndInsert (long  left, long  right, uint shift) => { throw null };
        public static ulong RightShiftAndInsert (ulong left, ulong right, uint shift) => { throw null };

    }

    public static class AdvSimd
    {
        public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// vsli[q]_n_[su][8,16,32,64]
        //
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<byte>   LeftShiftAndInsert (Vector64<byte>   left, Vector64<byte>   right, uint shift) => { throw null };
        public static Vector64<ushort> LeftShiftAndInsert (Vector64<ushort> left, Vector64<ushort> right, uint shift) => { throw null };
        public static Vector64<uint>   LeftShiftAndInsert (Vector64<uint>   left, Vector64<uint>   right, uint shift) => { throw null };
        public static Vector64<sbyte>  LeftShiftAndInsert (Vector64<sbyte>  left, Vector64<sbyte>  right, uint shift) => { throw null };
        public static Vector64<short>  LeftShiftAndInsert (Vector64<short>  left, Vector64<short>  right, uint shift) => { throw null };
        public static Vector64<int>    LeftShiftAndInsert (Vector64<int>    left, Vector64<int>    right, uint shift) => { throw null };

        public static Vector128<byte>   LeftShiftAndInsert (Vector128<byte>   left, Vector128<byte>   right, uint shift) => { throw null };
        public static Vector128<ushort> LeftShiftAndInsert (Vector128<ushort> left, Vector128<ushort> right, uint shift) => { throw null };
        public static Vector128<uint>   LeftShiftAndInsert (Vector128<uint>   left, Vector128<uint>   right, uint shift) => { throw null };
        public static Vector128<ulong>  LeftShiftAndInsert (Vector128<ulong>  left, Vector128<ulong>  right, uint shift) => { throw null };
        public static Vector128<sbyte>  LeftShiftAndInsert (Vector128<sbyte>  left, Vector128<sbyte>  right, uint shift) => { throw null };
        public static Vector128<short>  LeftShiftAndInsert (Vector128<short>  left, Vector128<short>  right, uint shift) => { throw null };
        public static Vector128<int>    LeftShiftAndInsert (Vector128<int>    left, Vector128<int>    right, uint shift) => { throw null };
        public static Vector128<long>   LeftShiftAndInsert (Vector128<long>   left, Vector128<long>   right, uint shift) => { throw null };

        /// <summary>
        /// vsri[q]_n_[su][8,16,32,64]
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<byte>   RightShiftAndInsert (Vector64<byte>   left, Vector64<byte>   right, uint shift) => { throw null };
        public static Vector64<ushort> RightShiftAndInsert (Vector64<ushort> left, Vector64<ushort> right, uint shift) => { throw null };
        public static Vector64<uint>   RightShiftAndInsert (Vector64<uint>   left, Vector64<uint>   right, uint shift) => { throw null };
        public static Vector64<sbyte>  RightShiftAndInsert (Vector64<sbyte>  left, Vector64<sbyte>  right, uint shift) => { throw null };
        public static Vector64<short>  RightShiftAndInsert (Vector64<short>  left, Vector64<short>  right, uint shift) => { throw null };
        public static Vector64<int>    RightShiftAndInsert (Vector64<int>    left, Vector64<int>    right, uint shift) => { throw null };

        public static Vector128<byte>   RightShiftAndInsert (Vector128<byte>   left, Vector128<byte>   right, uint shift) => { throw null };
        public static Vector128<ushort> RightShiftAndInsert (Vector128<ushort> left, Vector128<ushort> right, uint shift) => { throw null };
        public static Vector128<uint>   RightShiftAndInsert (Vector128<uint>   left, Vector128<uint>   right, uint shift) => { throw null };
        public static Vector128<ulong>  RightShiftAndInsert (Vector128<ulong>  left, Vector128<ulong>  right, uint shift) => { throw null };
        public static Vector128<sbyte>  RightShiftAndInsert (Vector128<sbyte>  left, Vector128<sbyte>  right, uint shift) => { throw null };
        public static Vector128<short>  RightShiftAndInsert (Vector128<short>  left, Vector128<short>  right, uint shift) => { throw null };
        public static Vector128<int>    RightShiftAndInsert (Vector128<int>    left, Vector128<int>    right, uint shift) => { throw null };
        public static Vector128<long>   RightShiftAndInsert (Vector128<long>   left, Vector128<long>   right, uint shift) => { throw null };

        /// <summary>
        /// vmovn_[su][16,32,64]
        ///
        /// A64: XTN
        /// A32: VMOVN
        /// </summary>
        public static Vector64<sbyte>  ExtractAndNarrowLow (Vector128<short>  value) => { throw null };
        public static Vector64<short>  ExtractAndNarrowLow (Vector128<int>    value) => { throw null };
        public static Vector64<int>    ExtractAndNarrowLow (Vector128<long>   value) => { throw null };
        public static Vector64<byte>   ExtractAndNarrowLow (Vector128<ushort> value) => { throw null };
        public static Vector64<ushort> ExtractAndNarrowLow (Vector128<uint>   value) => { throw null };
        public static Vector64<uint>   ExtractAndNarrowLow (Vector128<ulong>  value) => { throw null };

        /// <summary>
        /// vmovn_high_[su][16,32,64]
        //
        /// A64: XTN2
        /// A32: VMOVN
        /// </summary>
        public static Vector128<sbyte>  ExtractAndNarrowHigh (Vector64<sbyte>  accum, Vector128<short>  value) => { throw null };
        public static Vector128<short>  ExtractAndNarrowHigh (Vector64<short>  accum, Vector128<int>    value) => { throw null };
        public static Vector128<int>    ExtractAndNarrowHigh (Vector64<int>    accum, Vector128<long>   value) => { throw null };
        public static Vector128<byte>   ExtractAndNarrowHigh (Vector64<byte>   accum, Vector128<ushort> value) => { throw null };
        public static Vector128<ushort> ExtractAndNarrowHigh (Vector64<ushort> accum, Vector128<uint>   value) => { throw null };
        public static Vector128<uint>   ExtractAndNarrowHigh (Vector64<uint>   accum, Vector128<ulong>  value) => { throw null };

        public static class Arm64
        {
            public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// vtrn1[q]_[suf][8,16,32,64]
        ///
        /// A64: UZP1
        /// </summary>
        public static Vector64<sbyte>  UnzipEven (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  UnzipEven (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    UnzipEven (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   UnzipEven (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> UnzipEven (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   UnzipEven (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  UnzipEven (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  UnzipEven (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  UnzipEven (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    UnzipEven (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   UnzipEven (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   UnzipEven (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> UnzipEven (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   UnzipEven (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  UnzipEven (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  UnzipEven (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> UnzipEven (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vtrn2[q]_[suf][8,16,32,64]
        ///
        /// A64: UZP2
        /// </summary>
        public static Vector64<sbyte>  UnzipOdd (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  UnzipOdd (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    UnzipOdd (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   UnzipOdd (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> UnzipOdd (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   UnzipOdd (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  UnzipOdd (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  UnzipOdd (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  UnzipOdd (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    UnzipOdd (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   UnzipOdd (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   UnzipOdd (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> UnzipOdd (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   UnzipOdd (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  UnzipOdd (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  UnzipOdd (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> UnzipOdd (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vzip1[q]_[suf][8,16,32,64]
        ///
        /// A64: ZIP1
        /// </summary>
        public static Vector64<sbyte>  ZipLow (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  ZipLow (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    ZipLow (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   ZipLow (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> ZipLow (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   ZipLow (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  ZipLow (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  ZipLow (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  ZipLow (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    ZipLow (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   ZipLow (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   ZipLow (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> ZipLow (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   ZipLow (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  ZipLow (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  ZipLow (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> ZipLow (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vzip2[q]_[suf][8,16,32,64]
        ///
        /// A64: ZIP2
        /// </summary>
        public static Vector64<sbyte>  ZipHigh (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  ZipHigh (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    ZipHigh (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   ZipHigh (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> ZipHigh (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   ZipHigh (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  ZipHigh (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  ZipHigh (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  ZipHigh (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    ZipHigh (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   ZipHigh (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   ZipHigh (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> ZipHigh (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   ZipHigh (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  ZipHigh (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  ZipHigh (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> ZipHigh (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vtrn1[q]_[suf][8,16,32,64]
        ///
        /// A64: TRN1
        /// </summary>
        public static Vector64<sbyte>  TransposeEven (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  TransposeEven (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    TransposeEven (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   TransposeEven (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> TransposeEven (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   TransposeEven (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  TransposeEven (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  TransposeEven (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  TransposeEven (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    TransposeEven (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   TransposeEven (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   TransposeEven (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> TransposeEven (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   TransposeEven (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  TransposeEven (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  TransposeEven (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> TransposeEven (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vtrn2[q]_[suf][8,16,32,64]
        ///
        /// A64: TRN2
        /// </summary>
        public static Vector64<sbyte>  TransposeOdd (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  TransposeOdd (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    TransposeOdd (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   TransposeOdd (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> TransposeOdd (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   TransposeOdd (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  TransposeOdd (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  TransposeOdd (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  TransposeOdd (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    TransposeOdd (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   TransposeOdd (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   TransposeOdd (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> TransposeOdd (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   TransposeOdd (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  TransposeOdd (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  TransposeOdd (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> TransposeOdd (Vector128<double> left, Vector128<double> right) => { throw null };
        }
  }
}

cc @tannergooding @CarolEidt @echesakovMSFT

@tannergooding
Copy link
Member

Thanks @TamarChristinaArm.

I've added this to the list of APIs to cover tomorrow.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review labels Mar 3, 2020
@terrajobst
Copy link
Member

terrajobst commented Mar 3, 2020

Video

  • Confirm that ShiftLeftLogicalAndInsert under ArmBase should move under AdvSimd, but is it 32 bit or 64 bit?
  • Some of instructions are destructive on ARM32 and non-destructive on ARM64. Should we split these up into a 32-bit version with a different shape?
API
namespace System.Runtime.Intrinsics.Arm
{
    public partial class ArmBase
    {
        /// <summary>
        /// vslid_n_[su]64
        ///
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<long>  ShiftLeftLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftLeftLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);

        /// <summary>
        /// vsrid_n_[su]64
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<long>  ShiftRightLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftRightLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
    }
    public partial class AdvSimd
    {
        /// <summary>
        /// vsli[q]_n_[su][8,16,32,64]
        //
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<byte>   ShiftLeftLogicalAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftLeftLogicalAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftLeftLogicalAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftLeftLogicalAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftLeftLogicalAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftLeftLogicalAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftLeftLogicalAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftLeftLogicalAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftLeftLogicalAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftLeftLogicalAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftLeftLogicalAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftLeftLogicalAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftLeftLogicalAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftLeftLogicalAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);

        /// <summary>
        /// vsri[q]_n_[su][8,16,32,64]
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<byte>   ShiftRightAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftRightAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftRightAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftRightAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftRightAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftRightAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftRightAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftRightAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftRightAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftRightAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftRightAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftRightAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftRightAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftRightAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);

        /// <summary>
        /// vmovn_[su][16,32,64]
        ///
        /// A64: XTN
        /// A32: VMOVN
        /// </summary>
        public static Vector64<sbyte>  ExtractAndNarrowLow(Vector128<short>  value);
        public static Vector64<short>  ExtractAndNarrowLow(Vector128<int>    value);
        public static Vector64<int>    ExtractAndNarrowLow(Vector128<long>   value);
        public static Vector64<byte>   ExtractAndNarrowLow(Vector128<ushort> value);
        public static Vector64<ushort> ExtractAndNarrowLow(Vector128<uint>   value);
        public static Vector64<uint>   ExtractAndNarrowLow(Vector128<ulong>  value);

        /// <summary>
        /// vmovn_high_[su][16,32,64]
        //
        /// A64: XTN2
        /// A32: VMOVN
        /// </summary>
        public static Vector128<sbyte>  ExtractAndNarrowHigh(Vector64<sbyte>  lower, Vector128<short>  value);
        public static Vector128<short>  ExtractAndNarrowHigh(Vector64<short>  lower, Vector128<int>    value);
        public static Vector128<int>    ExtractAndNarrowHigh(Vector64<int>    lower, Vector128<long>   value);
        public static Vector128<byte>   ExtractAndNarrowHigh(Vector64<byte>   lower, Vector128<ushort> value);
        public static Vector128<ushort> ExtractAndNarrowHigh(Vector64<ushort> lower, Vector128<uint>   value);
        public static Vector128<uint>   ExtractAndNarrowHigh(Vector64<uint>   lower, Vector128<ulong>  value);

        public partial class Arm64
        {
            /// <summary>
            /// vtrn1[q]_[suf][8,16,32,64]
            ///
            /// A64: UZP1
            /// </summary>
            public static Vector64<sbyte>  UnzipEven(Vector64<sbyte>  lower, Vector64<sbyte>  upper);
            public static Vector64<short>  UnzipEven(Vector64<short>  lower, Vector64<short>  upper);
            public static Vector64<int>    UnzipEven(Vector64<int>    lower, Vector64<int>    upper);
            public static Vector64<byte>   UnzipEven(Vector64<byte>   lower, Vector64<byte>   upper);
            public static Vector64<ushort> UnzipEven(Vector64<ushort> lower, Vector64<ushort> upper);
            public static Vector64<uint>   UnzipEven(Vector64<uint>   lower, Vector64<uint>   upper);
            public static Vector64<float>  UnzipEven(Vector64<float>  lower, Vector64<float>  upper);

            public static Vector128<sbyte>  UnzipEven(Vector128<sbyte>  lower, Vector128<sbyte>  upper);
            public static Vector128<short>  UnzipEven(Vector128<short>  lower, Vector128<short>  upper);
            public static Vector128<int>    UnzipEven(Vector128<int>    lower, Vector128<int>    upper);
            public static Vector128<long>   UnzipEven(Vector128<long>   lower, Vector128<long>   upper);
            public static Vector128<byte>   UnzipEven(Vector128<byte>   lower, Vector128<byte>   upper);
            public static Vector128<ushort> UnzipEven(Vector128<ushort> lower, Vector128<ushort> upper);
            public static Vector128<uint>   UnzipEven(Vector128<uint>   lower, Vector128<uint>   upper);
            public static Vector128<ulong>  UnzipEven(Vector128<ulong>  lower, Vector128<ulong>  upper);
            public static Vector128<float>  UnzipEven(Vector128<float>  lower, Vector128<float>  upper);
            public static Vector128<double> UnzipEven(Vector128<double> lower, Vector128<double> upper);

            /// <summary>
            /// vtrn2[q]_[suf][8,16,32,64]
            ///
            /// A64: UZP2
            /// </summary>
            public static Vector64<sbyte>  UnzipOdd(Vector64<sbyte>  lower, Vector64<sbyte>  upper);
            public static Vector64<short>  UnzipOdd(Vector64<short>  lower, Vector64<short>  upper);
            public static Vector64<int>    UnzipOdd(Vector64<int>    lower, Vector64<int>    upper);
            public static Vector64<byte>   UnzipOdd(Vector64<byte>   lower, Vector64<byte>   upper);
            public static Vector64<ushort> UnzipOdd(Vector64<ushort> lower, Vector64<ushort> upper);
            public static Vector64<uint>   UnzipOdd(Vector64<uint>   lower, Vector64<uint>   upper);
            public static Vector64<float>  UnzipOdd(Vector64<float>  lower, Vector64<float>  upper);

            public static Vector128<sbyte>  UnzipOdd(Vector128<sbyte>  lower, Vector128<sbyte>  upper);
            public static Vector128<short>  UnzipOdd(Vector128<short>  lower, Vector128<short>  upper);
            public static Vector128<int>    UnzipOdd(Vector128<int>    lower, Vector128<int>    upper);
            public static Vector128<long>   UnzipOdd(Vector128<long>   lower, Vector128<long>   upper);
            public static Vector128<byte>   UnzipOdd(Vector128<byte>   lower, Vector128<byte>   upper);
            public static Vector128<ushort> UnzipOdd(Vector128<ushort> lower, Vector128<ushort> upper);
            public static Vector128<uint>   UnzipOdd(Vector128<uint>   lower, Vector128<uint>   upper);
            public static Vector128<ulong>  UnzipOdd(Vector128<ulong>  lower, Vector128<ulong>  upper);
            public static Vector128<float>  UnzipOdd(Vector128<float>  lower, Vector128<float>  upper);
            public static Vector128<double> UnzipOdd(Vector128<double> lower, Vector128<double> upper);

            /// <summary>
            /// vzip1[q]_[suf][8,16,32,64]
            ///
            /// A64: ZIP1
            /// </summary>
            public static Vector64<sbyte>  ZipLow(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  ZipLow(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    ZipLow(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   ZipLow(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> ZipLow(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   ZipLow(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  ZipLow(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  ZipLow(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  ZipLow(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    ZipLow(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   ZipLow(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   ZipLow(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> ZipLow(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   ZipLow(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  ZipLow(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  ZipLow(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> ZipLow(Vector128<double> left, Vector128<double> right);

            /// <summary>
            /// vzip2[q]_[suf][8,16,32,64]
            ///
            /// A64: ZIP2
            /// </summary>
            public static Vector64<sbyte>  ZipHigh(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  ZipHigh(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    ZipHigh(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   ZipHigh(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> ZipHigh(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   ZipHigh(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  ZipHigh(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  ZipHigh(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  ZipHigh(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    ZipHigh(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   ZipHigh(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   ZipHigh(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> ZipHigh(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   ZipHigh(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  ZipHigh(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  ZipHigh(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> ZipHigh(Vector128<double> left, Vector128<double> right);

            /// <summary>
            /// vtrn1[q]_[suf][8,16,32,64]
            ///
            /// A64: TRN1
            /// </summary>
            public static Vector64<sbyte>  TransposeEven(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  TransposeEven(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    TransposeEven(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   TransposeEven(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> TransposeEven(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   TransposeEven(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  TransposeEven(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  TransposeEven(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  TransposeEven(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    TransposeEven(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   TransposeEven(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   TransposeEven(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> TransposeEven(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   TransposeEven(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  TransposeEven(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  TransposeEven(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> TransposeEven(Vector128<double> left, Vector128<double> right);

            /// <summary>
            /// vtrn2[q]_[suf][8,16,32,64]
            ///
            /// A64: TRN2
            /// </summary>
            public static Vector64<sbyte>  TransposeOdd(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  TransposeOdd(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    TransposeOdd(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   TransposeOdd(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> TransposeOdd(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   TransposeOdd(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  TransposeOdd(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  TransposeOdd(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  TransposeOdd(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    TransposeOdd(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   TransposeOdd(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   TransposeOdd(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> TransposeOdd(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   TransposeOdd(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  TransposeOdd(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  TransposeOdd(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> TransposeOdd(Vector128<double> left, Vector128<double> right);
        }
    }
}

@tannergooding
Copy link
Member

Confirm that ShiftLeftLogicalAndInsert under ArmBase should move under AdvSimd, but is it 32 bit or 64 bit?

@TamarChristinaArm, looking at the encoding + decoding for VSLI, it seems to imply that L=1 is valid and therefore esize is allowed to be 64, so this applies to ARM32 as well. Is that your understanding?

Specifically, the Applies when !(imm6 == 000xxx && L == 0) && Q = ? is stating it applies to everything but L:imm6 == 0000xxx, which is then reiterated in the Decode for all variants of this encoding.

@TamarChristinaArm
Copy link
Contributor Author

@tannergooding Yeah I believe that's correct, I had missed the ! the first time around.

@TamarChristinaArm
Copy link
Contributor Author

@tannergooding @echesakovMSFT I'm wondering about the LeftLogicalAndInsert. The shift element needs to be a compile time constant. Any idea how I should handle that?

@tannergooding
Copy link
Member

Non constant inputs are handled by dropping back to a call which contains a jump table handling all possible cases (which is between 1 and 256), its part of the reason the APIs are recursive.

Intrinsics which take a 8-bit immediate are marked as HW_Category_IMM. There are then some special modifier flags that indicate how it is handled in codegen:

  • HW_Flag_MaybeIMM - Indicates there is another overload which isn't HW_Category_IMM
  • HW_Flag_NoJmpTableIMM - Indicates that a jump table fallback isn't necessary
  • HW_Flag_FullRangeIMM - Indicates that the jump table fallback covers all 256 possible cases

The important methods for importation are then:

The importation logic currently assumes the "immediate" is the last operand in the list: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L648. It then uses the above methods to determine how to handle things. This includes things like directly expanding the relevant intrinsic, generating an equivalent intrinsic fallback, or falling back to a method call. If it falls back to the method call it will see the method is recursively calling itself and force expansion and the node will carry a non-constant input through to codegen. It will also insert the relevant range check if one is needed.

An example of how this is handled in codegen for x86 is here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp#L227-L254.
Once we get to codegen if it is a constant, we can just emit the instruction. If it isn't, we call genHWIntrinsicJumpTableFallback which emits the necessary jump table.

@tannergooding
Copy link
Member

tannergooding commented Apr 15, 2020

This ultimately allows things like reflection or debugging to "just work" and the expectation is users won't use it in actual perf-critical paths. Users are expected to be profiling this code so they should catch the issue relatively quickly if one does exist.

There are a few cases that this won't catch (of inputs that the JIT will eventually determine to be constant but which aren't "constant" during importation) and we have an issue tracking improving that: #9989 and #11062. Ideally we would delay the decision to be a "fallback method call" until later in the pipeline (such as lowering).

We already have some support for doing that with existing GT_INTRINSIC nodes and we do it in Rationalization: #11062 (comment), but expanding it for GT_HWINTRINSIC nodes is a bit tougher and hasn't been done yet.

@TamarChristinaArm
Copy link
Contributor Author

Awesome thanks, I'll get on those then :)

@echesakov
Copy link
Contributor

Awesome thanks, I'll get on those then :)

@TamarChristinaArm I am working at the moment on supporting intrinsic immediate operands on arm64 - I needed this for Extract, Insert and ExtractVector64/128 intrinsics - will have a PR soon

@TamarChristinaArm
Copy link
Contributor Author

@echesakovMSFT Ah great, I'll do the single register TBL in the mean time then.

@tannergooding
Copy link
Member

The following APIs are yet to be implemented:

namespace System.Runtime.Intrinsics.Arm
{
    public partial class ArmBase
    {
        /// <summary>
        /// vslid_n_[su]64
        ///
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<long>  ShiftLeftLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftLeftLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);

        /// <summary>
        /// vsrid_n_[su]64
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<long>  ShiftRightLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftRightLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
    }
    public partial class AdvSimd
    {
        /// <summary>
        /// vsli[q]_n_[su][8,16,32,64]
        //
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<byte>   ShiftLeftLogicalAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftLeftLogicalAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftLeftLogicalAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftLeftLogicalAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftLeftLogicalAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftLeftLogicalAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftLeftLogicalAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftLeftLogicalAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftLeftLogicalAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftLeftLogicalAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftLeftLogicalAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftLeftLogicalAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftLeftLogicalAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftLeftLogicalAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);

        /// <summary>
        /// vsri[q]_n_[su][8,16,32,64]
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<byte>   ShiftRightAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftRightAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftRightAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftRightAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftRightAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftRightAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftRightAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftRightAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftRightAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftRightAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftRightAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftRightAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftRightAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftRightAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);
    }
}

@BruceForstall BruceForstall added this to To do general in Hardware Intrinsics via automation Apr 16, 2020
@BruceForstall BruceForstall moved this from To do general to To do arm64 in Hardware Intrinsics Apr 16, 2020
@BruceForstall BruceForstall moved this from To do arm64 to API design in Hardware Intrinsics Apr 16, 2020
@TamarChristinaArm
Copy link
Contributor Author

@echesakovMSFT I see Extract committed. Are the pre-requites I need for this in now then?

@echesakov
Copy link
Contributor

@TamarChristinaArm Yes, I think so.

@TamarChristinaArm
Copy link
Contributor Author

@tannergooding @echesakovMSFT I believe we need to move the entries in ArmBase to AdvSimd right? This PR was created before we discussed this with other intrinsics.

@tannergooding
Copy link
Member

Yes, I believe so as VSRI requires "Enabling Advanced SIMD and floating-point support", which may not always be possible.

@TamarChristinaArm
Copy link
Contributor Author

@echesakovMSFT the comment in BuildHWIntrinsic on mayNeedBranchTargetReg confuses me. I'm trying to understand what it's going to do if you pass a non-const immediate to an intrinsics that has no non-const immediate variant. The comment seems to indicate it'll allocate an extra register.. but don't understand why or what it's trying to do?

@echesakov
Copy link
Contributor

@TamarChristinaArm If you pass a non-const immediate operand and the instruction does not have non-const form (i.e. does not accept a register operand instead of the immediate operand; I can't think about an instruction on Arm64 that has such fallback but I believe it was a case on x64 for many instructions - look for intrinsics marked as HW_Flag_MaybeIMM in hwintrinsiclistxarch.h) JIT still needs to be able to compile an intrinsic so it generates a "switch" table that conceptually does this:

switch(nonConstImm)
{
case 0: 
  inst Vd, Vn, #0; 
  break;
case 1: 
  inst Vd, Vn, #1; 
  break;
case 2: 
  inst Vd, Vn, #2; 
  break;
case 3:
  inst Vd, Vn, #3;
  break;
}

For example, for ExtractVector128(Vector128<byte>) the code will look as follows

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )  simd16  ->  [fp+0x20]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )  simd16  ->  [fp+0x10]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M54180_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        3D800BA0          str     q0, [fp,#32]
        3D8007A1          str     q1, [fp,#16]
						;; bbWeight=1    PerfScore 3.50
G_M54180_IG02:
        3DC00BB0          ldr     q16, [fp,#32]
        3DC007B1          ldr     q17, [fp,#16]
        53001C00          uxtb    w0, w0
        7100401F          cmp     w0, #16
        540004C2          bhs     G_M54180_IG21
        10000061          adr     x1, [G_M54180_IG03]
        8B000C21          add     x1, x1, x0, LSL #3
        D61F0020          br      x1
						;; bbWeight=1    PerfScore 8.50
G_M54180_IG03:
        6E110210          ext     v16.16b, v16.16b, v17.16b, #0
        1400001E          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG04:
        6E110A10          ext     v16.16b, v16.16b, v17.16b, #1
        1400001C          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG05:
        6E111210          ext     v16.16b, v16.16b, v17.16b, #2
        1400001A          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG06:
        6E111A10          ext     v16.16b, v16.16b, v17.16b, #3
        14000018          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG07:
        6E112210          ext     v16.16b, v16.16b, v17.16b, #4
        14000016          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG08:
        6E112A10          ext     v16.16b, v16.16b, v17.16b, #5
        14000014          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG09:
        6E113210          ext     v16.16b, v16.16b, v17.16b, #6
        14000012          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG10:
        6E113A10          ext     v16.16b, v16.16b, v17.16b, #7
        14000010          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG11:
        6E114210          ext     v16.16b, v16.16b, v17.16b, #8
        1400000E          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG12:
        6E114A10          ext     v16.16b, v16.16b, v17.16b, #9
        1400000C          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG13:
        6E115210          ext     v16.16b, v16.16b, v17.16b, #10
        1400000A          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG14:
        6E115A10          ext     v16.16b, v16.16b, v17.16b, #11
        14000008          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG15:
        6E116210          ext     v16.16b, v16.16b, v17.16b, #12
        14000006          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG16:
        6E116A10          ext     v16.16b, v16.16b, v17.16b, #13
        14000004          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG17:
        6E117210          ext     v16.16b, v16.16b, v17.16b, #14
        14000002          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG18:
        6E117A10          ext     v16.16b, v16.16b, v17.16b, #15
						;; bbWeight=1    PerfScore 1.00
G_M54180_IG19:
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 0.50
G_M54180_IG20:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG21:
        97FEEB1E          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt    
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 192, prolog size 8, PerfScore 64.70, (MethodHash=08b52c5b) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; ============================================================

This requires allocating of x1 as a branch target.

However, for cases when immediate is 0 or 1 (e.g. ExtractVector128(Vector128<double>) we can branch with cbnz and don't need to allocate the additional register as in the following example:

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )  simd16  ->  [fp+0x20]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )  simd16  ->  [fp+0x10]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M55355_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        3D800BA0          str     q0, [fp,#32]
        3D8007A1          str     q1, [fp,#16]
						;; bbWeight=1    PerfScore 3.50
G_M55355_IG02:
        3DC00BB0          ldr     q16, [fp,#32]
        3DC007B1          ldr     q17, [fp,#16]
        53001C00          uxtb    w0, w0
        7100081F          cmp     w0, #2
        54000102          bhs     G_M55355_IG07
        35000060          cbnz    w0, G_M55355_IG04
						;; bbWeight=1    PerfScore 7.00
G_M55355_IG03:
        6E110210          ext     v16.16b, v16.16b, v17.16b, #0
        14000002          b       G_M55355_IG05
						;; bbWeight=1    PerfScore 2.00
G_M55355_IG04:
        6E114210          ext     v16.16b, v16.16b, v17.16b, #8
						;; bbWeight=1    PerfScore 1.00
G_M55355_IG05:
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 0.50
G_M55355_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00
G_M55355_IG07:
        97FEEB06          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt    
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 72, prolog size 8, PerfScore 23.20, (MethodHash=721027c4) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================

In your case, (i.e. sri and sli) you will need to mark an intrinsic with HW_Category_IMM, compute an upper bound for immediate operand in HWIntrinsicInfo::lookupImmUpperBound() and add two case in the switch in LinearScan::BuildHWIntrinsic that will set needBranchTargetReg = !intrin.op3->isContainedIntOrIImmed();

@echesakov
Copy link
Contributor

@TamarChristinaArm I hope my explanation helped. If not - I would value any feedback how I can re-phrase this comment to make it more clear. Also feel free to ping me offline if you want to chat more about this.

@TamarChristinaArm
Copy link
Contributor Author

@echesakovMSFT It did!, I hadn't realized the JIT was emitting a runtime dispatch for this case, which makes sense, that's where my initial confusion came from :) Thanks for the explanation!

@echesakov
Copy link
Contributor

@TamarChristinaArm since you are working on this - I have assigned the issue to you.

@echesakov echesakov moved this from API design to In progress in Hardware Intrinsics May 6, 2020
@JulieLeeMSFT JulieLeeMSFT added this to the 5.0 milestone May 18, 2020
Hardware Intrinsics automation moved this from In progress to Done Jun 11, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented arch-arm64 area-System.Runtime.Intrinsics
Projects
Development

Successfully merging a pull request may close this issue.

6 participants