API Proposal : Arm Shift and Permute intrinsics #31324

TamarChristinaArm · 2019-10-28T17:21:48Z

The A32 variants of these are blocked pending resolution of the <lanes>x<copies> implementation in #24790 (e.g. int32x2x2). The permute instructions such as ZIP1 and ZIP2 present an interesting challenge. Since intrinsics in CoreCLR/CoreFX are supposed to match down to a single hardware instruction this makes it a bit awkward, since on A32 ZIP, TRN, UZP are destructive operations which perform both the Odd and Even shuffles at the same time. So while you could do the intrinsics for A32 by copying the vector and ignoring one of the outputs I believe that goes counter to the philosophy here (unless I'm mistaken.). It also means that if they were to be implemented on A32 for efficiency a ZIP1, ZIP2 combo should be combined to ZIP and the moves not generated.

This also means that the Arm ZIP, TRN, UZP intrinsics can't be implemented in A64 as a single intrinsics but rather the user needs to make two calls. This is the reason that in this proposal the intrinsics are A64 only, but it makes intrinsics code between A32 and A64 a bit less portable in this case.

Also to make things easier to read I combined the documentation headers for the proposal. They will of course be separated out in the actual implementation

namespace System.Runtime.Intrinsics.Arm
{

    public static class ArmBase
    {
        public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// vslid_n_[su]64
        ///
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static long  LeftShiftAndInsert (long  left, long  right, uint shift) => { throw null };
        public static ulong LeftShiftAndInsert (ulong left, ulong right, uint shift) => { throw null };

        /// <summary>
        /// vsrid_n_[su]64
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static long  RightShiftAndInsert (long  left, long  right, uint shift) => { throw null };
        public static ulong RightShiftAndInsert (ulong left, ulong right, uint shift) => { throw null };

    }

    public static class AdvSimd
    {
        public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// vsli[q]_n_[su][8,16,32,64]
        //
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<byte>   LeftShiftAndInsert (Vector64<byte>   left, Vector64<byte>   right, uint shift) => { throw null };
        public static Vector64<ushort> LeftShiftAndInsert (Vector64<ushort> left, Vector64<ushort> right, uint shift) => { throw null };
        public static Vector64<uint>   LeftShiftAndInsert (Vector64<uint>   left, Vector64<uint>   right, uint shift) => { throw null };
        public static Vector64<sbyte>  LeftShiftAndInsert (Vector64<sbyte>  left, Vector64<sbyte>  right, uint shift) => { throw null };
        public static Vector64<short>  LeftShiftAndInsert (Vector64<short>  left, Vector64<short>  right, uint shift) => { throw null };
        public static Vector64<int>    LeftShiftAndInsert (Vector64<int>    left, Vector64<int>    right, uint shift) => { throw null };

        public static Vector128<byte>   LeftShiftAndInsert (Vector128<byte>   left, Vector128<byte>   right, uint shift) => { throw null };
        public static Vector128<ushort> LeftShiftAndInsert (Vector128<ushort> left, Vector128<ushort> right, uint shift) => { throw null };
        public static Vector128<uint>   LeftShiftAndInsert (Vector128<uint>   left, Vector128<uint>   right, uint shift) => { throw null };
        public static Vector128<ulong>  LeftShiftAndInsert (Vector128<ulong>  left, Vector128<ulong>  right, uint shift) => { throw null };
        public static Vector128<sbyte>  LeftShiftAndInsert (Vector128<sbyte>  left, Vector128<sbyte>  right, uint shift) => { throw null };
        public static Vector128<short>  LeftShiftAndInsert (Vector128<short>  left, Vector128<short>  right, uint shift) => { throw null };
        public static Vector128<int>    LeftShiftAndInsert (Vector128<int>    left, Vector128<int>    right, uint shift) => { throw null };
        public static Vector128<long>   LeftShiftAndInsert (Vector128<long>   left, Vector128<long>   right, uint shift) => { throw null };

        /// <summary>
        /// vsri[q]_n_[su][8,16,32,64]
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<byte>   RightShiftAndInsert (Vector64<byte>   left, Vector64<byte>   right, uint shift) => { throw null };
        public static Vector64<ushort> RightShiftAndInsert (Vector64<ushort> left, Vector64<ushort> right, uint shift) => { throw null };
        public static Vector64<uint>   RightShiftAndInsert (Vector64<uint>   left, Vector64<uint>   right, uint shift) => { throw null };
        public static Vector64<sbyte>  RightShiftAndInsert (Vector64<sbyte>  left, Vector64<sbyte>  right, uint shift) => { throw null };
        public static Vector64<short>  RightShiftAndInsert (Vector64<short>  left, Vector64<short>  right, uint shift) => { throw null };
        public static Vector64<int>    RightShiftAndInsert (Vector64<int>    left, Vector64<int>    right, uint shift) => { throw null };

        public static Vector128<byte>   RightShiftAndInsert (Vector128<byte>   left, Vector128<byte>   right, uint shift) => { throw null };
        public static Vector128<ushort> RightShiftAndInsert (Vector128<ushort> left, Vector128<ushort> right, uint shift) => { throw null };
        public static Vector128<uint>   RightShiftAndInsert (Vector128<uint>   left, Vector128<uint>   right, uint shift) => { throw null };
        public static Vector128<ulong>  RightShiftAndInsert (Vector128<ulong>  left, Vector128<ulong>  right, uint shift) => { throw null };
        public static Vector128<sbyte>  RightShiftAndInsert (Vector128<sbyte>  left, Vector128<sbyte>  right, uint shift) => { throw null };
        public static Vector128<short>  RightShiftAndInsert (Vector128<short>  left, Vector128<short>  right, uint shift) => { throw null };
        public static Vector128<int>    RightShiftAndInsert (Vector128<int>    left, Vector128<int>    right, uint shift) => { throw null };
        public static Vector128<long>   RightShiftAndInsert (Vector128<long>   left, Vector128<long>   right, uint shift) => { throw null };

        /// <summary>
        /// vmovn_[su][16,32,64]
        ///
        /// A64: XTN
        /// A32: VMOVN
        /// </summary>
        public static Vector64<sbyte>  ExtractAndNarrowLow (Vector128<short>  value) => { throw null };
        public static Vector64<short>  ExtractAndNarrowLow (Vector128<int>    value) => { throw null };
        public static Vector64<int>    ExtractAndNarrowLow (Vector128<long>   value) => { throw null };
        public static Vector64<byte>   ExtractAndNarrowLow (Vector128<ushort> value) => { throw null };
        public static Vector64<ushort> ExtractAndNarrowLow (Vector128<uint>   value) => { throw null };
        public static Vector64<uint>   ExtractAndNarrowLow (Vector128<ulong>  value) => { throw null };

        /// <summary>
        /// vmovn_high_[su][16,32,64]
        //
        /// A64: XTN2
        /// A32: VMOVN
        /// </summary>
        public static Vector128<sbyte>  ExtractAndNarrowHigh (Vector64<sbyte>  accum, Vector128<short>  value) => { throw null };
        public static Vector128<short>  ExtractAndNarrowHigh (Vector64<short>  accum, Vector128<int>    value) => { throw null };
        public static Vector128<int>    ExtractAndNarrowHigh (Vector64<int>    accum, Vector128<long>   value) => { throw null };
        public static Vector128<byte>   ExtractAndNarrowHigh (Vector64<byte>   accum, Vector128<ushort> value) => { throw null };
        public static Vector128<ushort> ExtractAndNarrowHigh (Vector64<ushort> accum, Vector128<uint>   value) => { throw null };
        public static Vector128<uint>   ExtractAndNarrowHigh (Vector64<uint>   accum, Vector128<ulong>  value) => { throw null };

        public static class Arm64
        {
            public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// vtrn1[q]_[suf][8,16,32,64]
        ///
        /// A64: UZP1
        /// </summary>
        public static Vector64<sbyte>  UnzipEven (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  UnzipEven (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    UnzipEven (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   UnzipEven (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> UnzipEven (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   UnzipEven (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  UnzipEven (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  UnzipEven (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  UnzipEven (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    UnzipEven (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   UnzipEven (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   UnzipEven (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> UnzipEven (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   UnzipEven (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  UnzipEven (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  UnzipEven (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> UnzipEven (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vtrn2[q]_[suf][8,16,32,64]
        ///
        /// A64: UZP2
        /// </summary>
        public static Vector64<sbyte>  UnzipOdd (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  UnzipOdd (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    UnzipOdd (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   UnzipOdd (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> UnzipOdd (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   UnzipOdd (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  UnzipOdd (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  UnzipOdd (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  UnzipOdd (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    UnzipOdd (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   UnzipOdd (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   UnzipOdd (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> UnzipOdd (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   UnzipOdd (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  UnzipOdd (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  UnzipOdd (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> UnzipOdd (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vzip1[q]_[suf][8,16,32,64]
        ///
        /// A64: ZIP1
        /// </summary>
        public static Vector64<sbyte>  ZipLow (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  ZipLow (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    ZipLow (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   ZipLow (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> ZipLow (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   ZipLow (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  ZipLow (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  ZipLow (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  ZipLow (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    ZipLow (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   ZipLow (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   ZipLow (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> ZipLow (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   ZipLow (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  ZipLow (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  ZipLow (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> ZipLow (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vzip2[q]_[suf][8,16,32,64]
        ///
        /// A64: ZIP2
        /// </summary>
        public static Vector64<sbyte>  ZipHigh (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  ZipHigh (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    ZipHigh (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   ZipHigh (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> ZipHigh (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   ZipHigh (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  ZipHigh (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  ZipHigh (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  ZipHigh (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    ZipHigh (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   ZipHigh (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   ZipHigh (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> ZipHigh (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   ZipHigh (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  ZipHigh (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  ZipHigh (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> ZipHigh (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vtrn1[q]_[suf][8,16,32,64]
        ///
        /// A64: TRN1
        /// </summary>
        public static Vector64<sbyte>  TransposeEven (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  TransposeEven (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    TransposeEven (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   TransposeEven (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> TransposeEven (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   TransposeEven (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  TransposeEven (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  TransposeEven (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  TransposeEven (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    TransposeEven (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   TransposeEven (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   TransposeEven (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> TransposeEven (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   TransposeEven (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  TransposeEven (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  TransposeEven (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> TransposeEven (Vector128<double> left, Vector128<double> right) => { throw null };

        /// <summary>
        /// vtrn2[q]_[suf][8,16,32,64]
        ///
        /// A64: TRN2
        /// </summary>
        public static Vector64<sbyte>  TransposeOdd (Vector64<sbyte>  left, Vector64<sbyte>  right) => { throw null };
        public static Vector64<short>  TransposeOdd (Vector64<short>  left, Vector64<short>  right) => { throw null };
        public static Vector64<int>    TransposeOdd (Vector64<int>    left, Vector64<int>    right) => { throw null };
        public static Vector64<byte>   TransposeOdd (Vector64<byte>   left, Vector64<byte>   right) => { throw null };
        public static Vector64<ushort> TransposeOdd (Vector64<ushort> left, Vector64<ushort> right) => { throw null };
        public static Vector64<uint>   TransposeOdd (Vector64<uint>   left, Vector64<uint>   right) => { throw null };
        public static Vector64<float>  TransposeOdd (Vector64<float>  left, Vector64<float>  right) => { throw null };

        public static Vector128<sbyte>  TransposeOdd (Vector128<sbyte>  left, Vector128<sbyte>  right) => { throw null };
        public static Vector128<short>  TransposeOdd (Vector128<short>  left, Vector128<short>  right) => { throw null };
        public static Vector128<int>    TransposeOdd (Vector128<int>    left, Vector128<int>    right) => { throw null };
        public static Vector128<long>   TransposeOdd (Vector128<long>   left, Vector128<long>   right) => { throw null };
        public static Vector128<byte>   TransposeOdd (Vector128<byte>   left, Vector128<byte>   right) => { throw null };
        public static Vector128<ushort> TransposeOdd (Vector128<ushort> left, Vector128<ushort> right) => { throw null };
        public static Vector128<uint>   TransposeOdd (Vector128<uint>   left, Vector128<uint>   right) => { throw null };
        public static Vector128<ulong>  TransposeOdd (Vector128<ulong>  left, Vector128<ulong>  right) => { throw null };
        public static Vector128<float>  TransposeOdd (Vector128<float>  left, Vector128<float>  right) => { throw null };
        public static Vector128<double> TransposeOdd (Vector128<double> left, Vector128<double> right) => { throw null };
        }
  }
}

cc @tannergooding @CarolEidt @echesakovMSFT

The text was updated successfully, but these errors were encountered:

tannergooding · 2019-10-28T21:15:24Z

Thanks @TamarChristinaArm.

I've added this to the list of APIs to cover tomorrow.

terrajobst · 2020-03-03T19:25:26Z

Video

Confirm that ShiftLeftLogicalAndInsert under ArmBase should move under AdvSimd, but is it 32 bit or 64 bit?
Some of instructions are destructive on ARM32 and non-destructive on ARM64. Should we split these up into a 32-bit version with a different shape?

API

namespace System.Runtime.Intrinsics.Arm
{
    public partial class ArmBase
    {
        /// <summary>
        /// vslid_n_[su]64
        ///
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<long>  ShiftLeftLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftLeftLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);

        /// <summary>
        /// vsrid_n_[su]64
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<long>  ShiftRightLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftRightLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
    }
    public partial class AdvSimd
    {
        /// <summary>
        /// vsli[q]_n_[su][8,16,32,64]
        //
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<byte>   ShiftLeftLogicalAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftLeftLogicalAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftLeftLogicalAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftLeftLogicalAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftLeftLogicalAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftLeftLogicalAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftLeftLogicalAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftLeftLogicalAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftLeftLogicalAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftLeftLogicalAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftLeftLogicalAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftLeftLogicalAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftLeftLogicalAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftLeftLogicalAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);

        /// <summary>
        /// vsri[q]_n_[su][8,16,32,64]
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<byte>   ShiftRightAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftRightAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftRightAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftRightAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftRightAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftRightAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftRightAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftRightAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftRightAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftRightAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftRightAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftRightAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftRightAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftRightAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);

        /// <summary>
        /// vmovn_[su][16,32,64]
        ///
        /// A64: XTN
        /// A32: VMOVN
        /// </summary>
        public static Vector64<sbyte>  ExtractAndNarrowLow(Vector128<short>  value);
        public static Vector64<short>  ExtractAndNarrowLow(Vector128<int>    value);
        public static Vector64<int>    ExtractAndNarrowLow(Vector128<long>   value);
        public static Vector64<byte>   ExtractAndNarrowLow(Vector128<ushort> value);
        public static Vector64<ushort> ExtractAndNarrowLow(Vector128<uint>   value);
        public static Vector64<uint>   ExtractAndNarrowLow(Vector128<ulong>  value);

        /// <summary>
        /// vmovn_high_[su][16,32,64]
        //
        /// A64: XTN2
        /// A32: VMOVN
        /// </summary>
        public static Vector128<sbyte>  ExtractAndNarrowHigh(Vector64<sbyte>  lower, Vector128<short>  value);
        public static Vector128<short>  ExtractAndNarrowHigh(Vector64<short>  lower, Vector128<int>    value);
        public static Vector128<int>    ExtractAndNarrowHigh(Vector64<int>    lower, Vector128<long>   value);
        public static Vector128<byte>   ExtractAndNarrowHigh(Vector64<byte>   lower, Vector128<ushort> value);
        public static Vector128<ushort> ExtractAndNarrowHigh(Vector64<ushort> lower, Vector128<uint>   value);
        public static Vector128<uint>   ExtractAndNarrowHigh(Vector64<uint>   lower, Vector128<ulong>  value);

        public partial class Arm64
        {
            /// <summary>
            /// vtrn1[q]_[suf][8,16,32,64]
            ///
            /// A64: UZP1
            /// </summary>
            public static Vector64<sbyte>  UnzipEven(Vector64<sbyte>  lower, Vector64<sbyte>  upper);
            public static Vector64<short>  UnzipEven(Vector64<short>  lower, Vector64<short>  upper);
            public static Vector64<int>    UnzipEven(Vector64<int>    lower, Vector64<int>    upper);
            public static Vector64<byte>   UnzipEven(Vector64<byte>   lower, Vector64<byte>   upper);
            public static Vector64<ushort> UnzipEven(Vector64<ushort> lower, Vector64<ushort> upper);
            public static Vector64<uint>   UnzipEven(Vector64<uint>   lower, Vector64<uint>   upper);
            public static Vector64<float>  UnzipEven(Vector64<float>  lower, Vector64<float>  upper);

            public static Vector128<sbyte>  UnzipEven(Vector128<sbyte>  lower, Vector128<sbyte>  upper);
            public static Vector128<short>  UnzipEven(Vector128<short>  lower, Vector128<short>  upper);
            public static Vector128<int>    UnzipEven(Vector128<int>    lower, Vector128<int>    upper);
            public static Vector128<long>   UnzipEven(Vector128<long>   lower, Vector128<long>   upper);
            public static Vector128<byte>   UnzipEven(Vector128<byte>   lower, Vector128<byte>   upper);
            public static Vector128<ushort> UnzipEven(Vector128<ushort> lower, Vector128<ushort> upper);
            public static Vector128<uint>   UnzipEven(Vector128<uint>   lower, Vector128<uint>   upper);
            public static Vector128<ulong>  UnzipEven(Vector128<ulong>  lower, Vector128<ulong>  upper);
            public static Vector128<float>  UnzipEven(Vector128<float>  lower, Vector128<float>  upper);
            public static Vector128<double> UnzipEven(Vector128<double> lower, Vector128<double> upper);

            /// <summary>
            /// vtrn2[q]_[suf][8,16,32,64]
            ///
            /// A64: UZP2
            /// </summary>
            public static Vector64<sbyte>  UnzipOdd(Vector64<sbyte>  lower, Vector64<sbyte>  upper);
            public static Vector64<short>  UnzipOdd(Vector64<short>  lower, Vector64<short>  upper);
            public static Vector64<int>    UnzipOdd(Vector64<int>    lower, Vector64<int>    upper);
            public static Vector64<byte>   UnzipOdd(Vector64<byte>   lower, Vector64<byte>   upper);
            public static Vector64<ushort> UnzipOdd(Vector64<ushort> lower, Vector64<ushort> upper);
            public static Vector64<uint>   UnzipOdd(Vector64<uint>   lower, Vector64<uint>   upper);
            public static Vector64<float>  UnzipOdd(Vector64<float>  lower, Vector64<float>  upper);

            public static Vector128<sbyte>  UnzipOdd(Vector128<sbyte>  lower, Vector128<sbyte>  upper);
            public static Vector128<short>  UnzipOdd(Vector128<short>  lower, Vector128<short>  upper);
            public static Vector128<int>    UnzipOdd(Vector128<int>    lower, Vector128<int>    upper);
            public static Vector128<long>   UnzipOdd(Vector128<long>   lower, Vector128<long>   upper);
            public static Vector128<byte>   UnzipOdd(Vector128<byte>   lower, Vector128<byte>   upper);
            public static Vector128<ushort> UnzipOdd(Vector128<ushort> lower, Vector128<ushort> upper);
            public static Vector128<uint>   UnzipOdd(Vector128<uint>   lower, Vector128<uint>   upper);
            public static Vector128<ulong>  UnzipOdd(Vector128<ulong>  lower, Vector128<ulong>  upper);
            public static Vector128<float>  UnzipOdd(Vector128<float>  lower, Vector128<float>  upper);
            public static Vector128<double> UnzipOdd(Vector128<double> lower, Vector128<double> upper);

            /// <summary>
            /// vzip1[q]_[suf][8,16,32,64]
            ///
            /// A64: ZIP1
            /// </summary>
            public static Vector64<sbyte>  ZipLow(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  ZipLow(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    ZipLow(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   ZipLow(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> ZipLow(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   ZipLow(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  ZipLow(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  ZipLow(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  ZipLow(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    ZipLow(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   ZipLow(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   ZipLow(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> ZipLow(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   ZipLow(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  ZipLow(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  ZipLow(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> ZipLow(Vector128<double> left, Vector128<double> right);

            /// <summary>
            /// vzip2[q]_[suf][8,16,32,64]
            ///
            /// A64: ZIP2
            /// </summary>
            public static Vector64<sbyte>  ZipHigh(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  ZipHigh(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    ZipHigh(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   ZipHigh(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> ZipHigh(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   ZipHigh(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  ZipHigh(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  ZipHigh(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  ZipHigh(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    ZipHigh(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   ZipHigh(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   ZipHigh(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> ZipHigh(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   ZipHigh(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  ZipHigh(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  ZipHigh(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> ZipHigh(Vector128<double> left, Vector128<double> right);

            /// <summary>
            /// vtrn1[q]_[suf][8,16,32,64]
            ///
            /// A64: TRN1
            /// </summary>
            public static Vector64<sbyte>  TransposeEven(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  TransposeEven(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    TransposeEven(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   TransposeEven(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> TransposeEven(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   TransposeEven(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  TransposeEven(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  TransposeEven(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  TransposeEven(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    TransposeEven(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   TransposeEven(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   TransposeEven(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> TransposeEven(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   TransposeEven(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  TransposeEven(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  TransposeEven(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> TransposeEven(Vector128<double> left, Vector128<double> right);

            /// <summary>
            /// vtrn2[q]_[suf][8,16,32,64]
            ///
            /// A64: TRN2
            /// </summary>
            public static Vector64<sbyte>  TransposeOdd(Vector64<sbyte>  left, Vector64<sbyte>  right);
            public static Vector64<short>  TransposeOdd(Vector64<short>  left, Vector64<short>  right);
            public static Vector64<int>    TransposeOdd(Vector64<int>    left, Vector64<int>    right);
            public static Vector64<byte>   TransposeOdd(Vector64<byte>   left, Vector64<byte>   right);
            public static Vector64<ushort> TransposeOdd(Vector64<ushort> left, Vector64<ushort> right);
            public static Vector64<uint>   TransposeOdd(Vector64<uint>   left, Vector64<uint>   right);
            public static Vector64<float>  TransposeOdd(Vector64<float>  left, Vector64<float>  right);

            public static Vector128<sbyte>  TransposeOdd(Vector128<sbyte>  left, Vector128<sbyte>  right);
            public static Vector128<short>  TransposeOdd(Vector128<short>  left, Vector128<short>  right);
            public static Vector128<int>    TransposeOdd(Vector128<int>    left, Vector128<int>    right);
            public static Vector128<long>   TransposeOdd(Vector128<long>   left, Vector128<long>   right);
            public static Vector128<byte>   TransposeOdd(Vector128<byte>   left, Vector128<byte>   right);
            public static Vector128<ushort> TransposeOdd(Vector128<ushort> left, Vector128<ushort> right);
            public static Vector128<uint>   TransposeOdd(Vector128<uint>   left, Vector128<uint>   right);
            public static Vector128<ulong>  TransposeOdd(Vector128<ulong>  left, Vector128<ulong>  right);
            public static Vector128<float>  TransposeOdd(Vector128<float>  left, Vector128<float>  right);
            public static Vector128<double> TransposeOdd(Vector128<double> left, Vector128<double> right);
        }
    }
}

tannergooding · 2020-03-09T16:47:29Z

Confirm that ShiftLeftLogicalAndInsert under ArmBase should move under AdvSimd, but is it 32 bit or 64 bit?

@TamarChristinaArm, looking at the encoding + decoding for VSLI, it seems to imply that L=1 is valid and therefore esize is allowed to be 64, so this applies to ARM32 as well. Is that your understanding?

Specifically, the Applies when !(imm6 == 000xxx && L == 0) && Q = ? is stating it applies to everything but L:imm6 == 0000xxx, which is then reiterated in the Decode for all variants of this encoding.

TamarChristinaArm · 2020-03-10T10:17:42Z

@tannergooding Yeah I believe that's correct, I had missed the ! the first time around.

TamarChristinaArm · 2020-04-15T14:25:15Z

@tannergooding @echesakovMSFT I'm wondering about the LeftLogicalAndInsert. The shift element needs to be a compile time constant. Any idea how I should handle that?

tannergooding · 2020-04-15T15:59:45Z

Non constant inputs are handled by dropping back to a call which contains a jump table handling all possible cases (which is between 1 and 256), its part of the reason the APIs are recursive.

Intrinsics which take a 8-bit immediate are marked as HW_Category_IMM. There are then some special modifier flags that indicate how it is handled in codegen:

HW_Flag_MaybeIMM - Indicates there is another overload which isn't HW_Category_IMM
HW_Flag_NoJmpTableIMM - Indicates that a jump table fallback isn't necessary
HW_Flag_FullRangeIMM - Indicates that the jump table fallback covers all 256 possible cases

The important methods for importation are then:

HWIntrinsicInfo::isImmOp - Indicates whether the given intrinsic is HW_Category_IMM and if it is HW_Flag_MaybeIMM if the corresponding operand is TYP_INT
Compiler::addRangeCheckIfNeeded - Inserts a bounds check for a HW_Category_IMM where it is not HW_Flag_FullRangeIMM
HWIntrinsicInfo::lookupImmUpperBound - Gets the upper bound for a HW_Category_IMM where it is not HW_Flag_FullRangeIMM
HWIntrinsicInfo::isInImmRange - Determines if the constant value is in range for the intrinsic
HWIntrinsicInfo::impNonConstFallback - Replaces the codegen with a non-immediate equivalent if one exists. For example on x86 we also have overloads that take the shift amount in a vector register: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsicxarch.cpp#L340

The importation logic currently assumes the "immediate" is the last operand in the list: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.cpp#L648. It then uses the above methods to determine how to handle things. This includes things like directly expanding the relevant intrinsic, generating an equivalent intrinsic fallback, or falling back to a method call. If it falls back to the method call it will see the method is recursively calling itself and force expansion and the node will carry a non-constant input through to codegen. It will also insert the relevant range check if one is needed.

An example of how this is handled in codegen for x86 is here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp#L227-L254.
Once we get to codegen if it is a constant, we can just emit the instruction. If it isn't, we call genHWIntrinsicJumpTableFallback which emits the necessary jump table.

tannergooding · 2020-04-15T16:10:36Z

This ultimately allows things like reflection or debugging to "just work" and the expectation is users won't use it in actual perf-critical paths. Users are expected to be profiling this code so they should catch the issue relatively quickly if one does exist.

There are a few cases that this won't catch (of inputs that the JIT will eventually determine to be constant but which aren't "constant" during importation) and we have an issue tracking improving that: #9989 and #11062. Ideally we would delay the decision to be a "fallback method call" until later in the pipeline (such as lowering).

We already have some support for doing that with existing GT_INTRINSIC nodes and we do it in Rationalization: #11062 (comment), but expanding it for GT_HWINTRINSIC nodes is a bit tougher and hasn't been done yet.

TamarChristinaArm · 2020-04-15T16:35:27Z

Awesome thanks, I'll get on those then :)

echesakov · 2020-04-15T17:29:57Z

Awesome thanks, I'll get on those then :)

@TamarChristinaArm I am working at the moment on supporting intrinsic immediate operands on arm64 - I needed this for Extract, Insert and ExtractVector64/128 intrinsics - will have a PR soon

TamarChristinaArm · 2020-04-15T17:52:35Z

@echesakovMSFT Ah great, I'll do the single register TBL in the mean time then.

tannergooding · 2020-04-16T00:36:40Z

The following APIs are yet to be implemented:

namespace System.Runtime.Intrinsics.Arm
{
    public partial class ArmBase
    {
        /// <summary>
        /// vslid_n_[su]64
        ///
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<long>  ShiftLeftLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftLeftLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);

        /// <summary>
        /// vsrid_n_[su]64
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<long>  ShiftRightLogicalAndInsertScalar(Vector64<long>  left, Vector64<long>  right, byte shift);
        public static Vector64<ulong> ShiftRightLogicalAndInsertScalar(Vector64<ulong> left, Vector64<ulong> right, byte shift);
    }
    public partial class AdvSimd
    {
        /// <summary>
        /// vsli[q]_n_[su][8,16,32,64]
        //
        /// A64: SLI
        /// A32: VSLI
        /// </summary>
        public static Vector64<byte>   ShiftLeftLogicalAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftLeftLogicalAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftLeftLogicalAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftLeftLogicalAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftLeftLogicalAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftLeftLogicalAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftLeftLogicalAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftLeftLogicalAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftLeftLogicalAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftLeftLogicalAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftLeftLogicalAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftLeftLogicalAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftLeftLogicalAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftLeftLogicalAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);

        /// <summary>
        /// vsri[q]_n_[su][8,16,32,64]
        ///
        /// A64: SRI
        /// A32: VSRI
        /// </summary>
        public static Vector64<byte>   ShiftRightAndInsert(Vector64<byte>   left, Vector64<byte>   right, byte shift);
        public static Vector64<ushort> ShiftRightAndInsert(Vector64<ushort> left, Vector64<ushort> right, byte shift);
        public static Vector64<uint>   ShiftRightAndInsert(Vector64<uint>   left, Vector64<uint>   right, byte shift);
        public static Vector64<sbyte>  ShiftRightAndInsert(Vector64<sbyte>  left, Vector64<sbyte>  right, byte shift);
        public static Vector64<short>  ShiftRightAndInsert(Vector64<short>  left, Vector64<short>  right, byte shift);
        public static Vector64<int>    ShiftRightAndInsert(Vector64<int>    left, Vector64<int>    right, byte shift);

        public static Vector128<byte>   ShiftRightAndInsert(Vector128<byte>   left, Vector128<byte>   right, byte shift);
        public static Vector128<ushort> ShiftRightAndInsert(Vector128<ushort> left, Vector128<ushort> right, byte shift);
        public static Vector128<uint>   ShiftRightAndInsert(Vector128<uint>   left, Vector128<uint>   right, byte shift);
        public static Vector128<ulong>  ShiftRightAndInsert(Vector128<ulong>  left, Vector128<ulong>  right, byte shift);
        public static Vector128<sbyte>  ShiftRightAndInsert(Vector128<sbyte>  left, Vector128<sbyte>  right, byte shift);
        public static Vector128<short>  ShiftRightAndInsert(Vector128<short>  left, Vector128<short>  right, byte shift);
        public static Vector128<int>    ShiftRightAndInsert(Vector128<int>    left, Vector128<int>    right, byte shift);
        public static Vector128<long>   ShiftRightAndInsert(Vector128<long>   left, Vector128<long>   right, byte shift);
    }
}

TamarChristinaArm · 2020-04-27T10:58:57Z

@echesakovMSFT I see Extract committed. Are the pre-requites I need for this in now then?

echesakov · 2020-04-27T17:12:21Z

@TamarChristinaArm Yes, I think so.

TamarChristinaArm · 2020-04-30T12:26:49Z

@tannergooding @echesakovMSFT I believe we need to move the entries in ArmBase to AdvSimd right? This PR was created before we discussed this with other intrinsics.

tannergooding · 2020-04-30T14:56:22Z

Yes, I believe so as VSRI requires "Enabling Advanced SIMD and floating-point support", which may not always be possible.

TamarChristinaArm · 2020-05-01T14:54:22Z

@echesakovMSFT the comment in BuildHWIntrinsic on mayNeedBranchTargetReg confuses me. I'm trying to understand what it's going to do if you pass a non-const immediate to an intrinsics that has no non-const immediate variant. The comment seems to indicate it'll allocate an extra register.. but don't understand why or what it's trying to do?

echesakov · 2020-05-01T16:31:39Z

@TamarChristinaArm If you pass a non-const immediate operand and the instruction does not have non-const form (i.e. does not accept a register operand instead of the immediate operand; I can't think about an instruction on Arm64 that has such fallback but I believe it was a case on x64 for many instructions - look for intrinsics marked as HW_Flag_MaybeIMM in hwintrinsiclistxarch.h) JIT still needs to be able to compile an intrinsic so it generates a "switch" table that conceptually does this:

switch(nonConstImm)
{
case 0: 
  inst Vd, Vn, #0; 
  break;
case 1: 
  inst Vd, Vn, #1; 
  break;
case 2: 
  inst Vd, Vn, #2; 
  break;
case 3:
  inst Vd, Vn, #3;
  break;
}

For example, for ExtractVector128(Vector128<byte>) the code will look as follows

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )  simd16  ->  [fp+0x20]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )  simd16  ->  [fp+0x10]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M54180_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        3D800BA0          str     q0, [fp,#32]
        3D8007A1          str     q1, [fp,#16]
						;; bbWeight=1    PerfScore 3.50
G_M54180_IG02:
        3DC00BB0          ldr     q16, [fp,#32]
        3DC007B1          ldr     q17, [fp,#16]
        53001C00          uxtb    w0, w0
        7100401F          cmp     w0, #16
        540004C2          bhs     G_M54180_IG21
        10000061          adr     x1, [G_M54180_IG03]
        8B000C21          add     x1, x1, x0, LSL #3
        D61F0020          br      x1
						;; bbWeight=1    PerfScore 8.50
G_M54180_IG03:
        6E110210          ext     v16.16b, v16.16b, v17.16b, #0
        1400001E          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG04:
        6E110A10          ext     v16.16b, v16.16b, v17.16b, #1
        1400001C          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG05:
        6E111210          ext     v16.16b, v16.16b, v17.16b, #2
        1400001A          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG06:
        6E111A10          ext     v16.16b, v16.16b, v17.16b, #3
        14000018          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG07:
        6E112210          ext     v16.16b, v16.16b, v17.16b, #4
        14000016          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG08:
        6E112A10          ext     v16.16b, v16.16b, v17.16b, #5
        14000014          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG09:
        6E113210          ext     v16.16b, v16.16b, v17.16b, #6
        14000012          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG10:
        6E113A10          ext     v16.16b, v16.16b, v17.16b, #7
        14000010          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG11:
        6E114210          ext     v16.16b, v16.16b, v17.16b, #8
        1400000E          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG12:
        6E114A10          ext     v16.16b, v16.16b, v17.16b, #9
        1400000C          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG13:
        6E115210          ext     v16.16b, v16.16b, v17.16b, #10
        1400000A          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG14:
        6E115A10          ext     v16.16b, v16.16b, v17.16b, #11
        14000008          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG15:
        6E116210          ext     v16.16b, v16.16b, v17.16b, #12
        14000006          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG16:
        6E116A10          ext     v16.16b, v16.16b, v17.16b, #13
        14000004          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG17:
        6E117210          ext     v16.16b, v16.16b, v17.16b, #14
        14000002          b       G_M54180_IG19
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG18:
        6E117A10          ext     v16.16b, v16.16b, v17.16b, #15
						;; bbWeight=1    PerfScore 1.00
G_M54180_IG19:
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 0.50
G_M54180_IG20:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00
G_M54180_IG21:
        97FEEB1E          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt    
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 192, prolog size 8, PerfScore 64.70, (MethodHash=08b52c5b) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Byte],System.Runtime.Intrinsics.Vector128`1[Byte],ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; ============================================================

This requires allocating of x1 as a branch target.

However, for cases when immediate is 0 or 1 (e.g. ExtractVector128(Vector128<double>) we can branch with cbnz and don't need to allocate the additional register as in the following example:

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )  simd16  ->  [fp+0x20]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )  simd16  ->  [fp+0x10]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M55355_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        3D800BA0          str     q0, [fp,#32]
        3D8007A1          str     q1, [fp,#16]
						;; bbWeight=1    PerfScore 3.50
G_M55355_IG02:
        3DC00BB0          ldr     q16, [fp,#32]
        3DC007B1          ldr     q17, [fp,#16]
        53001C00          uxtb    w0, w0
        7100081F          cmp     w0, #2
        54000102          bhs     G_M55355_IG07
        35000060          cbnz    w0, G_M55355_IG04
						;; bbWeight=1    PerfScore 7.00
G_M55355_IG03:
        6E110210          ext     v16.16b, v16.16b, v17.16b, #0
        14000002          b       G_M55355_IG05
						;; bbWeight=1    PerfScore 2.00
G_M55355_IG04:
        6E114210          ext     v16.16b, v16.16b, v17.16b, #8
						;; bbWeight=1    PerfScore 1.00
G_M55355_IG05:
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 0.50
G_M55355_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00
G_M55355_IG07:
        97FEEB06          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt    
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 72, prolog size 8, PerfScore 23.20, (MethodHash=721027c4) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================

In your case, (i.e. sri and sli) you will need to mark an intrinsic with HW_Category_IMM, compute an upper bound for immediate operand in HWIntrinsicInfo::lookupImmUpperBound() and add two case in the switch in LinearScan::BuildHWIntrinsic that will set needBranchTargetReg = !intrin.op3->isContainedIntOrIImmed();

echesakov · 2020-05-01T17:11:55Z

@TamarChristinaArm I hope my explanation helped. If not - I would value any feedback how I can re-phrase this comment to make it more clear. Also feel free to ping me offline if you want to chat more about this.

TamarChristinaArm · 2020-05-04T12:17:41Z

@echesakovMSFT It did!, I hadn't realized the JIT was emitting a runtime dispatch for this case, which makes sense, that's where my initial confusion came from :) Thanks for the explanation!

echesakov · 2020-05-06T20:45:09Z

@TamarChristinaArm since you are working on this - I have assigned the issue to you.

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

TamarChristinaArm mentioned this issue Mar 3, 2020

Arm64: Add xtn and xtn2 intrinsics codegen, api and tests. #33108

Merged

tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review labels Mar 3, 2020

BruceForstall added this to To do general in Hardware Intrinsics via automation Apr 16, 2020

BruceForstall moved this from To do general to To do arm64 in Hardware Intrinsics Apr 16, 2020

BruceForstall moved this from To do arm64 to API design in Hardware Intrinsics Apr 16, 2020

echesakov assigned TamarChristinaArm May 6, 2020

echesakov moved this from API design to In progress in Hardware Intrinsics May 6, 2020

JulieLeeMSFT added this to the 5.0 milestone May 18, 2020

TamarChristinaArm mentioned this issue May 21, 2020

Implement Shift and Inserts scalar and SIMD intrinsics. #36818

Merged

echesakov closed this as completed in #36818 Jun 11, 2020

Hardware Intrinsics automation moved this from In progress to Done Jun 11, 2020

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Proposal : Arm Shift and Permute intrinsics #31324

API Proposal : Arm Shift and Permute intrinsics #31324

TamarChristinaArm commented Oct 28, 2019

tannergooding commented Oct 28, 2019

terrajobst commented Mar 3, 2020 •

edited

Loading

tannergooding commented Mar 9, 2020

TamarChristinaArm commented Mar 10, 2020

TamarChristinaArm commented Apr 15, 2020

tannergooding commented Apr 15, 2020

tannergooding commented Apr 15, 2020 •

edited

Loading

TamarChristinaArm commented Apr 15, 2020

echesakov commented Apr 15, 2020

TamarChristinaArm commented Apr 15, 2020

tannergooding commented Apr 16, 2020

TamarChristinaArm commented Apr 27, 2020

echesakov commented Apr 27, 2020

TamarChristinaArm commented Apr 30, 2020

tannergooding commented Apr 30, 2020

TamarChristinaArm commented May 1, 2020

echesakov commented May 1, 2020

echesakov commented May 1, 2020

TamarChristinaArm commented May 4, 2020

echesakov commented May 6, 2020

API Proposal : Arm Shift and Permute intrinsics #31324

API Proposal : Arm Shift and Permute intrinsics #31324

Comments

TamarChristinaArm commented Oct 28, 2019

tannergooding commented Oct 28, 2019

terrajobst commented Mar 3, 2020 • edited Loading

tannergooding commented Mar 9, 2020

TamarChristinaArm commented Mar 10, 2020

TamarChristinaArm commented Apr 15, 2020

tannergooding commented Apr 15, 2020

tannergooding commented Apr 15, 2020 • edited Loading

TamarChristinaArm commented Apr 15, 2020

echesakov commented Apr 15, 2020

TamarChristinaArm commented Apr 15, 2020

tannergooding commented Apr 16, 2020

TamarChristinaArm commented Apr 27, 2020

echesakov commented Apr 27, 2020

TamarChristinaArm commented Apr 30, 2020

tannergooding commented Apr 30, 2020

TamarChristinaArm commented May 1, 2020

echesakov commented May 1, 2020

echesakov commented May 1, 2020

TamarChristinaArm commented May 4, 2020

echesakov commented May 6, 2020

terrajobst commented Mar 3, 2020 •

edited

Loading

tannergooding commented Apr 15, 2020 •

edited

Loading