Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

BruceForstall · 2020-03-11T21:38:17Z

This item tracks the conversion of the System.Runtime.Intrinsics class to use arm64 intrinsics.

Related: #33308

tannergooding · 2020-03-11T21:39:34Z

@BruceForstall, how does this compare to #33495?

tannergooding · 2020-03-11T21:40:42Z

Ah, based on #33308, this should be System.Runtime.Intrinsics and the original comment should be updated.

tannergooding · 2020-03-24T17:45:26Z

@TamarChristinaArm, is there an existing reference for how to efficiently create a Vector64/Vector128 from non-constant inputs?

We have methods both like:

public static unsafe Vector128<byte> Create(byte value);
public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15)

In the case of the former, the given value is duplicated to all elements in the Vector.
In the case of the latter, we are given the value for each element separately.

I don't see any existing reference that can be used on the C++ side. But, on the x86/x64 side the former is basically a single broadcast instruction and the latter is a series of inserts.

Would the same be the correct thing to do for ARM64?

TamarChristinaArm · 2020-03-24T18:55:26Z

@TamarChristinaArm, is there an existing reference for how to efficiently create a Vector64/Vector128 from non-constant inputs?

Hmm not that I'm aware of, though we have a very limited number of instructions for this.

In the case of the former, the given value is duplicated to all elements in the Vector.
In the case of the latter, we are given the value for each element separately.

I don't see any existing reference that can be used on the C++ side. But, on the x86/x64 side the former is basically a single broadcast instruction and the latter is a series of inserts.

Would the same be the correct thing to do for ARM64?

Yeah, the former case is just a dup and the latter is indeed a series of ins. where the first one is an fmov to create the vector.

Though the context they're used in is important, since e.g. doing a load followed by Create(byte value); should ideally result in a ld1r.

tannergooding · 2020-03-24T19:00:03Z

Though the context they're used in is important, since e.g. doing a load followed by Create(byte value); should ideally result in a ld1r.

Right, and if all or most of the inputs are constant, we could just construct a 64-bit or 128-bit constant (or simplify codegen in other ways) and load that instead.

But those should hopefully be optimizations handled in the JIT😄

TamarChristinaArm · 2020-03-24T19:05:35Z

yup :) and if creating one of the YxZ_t types like int32x4x4 then the Create call itself should be a no-op and the register allocate should just arrange the values to be put in the correct register when they're created if it can so it's zero cost :)

kunalspathak · 2020-06-03T23:19:13Z

@BruceForstall - All the APIs under System.Runtime.Intrinsics are optimized. Thank you @tannergooding , @echesakovMSFT and @TamarChristinaArm for your valuable feedback throughout.

BruceForstall · 2020-06-03T23:52:47Z

@kunalspathak That's great! Thanks for all the work!

BruceForstall added arch-arm64 area-System.Numerics labels Mar 11, 2020

BruceForstall added this to the 5.0 milestone Mar 11, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 11, 2020

BruceForstall changed the title ~~Optimize System.Numerics.Intrinsics~~ Optimize System.Numerics.Intrinsics using arm64 intrinsics Mar 11, 2020

BruceForstall assigned kunalspathak Mar 11, 2020

tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 11, 2020

BruceForstall mentioned this issue Mar 11, 2020

Optimize library code using arm64 intrinsics #33308

Closed

BruceForstall changed the title ~~Optimize System.Numerics.Intrinsics using arm64 intrinsics~~ Optimize System.Runtime.Intrinsics using arm64 intrinsics Mar 18, 2020

tannergooding mentioned this issue Mar 20, 2020

Vectorise BitArray for ARM64 #33749

Merged

tannergooding mentioned this issue Apr 2, 2020

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

Closed

kunalspathak mentioned this issue Apr 3, 2020

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

Closed

BruceForstall added this to To do general in Hardware Intrinsics via automation Apr 16, 2020

BruceForstall moved this from To do general to To do arm64 in Hardware Intrinsics Apr 16, 2020

john-h-k mentioned this issue Apr 24, 2020

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

Closed

2 tasks

This was referenced Apr 28, 2020

ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590

Merged

Optimize Vector64<T>.ToScalar() and Vector128<T>.ToScalar() #35736

Closed

kunalspathak mentioned this issue May 9, 2020

Optimize ToScalar() and GetElement() to use arm64 intrinsic #36156

Merged

kunalspathak mentioned this issue May 20, 2020

Optimize ToVector128, ToVector128Unsafe and Vector128.GetLower() #36732

Merged

This was referenced Jun 1, 2020

Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

Merged

Optimize AsVector, AsVector128, GetUpper, As and WithElement with ARM64 intrinsics #37338

Merged

kunalspathak closed this as completed Jun 3, 2020

Hardware Intrinsics automation moved this from To do arm64 to Done Jun 3, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

BruceForstall commented Mar 11, 2020 •

edited

Loading

tannergooding commented Mar 11, 2020

tannergooding commented Mar 11, 2020

tannergooding commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

tannergooding commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

kunalspathak commented Jun 3, 2020

BruceForstall commented Jun 3, 2020

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

Comments

BruceForstall commented Mar 11, 2020 • edited Loading

tannergooding commented Mar 11, 2020

tannergooding commented Mar 11, 2020

tannergooding commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

tannergooding commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

kunalspathak commented Jun 3, 2020

BruceForstall commented Jun 3, 2020

BruceForstall commented Mar 11, 2020 •

edited

Loading