Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

Closed
BruceForstall opened this issue Mar 11, 2020 · 8 comments
Closed

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

BruceForstall opened this issue Mar 11, 2020 · 8 comments

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Mar 11, 2020

This item tracks the conversion of the System.Runtime.Intrinsics class to use arm64 intrinsics.

Related: #33308

@BruceForstall BruceForstall added this to the 5.0 milestone Mar 11, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 11, 2020
@BruceForstall BruceForstall changed the title Optimize System.Numerics.Intrinsics Optimize System.Numerics.Intrinsics using arm64 intrinsics Mar 11, 2020
@tannergooding
Copy link
Member

@BruceForstall, how does this compare to #33495?

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 11, 2020
@tannergooding
Copy link
Member

Ah, based on #33308, this should be System.Runtime.Intrinsics and the original comment should be updated.

@BruceForstall BruceForstall changed the title Optimize System.Numerics.Intrinsics using arm64 intrinsics Optimize System.Runtime.Intrinsics using arm64 intrinsics Mar 18, 2020
@tannergooding
Copy link
Member

@TamarChristinaArm, is there an existing reference for how to efficiently create a Vector64/Vector128 from non-constant inputs?

We have methods both like:

public static unsafe Vector128<byte> Create(byte value);
public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15)

In the case of the former, the given value is duplicated to all elements in the Vector.
In the case of the latter, we are given the value for each element separately.

I don't see any existing reference that can be used on the C++ side. But, on the x86/x64 side the former is basically a single broadcast instruction and the latter is a series of inserts.

Would the same be the correct thing to do for ARM64?

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm, is there an existing reference for how to efficiently create a Vector64/Vector128 from non-constant inputs?

Hmm not that I'm aware of, though we have a very limited number of instructions for this.

In the case of the former, the given value is duplicated to all elements in the Vector.
In the case of the latter, we are given the value for each element separately.

I don't see any existing reference that can be used on the C++ side. But, on the x86/x64 side the former is basically a single broadcast instruction and the latter is a series of inserts.

Would the same be the correct thing to do for ARM64?

Yeah, the former case is just a dup and the latter is indeed a series of ins. where the first one is an fmov to create the vector.

Though the context they're used in is important, since e.g. doing a load followed by Create(byte value); should ideally result in a ld1r.

@tannergooding
Copy link
Member

Though the context they're used in is important, since e.g. doing a load followed by Create(byte value); should ideally result in a ld1r.

Right, and if all or most of the inputs are constant, we could just construct a 64-bit or 128-bit constant (or simplify codegen in other ways) and load that instead.

But those should hopefully be optimizations handled in the JIT😄

@TamarChristinaArm
Copy link
Contributor

yup :) and if creating one of the YxZ_t types like int32x4x4 then the Create call itself should be a no-op and the register allocate should just arrange the values to be put in the correct register when they're created if it can so it's zero cost :)

@kunalspathak
Copy link
Member

@BruceForstall - All the APIs under System.Runtime.Intrinsics are optimized. Thank you @tannergooding , @echesakovMSFT and @TamarChristinaArm for your valuable feedback throughout.

Hardware Intrinsics automation moved this from To do arm64 to Done Jun 3, 2020
@BruceForstall
Copy link
Member Author

@kunalspathak That's great! Thanks for all the work!

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

5 participants