Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ShuffleUnsafe methods #99596

Closed
wants to merge 54 commits into from
Closed

Implement ShuffleUnsafe methods #99596

wants to merge 54 commits into from

Conversation

hamarb123
Copy link
Contributor

@hamarb123 hamarb123 commented Mar 12, 2024

  • Fixes [API Proposal]: Vector128.ShuffleUnsafe #81609 (only implements for byte, sbyte)
  • Support optimised Shuffle with variable indices on coreclr (for all types)
  • Support optimised cross-lane Shuffle on Vector256 (with signed/unsigned bytes and shorts)
  • Optimise Vector256 shuffle with Avx2.Shuffle (for signed/unsigned bytes and shorts)

Todo tasks:

  • Optimise variable shuffle for short/int/long (and other types of same size)
  • Implement VectorXXX.ShuffleUnsafe for vectors of other element types
  • Implement additional tests
  • Validate that the variable shuffles are actually faster for the larger types

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 12, 2024
@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 12, 2024

Benchmark results of my AVX2 code (ShuffleUnsafe256):
image
https://gist.github.com/hamarb123/c4e994a896653a46c2788df4cd6bfc74

Yes, this is a very micro benchmark, but results are pretty reproducible on my machine (within ~%10 usually), and are probably pretty close to reality since it should be pretty quick (but obviously this doesn't measure the overhead with surrounding code due to more pipeline usage, etc.).

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this again.

We're already using the so-far-internal Vector128.ShuffleUnsafe in a bunch of places. Should we be using Vector256.ShuffleUnsafe somewhere?

@hamarb123
Copy link
Contributor Author

We're already using the so-far-internal Vector128.ShuffleUnsafe in a bunch of places. Should we be using Vector256.ShuffleUnsafe somewhere?

It seems to me that all the current uses of Vector128.ShuffleUnsafe have Avx2.Shuffle on their V256 branch, which indicates they do not need the additional guarantees that Vector256.ShuffleUnsafe provides (around between-lane shuffling working), so I don't think there is currently spot we should be using it in dotnet/runtime currently, unless I've missed one.

- Used when non-constant indices are given to Shuffle, and an intrinsic implementation of ShuffleUnsafe is available
- Optimises byte and sbyte cases only
- Re-implement ShuffleUnsafe in JIT
- Also re-implement the Shuffle optimisations in JIT
- Add basic JIT support for mono (ShuffleUnsafe just gets the same implementation as Shuffle)
- Implement support for variable index Shuffle & ShuffleUnsafe (for bytes)
- Implement support for cross-lane shuffling in JIT (for bytes)
- Optimise Vector128 shuffle for bytes in JIT to use Avx2.Shuffle
@hamarb123
Copy link
Contributor Author

Can someone please check I won't accidentally regress mono :)
I've attempted to implement a very basic ShuffleUnsafe for mono, but it might need to be more advanced.

@filipnavara filipnavara added the tenet-performance Performance related issue label Mar 25, 2024
@MihaZupan
Copy link
Member

Re: 9868e73
SearchValues rely on Vector128.ShuffleUnsafe using exactly Ssse3.Shuffle semantics whenever Ssse3.IsSupported (hence the comment). If we're not making that guarantee anymore, we'll need to stop using this helper there.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 26, 2024

Re: 9868e73 SearchValues rely on Vector128.ShuffleUnsafe using exactly Ssse3.Shuffle semantics whenever Ssse3.IsSupported (hence the comment). If we're not making that guarantee anymore, we'll need to stop using this helper there.

I don't think there's any issue with the runtime relying on specific behaviour.

For external libraries, I think one of the following approaches makes sense:

  • We document the current behaviour and say that it's only guaranteed to be exactly this in this .NET version; future ones might add additional optimisations
  • Or we just don't document how they work and which specific instructions they use

I think the approach needs to be consistent for all of them, so I removed the Vector128.ShuffleUnsafe comment. I can include it within the method implementation if you'd like, as a reference for the runtime developers.

Another option, which I briefly mentioned in a comment somewhere, is to expose a variant like VectorXXX.ShuffleUnsafeHighZero - which would mean that if the high bit is set (of whatever byte we think is most convenient for the larger types, probably the first one in memory I would think / or of the value, would have to think more about what would be the best solution), then you're guaranteed to get 0. This would only require some of the implementations to have special handling to ensure this, while the rest could remain the same, still giving greater performance than VectorXXX.Shuffle in the general case, but also allowing external consumers of the API to guarantee they can get 0 when they need it.

@MihaZupan
Copy link
Member

I'm fine with only documenting "anything above 15 is UB".
Just wanted to note that this is behavior we are currently relying on internally, and we'll want to either keep the behavior or account for it with internal callers.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 26, 2024

I'm fine with only documenting "anything above 15 is UB". Just wanted to note that this is behavior we are currently relying on internally, and we'll want to either keep the behavior or account for it with internal callers.

Yes, I've been careful to not use the AVX-512 one for this method for this reason. I will add a comment at some point to explain this in the method (assuming I don't forget).

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@hamarb123
Copy link
Contributor Author

hamarb123 commented May 2, 2024

I'll be able to work on this in a few weeks when I have a few days to sit down and implement all the remaining things & make sure it's all correct :) (or maybe the sporadic day here and there also before that). I will ping someone when that day comes to get it reopened (or you can just reopen it now if you want).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: Vector128.ShuffleUnsafe
6 participants