-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Begin using the xplat hardware intrinsics in BitArray #63722
Conversation
Tagging subscribers to this area: @dotnet/area-system-collections Issue Detailsnull
|
{ | ||
// JIT does not support code hoisting for SIMD yet | ||
Vector256<byte> zero = Vector256<byte>.Zero; | ||
fixed (bool* ptr = values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now expose helper intrinsics that directly operate on ref
: LoadUnsafe(ref T source, nuint elementOffset)
.
This helps avoid pinning, which can have measurable overhead for small counts and which can hinder the GC in the case of long inputs.
It likewise helps improve readability over the pattern we are already utilizing in parts of the BCL where we were using Unsafe.ReadUnaligned
+ Unsafe.Add
+ Unsafe.As
.
Vector256<byte> vector = Vector256.LoadUnsafe(ref value, i); | ||
Vector256<byte> isFalse = Vector256.Equals(vector, Vector256<byte>.Zero); | ||
|
||
uint result = isFalse.ExtractMostSignificantBits(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtractMostSignificantBits
behaves just like MoveMask
on x86/x64. This is also exposed by WASM
as bitmask
if (Avx2.IsSupported) | ||
ref byte value = ref Unsafe.As<bool, byte>(ref MemoryMarshal.GetArrayDataReference<bool>(values)); | ||
|
||
if (Vector256.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've preserved the Vector256
path given that it was already here and I would presume has undergone the necessary checks to ensure it is worth doing on x86/x64.
Arm64 doesn't support V256
and so will only go down the V128
codepath.
Vector128<int> rightVec = Sse2.LoadVector128(rightPtr + i); | ||
Sse2.Store(leftPtr + i, Sse2.And(leftVec, rightVec)); | ||
} | ||
Vector256<int> result = Vector256.LoadUnsafe(ref left, i) & Vector256.LoadUnsafe(ref right, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xplat helper intrinsics support operators and so we can make this "more readable" by just using x & y
.
Sse2.Store(leftPtr + i, Sse2.And(leftVec, rightVec)); | ||
} | ||
Vector256<int> result = Vector256.LoadUnsafe(ref left, i) & Vector256.LoadUnsafe(ref right, i); | ||
result.StoreUnsafe(ref left, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing an intrinsic likewise no longer requires pinning or complex Unsafe
logic.
} | ||
} | ||
else if (Sse2.IsSupported) | ||
else if (Vector128.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation would we also want a Vector64 code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector64
can be beneficial for cases where you know the inputs are going to be commonly small and for handling the "trailing" elements (rather than falling back to a scalar loop or manually unrolled loop).
We aren't currently taking advantage of this anywhere and it would need some more work/profiling to show the extra complexity is worthwhile.
- The extra complexity isn't from using
Vector64<T>
but rather from changing out the "fallback" fromfor (; index < length; index++)
to usingVector64<T>
orVector128<T>
with appropriate backtracking and masking
Merging |
7e82986
to
9b6ea8d
Compare
Rebased onto main to pick up some important fixes. Will share diffs and perf numbers in a little bit. |
Jit Diff:
Most of the diff is from being able to remove the pinning:Before (
|
Is that because the code you wrote is better and we could have done the same thing with the intrinsics directly, or is this something that could be improved in the JIT's handling of the intrinsics as well? |
We could've written a better implementation here but the original authors likely weren't aware of the available optimization. Everything the xplat helper intrinsics do is implemented directly in terms of the underlying platform specific intrinsics and so there is nothing they can do that you cannot also do yourself. The benefit is that you don't have to consider the optimal approach to each and every platform. You don't have to consider things like They also provide a simplification for working with unpinned memory as you can just |
Not a huge difference on perf; but it is there, likely mostly from removing the big pinning blocks/logic: Before
After
|
Easier to read + optimized for all platforms automatically + faster as well. Beautiful. |
This should also be ready to merge pending area owner sign-off: @eiriktsarpalis @krwq @layomia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM wrt area ownership, chatted offline with @tannergooding for basic overview of changes.
Arm64 improvements: dotnet/perf-autofiling-issues#3437 |
This itself isn't a significant change but it does represent a significant stepping stone for .NET 7 by simplifying some of the existing hardware intrinsic usage to use the "Cross Platform Hardware Intrinsics" as described in #49397.
The user story here is that many libraries want to write performant code and the number of platforms that may need to be considered is constantly increasing. A few years ago, RyuJIT only supported x86 and x64 which were similar enough that the same code could support both. However, beginning in .NET 5 we started adding the same SIMD acceleration to ARM64 and the number of platforms expanded. Due to it being a new platform for RyuJIT, many users did not have hardware available on which to test and more so may not have been familiar with some of the intricacies of the platform making providing equivalent support sometimes difficult. This was made worse by the fact that the code paths to support the entire set of x86, x64, and ARM64 were often very similar, generally differing just in ISA (
Sse
on x86/64 vsAdvSimd
onArm64
) or even naming conventions (Horizontal
on x86/x64 vsPairwise
on ARM64). Finally there are potentially even more platforms that will be adding support in the future (such asWASM
) and even existing platforms that Mono supports which likewise have their own SIMD support.The cross platform hardware intrinsic helper APIs are the solution. These APIs provide functionality common to all the SIMD supporting platforms on the fixed-size hardware intrinsic ABI types (
Vector64<T>
,Vector128<T>
, andVector256<T>
). This allows users to have a good understanding of the potential performance characteristics and iteration patterns, it allows them to utilize APIs that cannot be easily exposed on existing variable length SIMD types (such asVector<T>
), and it allows them to trivially fallback and utilize platform specific intrinsics where that extra bit of perf can be grabbed due to functionality only available to a singular platform since they already are using the types that the platform specific intrinsics require (Vector64<T>
,Vector128<T>
, andVector256<T>
).BitArray
is just the first of the BCL APIs to take advantage of these new helper APIs and it shows how simple supporting accelerated SIMD code on all the target architectures can be.