Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin using the xplat hardware intrinsics in BitArray #63722

Merged
merged 2 commits into from
Feb 2, 2022

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jan 13, 2022

This itself isn't a significant change but it does represent a significant stepping stone for .NET 7 by simplifying some of the existing hardware intrinsic usage to use the "Cross Platform Hardware Intrinsics" as described in #49397.

The user story here is that many libraries want to write performant code and the number of platforms that may need to be considered is constantly increasing. A few years ago, RyuJIT only supported x86 and x64 which were similar enough that the same code could support both. However, beginning in .NET 5 we started adding the same SIMD acceleration to ARM64 and the number of platforms expanded. Due to it being a new platform for RyuJIT, many users did not have hardware available on which to test and more so may not have been familiar with some of the intricacies of the platform making providing equivalent support sometimes difficult. This was made worse by the fact that the code paths to support the entire set of x86, x64, and ARM64 were often very similar, generally differing just in ISA (Sse on x86/64 vs AdvSimd on Arm64) or even naming conventions (Horizontal on x86/x64 vs Pairwise on ARM64). Finally there are potentially even more platforms that will be adding support in the future (such as WASM) and even existing platforms that Mono supports which likewise have their own SIMD support.

The cross platform hardware intrinsic helper APIs are the solution. These APIs provide functionality common to all the SIMD supporting platforms on the fixed-size hardware intrinsic ABI types (Vector64<T>, Vector128<T>, and Vector256<T>). This allows users to have a good understanding of the potential performance characteristics and iteration patterns, it allows them to utilize APIs that cannot be easily exposed on existing variable length SIMD types (such as Vector<T>), and it allows them to trivially fallback and utilize platform specific intrinsics where that extra bit of perf can be grabbed due to functionality only available to a singular platform since they already are using the types that the platform specific intrinsics require (Vector64<T>, Vector128<T>, and Vector256<T>).

BitArray is just the first of the BCL APIs to take advantage of these new helper APIs and it shows how simple supporting accelerated SIMD code on all the target architectures can be.

@ghost
Copy link

ghost commented Jan 13, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: tannergooding
Assignees: tannergooding
Labels:

area-System.Collections

Milestone: -

@tannergooding
Copy link
Member Author

CC. @jeffhandley, @danmoseley, @stephentoub

{
// JIT does not support code hoisting for SIMD yet
Vector256<byte> zero = Vector256<byte>.Zero;
fixed (bool* ptr = values)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now expose helper intrinsics that directly operate on ref: LoadUnsafe(ref T source, nuint elementOffset).

This helps avoid pinning, which can have measurable overhead for small counts and which can hinder the GC in the case of long inputs.

It likewise helps improve readability over the pattern we are already utilizing in parts of the BCL where we were using Unsafe.ReadUnaligned + Unsafe.Add + Unsafe.As.

Vector256<byte> vector = Vector256.LoadUnsafe(ref value, i);
Vector256<byte> isFalse = Vector256.Equals(vector, Vector256<byte>.Zero);

uint result = isFalse.ExtractMostSignificantBits();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtractMostSignificantBits behaves just like MoveMask on x86/x64. This is also exposed by WASM as bitmask

if (Avx2.IsSupported)
ref byte value = ref Unsafe.As<bool, byte>(ref MemoryMarshal.GetArrayDataReference<bool>(values));

if (Vector256.IsHardwareAccelerated)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've preserved the Vector256 path given that it was already here and I would presume has undergone the necessary checks to ensure it is worth doing on x86/x64.

Arm64 doesn't support V256 and so will only go down the V128 codepath.

Vector128<int> rightVec = Sse2.LoadVector128(rightPtr + i);
Sse2.Store(leftPtr + i, Sse2.And(leftVec, rightVec));
}
Vector256<int> result = Vector256.LoadUnsafe(ref left, i) & Vector256.LoadUnsafe(ref right, i);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xplat helper intrinsics support operators and so we can make this "more readable" by just using x & y.

Sse2.Store(leftPtr + i, Sse2.And(leftVec, rightVec));
}
Vector256<int> result = Vector256.LoadUnsafe(ref left, i) & Vector256.LoadUnsafe(ref right, i);
result.StoreUnsafe(ref left, i);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing an intrinsic likewise no longer requires pinning or complex Unsafe logic.

}
}
else if (Sse2.IsSupported)
else if (Vector128.IsHardwareAccelerated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would we also want a Vector64 code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector64 can be beneficial for cases where you know the inputs are going to be commonly small and for handling the "trailing" elements (rather than falling back to a scalar loop or manually unrolled loop).

We aren't currently taking advantage of this anywhere and it would need some more work/profiling to show the extra complexity is worthwhile.

  • The extra complexity isn't from using Vector64<T> but rather from changing out the "fallback" from for (; index < length; index++) to using Vector64<T> or Vector128<T> with appropriate backtracking and masking

@tannergooding
Copy link
Member Author

Merging main to try and get CI to pass (seems more jobs have passed today).

@tannergooding
Copy link
Member Author

Rebased onto main to pick up some important fixes. Will share diffs and perf numbers in a little bit.

@tannergooding
Copy link
Member Author

Jit Diff:

Top method improvements (bytes):
        -123 (-17.70% of base) : System.Collections.dasm - BitArray:Xor(BitArray):BitArray:this
        -102 (-15.13% of base) : System.Collections.dasm - BitArray:And(BitArray):BitArray:this
        -102 (-15.13% of base) : System.Collections.dasm - BitArray:Or(BitArray):BitArray:this
         -64 (-16.62% of base) : System.Collections.dasm - BitArray:Not():BitArray:this
         -51 (-4.46% of base) : System.Collections.dasm - BitArray:.ctor(ref):this (3 methods)

Most of the diff is from being able to remove the pinning:

Before (BitArray:And(BitArray):BitArray:this):

						;; bbWeight=0.50 PerfScore 4.88
G_M4042_IG10:
       xor      r11d, r11d
       mov      gword ptr [rsp+30H], rax
       mov      rsi, gword ptr [rsp+30H]
       cmp      dword ptr [rsi+8], 0
       jne      SHORT G_M4042_IG11
       xor      esi, esi
       jmp      SHORT G_M4042_IG12
						;; bbWeight=0.50 PerfScore 4.25
G_M4042_IG11:
       mov      rsi, gword ptr [rsp+30H]
       cmp      dword ptr [rsi+8], 0
       jbe      G_M4042_IG22
       mov      rsi, gword ptr [rsp+30H]
       add      rsi, 16
						;; bbWeight=0.50 PerfScore 3.12
G_M4042_IG12:
       mov      gword ptr [rsp+28H], r8
       mov      rdi, gword ptr [rsp+28H]
       cmp      dword ptr [rdi+8], 0
       jne      SHORT G_M4042_IG13
       xor      edi, edi
       jmp      SHORT G_M4042_IG14
						;; bbWeight=0.50 PerfScore 4.12
G_M4042_IG13:
       mov      rdi, gword ptr [rsp+28H]
       cmp      dword ptr [rdi+8], 0
       jbe      G_M4042_IG22
       mov      rdi, gword ptr [rsp+28H]
       add      rdi, 16
						;; bbWeight=0.50 PerfScore 3.12
G_M4042_IG14:
       lea      ebx, [r10-7]
       test     ebx, ebx
       jbe      SHORT G_M4042_IG16
						;; bbWeight=0.50 PerfScore 0.88
G_M4042_IG15:
       mov      ebp, r11d
       vmovdqu  ymm0, ymmword ptr[rsi+4*rbp]
       vmovdqu  ymm1, ymmword ptr[rdi+4*rbp]
       vpand    ymm0, ymm0, ymm1
       vmovdqu  ymmword ptr[rsi+4*rbp], ymm0
       add      r11d, 8
       cmp      r11d, ebx
       jb       SHORT G_M4042_IG15
						;; bbWeight=4    PerfScore 56.33
G_M4042_IG16:
       xor      rsi, rsi
       mov      gword ptr [rsp+28H], rsi
       mov      gword ptr [rsp+30H], rsi
       cmp      r11d, r10d
       jae      SHORT G_M4042_IG18
       align    [0 bytes for IG17]
						;; bbWeight=0.50 PerfScore 1.75

After:

G_M4042_IG10:
       xor      r11d, r11d
       lea      rsi, bword ptr [rax+16]
       lea      rdi, bword ptr [r8+16]
       lea      ebx, [r10-7]
       test     ebx, ebx
       jbe      SHORT G_M4042_IG12
						;; bbWeight=0.50 PerfScore 1.50
G_M4042_IG11:
       mov      ebp, r11d
       vmovdqu  ymm0, ymmword ptr[rsi+4*rbp]
       vpand    ymm0, ymm0, ymmword ptr[rdi+4*rbp]
       vmovdqu  ymmword ptr[rsi+4*rbp], ymm0
       add      r11d, 8
       cmp      r11d, ebx
       jb       SHORT G_M4042_IG11
						;; bbWeight=4    PerfScore 47.00
G_M4042_IG12:
       cmp      r11d, r10d
       jae      SHORT G_M4042_IG14
       align    [0 bytes for IG13]
						;; bbWeight=0.50 PerfScore 0.62

We also see some improvements because things like Not are implemented "better".

Before:

      xor      r9d, r9d
       vpcmpeqd ymm0, ymm0, ymm0
       mov      gword ptr [rsp+20H], rdx
       test     rdx, rdx
       je       SHORT G_M11410_IG11
       mov      rax, gword ptr [rsp+20H]
       cmp      dword ptr [rax+8], 0
       jne      SHORT G_M11410_IG12
						;; bbWeight=0.50 PerfScore 4.00
G_M11410_IG11:
       xor      eax, eax
       jmp      SHORT G_M11410_IG13
						;; bbWeight=0.50 PerfScore 1.12
G_M11410_IG12:
       mov      rax, gword ptr [rsp+20H]
       cmp      dword ptr [rax+8], 0
       jbe      SHORT G_M11410_IG19
       mov      rax, gword ptr [rsp+20H]
       add      rax, 16
						;; bbWeight=0.50 PerfScore 3.12
G_M11410_IG13:
       lea      r10d, [r8-7]
       test     r10d, r10d
       jbe      SHORT G_M11410_IG15
						;; bbWeight=0.50 PerfScore 0.88
G_M11410_IG14:
       mov      r11d, r9d
       vmovdqu  ymm1, ymmword ptr[rax+4*r11]
       vpxor    ymm1, ymm1, ymm0
       vmovdqu  ymmword ptr[rax+4*r11], ymm1
       add      r9d, 8
       cmp      r9d, r10d
       jb       SHORT G_M11410_IG14
						;; bbWeight=4    PerfScore 36.33
G_M11410_IG15:
       xor      rax, rax
       mov      gword ptr [rsp+20H], rax
       cmp      r9d, r8d
       jae      SHORT G_M11410_IG17
       mov      eax, dword ptr [rdx+8]
       align    [4 bytes for IG16]
						;; bbWeight=0.50 PerfScore 2.38

After:

G_M11410_IG10:
       xor      r9d, r9d
       cmp      dword ptr [rdx], edx
       lea      rax, bword ptr [rdx+16]
       lea      r10d, [r8-7]
       test     r10d, r10d
       jbe      SHORT G_M11410_IG12
						;; bbWeight=0.50 PerfScore 2.75
G_M11410_IG11:
       mov      r11d, r9d
       vpcmpeqd ymm0, ymm0, ymm0
       vpxor    ymm0, ymm0, ymmword ptr[rax+4*r11]
       vmovdqu  ymmword ptr[rax+4*r11], ymm0
       add      r9d, 8
       cmp      r9d, r10d
       jb       SHORT G_M11410_IG11
						;; bbWeight=4    PerfScore 29.00
G_M11410_IG12:
       cmp      r9d, r8d
       jae      SHORT G_M11410_IG14
       mov      eax, dword ptr [rdx+8]
       align    [11 bytes for IG13]
						;; bbWeight=0.50 PerfScore 1.75

@stephentoub
Copy link
Member

We also see some improvements because things like Not are implemented "better"

Is that because the code you wrote is better and we could have done the same thing with the intrinsics directly, or is this something that could be improved in the JIT's handling of the intrinsics as well?

@tannergooding
Copy link
Member Author

tannergooding commented Jan 21, 2022

Is that because the code you wrote is better and we could have done the same thing with the intrinsics directly, or is this something that could be improved in the JIT's handling of the intrinsics as well?

We could've written a better implementation here but the original authors likely weren't aware of the available optimization.

Everything the xplat helper intrinsics do is implemented directly in terms of the underlying platform specific intrinsics and so there is nothing they can do that you cannot also do yourself.

The benefit is that you don't have to consider the optimal approach to each and every platform. You don't have to consider things like ~x is best implemented as Not(x) on Arm64 vs x ^ ~0 on x86/x64 or that Arm64 has Abs while x86/x64 needs you to x & 0x7FFFF..., etc

They also provide a simplification for working with unpinned memory as you can just LoadUnsafe(ref value, index) rather than Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<T, byte>(ref Unsafe.Add(ref value, index))) and directly do things like x == y rather than Compare + MoveMask, etc

@tannergooding
Copy link
Member Author

Not a huge difference on perf; but it is there, likely mostly from removing the big pinning blocks/logic:

Before

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Allocated
BitArrayAnd 4 1.669 ns 0.0121 ns 0.0113 ns 1.670 ns 1.648 ns 1.688 ns - - -
BitArrayNot 4 1.316 ns 0.0143 ns 0.0119 ns 1.316 ns 1.298 ns 1.340 ns - - -
BitArrayOr 4 1.656 ns 0.0106 ns 0.0082 ns 1.659 ns 1.641 ns 1.668 ns - - -
BitArrayXor 4 1.504 ns 0.0215 ns 0.0191 ns 1.505 ns 1.467 ns 1.533 ns - - -
BitArrayBoolArrayCtor 4 9.555 ns 0.4905 ns 0.5648 ns 9.599 ns 8.797 ns 10.61 ns 0.0038 - 64 B
BitArrayAnd 512 9.331 ns 0.0417 ns 0.0390 ns 9.333 ns 9.258 ns 9.405 ns - - -
BitArrayNot 512 5.716 ns 0.0364 ns 0.0340 ns 5.720 ns 5.661 ns 5.772 ns - - -
BitArrayOr 512 9.184 ns 0.1009 ns 0.0944 ns 9.151 ns 9.082 ns 9.363 ns - - -
BitArrayXor 512 9.648 ns 0.0411 ns 0.0364 ns 9.645 ns 9.600 ns 9.716 ns - - -
BitArrayBoolArrayCtor 512 25.635 ns 0.5980 ns 0.6887 ns 25.769 ns 24.590 ns 26.70 ns 0.0071 - 120 B

After

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Allocated
BitArrayAnd 4 1.361 ns 0.0189 ns 0.0176 ns 1.369 ns 1.324 ns 1.383 ns - - -
BitArrayNot 4 1.106 ns 0.0208 ns 0.0195 ns 1.110 ns 1.076 ns 1.143 ns - - -
BitArrayOr 4 1.360 ns 0.0268 ns 0.0251 ns 1.371 ns 1.316 ns 1.398 ns - - -
BitArrayXor 4 1.356 ns 0.0261 ns 0.0244 ns 1.368 ns 1.324 ns 1.391 ns - - -
BitArrayBoolArrayCtor 4 9.068 ns 0.1507 ns 0.1410 ns 9.077 ns 8.847 ns 9.336 ns 0.0038 - 64 B
BitArrayAnd 512 8.229 ns 0.0474 ns 0.0396 ns 8.229 ns 8.154 ns 8.301 ns - - -
BitArrayNot 512 5.325 ns 0.0323 ns 0.0302 ns 5.333 ns 5.266 ns 5.360 ns - - -
BitArrayOr 512 8.947 ns 0.0951 ns 0.0843 ns 8.970 ns 8.761 ns 9.046 ns - - -
BitArrayXor 512 8.146 ns 0.0587 ns 0.0549 ns 8.157 ns 8.043 ns 8.219 ns - - -
BitArrayBoolArrayCtor 512 23.134 ns 0.5760 ns 0.6633 ns 23.229 ns 21.916 ns 23.975 ns 0.0072 - 120 B

@danmoseley
Copy link
Member

Easier to read + optimized for all platforms automatically + faster as well. Beautiful.

@tannergooding
Copy link
Member Author

This should also be ready to merge pending area owner sign-off: @eiriktsarpalis @krwq @layomia

Copy link
Contributor

@layomia layomia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM wrt area ownership, chatted offline with @tannergooding for basic overview of changes.

@EgorBo
Copy link
Member

EgorBo commented Feb 17, 2022

Arm64 improvements: dotnet/perf-autofiling-issues#3437

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants