New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Begin using the xplat hardware intrinsics in BitArray #63722

Merged

tannergooding merged 2 commits into dotnet:main from tannergooding:xplat-hwintrin-usage

Feb 2, 2022

Member

tannergooding commented Jan 13, 2022 •

edited

Loading

This itself isn't a significant change but it does represent a significant stepping stone for .NET 7 by simplifying some of the existing hardware intrinsic usage to use the "Cross Platform Hardware Intrinsics" as described in #49397.

The user story here is that many libraries want to write performant code and the number of platforms that may need to be considered is constantly increasing. A few years ago, RyuJIT only supported x86 and x64 which were similar enough that the same code could support both. However, beginning in .NET 5 we started adding the same SIMD acceleration to ARM64 and the number of platforms expanded. Due to it being a new platform for RyuJIT, many users did not have hardware available on which to test and more so may not have been familiar with some of the intricacies of the platform making providing equivalent support sometimes difficult. This was made worse by the fact that the code paths to support the entire set of x86, x64, and ARM64 were often very similar, generally differing just in ISA (Sse on x86/64 vs AdvSimd on Arm64) or even naming conventions (Horizontal on x86/x64 vs Pairwise on ARM64). Finally there are potentially even more platforms that will be adding support in the future (such as WASM) and even existing platforms that Mono supports which likewise have their own SIMD support.

The cross platform hardware intrinsic helper APIs are the solution. These APIs provide functionality common to all the SIMD supporting platforms on the fixed-size hardware intrinsic ABI types (Vector64<T>, Vector128<T>, and Vector256<T>). This allows users to have a good understanding of the potential performance characteristics and iteration patterns, it allows them to utilize APIs that cannot be easily exposed on existing variable length SIMD types (such as Vector<T>), and it allows them to trivially fallback and utilize platform specific intrinsics where that extra bit of perf can be grabbed due to functionality only available to a singular platform since they already are using the types that the platform specific intrinsics require (Vector64<T>, Vector128<T>, and Vector256<T>).

BitArray is just the first of the BCL APIs to take advantage of these new helper APIs and it shows how simple supporting accelerated SIMD code on all the target architectures can be.

dotnet-issue-labeler bot added the area-System.Collections label

ghost assigned tannergooding

ghost commented Jan 13, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-System.Collections`
Milestone:	-

Member Author

tannergooding commented Jan 13, 2022

CC. @jeffhandley, @danmoseley, @stephentoub

tannergooding commented

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs

    
                          {

                              // JIT does not support code hoisting for SIMD yet

                              Vector256<byte> zero = Vector256<byte>.Zero;

                              fixed (bool* ptr = values)

Member Author

tannergooding Jan 13, 2022

We now expose helper intrinsics that directly operate on ref: LoadUnsafe(ref T source, nuint elementOffset).

This helps avoid pinning, which can have measurable overhead for small counts and which can hinder the GC in the case of long inputs.

It likewise helps improve readability over the pattern we are already utilizing in parts of the BCL where we were using Unsafe.ReadUnaligned + Unsafe.Add + Unsafe.As.

tannergooding commented

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs

    
                                  Vector256<byte> vector = Vector256.LoadUnsafe(ref value, i);

                                  Vector256<byte> isFalse = Vector256.Equals(vector, Vector256<byte>.Zero);

                                  uint result = isFalse.ExtractMostSignificantBits();

Member Author

tannergooding Jan 13, 2022

ExtractMostSignificantBits behaves just like MoveMask on x86/x64. This is also exposed by WASM as bitmask

tannergooding commented

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs

    
                          if (Avx2.IsSupported)

                          ref byte value = ref Unsafe.As<bool, byte>(ref MemoryMarshal.GetArrayDataReference<bool>(values));

                          if (Vector256.IsHardwareAccelerated)

Member Author

tannergooding Jan 13, 2022

I've preserved the Vector256 path given that it was already here and I would presume has undergone the necessary checks to ensure it is worth doing on x86/x64.

Arm64 doesn't support V256 and so will only go down the V128 codepath.

tannergooding commented

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs

    
                                      Vector128<int> rightVec = Sse2.LoadVector128(rightPtr + i);

                                      Sse2.Store(leftPtr + i, Sse2.And(leftVec, rightVec));

                                  }

                                  Vector256<int> result = Vector256.LoadUnsafe(ref left, i) & Vector256.LoadUnsafe(ref right, i);

Member Author

tannergooding Jan 13, 2022

The xplat helper intrinsics support operators and so we can make this "more readable" by just using x & y.

tannergooding commented

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs

    
                                      Sse2.Store(leftPtr + i, Sse2.And(leftVec, rightVec));

                                  }

                                  Vector256<int> result = Vector256.LoadUnsafe(ref left, i) & Vector256.LoadUnsafe(ref right, i);

                                  result.StoreUnsafe(ref left, i);

Member Author

tannergooding Jan 13, 2022

Storing an intrinsic likewise no longer requires pinning or complex Unsafe logic.

stephentoub reviewed

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Show resolved Hide resolved

stephentoub reviewed

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs

    
                              }

                          }

                          else if (Sse2.IsSupported)

                          else if (Vector128.IsHardwareAccelerated)

Member

stephentoub Jan 13, 2022

In what situation would we also want a Vector64 code path?

Member Author

tannergooding Jan 13, 2022

Vector64 can be beneficial for cases where you know the inputs are going to be commonly small and for handling the "trailing" elements (rather than falling back to a scalar loop or manually unrolled loop).

We aren't currently taking advantage of this anywhere and it would need some more work/profiling to show the extra complexity is worthwhile.

The extra complexity isn't from using Vector64<T> but rather from changing out the "fallback" from for (; index < length; index++) to using Vector64<T> or Vector128<T> with appropriate backtracking and masking

stephentoub approved these changes

View reviewed changes

Member Author

tannergooding commented Jan 14, 2022

Merging main to try and get CI to pass (seems more jobs have passed today).

gfoidl reviewed

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Show resolved Hide resolved

tannergooding added 2 commits

January 21, 2022 12:20


          Change the BitArray(bool[]) constructor to use the xplat intrinsics

ab45576


          Change the And, Or, Xor, and Not methods to use the xplat intrinsics

9b6ea8d

tannergooding force-pushed the xplat-hwintrin-usage branch from 7e82986 to 9b6ea8d Compare

January 21, 2022 20:20

Member Author

tannergooding commented Jan 21, 2022

Rebased onto main to pick up some important fixes. Will share diffs and perf numbers in a little bit.

Member Author

tannergooding commented Jan 21, 2022

Jit Diff:

Top method improvements (bytes):
        -123 (-17.70% of base) : System.Collections.dasm - BitArray:Xor(BitArray):BitArray:this
        -102 (-15.13% of base) : System.Collections.dasm - BitArray:And(BitArray):BitArray:this
        -102 (-15.13% of base) : System.Collections.dasm - BitArray:Or(BitArray):BitArray:this
         -64 (-16.62% of base) : System.Collections.dasm - BitArray:Not():BitArray:this
         -51 (-4.46% of base) : System.Collections.dasm - BitArray:.ctor(ref):this (3 methods)

Most of the diff is from being able to remove the pinning:

Before (`BitArray:And(BitArray):BitArray:this`):

						;; bbWeight=0.50 PerfScore 4.88
G_M4042_IG10:
       xor      r11d, r11d
       mov      gword ptr [rsp+30H], rax
       mov      rsi, gword ptr [rsp+30H]
       cmp      dword ptr [rsi+8], 0
       jne      SHORT G_M4042_IG11
       xor      esi, esi
       jmp      SHORT G_M4042_IG12
						;; bbWeight=0.50 PerfScore 4.25
G_M4042_IG11:
       mov      rsi, gword ptr [rsp+30H]
       cmp      dword ptr [rsi+8], 0
       jbe      G_M4042_IG22
       mov      rsi, gword ptr [rsp+30H]
       add      rsi, 16
						;; bbWeight=0.50 PerfScore 3.12
G_M4042_IG12:
       mov      gword ptr [rsp+28H], r8
       mov      rdi, gword ptr [rsp+28H]
       cmp      dword ptr [rdi+8], 0
       jne      SHORT G_M4042_IG13
       xor      edi, edi
       jmp      SHORT G_M4042_IG14
						;; bbWeight=0.50 PerfScore 4.12
G_M4042_IG13:
       mov      rdi, gword ptr [rsp+28H]
       cmp      dword ptr [rdi+8], 0
       jbe      G_M4042_IG22
       mov      rdi, gword ptr [rsp+28H]
       add      rdi, 16
						;; bbWeight=0.50 PerfScore 3.12
G_M4042_IG14:
       lea      ebx, [r10-7]
       test     ebx, ebx
       jbe      SHORT G_M4042_IG16
						;; bbWeight=0.50 PerfScore 0.88
G_M4042_IG15:
       mov      ebp, r11d
       vmovdqu  ymm0, ymmword ptr[rsi+4*rbp]
       vmovdqu  ymm1, ymmword ptr[rdi+4*rbp]
       vpand    ymm0, ymm0, ymm1
       vmovdqu  ymmword ptr[rsi+4*rbp], ymm0
       add      r11d, 8
       cmp      r11d, ebx
       jb       SHORT G_M4042_IG15
						;; bbWeight=4    PerfScore 56.33
G_M4042_IG16:
       xor      rsi, rsi
       mov      gword ptr [rsp+28H], rsi
       mov      gword ptr [rsp+30H], rsi
       cmp      r11d, r10d
       jae      SHORT G_M4042_IG18
       align    [0 bytes for IG17]
						;; bbWeight=0.50 PerfScore 1.75

After:

G_M4042_IG10:
       xor      r11d, r11d
       lea      rsi, bword ptr [rax+16]
       lea      rdi, bword ptr [r8+16]
       lea      ebx, [r10-7]
       test     ebx, ebx
       jbe      SHORT G_M4042_IG12
						;; bbWeight=0.50 PerfScore 1.50
G_M4042_IG11:
       mov      ebp, r11d
       vmovdqu  ymm0, ymmword ptr[rsi+4*rbp]
       vpand    ymm0, ymm0, ymmword ptr[rdi+4*rbp]
       vmovdqu  ymmword ptr[rsi+4*rbp], ymm0
       add      r11d, 8
       cmp      r11d, ebx
       jb       SHORT G_M4042_IG11
						;; bbWeight=4    PerfScore 47.00
G_M4042_IG12:
       cmp      r11d, r10d
       jae      SHORT G_M4042_IG14
       align    [0 bytes for IG13]
						;; bbWeight=0.50 PerfScore 0.62

We also see some improvements because things like `Not` are implemented "better".

Before:

      xor      r9d, r9d
       vpcmpeqd ymm0, ymm0, ymm0
       mov      gword ptr [rsp+20H], rdx
       test     rdx, rdx
       je       SHORT G_M11410_IG11
       mov      rax, gword ptr [rsp+20H]
       cmp      dword ptr [rax+8], 0
       jne      SHORT G_M11410_IG12
						;; bbWeight=0.50 PerfScore 4.00
G_M11410_IG11:
       xor      eax, eax
       jmp      SHORT G_M11410_IG13
						;; bbWeight=0.50 PerfScore 1.12
G_M11410_IG12:
       mov      rax, gword ptr [rsp+20H]
       cmp      dword ptr [rax+8], 0
       jbe      SHORT G_M11410_IG19
       mov      rax, gword ptr [rsp+20H]
       add      rax, 16
						;; bbWeight=0.50 PerfScore 3.12
G_M11410_IG13:
       lea      r10d, [r8-7]
       test     r10d, r10d
       jbe      SHORT G_M11410_IG15
						;; bbWeight=0.50 PerfScore 0.88
G_M11410_IG14:
       mov      r11d, r9d
       vmovdqu  ymm1, ymmword ptr[rax+4*r11]
       vpxor    ymm1, ymm1, ymm0
       vmovdqu  ymmword ptr[rax+4*r11], ymm1
       add      r9d, 8
       cmp      r9d, r10d
       jb       SHORT G_M11410_IG14
						;; bbWeight=4    PerfScore 36.33
G_M11410_IG15:
       xor      rax, rax
       mov      gword ptr [rsp+20H], rax
       cmp      r9d, r8d
       jae      SHORT G_M11410_IG17
       mov      eax, dword ptr [rdx+8]
       align    [4 bytes for IG16]
						;; bbWeight=0.50 PerfScore 2.38

After:

G_M11410_IG10:
       xor      r9d, r9d
       cmp      dword ptr [rdx], edx
       lea      rax, bword ptr [rdx+16]
       lea      r10d, [r8-7]
       test     r10d, r10d
       jbe      SHORT G_M11410_IG12
						;; bbWeight=0.50 PerfScore 2.75
G_M11410_IG11:
       mov      r11d, r9d
       vpcmpeqd ymm0, ymm0, ymm0
       vpxor    ymm0, ymm0, ymmword ptr[rax+4*r11]
       vmovdqu  ymmword ptr[rax+4*r11], ymm0
       add      r9d, 8
       cmp      r9d, r10d
       jb       SHORT G_M11410_IG11
						;; bbWeight=4    PerfScore 29.00
G_M11410_IG12:
       cmp      r9d, r8d
       jae      SHORT G_M11410_IG14
       mov      eax, dword ptr [rdx+8]
       align    [11 bytes for IG13]
						;; bbWeight=0.50 PerfScore 1.75

Member

stephentoub commented Jan 21, 2022

We also see some improvements because things like Not are implemented "better"

Is that because the code you wrote is better and we could have done the same thing with the intrinsics directly, or is this something that could be improved in the JIT's handling of the intrinsics as well?

Member Author

tannergooding commented Jan 21, 2022 •

edited

Loading

Is that because the code you wrote is better and we could have done the same thing with the intrinsics directly, or is this something that could be improved in the JIT's handling of the intrinsics as well?

We could've written a better implementation here but the original authors likely weren't aware of the available optimization.

Everything the xplat helper intrinsics do is implemented directly in terms of the underlying platform specific intrinsics and so there is nothing they can do that you cannot also do yourself.

The benefit is that you don't have to consider the optimal approach to each and every platform. You don't have to consider things like ~x is best implemented as Not(x) on Arm64 vs x ^ ~0 on x86/x64 or that Arm64 has Abs while x86/x64 needs you to x & 0x7FFFF..., etc

They also provide a simplification for working with unpinned memory as you can just LoadUnsafe(ref value, index) rather than Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<T, byte>(ref Unsafe.Add(ref value, index))) and directly do things like x == y rather than Compare + MoveMask, etc

Member Author

tannergooding commented Jan 21, 2022

Not a huge difference on perf; but it is there, likely mostly from removing the big pinning blocks/logic:

Before

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Allocated
BitArrayAnd	4	1.669 ns	0.0121 ns	0.0113 ns	1.670 ns	1.648 ns	1.688 ns	-	-	-
BitArrayNot	4	1.316 ns	0.0143 ns	0.0119 ns	1.316 ns	1.298 ns	1.340 ns	-	-	-
BitArrayOr	4	1.656 ns	0.0106 ns	0.0082 ns	1.659 ns	1.641 ns	1.668 ns	-	-	-
BitArrayXor	4	1.504 ns	0.0215 ns	0.0191 ns	1.505 ns	1.467 ns	1.533 ns	-	-	-
BitArrayBoolArrayCtor	4	9.555 ns	0.4905 ns	0.5648 ns	9.599 ns	8.797 ns	10.61 ns	0.0038	-	64 B
BitArrayAnd	512	9.331 ns	0.0417 ns	0.0390 ns	9.333 ns	9.258 ns	9.405 ns	-	-	-
BitArrayNot	512	5.716 ns	0.0364 ns	0.0340 ns	5.720 ns	5.661 ns	5.772 ns	-	-	-
BitArrayOr	512	9.184 ns	0.1009 ns	0.0944 ns	9.151 ns	9.082 ns	9.363 ns	-	-	-
BitArrayXor	512	9.648 ns	0.0411 ns	0.0364 ns	9.645 ns	9.600 ns	9.716 ns	-	-	-
BitArrayBoolArrayCtor	512	25.635 ns	0.5980 ns	0.6887 ns	25.769 ns	24.590 ns	26.70 ns	0.0071	-	120 B

After

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Allocated
BitArrayAnd	4	1.361 ns	0.0189 ns	0.0176 ns	1.369 ns	1.324 ns	1.383 ns	-	-	-
BitArrayNot	4	1.106 ns	0.0208 ns	0.0195 ns	1.110 ns	1.076 ns	1.143 ns	-	-	-
BitArrayOr	4	1.360 ns	0.0268 ns	0.0251 ns	1.371 ns	1.316 ns	1.398 ns	-	-	-
BitArrayXor	4	1.356 ns	0.0261 ns	0.0244 ns	1.368 ns	1.324 ns	1.391 ns	-	-	-
BitArrayBoolArrayCtor	4	9.068 ns	0.1507 ns	0.1410 ns	9.077 ns	8.847 ns	9.336 ns	0.0038	-	64 B
BitArrayAnd	512	8.229 ns	0.0474 ns	0.0396 ns	8.229 ns	8.154 ns	8.301 ns	-	-	-
BitArrayNot	512	5.325 ns	0.0323 ns	0.0302 ns	5.333 ns	5.266 ns	5.360 ns	-	-	-
BitArrayOr	512	8.947 ns	0.0951 ns	0.0843 ns	8.970 ns	8.761 ns	9.046 ns	-	-	-
BitArrayXor	512	8.146 ns	0.0587 ns	0.0549 ns	8.157 ns	8.043 ns	8.219 ns	-	-	-
BitArrayBoolArrayCtor	512	23.134 ns	0.5760 ns	0.6633 ns	23.229 ns	21.916 ns	23.975 ns	0.0072	-	120 B

Member

danmoseley commented Jan 21, 2022

Easier to read + optimized for all platforms automatically + faster as well. Beautiful.

Member Author

tannergooding commented Jan 21, 2022

This should also be ready to merge pending area owner sign-off: @eiriktsarpalis @krwq @layomia

stephentoub mentioned this pull request

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

layomia approved these changes

View reviewed changes

Contributor

layomia left a comment

LGTM wrt area ownership, chatted offline with @tannergooding for basic overview of changes.

tannergooding merged commit 358e28a into dotnet:main

kunalspathak mentioned this pull request

Regressions in System.Collections.Tests.Perf_BitArray #64990

Closed

Member

EgorBo commented Feb 17, 2022

Arm64 improvements: dotnet/perf-autofiling-issues#3437

jeffhandley mentioned this pull request

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

Closed

2 tasks

jeffhandley mentioned this pull request

Optimize System.Text.ASCIIUtility for arm64 using cross-platform intrinsics #41292

Closed

2 tasks

ghost locked as resolved and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.