Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorise BitArray #41896

Merged
merged 6 commits into from Nov 7, 2019

Conversation

@Gnbrkm41
Copy link
Collaborator

Gnbrkm41 commented Oct 18, 2019

Fixes #41762 and #37946
Related #39173

This PR continues from the previous PR from @BruceForstall (#39173) in an attempt to speed up various operations of BitArray by vectorisation and using AVX2 256-bit wide instructions.

The performance difference, compared to before the optimizations were applied are as following, when operating on arrays of size 4/512/32768 (Threshold 5%):

summary:
better: 17, geomean: 3.773
worse: 3, geomean: 2.094
total diff: 20
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 4) 4.76 0.78 3.72
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 4) 1.79 0.92 1.64
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 4) 1.08 9.83 10.61
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 32768) 88.64 116382.63 1312.93
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 32768) 33.88 323823.66 9557.69
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 512) 25.40 5096.80 200.63
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 512) 14.80 407.56 27.54
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 32768) 7.83 3679.64 469.90
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) 6.55 61.56 9.39
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 4) 1.98 71.92 36.26
System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 512) 1.82 19.69 10.83
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512) 1.75 53.17 30.31
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 512) 1.66 17.88 10.78
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32768) 1.63 3078.29 1883.64
System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 512) 1.55 17.15 11.06
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 32768) 1.49 1259.86 847.41
System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 32768) 1.48 1261.98 853.47
System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 32768) 1.47 1238.11 842.76
System.Collections.Tests.Perf_BitArray.BitArrayLeftShift(Size: 4) 1.16 4.63 3.98
System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 4) 1.10 23.14 21.08

Regarding the slowdown of BitArraySetAll, I have re-run the benchmarks with various sizes to see at which point the new implementation outrun the current implementation.
(Threshold 5%)

summary:                                                                                                                                                                                  better: 6, geomean: 1.345
worse: 2, geomean: 1.729
total diff: 8
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 4) 2.69 1.39 3.74
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 16) 1.11 3.16 3.50
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512) 1.76 53.79 30.51
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 64) 1.35 7.03 5.23
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 96) 1.31 9.52 7.25
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 128) 1.30 11.93 9.21
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 256) 1.22 22.09 18.09
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32) 1.20 4.08 3.40

Which suggests that it may be faster for filling BitArray that contains more than 32 elements. One thing to note though, is that the numbers for small sizes seem to fluctuate around for small sizes, so I suppose the results may be inaccurate. (Even for this benchmark, I expect the numbers to be similar for Size 4/16/32, since they are all stored in one int and therefore should be just a single copy of an int; but they all seem to give different results)

Furthermore, since the current implementation of SetAll operates on the whole of the backing array, this may result in unnecessary copying to unused area when the BitArray.Length has been set to make the BitArray smaller but the backing array hasn't been resized due to the new length not meeting the _ShrinkThreshold (in int counts):

private const int _ShrinkThreshold = 256;

The new implementation uses GetInt32ArrayLengthFromBitLength(Length) method to calculate where the used area are and only copies to that region. Unfortunately, since this happens for smaller sized arrays as well, this check on itself seem to results in approximately 0.7x slowdown when the array has less than 32 elements.

Regarding the use of AVX2, I figured out that AVX2 generally improved the performance despite the concerns about downclocking. This is an example comparison between various paths for BitArray(Array, int) with bool arrays (See #41762 (comment) and #41762 (comment) for benchmarks of And/or/xor/not and BitArray(bool[])):

// * Summary *                                                                                                          
BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999                                                             Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014899
  [Host]              : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
  Job-VRDFCM          : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
  AVX2 Disabled       : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
  Intrinsics Disabled : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
Method Job EnvironmentVariables PowerPlanMode Toolchain IterationTime MaxIterationCount MinIterationCount WarmupCount Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
BitArrayCopyToBoolArray Default Empty 00000000-0000-0000-0000-000000000000 CoreRun 250.0000 ms 20 15 1 4 35.90 ns 0.218 ns 0.203 ns 35.84 ns 35.60 ns 36.26 ns - - - -
BitArrayCopyToBoolArray AVX2 Disabled COMPlus_EnableAVX2=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 4 35.90 ns 0.142 ns 0.132 ns 35.89 ns 35.65 ns 36.13 ns - - - -
BitArrayCopyToBoolArray Intrinsics Disabled COMPlus_EnableHWIntrinsic=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 4 73.08 ns 0.333 ns 0.311 ns 73.06 ns 72.56 ns 73.76 ns - - - -
BitArrayCopyToBoolArray Default Empty 00000000-0000-0000-0000-000000000000 CoreRun 250.0000 ms 20 15 1 512 191.24 ns 1.056 ns 0.988 ns 191.25 ns 189.72 ns 192.84 ns - - - -
BitArrayCopyToBoolArray AVX2 Disabled COMPlus_EnableAVX2=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 512 220.78 ns 0.819 ns 0.766 ns 220.75 ns 219.00 ns 221.94 ns - - - -
BitArrayCopyToBoolArray Intrinsics Disabled COMPlus_EnableHWIntrinsic=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 512 5,221.48 ns 16.221 ns 12.664 ns 5,227.59 ns 5,202.04 ns 5,235.70 ns - - - -
BitArrayCopyToBoolArray Default Empty 00000000-0000-0000-0000-000000000000 CoreRun 250.0000 ms 20 15 1 32768 9,776.74 ns 89.691 ns 74.896 ns 9,756.03 ns 9,696.10 ns 9,976.58 ns - - - -
BitArrayCopyToBoolArray AVX2 Disabled COMPlus_EnableAVX2=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 32768 11,746.34 ns 50.812 ns 42.431 ns 11,735.14 ns 11,679.19 ns 11,834.34 ns - - - -
BitArrayCopyToBoolArray Intrinsics Disabled COMPlus_EnableHWIntrinsic=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 32768 331,026.89 ns 936.011 ns 781.612 ns 331,110.79 ns 330,069.53 ns 332,353.66 ns - - - -

// The extracted bits can be anywhere between 0 and 255, so we normalise the value to either 0 or 1
// to ensure compatibility with "C# bool" (0 for false, 1 for true, rest undefined)
Vector256<byte> normalized = Avx2.Min(extracted, ones);

This comment has been minimized.

Copy link
@EgorBo

EgorBo Oct 18, 2019

Contributor

It seems you don't do this kind of normalization for BitArray(bool[]) constructor

This comment has been minimized.

Copy link
@Gnbrkm41

Gnbrkm41 Oct 18, 2019

Author Collaborator

This is handled by comparing the bytes with zero (checking if the bytes are false) then negating the result: 72477e7#diff-e2f01cf03382b7d63fc3a67ad77fcedcR140-R142

{
for (; (i + Vector256<byte>.Count) <= m_length; i += Vector256<byte>.Count)
{
int bits = m_array[i / BitsPerInt32];

This comment has been minimized.

Copy link
@EgorBo

EgorBo Oct 18, 2019

Contributor

Again, you can load m_array as a Vector and spawn vectors for each integer

@@ -275,16 +309,34 @@ public unsafe BitArray And(BitArray value)
if (Length != value.Length || (uint)count > (uint)thisArray.Length || (uint)count > (uint)valueArray.Length)
throw new ArgumentException(SR.Arg_ArrayLengthsDiffer);

// Unroll loop for count less than Vector256 size.

This comment has been minimized.

Copy link
@gfoidl

gfoidl Oct 18, 2019

Contributor

After the vectorized version there's a sequential loop to process the remaining elements. Why not jump to this switch instead the loop?

(Of course, keep the loop if no Avx2 or Sse2 is available.)

@maryamariyan maryamariyan requested a review from safern Nov 1, 2019
BruceForstall and others added 5 commits Jul 3, 2019
1. Use AVX2, if available, for And/Or/Xor
2. Vectorize Not
3. Use Span<T>.Fill() for SetAll()
4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases
* Fix bugs present in BitArray(bool[])
* Vectorise CopyTo(Array, int) when copying to a bool[]
* Add test data for random values & larger array
* Use Vector128/256.Create and store it in static readonly field instead of loading from PE header
@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Nov 4, 2019

@Gnbrkm41 thanks for your work on this PR. As you probably saw this repo will move to a new one so we are hoping to finish as many active PR's as possible by 11/13 so they don't have to be manually re-created. Are you able to keep moving this along?

@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Nov 5, 2019

I'll push a few commits today; I've been investigating whether fetching and storing int elements in bulk using vector instruction is worth it per @EgorBo's comment but it feels like either my code is not good enough or it isn't worth it, because the results seemed worse than the current version.

I think it'll be fine to get this merged with the current logic (that is, after I push my commits). Alternatively I'm also fine with digging more into it then just re-opening the PR after the consolidation.

@Gnbrkm41 Gnbrkm41 force-pushed the Gnbrkm41:speedupbitarray branch from b491574 to 29ea3ec Nov 5, 2019
@adamsitnik

This comment has been minimized.

Copy link
Member

adamsitnik commented Nov 6, 2019

@tannergooding @BruceForstall could you please take a look? I would love to merge it before repo consolidation

Copy link
Member

tannergooding left a comment

Overall looks good/correct to me.

@maryamariyan

This comment has been minimized.

Copy link
Member

maryamariyan commented Nov 6, 2019

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your corefx repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/corefx <path to the patch created in step 1>
@adamsitnik adamsitnik merged commit a4f0447 into dotnet:master Nov 7, 2019
15 checks passed
15 checks passed
WIP Ready for review
Details
corefx-ci Build #20191104.79 succeeded
Details
corefx-ci (Linux Build RedHat6_x64_Release) Linux Build RedHat6_x64_Release succeeded
Details
corefx-ci (Linux Build arm64_Debug) Linux Build arm64_Debug succeeded
Details
corefx-ci (Linux Build arm_Debug) Linux Build arm_Debug succeeded
Details
corefx-ci (Linux Build musl_arm64_Debug) Linux Build musl_arm64_Debug succeeded
Details
corefx-ci (Linux Build musl_x64_Debug) Linux Build musl_x64_Debug succeeded
Details
corefx-ci (Linux Build wasm_Release) Linux Build wasm_Release succeeded
Details
corefx-ci (Linux Build x64_Debug) Linux Build x64_Debug succeeded
Details
corefx-ci (MacOS Build x64_Debug) MacOS Build x64_Debug succeeded
Details
corefx-ci (Windows Build NETFX_x86_Release) Windows Build NETFX_x86_Release succeeded
Details
corefx-ci (Windows Build x64_Debug) Windows Build x64_Debug succeeded
Details
corefx-ci (Windows Build x86_Release) Windows Build x86_Release succeeded
Details
corefx-ci (Windows Packaging All Configurations x64_Debug) Windows Packaging All Configurations x64_Debug succeeded
Details
license/cla All CLA requirements met.
Details
@adamsitnik

This comment has been minimized.

Copy link
Member

adamsitnik commented Nov 7, 2019

@Gnbrkm41 thank you!

@Gnbrkm41 Gnbrkm41 deleted the Gnbrkm41:speedupbitarray branch Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.