Vectorise BitArray #41896

Gnbrkm41 · 2019-10-18T14:52:35Z

Fixes https://github.com/dotnet/corefx/issues/41762 and https://github.com/dotnet/corefx/issues/37946
Related #39173

This PR continues from the previous PR from @BruceForstall (#39173) in an attempt to speed up various operations of BitArray by vectorisation and using AVX2 256-bit wide instructions.

The performance difference, compared to before the optimizations were applied are as following, when operating on arrays of size 4/512/32768 (Threshold 5%):

summary:
better: 17, geomean: 3.773
worse: 3, geomean: 2.094
total diff: 20

Slower	diff/base	Base Median (ns)	Diff Median (ns)
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 4)	4.76	0.78	3.72
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 4)	1.79	0.92	1.64
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 4)	1.08	9.83	10.61

Faster	base/diff	Base Median (ns)	Diff Median (ns)
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 32768)	88.64	116382.63	1312.93
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 32768)	33.88	323823.66	9557.69
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 512)	25.40	5096.80	200.63
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 512)	14.80	407.56	27.54
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 32768)	7.83	3679.64	469.90
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512)	6.55	61.56	9.39
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 4)	1.98	71.92	36.26
System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 512)	1.82	19.69	10.83
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512)	1.75	53.17	30.31
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 512)	1.66	17.88	10.78
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32768)	1.63	3078.29	1883.64
System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 512)	1.55	17.15	11.06
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 32768)	1.49	1259.86	847.41
System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 32768)	1.48	1261.98	853.47
System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 32768)	1.47	1238.11	842.76
System.Collections.Tests.Perf_BitArray.BitArrayLeftShift(Size: 4)	1.16	4.63	3.98
System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 4)	1.10	23.14	21.08

Regarding the slowdown of BitArraySetAll, I have re-run the benchmarks with various sizes to see at which point the new implementation outrun the current implementation.
(Threshold 5%)

summary:                                                                                                                                                                                  better: 6, geomean: 1.345
worse: 2, geomean: 1.729
total diff: 8

Slower	diff/base	Base Median (ns)	Diff Median (ns)	Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 4)	2.69	1.39	3.74
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 16)	1.11	3.16	3.50

Faster	base/diff	Base Median (ns)	Diff Median (ns)
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512)	1.76	53.79	30.51
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 64)	1.35	7.03	5.23
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 96)	1.31	9.52	7.25
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 128)	1.30	11.93	9.21
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 256)	1.22	22.09	18.09
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32)	1.20	4.08	3.40

Which suggests that it may be faster for filling BitArray that contains more than 32 elements. One thing to note though, is that the numbers for small sizes seem to fluctuate around for small sizes, so I suppose the results may be inaccurate. (Even for this benchmark, I expect the numbers to be similar for Size 4/16/32, since they are all stored in one int and therefore should be just a single copy of an int; but they all seem to give different results)

Furthermore, since the current implementation of SetAll operates on the whole of the backing array, this may result in unnecessary copying to unused area when the BitArray.Length has been set to make the BitArray smaller but the backing array hasn't been resized due to the new length not meeting the _ShrinkThreshold (in int counts):

corefx/src/System.Collections/src/System/Collections/BitArray.cs

Line 23 in 2b92fc0

private const int _ShrinkThreshold = 256;

The new implementation uses GetInt32ArrayLengthFromBitLength(Length) method to calculate where the used area are and only copies to that region. Unfortunately, since this happens for smaller sized arrays as well, this check on itself seem to results in approximately 0.7x slowdown when the array has less than 32 elements.

Regarding the use of AVX2, I figured out that AVX2 generally improved the performance despite the concerns about downclocking. This is an example comparison between various paths for BitArray(Array, int) with bool arrays (See https://github.com/dotnet/corefx/issues/41762#issuecomment-542658154 and https://github.com/dotnet/corefx/issues/41762#issuecomment-542831649 for benchmarks of And/or/xor/not and BitArray(bool[])):

// * Summary *                                                                                                          
BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999                                                             Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014899
  [Host]              : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
  Job-VRDFCM          : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
  AVX2 Disabled       : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
  Intrinsics Disabled : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT

Method	Job	EnvironmentVariables	PowerPlanMode	Toolchain	IterationTime	MaxIterationCount	MinIterationCount	WarmupCount	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
BitArrayCopyToBoolArray	Default	Empty	00000000-0000-0000-0000-000000000000	CoreRun	250.0000 ms	20	15	1	4	35.90 ns	0.218 ns	0.203 ns	35.84 ns	35.60 ns	36.26 ns	-	-	-	-
BitArrayCopyToBoolArray	AVX2 Disabled	COMPlus_EnableAVX2=0	8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c	Before	Default	Default	Default	Default	4	35.90 ns	0.142 ns	0.132 ns	35.89 ns	35.65 ns	36.13 ns	-	-	-	-
BitArrayCopyToBoolArray	Intrinsics Disabled	COMPlus_EnableHWIntrinsic=0	8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c	Before	Default	Default	Default	Default	4	73.08 ns	0.333 ns	0.311 ns	73.06 ns	72.56 ns	73.76 ns	-	-	-	-
BitArrayCopyToBoolArray	Default	Empty	00000000-0000-0000-0000-000000000000	CoreRun	250.0000 ms	20	15	1	512	191.24 ns	1.056 ns	0.988 ns	191.25 ns	189.72 ns	192.84 ns	-	-	-	-
BitArrayCopyToBoolArray	AVX2 Disabled	COMPlus_EnableAVX2=0	8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c	Before	Default	Default	Default	Default	512	220.78 ns	0.819 ns	0.766 ns	220.75 ns	219.00 ns	221.94 ns	-	-	-	-
BitArrayCopyToBoolArray	Intrinsics Disabled	COMPlus_EnableHWIntrinsic=0	8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c	Before	Default	Default	Default	Default	512	5,221.48 ns	16.221 ns	12.664 ns	5,227.59 ns	5,202.04 ns	5,235.70 ns	-	-	-	-
BitArrayCopyToBoolArray	Default	Empty	00000000-0000-0000-0000-000000000000	CoreRun	250.0000 ms	20	15	1	32768	9,776.74 ns	89.691 ns	74.896 ns	9,756.03 ns	9,696.10 ns	9,976.58 ns	-	-	-	-
BitArrayCopyToBoolArray	AVX2 Disabled	COMPlus_EnableAVX2=0	8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c	Before	Default	Default	Default	Default	32768	11,746.34 ns	50.812 ns	42.431 ns	11,735.14 ns	11,679.19 ns	11,834.34 ns	-	-	-	-
BitArrayCopyToBoolArray	Intrinsics Disabled	COMPlus_EnableHWIntrinsic=0	8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c	Before	Default	Default	Default	Default	32768	331,026.89 ns	936.011 ns	781.612 ns	331,110.79 ns	330,069.53 ns	332,353.66 ns	-	-	-	-