Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Speed up BitArray #39173

Closed
wants to merge 1 commit into from
Closed

Conversation

BruceForstall
Copy link
Member

  1. Use AVX2, if available, for And/Or/Xor
  2. Vectorize Not
  3. Use Span.Fill() for SetAll()
  4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases

@BruceForstall
Copy link
Member Author

Related: dotnet/performance#610

@BruceForstall
Copy link
Member Author

Perf results:

summary:
better: 22, geomean: 1.722
worse: 2, geomean: 1.128
total diff: 24

| Slower                                                          | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 32)     |      1.18 |             4.27 |             5.03 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayGet(Size: 20480) |      1.08 |        155654.16 |        168092.73 |         |

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512)               |      5.96 |            64.24 |            10.79 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 20480)             |      5.91 |          2095.84 |           354.42 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 256)               |      5.74 |            35.80 |             6.24 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 32)                |      2.35 |             5.07 |             2.16 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512)            |      1.76 |            64.45 |            36.52 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 512)               |      1.72 |            22.85 |            13.28 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 20480)          |      1.67 |          2331.80 |          1396.23 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 512)                |      1.57 |            20.96 |            13.34 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 512)               |      1.56 |            21.63 |            13.83 | bimodal |
| System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 20480)             |      1.52 |           992.41 |           651.68 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 20480)             |      1.52 |           988.19 |           649.90 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 20480)              |      1.52 |           990.61 |           651.54 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayCopyToIntArray(Size: 32)     |      1.45 |            22.64 |            15.59 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 256)               |      1.37 |            12.74 |             9.31 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 256)                |      1.33 |            12.48 |             9.36 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 256)               |      1.33 |            12.35 |             9.31 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 256)            |      1.21 |            22.79 |            18.87 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 32)                |      1.20 |             4.35 |             3.62 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32)             |      1.17 |             4.87 |             4.14 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetLengthShrink(Size: 256)   |      1.09 |            98.23 |            90.07 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayLengthValueCtor(Size: 20480) |      1.08 |           378.30 |           349.60 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 32)    |      1.06 |            33.76 |            31.84 |         |

I presume the "Slower" cases are due to insufficiently quiescent machine.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jul 3, 2019

Are these modifications "safe"? If somebody (like a separate thread) is modifying either BitArray instance concurrently with this pointer-based code running, is there a potential that memory corruption could occur due to references being swapped out mid-stream without anybody noticing?

Edit: I see that this change isn't introducing unsafe code; it's modifying existing unsafe code. I'll follow up offline regarding the existing code. Meanwhile don't let my comment here block checkin.

@BruceForstall
Copy link
Member Author

@GrabYourPitchforks Well, my changes I think are no less safe than what was already there.

And the documentation (https://docs.microsoft.com/en-us/dotnet/api/system.collections.bitarray) says:

Thread Safety
Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.

This implementation does not provide a synchronized (thread safe) wrapper for a BitArray.

@BruceForstall
Copy link
Member Author

@stephentoub

btw, are changes like this acceptable at this point in 3.0, or should it wait?

@stephentoub
Copy link
Member

btw, are changes like this acceptable at this point in 3.0, or should it wait?

Thanks, Bruce. I suggest we wait on this until after we fork master. In the past we've sometimes had some bug tail from vectorization work, and at this point we don't have a lot of time to react to that well.

}

int arrayLength = GetInt32ArrayLengthFromBitLength(Length);
m_array.AsSpan(0, arrayLength).Fill(fillValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahsonkhan had previously looked into doing this and found there were regressions for smaller inputs. See #33367 (comment) for a discussion with him and @GrabYourPitchforks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected that, but the perf results for a single "int" (32-bit) "SetAll" still showed an improvement, which is surprising:

| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32) | 1.17 | 4.87 | 4.14 | |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably run them again, and try harder to get a quiescent machine.

{
for (; i < count - (Vector256<int>.Count - 1); i += Vector256<int>.Count)
{
Vector256<int> leftVec = Avx.LoadVector256(leftPtr + i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any potential alignment issues here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Are there any Vector256 alignment issues to be aware of? (also, if you could look at this change, that would be great)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from a correctness point of view as these loads/stores are unaligned anyway. It may be a perf issue as unaligned access that straddles cache lines is kind of slow. Typical vectorized implementations attempt to first align the pointer but it looks like the existing Vector128 code didn't bother with that so...

Copy link
Member

@EgorBo EgorBo Jul 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree AVX + aligned loads + partially unrolled loops to do 4 XORs per iteration is super fast for large inputs I find SSE + unaligned loads is more friendly for small inputs and the code looks (and actually) much smaller. Just take a look at this "Calculate a sum of floats" impl with all of those in mind: https://github.com/dotnet/machinelearning/blob/287bc3e9b84f640e9b75e7b288c2663a536a9863/src/Microsoft.ML.CpuMath/AvxIntrinsics.cs#L988-L1095

Btw, the extended jump table probably regresses bitarrays with Length > 3 && Length < 8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as mikedn said, these are all "unaligned" loads anyways and the only real concern is for reads/writes that cross a cache-line or page boundary, which can be significantly slower. The current optimization manual indicates ~4.5 cycles vs 1 cycle for modern processors for cache-line load splits and ~150 cycles for a cache-line store split.

So, you might see some perf benefit (especially for large arrays) by aligning the pointer. You can generally still perform the first (handling unaligned elements) and last (handling trailing elements) via vectorization by masking out elements you don't care about.

@danmoseley danmoseley added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Jul 3, 2019
@danmoseley
Copy link
Member

danmoseley commented Jul 3, 2019

Marking no-merge to wait for the last merge from master into release.

@danmoseley
Copy link
Member

I assume you'll run the tests with the various COMPlus_xx environment variables so it goes down each of the various ISA paths? I don't think we (?@tannergooding) have set up infrastrcuture yet to make that easy in CoreFX tests.

@mikedn
Copy link

mikedn commented Jul 4, 2019

Are these modifications "safe"? If somebody (like a separate thread) is modifying either BitArray instance concurrently with this pointer-based code running, is there a potential that memory corruption could occur due to references being swapped out mid-stream without anybody noticing?

The current BitArray code is already broken as it makes no attempt to prevent races leading to memory corruption. That should be fixed, it is OK for BitArray to not be thread safe but in general the lack of thread safety in .NET shouldn't lead to memory corruption. It is expected that changing the length of the BitArray while an operation is in progress may lead to an IndexOutOfRangeException or something similar in the worst case.

case 3: m_array[2] &= value.m_array[2]; goto case 2;
case 2: m_array[1] &= value.m_array[1]; goto case 1;
case 1: m_array[0] &= value.m_array[0]; goto Done;
case 0: goto Done;
}

int i = 0;
if (Sse2.IsSupported)
if (Avx2.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if BitArray is commonly large enough that we won't regress perf for the 16-32 byte scenario since Avx2 will force that back to the scalar case?

It's likely worth determining the cutoff point at which AVX2 starts significantly winning out over SSE2 and account for that (I have seen it generally be at 48-128 bytes, rather than at exactly 32-bytes).

@GrabYourPitchforks
Copy link
Member

It looks like the unsafe code was originally added back in #33781. I followed up with some folks offline and the consensus is that thread safety violations shouldn't lead to type safety violations or memory corruption. This gels with @mikedn's observations above and fits the pattern already used by some of our other types like Memory<T>.

I opened https://github.com/dotnet/corefx/issues/39204 to track the issue. I have no preference on whether the issue is addressed as part of this PR or as a separate follow-up PR.

@jkotas
Copy link
Member

jkotas commented Jul 4, 2019

no preference on whether the issue is addressed as part of this PR or as a separate follow-up PR.

#39204 should be addressed in separate PR for 3.0. This vectorization work is not for 3.0.

@Gnbrkm41
Copy link

Gnbrkm41 commented Jul 5, 2019

Please close #37946 if this get merged :^)

@BruceForstall
Copy link
Member Author

To summarize the work left to do:

  1. Test and measure performance with AVX2 disabled, and AVX2 and SSE2 both disabled. For performance, this is to see when AVX2 starts being better.
  2. Meaure the performance of SetAll for various small sizes to see if the previous simple loop implementation is faster, and at what point vectorized implementations in Span.Fill() are better.
  3. Investigate aligning Vector256 (and Vector128) reads/writes: split operations into unaligned head, aligned body, unaligned tail. Measure by forcing unaligned and forcing aligned incoming pointers to see the difference. Measure for small and large BitArray sizes.
  4. Merge with Ensure BitArray unsafe accesses are within bounds #39270

@stephentoub stephentoub removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Jul 17, 2019
@adamsitnik adamsitnik added the tenet-performance Performance related issue label Jul 31, 2019
@adamsitnik
Copy link
Member

To summarize the work left to do

@BruceForstall any progress on this? Could you please add a test case of length=4 to the existing performance repo benchmark

https://github.com/dotnet/performance/blob/fb4d42a60f1e79e41212a46a75583bd41653f71d/src/benchmarks/micro/corefx/System.Collections/Perf.BitArray.cs#L17

+ [Params(4, 512)]
- [Params(512)]

@BruceForstall
Copy link
Member Author

@adamsitnik I've been waiting until master is post-3.0 (i.e., now), and to find some time. Thanks for the suggestion; will do that.

@karelz karelz added this to the 5.0 milestone Aug 3, 2019
@maryamariyan
Copy link
Member

maryamariyan commented Aug 7, 2019

@BruceForstall master branch is now open for 5.0 submissions. There is some conflicts on this PR, would you be updating this or prefer to close and reopen it at a later time?

Asking because, the aim is to have less PRs open that are inactive for multiple weeks.

@BruceForstall
Copy link
Member Author

@maryamariyan I plan to update it. Hopefully relatively soon.

1. Use AVX2, if available, for And/Or/Xor
2. Vectorize Not
3. Use Span<T>.Fill() for SetAll()
4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases
@ViktorHofer
Copy link
Member

ping @BruceForstall

@BruceForstall
Copy link
Member Author

@ViktorHofer I haven't forgotten, but new responsibilities + vacation has gotten in the way. If someone else wants to pick this up and finish it, I'm willing to help. Or, I'll get to it eventually...

@maryamariyan
Copy link
Member

maryamariyan commented Oct 8, 2019

If nobody will be working on this PR anytime soon, it is best to link this PR to an issue and close this PR.
@BruceForstall do you agree? (and mark as up-for-grabs or something)

@BruceForstall
Copy link
Member Author

@maryamariyan There is already an issue: https://github.com/dotnet/corefx/issues/37946. I'd prefer to keep this open a little longer, but if you have a "not recently touched PR" requirement to close it, then I guess that's ok.

@ViktorHofer
Copy link
Member

we don't have such a requirement but prefer to have a not too long list of open PRs. That said, I'm fine with keeping this one open.

@maryamariyan
Copy link
Member

Thanks Bruce. That makes sense. We're trying to reduce the backlog of old PR'S so we can focus on being responsive on the active ones. I'm going to close this for now. Please do reopen when you're ready to pick this up again

@adamsitnik
Copy link
Member

I would really love to see these optmizations merged. I've created a new issue to keep track of it and added the Hacktoberfest label, hoping that some contributor is going to pick it up.

https://github.com/dotnet/corefx/issues/41762

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet