Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #30781

Closed
ReubenBond opened this Issue Jul 2, 2018 · 17 comments

Comments

Projects
None yet
6 participants
@ReubenBond
Copy link

ReubenBond commented Jul 2, 2018

When the underlying collection for BlockingCollection<T> is ConcurrentBag<T>, concurrent calls to the static BlockingCollection<T>.TryTakeFromAny method can sometimes throw InvalidOperationException with the message "The underlying collection was modified from outside of the BlockingCollection". This can occur without any external modification to the collection.

This behavior is present in .NET Core 2.0/2.1 but not in .NET Framework 4.6.1.

Repro: https://gist.github.com/ReubenBond/98de2cede0d57a989ededa8e113b0f39#file-blockingcollection_concurrentbag_issue-cs

EDIT: This can be reproduced without TryTakeFromAny by replacing that line in the repro with success = blockingCollection.TryTake(out _).

EDIT 2: This does not reproduce with an underlying collection of type ConcurrentQueue<T>

EDIT 3: Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now.

@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Jul 2, 2018

I believe the problem is that when we rewrote ConcurrentBag for .NET Core to significantly improve its performance, as one small part of that we effectively removed this check:
https://referencesource.microsoft.com/#System/sys/system/collections/concurrent/ConcurrentBag.cs,397

The problem with that, as this repro highlights, is that if multiple threads are taking/stealing, it's possible that a thread may miss an item if it's taken by another thread. Consider a situation with four threads each with their own local queue, all of which are currently empty. Then consider this ordering of operations:

  1. Thread 2 adds an item, incrementing the BlockingCollection's semaphore to 1.
  2. Thread 4 tries to take an item; there is one, so it decrements the semaphore's count to 0, finds its local queue empty, and starts searching for an item to steal. It looks at thread 1 and finds its local queue empty.
  3. Thread 1 adds an item, incrementing the BC's semaphore back to 1. Thread 4 just checked Thread 1's queue, so it's not going to check it again.
  4. Thread 2 takes an item, decrementing the semaphore back to 0. It checks it own queue and finds that it contains an item. Success.
  5. Thread 4 continues its search: it finds thread 2's list is empty and then finds thread 3's list is empty. It's now looked at all of the lists, and returns false from TryTake, even though thread 1's list contains an item and it successfully decremented the semaphore's count. BlockingCollection throws.

So, even though there was an item in the collection that could have been taken, it missed it.

This sequence highlights why "Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now" made a difference: ThreadPool.QueueUserWorkItem(callback) puts work items into the global queue, whereas Task.Run from a thread pool thread puts the task into the thread's queue... that means the thread that's doing the add is very likely to keep doing adds rather than takes, which means it'll be much less likely to get into a situation like with steps (1) and (4) in the above sequence, where the same thread needed to add then take.

Unfortunately I think we're going to need to put back some kind of versioning check, where steals that fail check the versions, and if anything's been added since, it tries again.

cc: @kouvel, @benaadams

@ReubenBond

This comment has been minimized.

Copy link
Author

ReubenBond commented Jul 4, 2018

Thank you for taking a look, @stephentoub.

One effect of this bug is that some items which are added to the collection cannot be retrieved even after successive calls to TryTake, as seen in this modified repro, where the results typically look something like this:
image

IsCompleted returns true while the items are still in the underlying collection. The underlying collection show the lost items, but blockingCollection.Count == 0.

Maybe this is because IsCompleted is implemented as IsAddingCompleted && _occupiedNodes.Currentcount == 0, but when the InvalidOperationException is thrown from TryTakeWithNoTimeValidation, the semaphore is not released (even though an item was not taken), and on subsequent calls, that method will terminate early if IsCompleted is true.

@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Jul 11, 2018

@ReubenBond, would you be able to test the fix in #30947 and confirm that it fixes your issue?

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Jul 13, 2018

@ReubenBond ? If you guys can OK master, that would make us more comfortable with taking this into 2.1 in servicing.

@ReubenBond

This comment has been minimized.

Copy link
Author

ReubenBond commented Jul 13, 2018

Apologies, I'll try to build CoreFx and test today.
Thank you for addressing it so quickly.

Please note the second issue which resides in BlockingCollection<T> itself: #30781 (comment)

@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Jul 13, 2018

Please note the second issue

There's no issue if the wrapped collection behaves correctly, though, right?

@ReubenBond

This comment has been minimized.

Copy link
Author

ReubenBond commented Jul 13, 2018

@stephentoub that's right, so this fix should rectify our problems. It's just a question of resiliency, since a transient error with the underlying collection will break the BlockingCollection

@ReubenBond

This comment has been minimized.

Copy link
Author

ReubenBond commented Jul 13, 2018

Apologies, @stephentoub @danmosemsft, could you point me in the right direction for using a local corefx build from a csproj?

@danmosemsft

This comment has been minimized.

@ReubenBond

This comment has been minimized.

Copy link
Author

ReubenBond commented Jul 14, 2018

Thanks, @danmosemsft.

It worked! This change fixes the issue we were experiencing.

As a bonus, I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301.

Before: 41.48 ms per iteration on average
After: 14.07 ms per iteration on average

EDIT: I'm not sure of the etiquette/workflow for corefx, but please feel free to close this.

@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Jul 14, 2018

Thanks for validating!

I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301.

Did you try with netcoreapp2.1? My assumption is that's where the bulk of the wins are coming from, though there has already been some additional perf work for netcoreapp3.0, just not as much.

I'm not sure of the etiquette/workflow for corefx, but please feel free to close this.

Thanks. We'll close it when we either close or merge the release/2.1 port PR.

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Jul 17, 2018

Shiproom template

Description

ConcurrentBag.TryTake may fail to take an item from the collection even if it’s known to be there.  This in turn causes problems for wrappers that assume if they know the collection contains an item that TryTake will succeed, like BlockingCollection.  Race conditions can result in BlockingCollection throwing exceptions and getting into a corrupted state due to TryTake failing to return an item when it should have been able to.

Customer Impact

Exceptions / corrupted data structures / deadlocks when multiple threads access a BlockingCollection wrapped around a ConcurrentBag and race in a manner that results in takes on the bag failing.
Reported by Orleans.

Regression?

Regression from 1.x

Risk

Low:
- Small perf hit due to additional synchronization on some code paths, but that synchronization was already there pre-.NET Core 2.0 and is in netfx.
- The fix involves retries if a particular status changes during the operation, and so in theory it's possible that each try could hit that same condition and result in livelock, but the chances of that are so small as to not be relevant, and the same issue was present in pre-.NET Core 2.0 and is in netfx.

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Jul 17, 2018

Ported to 2.1 with #31009

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Jul 17, 2018

Pulled this temporarily as it missed 2.1.3 and they want a clean branch in case they need to rebuild 2.1.3 for some reason. keeping this issue open meantime.

@joshfree

This comment has been minimized.

Copy link
Member

joshfree commented Aug 9, 2018

Moving this to 2.1.5 per shiproom today.

@tarekgh would you be able to help load balance this from @danmosemsft

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Aug 14, 2018

Moving label to PR per new process

BTW I think the PR is already ready, no work required

@karelz

This comment has been minimized.

Copy link
Member

karelz commented Sep 6, 2018

Fixed in release/2.1 branch in PR #31162 (2nd attempt after the first attempt in PR #31009 was reverted by PR #31132). The fix will ship as part of 2.1.5 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.