Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlockingCollection<T>.TryTakeFromAny throws InvalidOperationException when underlying collection is ConcurrentBag<T> #26671

Closed
ReubenBond opened this issue Jul 2, 2018 · 17 comments

Comments

@ReubenBond
Copy link
Member

When the underlying collection for BlockingCollection<T> is ConcurrentBag<T>, concurrent calls to the static BlockingCollection<T>.TryTakeFromAny method can sometimes throw InvalidOperationException with the message "The underlying collection was modified from outside of the BlockingCollection". This can occur without any external modification to the collection.

This behavior is present in .NET Core 2.0/2.1 but not in .NET Framework 4.6.1.

Repro: https://gist.github.com/ReubenBond/98de2cede0d57a989ededa8e113b0f39#file-blockingcollection_concurrentbag_issue-cs

EDIT: This can be reproduced without TryTakeFromAny by replacing that line in the repro with success = blockingCollection.TryTake(out _).

EDIT 2: This does not reproduce with an underlying collection of type ConcurrentQueue<T>

EDIT 3: Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now.

@stephentoub
Copy link
Member

I believe the problem is that when we rewrote ConcurrentBag for .NET Core to significantly improve its performance, as one small part of that we effectively removed this check:
https://referencesource.microsoft.com/#System/sys/system/collections/concurrent/ConcurrentBag.cs,397

The problem with that, as this repro highlights, is that if multiple threads are taking/stealing, it's possible that a thread may miss an item if it's taken by another thread. Consider a situation with four threads each with their own local queue, all of which are currently empty. Then consider this ordering of operations:

  1. Thread 2 adds an item, incrementing the BlockingCollection's semaphore to 1.
  2. Thread 4 tries to take an item; there is one, so it decrements the semaphore's count to 0, finds its local queue empty, and starts searching for an item to steal. It looks at thread 1 and finds its local queue empty.
  3. Thread 1 adds an item, incrementing the BC's semaphore back to 1. Thread 4 just checked Thread 1's queue, so it's not going to check it again.
  4. Thread 2 takes an item, decrementing the semaphore back to 0. It checks it own queue and finds that it contains an item. Success.
  5. Thread 4 continues its search: it finds thread 2's list is empty and then finds thread 3's list is empty. It's now looked at all of the lists, and returns false from TryTake, even though thread 1's list contains an item and it successfully decremented the semaphore's count. BlockingCollection throws.

So, even though there was an item in the collection that could have been taken, it missed it.

This sequence highlights why "Updated repro to use ThreadPool instead of tasks - it reproduces much more frequently now" made a difference: ThreadPool.QueueUserWorkItem(callback) puts work items into the global queue, whereas Task.Run from a thread pool thread puts the task into the thread's queue... that means the thread that's doing the add is very likely to keep doing adds rather than takes, which means it'll be much less likely to get into a situation like with steps (1) and (4) in the above sequence, where the same thread needed to add then take.

Unfortunately I think we're going to need to put back some kind of versioning check, where steals that fail check the versions, and if anything's been added since, it tries again.

cc: @kouvel, @benaadams

@stephentoub stephentoub self-assigned this Jul 2, 2018
@ReubenBond
Copy link
Member Author

Thank you for taking a look, @stephentoub.

One effect of this bug is that some items which are added to the collection cannot be retrieved even after successive calls to TryTake, as seen in this modified repro, where the results typically look something like this:
image

IsCompleted returns true while the items are still in the underlying collection. The underlying collection show the lost items, but blockingCollection.Count == 0.

Maybe this is because IsCompleted is implemented as IsAddingCompleted && _occupiedNodes.Currentcount == 0, but when the InvalidOperationException is thrown from TryTakeWithNoTimeValidation, the semaphore is not released (even though an item was not taken), and on subsequent calls, that method will terminate early if IsCompleted is true.

@stephentoub
Copy link
Member

@ReubenBond, would you be able to test the fix in dotnet/corefx#30947 and confirm that it fixes your issue?

@danmoseley
Copy link
Member

@ReubenBond ? If you guys can OK master, that would make us more comfortable with taking this into 2.1 in servicing.

@ReubenBond
Copy link
Member Author

Apologies, I'll try to build CoreFx and test today.
Thank you for addressing it so quickly.

Please note the second issue which resides in BlockingCollection<T> itself: https://github.com/dotnet/corefx/issues/30781#issuecomment-402371237

@stephentoub
Copy link
Member

Please note the second issue

There's no issue if the wrapped collection behaves correctly, though, right?

@ReubenBond
Copy link
Member Author

@stephentoub that's right, so this fix should rectify our problems. It's just a question of resiliency, since a transient error with the underlying collection will break the BlockingCollection

@ReubenBond
Copy link
Member Author

Apologies, @stephentoub @danmosemsft, could you point me in the right direction for using a local corefx build from a csproj?

@danmoseley
Copy link
Member

@ReubenBond
Copy link
Member Author

ReubenBond commented Jul 14, 2018

Thanks, @danmosemsft.

It worked! This change fixes the issue we were experiencing.

As a bonus, I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301.

Before: 41.48 ms per iteration on average
After: 14.07 ms per iteration on average

EDIT: I'm not sure of the etiquette/workflow for corefx, but please feel free to close this.

@stephentoub
Copy link
Member

Thanks for validating!

I see a massive speed improvement in our little Orleans repro project when running netcoreapp3.0 with the latest nightly SDK compared to netcoreapp2.0 on SDK 2.1.301.

Did you try with netcoreapp2.1? My assumption is that's where the bulk of the wins are coming from, though there has already been some additional perf work for netcoreapp3.0, just not as much.

I'm not sure of the etiquette/workflow for corefx, but please feel free to close this.

Thanks. We'll close it when we either close or merge the release/2.1 port PR.

@danmoseley
Copy link
Member

Shiproom template

Description

ConcurrentBag.TryTake may fail to take an item from the collection even if it’s known to be there.  This in turn causes problems for wrappers that assume if they know the collection contains an item that TryTake will succeed, like BlockingCollection.  Race conditions can result in BlockingCollection throwing exceptions and getting into a corrupted state due to TryTake failing to return an item when it should have been able to.

Customer Impact

Exceptions / corrupted data structures / deadlocks when multiple threads access a BlockingCollection wrapped around a ConcurrentBag and race in a manner that results in takes on the bag failing.
Reported by Orleans.

Regression?

Regression from 1.x

Risk

Low:
- Small perf hit due to additional synchronization on some code paths, but that synchronization was already there pre-.NET Core 2.0 and is in netfx.
- The fix involves retries if a particular status changes during the operation, and so in theory it's possible that each try could hit that same condition and result in livelock, but the chances of that are so small as to not be relevant, and the same issue was present in pre-.NET Core 2.0 and is in netfx.

@danmoseley
Copy link
Member

Ported to 2.1 with dotnet/corefx#31009

@danmoseley
Copy link
Member

Pulled this temporarily as it missed 2.1.3 and they want a clean branch in case they need to rebuild 2.1.3 for some reason. keeping this issue open meantime.

@danmoseley danmoseley reopened this Jul 17, 2018
@joshfree
Copy link
Member

joshfree commented Aug 9, 2018

Moving this to 2.1.5 per shiproom today.

@tarekgh would you be able to help load balance this from @danmosemsft

@danmoseley
Copy link
Member

Moving label to PR per new process

BTW I think the PR is already ready, no work required

@karelz
Copy link
Member

karelz commented Sep 6, 2018

Fixed in release/2.1 branch in PR dotnet/corefx#31162 (2nd attempt after the first attempt in PR dotnet/corefx#31009 was reverted by PR dotnet/corefx#31132). The fix will ship as part of 2.1.5 release.

@karelz karelz closed this as completed Sep 6, 2018
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.1.x milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants