Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CancellationTokenRegistration.Unregister race condition #309

Merged
merged 1 commit into from Dec 2, 2019

Conversation

stephentoub
Copy link
Member

There's a race condition that exists if multiple threads are accessing the same CancellationTokenRegistration field, with one Unregistering and zero'ing out the field while another Unregisters. We shouldn't ever be in a situation where we have a non-null node and a 0 id, but due to struct tearing we can end up in that exact situation inside of Unregister (if the thread zero'ing out the struct succeeded in zero'ing out the id but we still read a valid node). If the node was then already unregistered, it will contain 0 for its id, in which case we'll see that the ids match and assume that means the node is still in the list. At that point we proceed to dereference nodes in the list and null ref.

The fix is simple: rather than just asserting that we'll never get 0, explicitly check for 0 and return if it is.

Fixes https://github.com/dotnet/coreclr/issues/22946
cc: @tarekgh, @kouvel, @rynowak

This has shown up sporadically in a variety of workloads that use Task.Delay(timeout, token), because if just the right interleaving occurs it can exhibit this behavior. It's easy to repro though by stressing this pattern:

using System;
using System.Threading;

class Program
{
    static void Main()
    {
        CancellationTokenRegistration reg = default;
        var cts = new CancellationTokenSource();

        ThreadPool.QueueUserWorkItem(delegate
        {
            while (true)
            {
                reg = cts.Token.Register(() => { });
                reg.Unregister();
                reg = default;
            }
        });

        ThreadPool.QueueUserWorkItem(delegate
        {
            while (true)
            {
                reg.Unregister();
            }
        });

        Console.ReadLine();
    }
}

There's a race condition that exists if multiple threads are accessing the same CancellationTokenRegistration field, with one Unregistering and zero'ing out the field while another Unregisters.  We shouldn't ever be in a situation where we have a non-null node and a 0 id, but due to struct tearing we can end up in that exact situation inside of Unregister (if the thread zero'ing out the struct succeeded in zero'ing out the id but we still read a valid node).  If the node was then already unregistered, it will contain 0 for its id, in which case we'll see that the ids match and assume that means the node is still in the list.  At that point we proceed to dereference nodes in the list and null ref.

The fix is simple: rather than just asserting that we'll never get 0, explicitly check for 0 and return if it is.
Debug.Assert(node != null, "Expected non-null node");

if (id == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the calling code, so just curious...

If it's possible for id == 0 && node == null (I think this was the case that caused the exception) is it also possible to see id != 0 && node == null? What guarantees that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here. If node is null, we won't ever make it here because all the callers check for node being null and don't make this call at all if it is. And the callers that check for node being null don't care about the value of id. The id only exists to ensure that the holder of the struct has the right to mutate the node instance, and we enter the lock while examining/mutating the node state.

@danmoseley
Copy link
Member

Is the repro above worth putting into a test? You mention it fails within a second.

@tactical-drone
Copy link

Aah I was playing around with Task.Delay(timeout, token) at the time I got this Unregister issue. I eventually removed all Task.Delay from my code as Task.Delay is completely broken IMO. Does not work at all like you think it should in async/await code.

@stephentoub
Copy link
Member Author

Is the repro above worth putting into a test?

We can, but it won't pass until we have live/live builds working. I can put up a separate PR for it.

@stephentoub
Copy link
Member Author

Can someone sign-off on this for me? :)

@stephentoub
Copy link
Member Author

Thanks :)

@stephentoub stephentoub merged commit d4ef0d1 into dotnet:master Dec 2, 2019
@stephentoub stephentoub deleted the fixctrace branch December 2, 2019 21:45
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants