Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 preview 2: NullReferenceException at CancellationTokenSource.CallbackPartition.Unregister #22946

Closed
tactical-drone opened this issue Mar 1, 2019 · 7 comments · Fixed by dotnet/runtime#309
Assignees
Milestone

Comments

@tactical-drone
Copy link

@tactical-drone tactical-drone commented Mar 1, 2019

On linux with the new preview I have stumbled upon some kind of strange rare bug that does not occur on Windows. The result is a clr crash with sigsegv.

When I run the program overnight with gdb --args dotnet run ... it ran for 7 hours then crashed with:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Threading.CancellationTokenSource.CallbackPartition.Unregister(Int64 id, CallbackNode node)
   at System.Threading.Tasks.Task.DelayPromiseWithCancellation.Cleanup()
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.Tasks.Task.DelayPromise.<>c.<.ctor>b__1_0(Object state)
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
[New Thread 0x7fffff580700 (LWP 4018)]
[New Thread 0x7ffff6fb0700 (LWP 4019)]
[Thread 0x7ffffcea0700 (LWP 815) exited]
[Thread 0x7ffffc690700 (LWP 816) exited]
[Thread 0x7ffffbe80700 (LWP 817) exited]
[Thread 0x7ffffb2f0700 (LWP 818) exited]
[Thread 0x7fffff5f0700 (LWP 819) exited]
[Thread 0x7ffff6580700 (LWP 824) exited]
[Thread 0x7fff83900700 (LWP 827) exited]
[Thread 0x7fff825b0700 (LWP 845) exited]
[Thread 0x7fffff4a0700 (LWP 927) exited]
[Thread 0x7ffff77c0700 (LWP 4015) exited]
[Thread 0x7fffff580700 (LWP 4018) exited]
[Thread 0x7ffff6fb0700 (LWP 4019) exited]
[Inferior 1 (process 814) exited with code 0206]
(gdb)

I have no idea why the output looks like that and why a C# like stacktrace was printed. I was expecting some machine code or something.

This did not occur on dotnet 3 preview 1. I am highly confident the issue was introduced with preview 2. But there still might be a small chance my program is causing this somehow since I interop with C code. Even though that C code does not use pointers at all.

@sergiy-k sergiy-k added this to the 3.0 milestone Jun 5, 2019
@kouvel kouvel removed their assignment Jun 5, 2019
@kouvel

This comment has been minimized.

Copy link
Member

@kouvel kouvel commented Jun 5, 2019

@tarekgh

This comment has been minimized.

Copy link
Member

@tarekgh tarekgh commented Jun 6, 2019

@stephentoub can you think in any way make CallbackPartition.Unregister throw NullReferenceException? I have looked at all callers and I am seeing we always ensure the passed node is not null. And inside CallbackPartition.Unregister I am not seeing a way we throw NullReferenceException except if somehow node != Callbacks and in same time node.Prev is null. The only way this can happen if CallbackPartition.Unregister get called same time from 2 threads with the exact same node object. I am not sure if this possible to happen.

@tarekgh

This comment has been minimized.

Copy link
Member

@tarekgh tarekgh commented Jun 6, 2019

looking more, even the scenario I mentioned which is calling CallbackPartition.Unregister twice in same time with the exact same node wouldn't cause this problem either as we reset the node.Id. I have no clue now how this NullReferenceException can happen.

@stephentoub

This comment has been minimized.

Copy link
Member

@stephentoub stephentoub commented Jun 6, 2019

I also don't see how that could be coming from Unregister, unless corruption was introduced either via some code elsewhere in the process or a code gen issue.

Has this happened again? I don't think there's much we can do at this point without more info.

If you can get and share a core dump if it happens again, that would be helpful.

@tactical-drone

This comment has been minimized.

Copy link
Author

@tactical-drone tactical-drone commented Jun 6, 2019

I am not sure how I eventually got rid of this error, but I think my program caused it by adding to a blocking collection without stating a BoundedCapacity size. That might have caused some kind of oom situation (on linux only) causing teardown. My program is full of CancellationToken.Unregister paths. But that one came from some timer so who knows.

After I specified that bound on blocking collection I think the problem went away. It had a low probability to occur so I am not entirely sure when it got fixed. I went back through my commits and this is all I could find that could have fixed it.

I therefor think you can safely ignore this error. That time was also the only time it failed including a .net stack trace. Never done so before or since. So it might have been a random error due to corruption.

@tarekgh

This comment has been minimized.

Copy link
Member

@tarekgh tarekgh commented Jun 6, 2019

@tactical-drone I am closing the issue per your last comment but feel free to send back if you hit the problem again and would be nice at that time to have a core dump. thanks for your report.

@tarekgh tarekgh closed this Jun 6, 2019
@rynowak

This comment has been minimized.

Copy link
Member

@rynowak rynowak commented Nov 26, 2019

I'm not able to reactivate this issue, but I'm pretty certain I have a repro of this failure: aspnet/AspNetCore-Internal#3299

This program runs for a 20 minutes before finishing, and even then it's not 100%

Here's a slimmed down version of this unit test:

using System;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Diagnostics.HealthChecks;

namespace CoolRepro
{
    class Program
    {
        static async Task Main(string[] args)
        {
            await Task.WhenAny(Enumerable.Range(0, 8).Select(_ => Task.Run(async () =>
            {
                for (var i = 0; i < 1_000_000; i++)
                {
                    await RunTest();
                }
            })));
        }

        private static async Task RunTest()
        {
            // Arrange
            var insideCheck = new TaskCompletionSource<object>();

            var service = CreateHealthChecksService(b =>
            {
                b.AddAsyncCheck("cancels", async ct =>
                {
                    insideCheck.SetResult(null);

                    await Task.Delay(10000, ct);
                    return HealthCheckResult.Unhealthy();
                });
            });

            var cancel = new CancellationTokenSource();
            var task = service.CheckHealthAsync(cancel.Token);

            // After this returns we know the check has started
            await insideCheck.Task;

            cancel.Cancel();

            try
            {
                await task;
            }
            catch (TaskCanceledException)
            {
            }
        }

        private static HealthCheckService CreateHealthChecksService(Action<IHealthChecksBuilder> configure)
        {
            var services = new ServiceCollection();
            services.AddLogging();
            services.AddOptions();

            var builder = services.AddHealthChecks();
            if (configure != null)
            {
                configure(builder);
            }

            return services.BuildServiceProvider(validateScopes: true).GetRequiredService<HealthCheckService>();
        }
    }
}
<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp3.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Microsoft.Extensions.DependencyInjection" Version="3.0.1" />
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks" Version="3.0.1" />
    <PackageReference Include="Microsoft.Extensions.Logging" Version="3.0.1" />
  </ItemGroup>

</Project>

Implementation of HealthCheckService is here: https://github.com/aspnet/Extensions/blob/master/src/HealthChecks/HealthChecks/src/DefaultHealthCheckService.cs#L38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.