Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak from linked cancellation token registrations #78180

Closed
ogxd opened this issue Nov 10, 2022 · 9 comments
Closed

Memory leak from linked cancellation token registrations #78180

ogxd opened this issue Nov 10, 2022 · 9 comments

Comments

@ogxd
Copy link
Contributor

ogxd commented Nov 10, 2022

Description

Context

We currently work on an ASP NET Core service that handles several thousands of request per seconds per instance. Internally, each request leads to several additional I/Os, with different response time requirements. In order to avoid unnecessarily waiting for an I/O completion when a request must be completed under a given time, we heavily rely on cancellation tokens and more specifically CancellationTokenSource.CreateLinkedTokenSource to merge the different timeout policies. All good so far, this is probably quite a standard way to do this.

A while ago, we noticed that our cancellation system was leading to a lot of thread contention (about 100k contention events / s according to ETW ContentionStart / Stop events).
Fortunately, we saw this 11 years old blog post from @stephentoub and by sharing the cancellation tokens the thread contention when down from 100k to something like 10 🎉

The problem

Since we have been "pooling" cancellation tokens, despite the thread contention improvements, we have noticed weird memory patterns (some kind of memory leak). We've tried different implementations for pooling tokens, but the results were always the same, up to a point that we began questioning the framework itself.

The repro

It wasn't clear until we have made a simple repro. You can find it here: cancellation-tokens-mem-leak-repro.
In this repro you can:

  • Switch between not pooling token (FrameworkTimeoutCancellationTokenProvider) or pooling them (CoalescingTimeoutCancellationTokenProvider)
  • Switch between using the framework way to merge 2 cancellation tokens (FrameworkLinkedCancellationTokenProvider) or 2 custom implementations (CustomLinkedCancellationTokenProvider and CustomLinkedNoRegistrationCancellationTokenProvider)

In short, the results are that whenever we pool tokens and keep a reference to cancellation token registrations, there is a memory leak.

Here are some graphs I've made using dotMemory:

No pooling + Registrations = No leak (Reference)

image
It's flat after a few seconds = no leak (verified by looking at GC survivors)

Pooling + Registrations = Leak

(The results are the same, whether I use FrameworkLinkedCancellationTokenProvider or CustomLinkedCancellationTokenProvider so no need to show the results for both).
image
Memory is growing over time: it's leaking.
So what's the leak?
image
As you can see there are a lot of registrations and Linked2CancellationTokenSource surviving GC (it shouldn't)
With dotMemory we can trace back the retention path for a single surviving Linked2CancellationTokenSource instance. It's very deep, so here is the beginning:
image
And here is the end (GC roots):
image

Pooling + No registration = No leak ???

I have created CustomLinkedNoRegistrationCancellationTokenProvider which is similar to Linked2CancellationTokenSource BUT it doesn't keep a reference to registration. It turns out that doing so we can completely get rid of the leak. It feels weird to me however to not dispose an IDisposable, and this is why I would like to investigate this with you. This is probably more a workaround than a proper solution.

Here is the memory profile anyway:
image
After a few seconds it's flat, which is a good sign. We can also look at GC survivors over 20 seconds:
image
There are far less cancellation token sources surviving. Those remaining one are probably expected given how the repo is made.

Reproduction Steps

https://github.com/ogxd/cancellation-tokens-mem-leak-repro

Expected behavior

Expected behavior is no memory leak

Actual behavior

It seems there is a memory leak.

Regression?

I don't know

Known Workarounds

The CustomLinkedNoRegistrationCancellationTokenProvider from the repro is a workaround, but it would be better to understand/fix the issue before relying on this in a production environment.

Configuration

  • Tested on MacOS, Windows and Ubuntu 20.
  • Framework is SDK 6.0 and runtime 6.0

Other information

No response

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 10, 2022
@ghost
Copy link

ghost commented Nov 10, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Context

We currently work on an ASP NET Core service that handles several thousands of request per seconds per instance. Internally, each request leads to several additional I/Os, with different response time requirements. In order to avoid unnecessarily waiting for an I/O completion when a request must be completed under a given time, we heavily rely on cancellation tokens and more specifically CancellationTokenSource.CreateLinkedTokenSource to merge the different timeout policies. All good so far, this is probably quite a standard way to do this.

A while ago, we noticed that our cancellation system was leading to a lot of thread contention (about 100k contention events / s according to ETW ContentionStart / Stop events).
Fortunately, we saw this 11 years old blog post from @stephentoub and by sharing the cancellation tokens the thread contention when down from 100k to something like 10 🎉

The problem

Since we have been "pooling" cancellation tokens, despite the thread contention improvements, we have noticed weird memory patterns (some kind of memory leak). We've tried different implementations for pooling tokens, but the results were always the same, up to a point that we began questioning the framework itself.

The repro

It wasn't clear until we have made a simple repro. You can find it here: .
In this repro you can:

  • Switch between not pooling token (FrameworkTimeoutCancellationTokenProvider) or pooling them (CoalescingTimeoutCancellationTokenProvider)
  • Switch between using the framework way to merge 2 cancellation tokens (FrameworkLinkedCancellationTokenProvider) or 2 custom implementations (CustomLinkedCancellationTokenProvider and CustomLinkedNoRegistrationCancellationTokenProvider)

In short, the results are that whenever we pool tokens and keep a reference to cancellation token registrations, there is a memory leak.

Here are some graphs I've made using dotMemory:

No pooling + Registrations = No leak (Reference)

image
It's flat after a few seconds = no leak (verified by looking at GC survivors)

Pooling + Registrations = Leak

(The results are the same, whether I use FrameworkLinkedCancellationTokenProvider or CustomLinkedCancellationTokenProvider so no need to show the results for both).
image
Memory is growing over time: it's leaking.
So what's the leak?
image
As you can see there are a lot of registrations and Linked2CancellationTokenSource surviving GC (it shouldn't)
With dotMemory we can trace back the retention path for a single surviving Linked2CancellationTokenSource instance. It's very deep, so here is the beginning:
image
And here is the end (GC roots):
image

Pooling + No registration = No leak ???

I have created CustomLinkedNoRegistrationCancellationTokenProvider which is similar to Linked2CancellationTokenSource BUT it doesn't keep a reference to registration. It turns out that doing so we can completely get rid of the leak. It feels weird to me however to not dispose an IDisposable, and this is why I would like to investigate this with you. This is probably more a workaround than a proper solution.

Here is the memory profile anyway:
image
After a few seconds it's flat, which is a good sign. We can also look at GC survivors over 20 seconds:
image
There are far less cancellation token sources surviving. Those remaining one are probably expected given how the repo is made.

Reproduction Steps

https://github.com/ogxd/cancellation-tokens-mem-leak-repro

Expected behavior

Expected behavior is no memory leak

Actual behavior

It seems there is a memory leak.

Regression?

I don't know

Known Workarounds

The CustomLinkedNoRegistrationCancellationTokenProvider from the repro is a workaround, but it would be better to understand/fix the issue before relying on this in a production environment.

Configuration

  • Tested on MacOS, Windows and Ubuntu 20.
  • Framework is SDK 6.0 and runtime 6.0

Other information

No response

Author: ogxd
Assignees: -
Labels:

area-System.Threading, untriaged

Milestone: -

@MihaZupan
Copy link
Member

If you run the repro longer, does the leak persist, or is it bounded at some point?

It feels weird to me however to not dispose an IDisposable

Your repro isn't disposing anything though.

Regardless of what the exact cause of this issue is here, you may find https://github.com/microsoft/reverse-proxy/blob/main/src/ReverseProxy/Utilities/ActivityCancellationTokenSource.cs useful depending on exactly what you're doing with these cancellation tokens. This approach is practically allocation-free when timeouts aren't occurring.

@ogxd
Copy link
Contributor Author

ogxd commented Nov 10, 2022

If you run the repro longer, does the leak persist, or is it bounded at some point?

Yes, here I ran it for 8 minutes:
image
image
image
Here is about the same duration with the workaround instead (not keeping a reference to registrations):
image

Your repro isn't disposing anything though.

I was expecting deconstructor to invoke Dispose() even if I don't explicitly call it

public class LinkedCancellationTokenSource : CancellationTokenSource
{
    [...]

    protected override void Dispose(bool disposing)
    {
        reg1?.Dispose();
        reg2?.Dispose();
      
        base.Dispose(disposing);
    }

    ~LinkedCancellationTokenSource()
    {
        Dispose(true);
    }
}

@stephentoub
Copy link
Member

stephentoub commented Nov 10, 2022

I was expecting deconstructor to invoke Dispose() even if I don't explicitly call it

That's the finalizer, which is only invoked when there are no more rooted references to this instance and the GC is able to collect it. And those registrations have a reference back to this instance. When you have:

CancellationTokenSource cts1 = ...;
CancellationTokenSource cts2 = CancellationTokenSource.CreateLinkedTokenSource(cts1.Token);

that's effectively storing a reference to cts2 into cts1 so that when cts1 has cancellation requested, that request will propagate to cts2. Until you Dispose of cts2, cts1 is going to continue to hold that reference. If you don't Dispose cts2, as long as cts1 is rooted, cts2 will also thus be rooted, and it will never have its finalizer called.

@ogxd
Copy link
Contributor Author

ogxd commented Nov 10, 2022

Given what you said, I was thinking why not unregistering right after the cancellation occurred?
Like this:

public class LinkedCancellationTokenSource : CancellationTokenSource
{
    private readonly CancellationTokenRegistration? _registration1;
    private readonly CancellationTokenRegistration? _registration2;
         
    // first bit = canceled
    // second bit = disposed
    private int _state = 0;

    public LinkedCancellationTokenSource(CancellationToken token1, CancellationToken token2)
    {
        if (token1.CanBeCanceled)
            _registration1 = token1.Register(CancelAndUnregister);

        if (token2.CanBeCanceled)
            _registration2 = token2.Register(CancelAndUnregister);
    }

    private void CancelAndUnregister()
    {
        int state = Interlocked.Or(ref _state, 1);
        if (state == 0)
        {
            Unregister();
            Cancel();
        }
    }

    protected override void Dispose(bool disposing)
    {
        int state = Interlocked.Or(ref _state, 2);
        if ((state & 2) != 2)
        {
            Debug.Assert(state < 2, "State should either be default or canceled, but not disposed");
            Unregister();
            base.Dispose(disposing);
        }
    }

    private void Unregister()
    {
        _registration1?.Unregister();
        _registration2?.Unregister();
    }
}

It seems to work very well.
image

Do you see any issue in doing this?

@ogxd
Copy link
Contributor Author

ogxd commented Nov 11, 2022

I played a little more with the repro and I think the reason for this "memory leak" is probably not a bug but the combination of these factors:

  • Using Linked2CancellationTokenSource to create token registrations (from CancellationTokenSource.CreateLinkedTokenSource)
  • Pooling the cancellation tokens (thus implying multiple registrations per token, amplifying the leak)
  • Not disposing Linked2CancellationTokenSource

By disposing the Linked2CancellationTokenSource I no longer have memory issues. We were missing this in our application.

using var cts = CancellationTokenSource.CreateLinkedTokenSource(token1, token2);
await FooAsync(cts.Token);

Still I wonder if the solution proposed above (unregister in token cancel callback instead of unregistering only in Linked2CancellationTokenSource.Dispose) isn't an nice usability/safety improvement. It's always best to not forget to dispose, but time have shown that people tend to forget it at times (like we did, but a search on GitHub shows that many forget it also and probably aren't aware that their code leak memory because of this).
To avoid this easy issue, this solution could be a nice way to prevent this kind of unsafe symptom to happen in a non-unsafe context. What do you think?

@ogxd
Copy link
Contributor Author

ogxd commented Nov 16, 2022

Here I come back with an answer to my own question 😀
The solution I proposed does seem to help however there is still a major issue from not disposing, even when using this. The issue is the case where the LinkedCancellationTokenSource isn't canceled yet but is out of scope. In this case, it is not disposed nor canceled, thus it will outlive the scope. This implies that despite addressing the main leak concern, it might live longer than it should and thus use more memory.

I still think this API is not very practical and can lead to memory safety issues, but at least now I understand it. For now, I decided to leverage roslyn analyzers with CA2000 to enforce disposal at build time. At least that's one less concern.

Closing this issue.

@ogxd ogxd closed this as completed Nov 16, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Nov 16, 2022
@stephentoub
Copy link
Member

Still I wonder if the solution proposed above (unregister in token cancel callback instead of unregistering only in Linked2CancellationTokenSource.Dispose) isn't an nice usability/safety improvement.

We've avoided doing that in the past as it would require extra virtual method invocations somewhere along the Cancel path, whereas right now the only one that's involved is the one for Dispose. But, technically I expect it could be done.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants