-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High object retention via timers in finalization queue #233
Comments
//fyi: @xinchen10, @JoshLove-msft, @danielmarbach |
Probably from here? azure-amqp/src/Fx/Singleton.cs Line 55 in c6242a5
|
#234 would fix at least that usage. |
Good find @danielmarbach . I'm not sure if that is the cause of my specific retention issue, but it might be 🤷♂️. It all depends on if the ASB client uses that method and I'm just too overburdened at the moment to investigate. It definitely looks like an issue though! Regarding contributing a fix for the explicit timer dispose: I just don't have enough confidence I wouldn't do harm by touching those code paths. I can't wrap my head around the object lifetimes in that area and wouldn't know how to property test or validate any of it. |
@aarondandy how many ReceivingAmqpLink+ReceiveAsyncResult objects did you see? Also 10,000+? The receiving link always disposes the cancellation token registration when the async result is completed. If you saw roughly the same amount of ReceiveAsyncResult objects as CallbackNode, it would mean that the AMQP receiving link does not dispose the registration token in some cases. Otherwise it could be leaks from some other places where registration tokens are created, especially if you use the same cancellation token source for all the calls. |
The crazy amount of CallbackNode could be some EF, Polly, or other shenanigans we are up to. The core issue for us though in this context is just the TimerHolder that is hanging on to all of them. I don't have the dumps anymore, but there were maybe only 100-1000 TimerHolder instances in the finalization queue. Also, possibly related, I think there are some framework level optimizations with CallbackNodes which may be contributing to them growing so large but 🤷♂️. Even if there were fewer CallbackNodes I don't think our finalization queues would be processed fast enough to rely on finalization over object disposal. Also, worth noting that I added a background thread to force the finalization queue to be processed and that has greatly reduced the symptoms of this issue for us. |
The timers are disposed in the receiving link. They show up in finalization queue because it has a finalizer method, but the method should be skipped after Dispose is called. Without a dump it is going to be difficult to find the root cause, unless we can reproduce it by simulating what your code does.
|
Could be. I'm making the assumption that uses GC.SuppressFinalize as part of the dispose pattern and won't end up in the finalization queue as a result when explicitly disposed, but 🤷♂️.
I'm not comfortable sharing my dumps, but I can undo my finalization hack and let it run for a few days and we can hop on a call and pick through the pile together if you want. I haven't been able to make a minimum reproduction of this as I don't know of a good way to prevent the finalization queue from being processed timely on my local machine. I also just don't have time to dig into the internals of the library sadly.
The Azure service bus library
I'm not really familiar with how that works but if this is related to the older Close issue I would guess they use 60 seconds
I think about each second we have at least one message, so it wouldn't be waiting too long. This is not a problem in our test environment where most ReceiveAsync calls would not retrieve a message, but that also means it has plenty of time to clean up the finalization queue.
EF, Polly, and some of my code may create some linked CancellationTokenSource but all of them should lead back to the CancellationToken we get from ProcessMessageAsync in Azure Service Bus. |
I have a new memory dump with over 700mb retained by a Timer made by ReceivingAmqlLink+ReceiveAsyncResult. I can see that the TimerQueueTimer._canceled field is true. I can also see that the CancellationTokenSource+CallbackNode's CallbackState held by the cancellationTokenRegistration is no longer pointing to the ReceiveAsyncResult callback state, but a completely different object instance's state. I think this is enough to indicate that both timer.Dispose() and cancellationTokenRegistration.Dispose() have been called here as expected. So, maybe even though the root is the Timer made by ReceiveAsyncResult the reason for this CallbackNode chain being so large is unrelated to the ASB or AMQP libraries like I thought. Looking at the remaining chain of CancellationTokenSource+Registrations, I wonder if this is actually due to Microsoft.Data.SqlClient failing to Dispose of a CancellationTokenRegristration so I guess I'll go poke around over there. I'll hold onto this dump for a few days if anybody out there was curious enough to dig through it with me. |
@aarondandy you might be affected by |
Oh I think this is exactly it! I feel like I came across this a month ago but wasn't smart enough to piece it together at the time. I'm not sure if this will work for me, but while I wait for that fix I wonder if wrapping the cancellationToken I get from ASB/AMQP in a new CancellationTokenSource would be enough to decouple the linked lists held by the ReceiveAsyncResult timer and the one used by SqlCommand, to let the GC take care of things quicker. |
Looking at the code behind CancellationTokenSource, I think I'm just doomed and going to need to try their preview release or go through constant service restarts 😬. I'll give the preview a try for a bit and if that resolves the issue then I think it definitely proves my issues are unrelated to ASB or AMQP. |
Damn sorry to hear. I wish I could be of more help, but I'm just a community sidekick occasionally sneaking into some repos making some noise. |
Microsoft.Data.SqlClient 5.1.0-preview2.22314.2 brought the memory usage down from gigabytes to megabytes 🎉. Thanks for the tip @danielmarbach . I'm closing this as I'm pretty sure it's unrelated to this library now, @xinchen10 . |
While investigating OutOfMemoryExceptions in our production environment, I noticed a large number of
CancellationTokenSource+CallbackNode
chains being retained. Sometimes these chains can be tens of thousands of instances long and retain gigabytes of memory in total. When investigating common retention paths in these memory dumps, the majority of them look like they originate from aTimerHolder
in the finalization queue with a retention path involvingReceivingAmqpLink+ReceiveAsyncResult
. Our systems, especially hosted in Azure Container Apps, are unable to work through this finalization queue and reclaim memory in time. I'm having some trouble following this codebase so I don't know what the root cause could be, but I saw a commit from 2020 on the developmentmaster
branch involving timer disposal which might be relevant. If that commit is for a known Dispose issue in the current release, would it be possible to get it on the current activehotfix
branch as well?Relevant commit: 9fad696
Retention path:
The text was updated successfully, but these errors were encountered: