-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RumViewScope creates memory leaks #1762
Comments
If the view is already stopped, there's no need to determine whether it needs to be stopped. See DataDog#1762
Hi @pyricau, thank you for all the details provided regarding this bug, tremendous work there. This is indeed a very interesting discovery, so if I understand correctly because the In any case we are going to fix this on our next version probably by just applying the quick fix first but on the long term I totally agree with you that we should find a better equivalent for those keys. |
Thanks for the quick reply @mariusc83 I don't think this necessarily has to do with the use of a background thread, though the executor used here probably increases the delays. Here's how I'd think about it:
I think long term I'd suggest something like: 1) Stop using the actual activity & fragment instances as keys, create unique identifiers for those instead (I don't know if you have any code expecting actual activity / fragment instances as keys though?). then 2) Stop using a weak ref, just keep a strong reference to the key, and make it clear that key should be an identifier not a live UI object. |
Could we call WeakReference.refersTo instead? The docs say "Prefer this to a comparison with the result of get" |
See DataDog#1762 for details. According to https://developer.android.com/reference/java/lang/ref/Reference#refersTo(T) we should prefer using refersTo over a comparison with the result of get. refersTo should avoid creating a strong ref even temporarily, thereby avoiding this leak.
@pyricau pointed out that If there's interest I can do something similar with System.identityHashCode. i.e. I could avoid calling WeakReference.get() except when the identity hash codes matched. This would be the case when they were the same object, but in such cases the object would definitely still be alive, so there would be no leak problem created. It would be very unlikely to happen spuriously since identity hash codes are 32 bits. I'm happy to prepare a PR if you're interested in such a fix. Let me know. |
@tcmulcahy thank you for all the suggestions, we discussed it internally and we will take the time to prepare a more robust solution there. We will handle this on our end. As @pyricau was pointing out and I also agree with, the best approach would be to not have those activities/fragments as keys in there and totally remove the On the other hand in the meantime we will add the quick fix for our next sdk version by not accessing the reference if the view was already stopped. Let me know if this is ok with you @pyricau @tcmulcahy. |
Sounds good! I had a PR with the quick fix at #1763 but happy to close it. |
Sounds good to me too.
|
@pyricau no need to close it, the PR is actually good, I just approved it and we will merge it soon. |
If the view is already stopped, there's no need to determine whether it needs to be stopped. See #1762
If the view is already stopped, there's no need to determine whether it needs to be stopped. See #1762
The issue referenced is fixed in #1779, but we will keep this ticket open for the remaining work to be done. |
The fix for this issue is now available in version 2.4.0. |
Describe what happened
RumViewScope prevents destroyed activities from being garbage collected, causing an increase in memory pressure.
This leak was hard to find: the destroyed activities were not being garbage collected, however every time myself or LeakCanary would look at a heap dump, they were always weakly reachable. I ended up tweaking the shortest path algorithm that LeakCanary uses to include weak references that aren't set by the Android Framework.
I eventually figured out that the culprit is the way the SDK abuses weak references and revives the references by frequently invoking
WeakReference.get()
and recreating temporary strong local references to the destroyed activities.Leak Trace:
Root Cause
The core bug is in
RumViewScope#onStopView
(source) :Every time a
RumRawEvent.StopView
event is sent, allRumViewScope
instances that are children ofRumViewManagerScope
will invokeonStopView()
and callkeyRef.get()
to do an equals check to see if the stop view is for themselves. Unfortunately this has the side effect of creating a strong local reference to the activity, preventing it from being garbage collected in the next GC run.The short term fix is simple: check if
stopped
is false before invokingval startedKey = keyRef.get()
and immediately returning ifstopped
is true.Now, of course, one might wonder: "why would we send
RumRawEvent.StopView
events to aRumViewScope
associated to a view that's already stopped, shouldn't it be removed from itsRumViewManagerScope
parent?A
RumViewScope
is removed from itsRumViewManagerScope
parent when theRumViewScope
returnsnull
inhandleEvent
, which happens only ifisViewComplete()
returns true. I have a heap dump showing aRumViewScope
instance wherestopped
is true,activeResourceScopes
is empty butpendingResourceCount = 2
(and the other pending fields counts set to 0), which makesisViewComplete()
return false. I'm not sure exactly how this happens.DatadogRumMonitor.executorService
has an empty queue so this isn't about pending events.Unfortunately, I'm only able to look at heap dumps long after the issue has happened. I have tried to reproduce locally and debug through it, and I didn't manage to reproduce the exact scenario above, however I did get a leak when I went offline and the number of events increased so much (more errors) that the single thread executor in
DatadogRumMonitor.executorService
was lagging behind and its queue was filling up, so much that the StopView event took a really long time to get delivered and in the meantime we kept leaking the activity by calling get() on the weak reference.Here's an example heap dump where I found this issue. I stripped all primitive arrays of their data and filled them with 0s to avoid sharing any secrets.
2023-12-13_12-20-26_265-e767bc81-376b-47e6-acc9-92a1a41eb297-stripped.hprof.zip
Proper fixes
The current implementation around keys is strange. The key has a type of Any, the framework holds a weak reference to it but calls get() often, then makes an equals check against the event key. This implies a requirement that the keys implement equals properly. That requirement most definitely doesn't work for activities and fragments, as it's very possible that developers override equals for their own need, leading to unexpected behavior on the datadog side.
On top of that, events are delivered async through a single threaded executor service. So events are always late. Sometimes they're a little bit late, but if the executor queue fills up they can be very late. When they're very late, they're held in memory for longer, and the event key is as well. So it's really not appropriate for the keys to be activities and fragments, they should instead be small objects that represent an identity but that we don't mind keeping in memory for longer.
The text was updated successfully, but these errors were encountered: