RumViewScope creates memory leaks #1762

pyricau · 2023-12-14T06:32:41Z

Describe what happened

RumViewScope prevents destroyed activities from being garbage collected, causing an increase in memory pressure.

This leak was hard to find: the destroyed activities were not being garbage collected, however every time myself or LeakCanary would look at a heap dump, they were always weakly reachable. I ended up tweaking the shortest path algorithm that LeakCanary uses to include weak references that aren't set by the Android Framework.

I eventually figured out that the culprit is the way the SDK abuses weak references and revives the references by frequently invoking WeakReference.get() and recreating temporary strong local references to the destroyed activities.

Leak Trace:

┬───
│ GC Root: Global variable in native code
│
├─ dalvik.system.PathClassLoader instance
│    Leaking: NO (GlobalRumMonitor↓ is not leaking and A ClassLoader is never leaking)
│    ↓ ClassLoader.runtimeInternalObjects
├─ java.lang.Object[] array
│    Leaking: NO (GlobalRumMonitor↓ is not leaking)
│    ↓ Object[23579]
├─ com.datadog.android.rum.GlobalRumMonitor class
│    Leaking: NO (a class is never leaking)
│    ↓ static GlobalRumMonitor.registeredMonitors
│                              ~~~~~~~~~~~~~~~~~~
├─ java.util.LinkedHashMap instance
│    ↓ LinkedHashMap[instance @378137616 of com.datadog.android.core.DatadogCore]
│                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
├─ com.datadog.android.rum.internal.monitor.DatadogRumMonitor instance
│    ↓ DatadogRumMonitor.rootScope
│                        ~~~~~~~~~
├─ com.datadog.android.rum.internal.domain.scope.RumApplicationScope instance
│    ↓ RumApplicationScope.childScopes
│                          ~~~~~~~~~~~
├─ java.util.ArrayList instance
│    ↓ ArrayList[0]
│               ~~~
├─ com.datadog.android.rum.internal.domain.scope.RumSessionScope instance
│    ↓ RumSessionScope.childScope
│                      ~~~~~~~~~~
├─ com.datadog.android.rum.internal.domain.scope.RumViewManagerScope instance
│    ↓ RumViewManagerScope.childrenScopes
│                          ~~~~~~~~~~~~~~
├─ java.util.ArrayList instance
│    ↓ ArrayList[1]
│               ~~~
├─ com.datadog.android.rum.internal.domain.scope.RumViewScope instance
│    ↓ RumViewScope.keyRef
│                   ~~~~~~
├─ java.lang.ref.WeakReference instance
│    referent instance of com.squareup.featureflags.devdrawer.LdFeaturesOverrideActivity with mDestroyed = true
│    ↓ Reference.referent
│                ~~~~~~~~
╰→ com.squareup.featureflags.devdrawer.LdFeaturesOverrideActivity instance
     Leaking: YES (Activity#mDestroyed is true)
     Retaining 690.4 kB in 14512 objects

Root Cause

The core bug is in RumViewScope#onStopView (source) :

    private fun onStopView(
        event: RumRawEvent.StopView,
        writer: DataWriter<Any>
    ) {
        delegateEventToChildren(event, writer)
        val startedKey = keyRef.get()

Every time a RumRawEvent.StopView event is sent, all RumViewScope instances that are children of RumViewManagerScope will invoke onStopView() and call keyRef.get() to do an equals check to see if the stop view is for themselves. Unfortunately this has the side effect of creating a strong local reference to the activity, preventing it from being garbage collected in the next GC run.

The short term fix is simple: check if stopped is false before invoking val startedKey = keyRef.get() and immediately returning if stopped is true.

Now, of course, one might wonder: "why would we send RumRawEvent.StopView events to a RumViewScope associated to a view that's already stopped, shouldn't it be removed from its RumViewManagerScope parent?

A RumViewScope is removed from its RumViewManagerScope parent when the RumViewScope returns null in handleEvent, which happens only if isViewComplete() returns true. I have a heap dump showing a RumViewScope instance where stopped is true, activeResourceScopes is empty but pendingResourceCount = 2 (and the other pending fields counts set to 0), which makes isViewComplete() return false. I'm not sure exactly how this happens. DatadogRumMonitor.executorService has an empty queue so this isn't about pending events.

Unfortunately, I'm only able to look at heap dumps long after the issue has happened. I have tried to reproduce locally and debug through it, and I didn't manage to reproduce the exact scenario above, however I did get a leak when I went offline and the number of events increased so much (more errors) that the single thread executor in DatadogRumMonitor.executorService was lagging behind and its queue was filling up, so much that the StopView event took a really long time to get delivered and in the meantime we kept leaking the activity by calling get() on the weak reference.

Here's an example heap dump where I found this issue. I stripped all primitive arrays of their data and filled them with 0s to avoid sharing any secrets.

2023-12-13_12-20-26_265-e767bc81-376b-47e6-acc9-92a1a41eb297-stripped.hprof.zip

Proper fixes

The current implementation around keys is strange. The key has a type of Any, the framework holds a weak reference to it but calls get() often, then makes an equals check against the event key. This implies a requirement that the keys implement equals properly. That requirement most definitely doesn't work for activities and fragments, as it's very possible that developers override equals for their own need, leading to unexpected behavior on the datadog side.

On top of that, events are delivered async through a single threaded executor service. So events are always late. Sometimes they're a little bit late, but if the executor queue fills up they can be very late. When they're very late, they're held in memory for longer, and the event key is as well. So it's really not appropriate for the keys to be activities and fragments, they should instead be small objects that represent an identity but that we don't mind keeping in memory for longer.

The text was updated successfully, but these errors were encountered:

If the view is already stopped, there's no need to determine whether it needs to be stopped. See DataDog#1762

mariusc83 · 2023-12-14T08:50:18Z

Hi @pyricau, thank you for all the details provided regarding this bug, tremendous work there.

This is indeed a very interesting discovery, so if I understand correctly because the onStopView is being called in a different Thread by using the keyRef.get() in there we are impeding that activity or fragment to be recycled in the GC cycle that could run in parallel. Is that correct ? Please correct me if I am wrong but this doesn't mean that the activity or fragment will always be leaked, is just that this is not going to be GC fast enough ?

In any case we are going to fix this on our next version probably by just applying the quick fix first but on the long term I totally agree with you that we should find a better equivalent for those keys.

pyricau · 2023-12-14T14:49:08Z

Thanks for the quick reply @mariusc83

I don't think this necessarily has to do with the use of a background thread, though the executor used here probably increases the delays.

Here's how I'd think about it:

In a Java based runtime, a memory leak is a programming error that prevents the GC from collecting an object that is no longer needed.
Activities and fragments should become unreachable immediately after their onDestroy() method is invoked. Preventing them from being garbage collected, even if it's only temporary, is still a memory leak. E.g. here if we were switching activities fast enough we'd be retaining quite a lot of memory.
This leak is particularly interesting in that the duration of the leak isn't well defined. It could be very short or very long, as it depends on a specific set of conditions: 1) RumViewScope staying as a child of RumViewManagerScope after the activity is destroyed 2) Any StopView event being sent close enough to a GC that the GC sees the java local reference when running and concludes that the object is reachable. Unfortunately, the more you use an app, the more the GC runs but also the more screens go in and out and the more we get StopView events.

I think long term I'd suggest something like: 1) Stop using the actual activity & fragment instances as keys, create unique identifiers for those instead (I don't know if you have any code expecting actual activity / fragment instances as keys though?). then 2) Stop using a weak ref, just keep a strong reference to the key, and make it clear that key should be an identifier not a live UI object.

tcmulcahy · 2023-12-14T17:27:36Z

Could we call WeakReference.refersTo instead? The docs say "Prefer this to a comparison with the result of get"

See DataDog#1762 for details. According to https://developer.android.com/reference/java/lang/ref/Reference#refersTo(T) we should prefer using refersTo over a comparison with the result of get. refersTo should avoid creating a strong ref even temporarily, thereby avoiding this leak.

tcmulcahy · 2023-12-14T23:54:18Z

@pyricau pointed out that refersTo is only available on 33+. :( I'll close my PR.

If there's interest I can do something similar with System.identityHashCode. i.e. I could avoid calling WeakReference.get() except when the identity hash codes matched. This would be the case when they were the same object, but in such cases the object would definitely still be alive, so there would be no leak problem created. It would be very unlikely to happen spuriously since identity hash codes are 32 bits.

I'm happy to prepare a PR if you're interested in such a fix. Let me know.

mariusc83 · 2023-12-15T09:13:57Z

@tcmulcahy thank you for all the suggestions, we discussed it internally and we will take the time to prepare a more robust solution there. We will handle this on our end. As @pyricau was pointing out and I also agree with, the best approach would be to not have those activities/fragments as keys in there and totally remove the WeakRef. As that code there is quite complex and has a lot of implications we will need more time to assess and properly fix this in a PR.

On the other hand in the meantime we will add the quick fix for our next sdk version by not accessing the reference if the view was already stopped.

Let me know if this is ok with you @pyricau @tcmulcahy.

pyricau · 2023-12-16T06:53:34Z

Sounds good! I had a PR with the quick fix at #1763 but happy to close it.

tcmulcahy · 2023-12-16T12:23:56Z

Sounds good to me too.

mariusc83 · 2023-12-18T08:45:11Z

@pyricau no need to close it, the PR is actually good, I just approved it and we will merge it soon.

If the view is already stopped, there's no need to determine whether it needs to be stopped. See #1762

0xnm · 2023-12-20T11:35:47Z

The issue referenced is fixed in #1779, but we will keep this ticket open for the remaining work to be done.

0xnm · 2023-12-21T16:47:58Z

The fix for this issue is now available in version 2.4.0.

0xnm · 2024-01-16T08:36:06Z

Hi @pyricau! We've made changes to our code and no more hold Activities inside the RUM processing pipeline. This change was released in version 2.5.0.

I am closing this issue, don't hesitate to re-open it if needed.

pyricau added the bug Something isn't working label Dec 14, 2023

pyricau added a commit to pyricau/dd-sdk-android that referenced this issue Dec 14, 2023

Fix leak caused by repeated calls to WeakReference.get()

2e39314

If the view is already stopped, there's no need to determine whether it needs to be stopped. See DataDog#1762

pyricau mentioned this issue Dec 14, 2023

Fix leak caused by repeated calls to WeakReference.get() #1763

Closed

tcmulcahy mentioned this issue Dec 14, 2023

Fix WeakReference leak #1767

Closed

mariusc83 self-assigned this Dec 15, 2023

0xnm pushed a commit that referenced this issue Dec 20, 2023

Fix leak caused by repeated calls to WeakReference.get()

75bb88a

If the view is already stopped, there's no need to determine whether it needs to be stopped. See #1762

xgouchet pushed a commit that referenced this issue Dec 20, 2023

Fix leak caused by repeated calls to WeakReference.get()

9875896

If the view is already stopped, there's no need to determine whether it needs to be stopped. See #1762

0xnm closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RumViewScope creates memory leaks #1762

RumViewScope creates memory leaks #1762

pyricau commented Dec 14, 2023 •

edited

Loading

mariusc83 commented Dec 14, 2023

pyricau commented Dec 14, 2023 •

edited

Loading

tcmulcahy commented Dec 14, 2023

tcmulcahy commented Dec 14, 2023

mariusc83 commented Dec 15, 2023

pyricau commented Dec 16, 2023

tcmulcahy commented Dec 16, 2023 via email

mariusc83 commented Dec 18, 2023

0xnm commented Dec 20, 2023

0xnm commented Dec 21, 2023

0xnm commented Jan 16, 2024

RumViewScope creates memory leaks #1762

RumViewScope creates memory leaks #1762

Comments

pyricau commented Dec 14, 2023 • edited Loading

mariusc83 commented Dec 14, 2023

pyricau commented Dec 14, 2023 • edited Loading

tcmulcahy commented Dec 14, 2023

tcmulcahy commented Dec 14, 2023

mariusc83 commented Dec 15, 2023

pyricau commented Dec 16, 2023

tcmulcahy commented Dec 16, 2023 via email

mariusc83 commented Dec 18, 2023

0xnm commented Dec 20, 2023

0xnm commented Dec 21, 2023

0xnm commented Jan 16, 2024

pyricau commented Dec 14, 2023 •

edited

Loading

pyricau commented Dec 14, 2023 •

edited

Loading