Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RumViewScope creates memory leaks #1762

Closed
pyricau opened this issue Dec 14, 2023 · 11 comments
Closed

RumViewScope creates memory leaks #1762

pyricau opened this issue Dec 14, 2023 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@pyricau
Copy link
Contributor

pyricau commented Dec 14, 2023

Describe what happened

RumViewScope prevents destroyed activities from being garbage collected, causing an increase in memory pressure.

This leak was hard to find: the destroyed activities were not being garbage collected, however every time myself or LeakCanary would look at a heap dump, they were always weakly reachable. I ended up tweaking the shortest path algorithm that LeakCanary uses to include weak references that aren't set by the Android Framework.

I eventually figured out that the culprit is the way the SDK abuses weak references and revives the references by frequently invoking WeakReference.get() and recreating temporary strong local references to the destroyed activities.

Leak Trace:

┬───
│ GC Root: Global variable in native code
│
├─ dalvik.system.PathClassLoader instance
│    Leaking: NO (GlobalRumMonitor↓ is not leaking and A ClassLoader is never leaking)
│    ↓ ClassLoader.runtimeInternalObjects
├─ java.lang.Object[] array
│    Leaking: NO (GlobalRumMonitor↓ is not leaking)
│    ↓ Object[23579]
├─ com.datadog.android.rum.GlobalRumMonitor class
│    Leaking: NO (a class is never leaking)
│    ↓ static GlobalRumMonitor.registeredMonitors
│                              ~~~~~~~~~~~~~~~~~~
├─ java.util.LinkedHashMap instance
│    ↓ LinkedHashMap[instance @378137616 of com.datadog.android.core.DatadogCore]
│                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
├─ com.datadog.android.rum.internal.monitor.DatadogRumMonitor instance
│    ↓ DatadogRumMonitor.rootScope
│                        ~~~~~~~~~
├─ com.datadog.android.rum.internal.domain.scope.RumApplicationScope instance
│    ↓ RumApplicationScope.childScopes
│                          ~~~~~~~~~~~
├─ java.util.ArrayList instance
│    ↓ ArrayList[0]
│               ~~~
├─ com.datadog.android.rum.internal.domain.scope.RumSessionScope instance
│    ↓ RumSessionScope.childScope
│                      ~~~~~~~~~~
├─ com.datadog.android.rum.internal.domain.scope.RumViewManagerScope instance
│    ↓ RumViewManagerScope.childrenScopes
│                          ~~~~~~~~~~~~~~
├─ java.util.ArrayList instance
│    ↓ ArrayList[1]
│               ~~~
├─ com.datadog.android.rum.internal.domain.scope.RumViewScope instance
│    ↓ RumViewScope.keyRef
│                   ~~~~~~
├─ java.lang.ref.WeakReference instance
│    referent instance of com.squareup.featureflags.devdrawer.LdFeaturesOverrideActivity with mDestroyed = true
│    ↓ Reference.referent
│                ~~~~~~~~
╰→ com.squareup.featureflags.devdrawer.LdFeaturesOverrideActivity instance
     Leaking: YES (Activity#mDestroyed is true)
     Retaining 690.4 kB in 14512 objects

Root Cause

The core bug is in RumViewScope#onStopView (source) :

    private fun onStopView(
        event: RumRawEvent.StopView,
        writer: DataWriter<Any>
    ) {
        delegateEventToChildren(event, writer)
        val startedKey = keyRef.get()

Every time a RumRawEvent.StopView event is sent, all RumViewScope instances that are children of RumViewManagerScope will invoke onStopView() and call keyRef.get() to do an equals check to see if the stop view is for themselves. Unfortunately this has the side effect of creating a strong local reference to the activity, preventing it from being garbage collected in the next GC run.

The short term fix is simple: check if stopped is false before invoking val startedKey = keyRef.get() and immediately returning if stopped is true.

Now, of course, one might wonder: "why would we send RumRawEvent.StopView events to a RumViewScope associated to a view that's already stopped, shouldn't it be removed from its RumViewManagerScope parent?

A RumViewScope is removed from its RumViewManagerScope parent when the RumViewScope returns null in handleEvent, which happens only if isViewComplete() returns true. I have a heap dump showing a RumViewScope instance where stopped is true, activeResourceScopes is empty but pendingResourceCount = 2 (and the other pending fields counts set to 0), which makes isViewComplete() return false. I'm not sure exactly how this happens. DatadogRumMonitor.executorService has an empty queue so this isn't about pending events.

image

Unfortunately, I'm only able to look at heap dumps long after the issue has happened. I have tried to reproduce locally and debug through it, and I didn't manage to reproduce the exact scenario above, however I did get a leak when I went offline and the number of events increased so much (more errors) that the single thread executor in DatadogRumMonitor.executorService was lagging behind and its queue was filling up, so much that the StopView event took a really long time to get delivered and in the meantime we kept leaking the activity by calling get() on the weak reference.

Here's an example heap dump where I found this issue. I stripped all primitive arrays of their data and filled them with 0s to avoid sharing any secrets.

2023-12-13_12-20-26_265-e767bc81-376b-47e6-acc9-92a1a41eb297-stripped.hprof.zip

Proper fixes

The current implementation around keys is strange. The key has a type of Any, the framework holds a weak reference to it but calls get() often, then makes an equals check against the event key. This implies a requirement that the keys implement equals properly. That requirement most definitely doesn't work for activities and fragments, as it's very possible that developers override equals for their own need, leading to unexpected behavior on the datadog side.

On top of that, events are delivered async through a single threaded executor service. So events are always late. Sometimes they're a little bit late, but if the executor queue fills up they can be very late. When they're very late, they're held in memory for longer, and the event key is as well. So it's really not appropriate for the keys to be activities and fragments, they should instead be small objects that represent an identity but that we don't mind keeping in memory for longer.

@pyricau pyricau added the bug Something isn't working label Dec 14, 2023
pyricau added a commit to pyricau/dd-sdk-android that referenced this issue Dec 14, 2023
If the view is already stopped, there's no need to determine whether it needs to be stopped.

See DataDog#1762
@mariusc83
Copy link
Collaborator

Hi @pyricau, thank you for all the details provided regarding this bug, tremendous work there.

This is indeed a very interesting discovery, so if I understand correctly because the onStopView is being called in a different Thread by using the keyRef.get() in there we are impeding that activity or fragment to be recycled in the GC cycle that could run in parallel. Is that correct ? Please correct me if I am wrong but this doesn't mean that the activity or fragment will always be leaked, is just that this is not going to be GC fast enough ?

In any case we are going to fix this on our next version probably by just applying the quick fix first but on the long term I totally agree with you that we should find a better equivalent for those keys.

@pyricau
Copy link
Contributor Author

pyricau commented Dec 14, 2023

Thanks for the quick reply @mariusc83

I don't think this necessarily has to do with the use of a background thread, though the executor used here probably increases the delays.

Here's how I'd think about it:

  • In a Java based runtime, a memory leak is a programming error that prevents the GC from collecting an object that is no longer needed.
  • Activities and fragments should become unreachable immediately after their onDestroy() method is invoked. Preventing them from being garbage collected, even if it's only temporary, is still a memory leak. E.g. here if we were switching activities fast enough we'd be retaining quite a lot of memory.
  • This leak is particularly interesting in that the duration of the leak isn't well defined. It could be very short or very long, as it depends on a specific set of conditions: 1) RumViewScope staying as a child of RumViewManagerScope after the activity is destroyed 2) Any StopView event being sent close enough to a GC that the GC sees the java local reference when running and concludes that the object is reachable. Unfortunately, the more you use an app, the more the GC runs but also the more screens go in and out and the more we get StopView events.

I think long term I'd suggest something like: 1) Stop using the actual activity & fragment instances as keys, create unique identifiers for those instead (I don't know if you have any code expecting actual activity / fragment instances as keys though?). then 2) Stop using a weak ref, just keep a strong reference to the key, and make it clear that key should be an identifier not a live UI object.

@tcmulcahy
Copy link

Could we call WeakReference.refersTo instead? The docs say "Prefer this to a comparison with the result of get"

tcmulcahy added a commit to tcmulcahy/dd-sdk-android that referenced this issue Dec 14, 2023
See DataDog#1762 for details.
According to https://developer.android.com/reference/java/lang/ref/Reference#refersTo(T)
we should prefer using refersTo over a comparison with the result of
get. refersTo should avoid creating a strong ref even temporarily,
thereby avoiding this leak.
@tcmulcahy
Copy link

@pyricau pointed out that refersTo is only available on 33+. :( I'll close my PR.

If there's interest I can do something similar with System.identityHashCode. i.e. I could avoid calling WeakReference.get() except when the identity hash codes matched. This would be the case when they were the same object, but in such cases the object would definitely still be alive, so there would be no leak problem created. It would be very unlikely to happen spuriously since identity hash codes are 32 bits.

I'm happy to prepare a PR if you're interested in such a fix. Let me know.

@mariusc83
Copy link
Collaborator

@tcmulcahy thank you for all the suggestions, we discussed it internally and we will take the time to prepare a more robust solution there. We will handle this on our end. As @pyricau was pointing out and I also agree with, the best approach would be to not have those activities/fragments as keys in there and totally remove the WeakRef. As that code there is quite complex and has a lot of implications we will need more time to assess and properly fix this in a PR.

On the other hand in the meantime we will add the quick fix for our next sdk version by not accessing the reference if the view was already stopped.

Let me know if this is ok with you @pyricau @tcmulcahy.

@mariusc83 mariusc83 self-assigned this Dec 15, 2023
@pyricau
Copy link
Contributor Author

pyricau commented Dec 16, 2023

Sounds good! I had a PR with the quick fix at #1763 but happy to close it.

@tcmulcahy
Copy link

tcmulcahy commented Dec 16, 2023 via email

@mariusc83
Copy link
Collaborator

@pyricau no need to close it, the PR is actually good, I just approved it and we will merge it soon.

0xnm pushed a commit that referenced this issue Dec 20, 2023
If the view is already stopped, there's no need to determine whether it needs to be stopped.

See #1762
xgouchet pushed a commit that referenced this issue Dec 20, 2023
If the view is already stopped, there's no need to determine whether it needs to be stopped.

See #1762
@0xnm
Copy link
Contributor

0xnm commented Dec 20, 2023

The issue referenced is fixed in #1779, but we will keep this ticket open for the remaining work to be done.

@0xnm
Copy link
Contributor

0xnm commented Dec 21, 2023

The fix for this issue is now available in version 2.4.0.

@0xnm
Copy link
Contributor

0xnm commented Jan 16, 2024

Hi @pyricau! We've made changes to our code and no more hold Activities inside the RUM processing pipeline. This change was released in version 2.5.0.

I am closing this issue, don't hesitate to re-open it if needed.

@0xnm 0xnm closed this as completed Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants