Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) #690

skirdey · 2021-02-24T23:09:37Z

Description

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true

CPU inference.

Expected Behavior

Garbage collection removing objects from Old Generation

Error Message

Java OOM

How to Reproduce?

Run multi-threaded inference for 30minutes with -Dorg.bytedeco.javacpp.nopointergc=true so you don't have JavaCPP Deallocator blocking thread.

/gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3'

skirdey · 2021-02-24T23:56:19Z

I have a high level overview of what happens captured via yourkit,
at some point (>30 min of continous predictions), getting a spike in GC collections

and pause time:

Also, OldGen takes most of the space leaving nothing for new allocations

here is memory snapshot, after several forced GCs

skirdey · 2021-02-28T00:23:24Z

Same benchmark doing multi-threading inference using PyTorch has minimal memory footprint and no GC pauses:
./gradlew benchmark -Dai.djl.default_engine=PyTorch -Dai.djl.repository.zoo.location="https://alpha-djl-demos.s3.amazonaws.com/model/djl-blockrunner/pytorch_resnet18.zip?model_name=traced_resnet18" --args='-c 100000 -d 1440 -t 10 -s 1,3,224,224'

karllessard · 2021-03-01T04:02:25Z

Please see tensorflow/java#229 for a potential fix for this problem in the TF Java library.

skirdey · 2021-03-12T05:27:11Z

@karllessard is it possible to release tf java 0.2.1 with the hotfix? it doesn't look like a breaking change so a patch would be nice.

DJl uses release versions of TF Java

saudet · 2021-03-12T08:20:49Z

@skirdey 0.3.0-SNAPSHOT is pretty much up now, so please give that a try, but like I tried to explain on the pull request, I don't believe this will fix any issues for DJL.

karllessard · 2021-03-12T14:00:49Z

Good news @skirdey ! Important build issues that were preventing us to release 0.3.0 are now fixed, I will provide more details a bit later but you can also follow this thread to see the progress.

skirdey · 2021-03-12T16:33:31Z

The fix could be useful when I do not set nopointergc=True, as it was also leaking memory just took 70 hours to catch it.

…

On Friday, 12 March 2021, Karl Lessard ***@***.***> wrote: Good news @skirdey <https://github.com/skirdey> ! Important build issues that were preventing us to release 0.3.0 are now fixed <tensorflow/java#240>, I will provide more details a bit later but you can also follow this thread <tensorflow/java#230> to see the progress. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#690 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHIJQDEYHKNQHHHV6OPGE3TDIGBJANCNFSM4YFNJGOA> .

saudet · 2021-03-12T16:45:48Z

That doesn't fix any leak. If there's a leak, it's somewhere else.

stu1130 · 2021-03-12T16:46:45Z

DJL team can publish 0.11.0-SNAPSHOT that depends on 0.3.0-SNAPSHOT. I am on it.

karllessard · 2021-03-12T19:35:19Z

We just had our community meeting and I’m targeting making a release of 0.3.0 by March 15th with the actual snapshot. Only detail is that the Windows MKL+GPU platform won’t be available for that release (at least not for this date), all other platforms will.

karllessard · 2021-03-15T14:37:54Z

Hi, just to let you know that 0.3.0 is now available on Maven Central. Please let us know how it goes with your experiments on this issue, thank you

skirdey · 2021-03-20T02:37:43Z

We tried it out and the memory still leaks, very fast when nopointergc=true and it takes 5 hours to halt when nopointergc=false
@karllessard @saudet

stu1130 · 2021-03-20T02:50:19Z

Like @skirdey mentioned
Here is my setting.
c5.2xlarge instance with jvmArgs -Xmx256m (make it easier to be OOM)
command:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true

-Dorg.bytedeco.javacpp.nopointergc=true took 2m 31s
-Dorg.bytedeco.javacpp.nopointergc=false took 5h 11m 6s

We are going to integrate TfNDManager with PointerScope as @saudet suggested for the next step

saudet · 2021-03-20T07:28:49Z

@stu1130 Can you point us to the code in DJL for that benchmark?

stu1130 · 2021-03-20T07:33:55Z

@stu1130 Can you point us to the code in DJL for that benchmark?

https://github.com/awslabs/djl/blob/master/examples/src/main/java/ai/djl/examples/inference/benchmark/MultithreadedBenchmark.java

saudet · 2021-03-20T08:04:05Z

I mean, what does it end up calling in TF Java? For example, @roywei says that

Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.

It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?

saudet · 2021-03-21T00:50:56Z

Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.

stu1130 · 2021-03-22T00:20:34Z

I mean, what does it end up calling in TF Java? For example, @roywei says that

Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.

It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?

When TfNDArray.close(), the operand is set to null.

Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

karllessard · 2021-03-22T15:07:55Z

The behavior is not the same in eager or graph mode.

In eager mode, it is true that the lifetime of a Tensor is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.

In graph mode though, all tensors returned by a session.run() must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.

When TfNDArray.close(), the operand is set to null.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

stu1130 · 2021-03-22T23:52:04Z

The behavior is not the same in eager or graph mode.

In eager mode, it is true that the lifetime of a Tensor is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.

In graph mode though, all tensors returned by a session.run() must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.

When TfNDArray.close(), the operand is set to null.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

Thanks @karllessard for those pointers. We will try to close tensors for both EagerSession/Session ASAP.

saudet · 2021-03-23T00:57:17Z

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

That's not true. I've explained this many times already, so I won't try to reexplain this here, but with the way you've "fixed" this, leaks will still happen, unless of course everything is done manually.

karllessard · 2021-03-24T04:06:18Z

Hi @stu1130 ,

I was experimenting with DJL's benchmark this evening and while I cannot draw conclusions yet on the TF side, I've noticed that the benchmark is collecting multiple metrics on each inference, which are being piled up in a list for the whole duration of the test and reaching the millions in number after a few hours. While instances of Metric seem quite small in size, it could be interesting to see if you still observe OOM on a JVM of 256M when disabling the metric collection, like it seems possible to do here, to make sure this constantly-growing list does not interfere with the experiment. What do you think?

saudet · 2021-03-24T10:57:39Z

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.

To clarify what I wrote above, the "issue" I'm referring to isn't the one in this thread about memory leaks in DJL, but another one concerning the incorrect use of WeakReference that @stu1130 inquired about. I think everyone is on the same page about fixing memory leaks in general, and right now we can avoid them by making sure that everything gets deallocated manually. Even if we started using WeakReference correctly somehow, it is my opinion that relying on GC is a bad idea anyway, and this is where my opinion diverges with the other members of SIG JVM. We all agree that calling things like Tensor.close() should work though, so if there's a bug there, we'll fix that.

Craigacp · 2021-03-24T14:20:13Z

When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?

Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 296.217 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading:     100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2050.987 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

I don't usually get this out of TF-Java when we run it separately, and I know that DJL uses it's own library loader which partially delegates to ours.

skirdey · 2021-03-26T00:40:08Z

For my use-case, I can not really have a blocking thread listening for GC - I run Scala / Akka on my end and it somehow affects entire ecosystem performance.

…

On Thu, Mar 25, 2021 at 5:20 PM Karl Lessard ***@***.***> wrote: @skirdey <https://github.com/skirdey>, nopointergc must not be set to true. Also, you might need to comment out this line <https://github.com/awslabs/djl/blob/215226750783d15a06e492e4ea96c429f8d2f103/examples/src/main/java/ai/djl/examples/inference/benchmark/MultithreadedBenchmark.java#L136> in the benchmark as I did, since metrics collected by DJL during the benchmark ends up taking a lot of space, please read my previous comment <#690 (comment)> about that. Meanwhile I'll look to run a different test, thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#690 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHIJQGGLR7CBBGKI2OWAX3TFPHNXANCNFSM4YFNJGOA> .

Craigacp · 2021-03-26T00:52:24Z

Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.

skirdey · 2021-03-26T00:53:28Z

Still on Java 8 :/

…

On Thursday, 25 March 2021, Adam Pocock ***@***.***> wrote: Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#690 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHIJQFA4ENOTM22D32CY2TTFPLFNANCNFSM4YFNJGOA> .

skirdey · 2021-03-26T06:08:10Z

I'll try out the new fix with nopointergc=false and report back.

…

On Thu, Mar 25, 2021 at 5:53 PM Stan Kirdey ***@***.***> wrote: Still on Java 8 :/ On Thursday, 25 March 2021, Adam Pocock ***@***.***> wrote: > Do you have this problem with Java 9 cleaners too? They are increasingly > used inside the JVM to deal with resource cleanup and it's basically the > same idiom. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#690 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABHIJQFA4ENOTM22D32CY2TTFPLFNANCNFSM4YFNJGOA> > . >

karllessard · 2021-03-26T13:18:02Z

Thanks @skirdey , please don't forget to disable DJL's metrics since they can grow drastically during the benchmark and cause an OOM if you are running on a low-memory JVM.

stu1130 · 2021-03-29T07:34:58Z

@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?

./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'

I found we don't release TF_Output pointer in session run. Can it be the reason?

saudet · 2021-03-29T08:28:47Z

When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?

@Craigacp It just means that the jnijavacpp library itself is missing, but same as tensorflow/java#189 (comment), we can ignore those warnings, unless we get some other loading error that is...

@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?
./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'
I found we don't release TF_Output pointer in session run. Can it be the reason?

When "org.bytedeco.javacpp.nopointergc" is true, JavaCPP won't be trying to use GC, so we're not going to get any "Collecting ..." lines in the debug log. You'll need to enable GC temporarily to see if it can pick up something that way. It won't be able to track anything that isn't registered with it though, so if you're still not seeing anything with GC enabled, I would say it's something it doesn't know about that is leaking. However, native arrays such as the one that you're linking to for TF_Output get allocated with the new operator from C++, and JavaCPP automatically registers deallocators that call their destructors using the delete operator for those, so if they were "leaking", it would most likely show up in JavaCPP's debug log. Also, it's created inside a PointerScope, so it's almost certainly getting deallocated there. To make sure, you can check that outputs.isNull() is true after leaving the scope.

As for detecting memory leaks in native code in general, I find that Valgrind works the least badly with the JVM:
https://github.com/deeplearning4j/libnd4j/wiki/Debugging-libnd4j
Nevertheless, with a bit more of hacking, it's also possible to use Address Sanitizer:
http://btorpey.github.io/blog/2014/03/27/using-clangs-address-sanitizer/

stu1130 · 2021-03-29T14:43:44Z

So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).

Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?

BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.

I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]) but unfortunate got OOM
command

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3'

[ERROR] -
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at ai.djl.examples.inference.benchmark.MultithreadedBenchmark.predict(MultithreadedBenchmark.java:86) [main/:?]
        at ai.djl.examples.inference.benchmark.util.AbstractBenchmark.runBenchmark(AbstractBenchmark.java:132) [main/:?]
        at ai.djl.examples.inference.benchmark.Benchmark.main(Benchmark.java:31) [main/:?]
Caused by: java.lang.OutOfMemoryError: Java heap space
[ERROR] - Only 0/10 threads finished.

> Task :examples:benchmark FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':examples:benchmark'.
> Process 'command '/usr/lib/jvm/java-12-openjdk-amd64/bin/java'' finished with non-zero exit value 255

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full
insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2h 1m 59s

Maybe I am using a more powful CPU machine (c5.2xlarge) so small objects like Pointer got accumulated faster to reach OOM

karllessard · 2021-03-29T22:00:33Z

I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]

This is strange, indeed, and not easy to debug if I cannot reproduce it on my side. @Craigacp , do you still have access to these fast machines so we can run the same test on them?

BTW, TF Java 0.3.1 has been released this morning which not only fixes a leak with the new string tensor but also potentially another leak (never observed so far) that could happened when closing multiple graph sessions. It might be better to work with this version from now on.

saudet · 2021-03-30T01:02:56Z

@stu1130 128 MB is not a lot of memory for the JVM itself + a couple of things from TF and DJL.
Make sure to try with something a bit larger like at least 256 or 512 MB as well.

stu1130 · 2021-03-30T15:36:03Z

@saudet I tested mine new refactoring(1G jvm size && nopointergc=true) and it went OOM after 2 hours. I used the jconsole and found native non-heap memory is pretty stable but old gen kept growing until OOM exception with several old gen GC attempts. I will share my experiment with nopointergc=false (GC enabled) as a baseline.

stu1130 · 2021-03-31T07:47:52Z

@saudet
Experiment 1:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true

EC2: c5.2xlarge
-Xmx512m -Xms512m G1GC
-Dorg.bytedeco.javacpp.nopointergc=true
OOM in 55 mins
noheap memory usage is stable

Here is JVM dump 3,4 mins before OOM. I tried to hit the perform GC button several times.
looks like DeallocatorReference and NativeDeallocator, and AtomicInteger are root cause of the leak.

Experiment 2:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 1000000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=false

EC2: c5.2xlarge
-Xmx512m -Xms512m G1GC
-Dorg.bytedeco.javacpp.nopointergc=false
pass the test after 5 hours
noheap memory usage is stable

Here is the jconsole snapshot

Here is the heap dump

it turns out majority of memory usage is from the Metric. So with GC enabled, system is able to release Pointer object.
@karllessard it also proves removing the metrics helps a lot

@saudet Is there a way to allow GC to release those pointer objects while -Dorg.bytedeco.javacpp.nopointergc=true?

I think I found the problem. When nopointergc is true, the deallocator keep appending like a LinkedList with pointer.deallocator(new DeleteDeallocator(o)) or pointer.deallocator(new Pointer.NativeDeallocator(this, ownerAddress, deallocatorAddress)) if the pointer is a PoiterPointer or IntPointer. But it never got the chance to call remove. Each node in the LinkedList holds the strong reference to the next node causing the memory leak.

….javacpp.nopointergc" (issue deepjavalibrary/djl#690)

saudet · 2021-04-01T08:39:40Z

Yes, I've noticed as well. I fixed this in commit bytedeco/javacpp@7f27899 along with proper testing for that case. Sorry about that. In practice though, there is little to be gained by disabling GC entirely. As long as we make sure to deallocate explicitly everything we're not going to lose any performance.

Incidentally, this gives us a benchmark about the kind of performance we can expect with and without relying on GC. On my machine (Fedora 32), PointerTest#testDeallocator produces these kinds of numbers:

Without explicit deallocation: Took 11330 ms
With explicit deallocation: Took 52 ms

(Windows gives me similar but slightly worse results.) The time for the latter explicit case doesn't change regardless of whether "org.bytedeco.javacpp.nopointergc" is true or not. Clearly, the problem isn't only with GC itself, but also with allocating native memory in one thread, and deallocating it in another thread...

stu1130 · 2021-04-01T09:04:12Z

@saudet awesome! I was testing my new code with explicit deallocator call when I am done with that pointer. Looks like I don't need to create that PR! So I can now totally rely on PointerScope without any explicit deallocate() call right? Or do you still recommend to add deallocate call at the end of PointerScope? The reason why we would like nopointergc=true is not only about GC but we don't want the blocking deallocator thread that calls System.gc(). But thanks for that experiment. We are more clear to what direction is right. Let me know when the 1.5.6-snapshot is out and when you are going to release 1.5.6 with that fix! Thanks!

saudet · 2021-04-01T09:37:42Z

PointerScope.close() ends up calling Pointer.deallocate() under the hood, so it's the same as calling it directly. In other words, it's explicit deallocation that doesn't rely on GC. About System.gc(), another way to prevent calls to that is by setting maxBytes to 0, but that's already being done in DJL by default:
https://github.com/frankfliu/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/LibUtils.java#L52
So it should never get called anyway and you don't need to have a new version of JavaCPP for that. Are you concerned about anything else though?

stu1130 · 2021-04-01T10:15:45Z

That's all I need. Thanks. I checked javacpp 1.6.0-SNAPSHOT but found the fix bytedeco/javacpp@7f27899 is not included yet. Let me know when it is available. I would like to try it out!

saudet · 2021-04-01T10:43:30Z

It's in 1.5.6-SNAPSHOT: https://github.com/bytedeco/javacpp/blob/7f27899578dfa18e22738a3dd49701e1806b464a/pom.xml#L7

1.6.0-SNAPSHOT is something else that's probably not going to happen.

stu1130 · 2021-04-01T15:16:20Z

@saudet So far I didn't find accumulating DeallocatorReference & NativeReference any more with the new fix!
One more question, what is the life cycle of the "Pointer" returned by TF_TensorData(TF_Tensor)? Should we call close() or deallactor()? I found it doesn't call setNull() so isNull() is always false even if I close the TF_Tensor.

skirdey · 2021-04-01T17:37:34Z

accumulating DeallocatorReference & NativeReference is what I also saw in the root of the thread, so it is awesome that it is fixed! I don't think that there were any other issues when running DJL / TF / nopointergc=true in prod, but need to try run inference again at scale.

saudet · 2021-04-02T00:24:45Z

@saudet So far I didn't find accumulating DeallocatorReference & NativeReference any more with new the fix!
One more question, what is the life cycle of the "Pointer" returned by TF_TensorData(TF_Tensor)? Should we call close() or deallactor()? I found it doesn't call setNull() so isNull() is always false even if I close the TF_Tensor.

TF_TensorData() is just a C function that returns a pointer and doesn't allocate anything. The Pointer object it returns is merely a wrapper around a long value, so we don't need to worry about it. Pointer.close() doesn't do anything when a deallocator hasn't been registered, that's normal.

accumulating DeallocatorReference & NativeReference is what I also saw in the root of the thread, so it is awesome that it is fixed! I don't think that there were any other issues when running DJL / TF / nopointergc=true in prod, but need to try run inference again at scale.

There should be no discernible difference in speed between JavaCPP 1.5.5 with maxBytes = 0 and JavaCPP 1.5.6-SNAPSHOT with noPointerGC = true, but if you do notice a difference, please let me know and I'll release JavaCPP 1.5.5-1 or something with the fix. Thanks for testing!

saudet · 2021-05-07T00:51:19Z

FYI, I've just updated JavaCPP to skip over all synchronized code when GC fallback is disabled, see bytedeco/javacpp@d788390. That may be able to help further with latency, but I'm not seeing any difference in my tests.

saudet · 2021-05-31T12:11:29Z

@skirdey BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try.
DJL uses the official Java API of TF Lite, but it is limited and inefficient, as per its own documentation:

https://www.tensorflow.org/lite/guide/inference#android_platform

So I've created builds that expose the C++ API with JavaCPP:

https://github.com/bytedeco/javacpp-presets/tree/master/tensorflow-lite

Those are currently low-level wrappers for the C++ API, but DJL could of course use this instead of the official API.

@stu1130 If this looks like something you guys may want to use but would prefer to maintain them here, please let me know!

saudet · 2021-08-03T09:08:30Z

BTW, I've released JavaCPP 1.5.6, so you can start using that.

frankfliu · 2021-08-03T15:02:56Z

@saudet Thank you so much!

We have updated to 1.5.6 in master branch, will release with your fix in next release.

skirdey · 2022-04-02T01:27:33Z

issue was resolved

skirdey added the bug Something isn't working label Feb 24, 2021

skirdey changed the title ~~Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopoinergc=true~~ Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true Feb 24, 2021

stu1130 added the Tensorflow Tensorflow Engine label Feb 25, 2021

skirdey changed the title ~~Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true~~ Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) Feb 25, 2021

skirdey mentioned this issue Feb 25, 2021

Interaction between Tensorflow Java and JavaCPP Pointer deallocation tensorflow/java#208

Open

skirdey mentioned this issue Mar 3, 2021

Keep weak references to eager resources in session tensorflow/java#229

Merged

saudet added a commit to bytedeco/javacpp that referenced this issue Apr 1, 2021

* Fix memory leak in Pointer.releaseReference() with "org.bytedeco…

7f27899

….javacpp.nopointergc" (issue deepjavalibrary/djl#690)

stu1130 mentioned this issue Apr 2, 2021

Fix TensorFlow memory leak & inplace flaky test #806

Merged

saudet mentioned this issue May 6, 2021

Large latency on Tensor allocation tensorflow/java#313

Closed

skirdey closed this as completed Apr 2, 2022

Lokiiiiii pushed a commit to Lokiiiiii/djl that referenced this issue Oct 10, 2023

[ci] make performance tests run in parallel (deepjavalibrary#690)

b221ea7

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) #690

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) #690

Comments

skirdey commented Feb 24, 2021 • edited Loading

Description

Expected Behavior

Error Message

How to Reproduce?

skirdey commented Feb 24, 2021

skirdey commented Feb 28, 2021

karllessard commented Mar 1, 2021

skirdey commented Mar 12, 2021

saudet commented Mar 12, 2021

karllessard commented Mar 12, 2021

skirdey commented Mar 12, 2021 via email

saudet commented Mar 12, 2021 • edited Loading

stu1130 commented Mar 12, 2021

karllessard commented Mar 12, 2021

karllessard commented Mar 15, 2021

skirdey commented Mar 20, 2021

stu1130 commented Mar 20, 2021 • edited Loading

saudet commented Mar 20, 2021

stu1130 commented Mar 20, 2021

saudet commented Mar 20, 2021

saudet commented Mar 21, 2021

stu1130 commented Mar 22, 2021

karllessard commented Mar 22, 2021 • edited Loading

stu1130 commented Mar 22, 2021

saudet commented Mar 23, 2021 • edited Loading

karllessard commented Mar 24, 2021

saudet commented Mar 24, 2021

Craigacp commented Mar 24, 2021 • edited Loading

skirdey commented Mar 26, 2021 via email

Craigacp commented Mar 26, 2021

skirdey commented Mar 26, 2021 via email

skirdey commented Mar 26, 2021 via email

karllessard commented Mar 26, 2021 • edited Loading

stu1130 commented Mar 29, 2021 • edited Loading

saudet commented Mar 29, 2021

stu1130 commented Mar 29, 2021 • edited Loading

karllessard commented Mar 29, 2021

saudet commented Mar 30, 2021

stu1130 commented Mar 30, 2021 • edited Loading

stu1130 commented Mar 31, 2021 • edited Loading

saudet commented Apr 1, 2021

stu1130 commented Apr 1, 2021 • edited Loading

saudet commented Apr 1, 2021

stu1130 commented Apr 1, 2021

saudet commented Apr 1, 2021

stu1130 commented Apr 1, 2021 • edited Loading

skirdey commented Apr 1, 2021

saudet commented Apr 2, 2021

saudet commented May 7, 2021

saudet commented May 31, 2021

saudet commented Aug 3, 2021

frankfliu commented Aug 3, 2021

skirdey commented Apr 2, 2022

skirdey commented Feb 24, 2021 •

edited

Loading

saudet commented Mar 12, 2021 •

edited

Loading

stu1130 commented Mar 20, 2021 •

edited

Loading

karllessard commented Mar 22, 2021 •

edited

Loading

saudet commented Mar 23, 2021 •

edited

Loading

Craigacp commented Mar 24, 2021 •

edited

Loading

karllessard commented Mar 26, 2021 •

edited

Loading

stu1130 commented Mar 29, 2021 •

edited

Loading

stu1130 commented Mar 29, 2021 •

edited

Loading

stu1130 commented Mar 30, 2021 •

edited

Loading

stu1130 commented Mar 31, 2021 •

edited

Loading

stu1130 commented Apr 1, 2021 •

edited

Loading

stu1130 commented Apr 1, 2021 •

edited

Loading