Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) #690

Closed
skirdey opened this issue Feb 24, 2021 · 56 comments
Labels
bug Something isn't working Tensorflow Tensorflow Engine

Comments

@skirdey
Copy link
Contributor

skirdey commented Feb 24, 2021

Description

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true

CPU inference.

Expected Behavior

Garbage collection removing objects from Old Generation

Error Message

Java OOM

How to Reproduce?

Run multi-threaded inference for 30minutes with -Dorg.bytedeco.javacpp.nopointergc=true so you don't have JavaCPP Deallocator blocking thread.

/gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3'

@skirdey skirdey added the bug Something isn't working label Feb 24, 2021
@skirdey
Copy link
Contributor Author

skirdey commented Feb 24, 2021

I have a high level overview of what happens captured via yourkit,
at some point (>30 min of continous predictions), getting a spike in GC collections
Screen Shot 2021-02-24 at 3 30 55 PM

and pause time:
Screen Shot 2021-02-24 at 3 30 44 PM

Also, OldGen takes most of the space leaving nothing for new allocations
Screen Shot 2021-02-24 at 2 49 52 PM

here is memory snapshot, after several forced GCs
Screen Shot 2021-02-24 at 3 36 43 PM

@skirdey skirdey changed the title Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopoinergc=true Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true Feb 24, 2021
@stu1130 stu1130 added the Tensorflow Tensorflow Engine label Feb 25, 2021
@skirdey skirdey changed the title Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) Feb 25, 2021
@skirdey
Copy link
Contributor Author

skirdey commented Feb 28, 2021

Same benchmark doing multi-threading inference using PyTorch has minimal memory footprint and no GC pauses:
./gradlew benchmark -Dai.djl.default_engine=PyTorch -Dai.djl.repository.zoo.location="https://alpha-djl-demos.s3.amazonaws.com/model/djl-blockrunner/pytorch_resnet18.zip?model_name=traced_resnet18" --args='-c 100000 -d 1440 -t 10 -s 1,3,224,224'
Screen Shot 2021-02-27 at 4 21 11 PM

@karllessard
Copy link

Please see tensorflow/java#229 for a potential fix for this problem in the TF Java library.

@skirdey
Copy link
Contributor Author

skirdey commented Mar 12, 2021

@karllessard is it possible to release tf java 0.2.1 with the hotfix? it doesn't look like a breaking change so a patch would be nice.

DJl uses release versions of TF Java

@saudet
Copy link

saudet commented Mar 12, 2021

@skirdey 0.3.0-SNAPSHOT is pretty much up now, so please give that a try, but like I tried to explain on the pull request, I don't believe this will fix any issues for DJL.

@karllessard
Copy link

Good news @skirdey ! Important build issues that were preventing us to release 0.3.0 are now fixed, I will provide more details a bit later but you can also follow this thread to see the progress.

@skirdey
Copy link
Contributor Author

skirdey commented Mar 12, 2021 via email

@saudet
Copy link

saudet commented Mar 12, 2021

That doesn't fix any leak. If there's a leak, it's somewhere else.

@stu1130
Copy link
Contributor

stu1130 commented Mar 12, 2021

DJL team can publish 0.11.0-SNAPSHOT that depends on 0.3.0-SNAPSHOT. I am on it.

@karllessard
Copy link

We just had our community meeting and I’m targeting making a release of 0.3.0 by March 15th with the actual snapshot. Only detail is that the Windows MKL+GPU platform won’t be available for that release (at least not for this date), all other platforms will.

@karllessard
Copy link

Hi, just to let you know that 0.3.0 is now available on Maven Central. Please let us know how it goes with your experiments on this issue, thank you

@skirdey
Copy link
Contributor Author

skirdey commented Mar 20, 2021

We tried it out and the memory still leaks, very fast when nopointergc=true and it takes 5 hours to halt when nopointergc=false
@karllessard @saudet

@stu1130
Copy link
Contributor

stu1130 commented Mar 20, 2021

Like @skirdey mentioned
Here is my setting.
c5.2xlarge instance with jvmArgs -Xmx256m (make it easier to be OOM)
command:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true

-Dorg.bytedeco.javacpp.nopointergc=true took 2m 31s
-Dorg.bytedeco.javacpp.nopointergc=false took 5h 11m 6s

We are going to integrate TfNDManager with PointerScope as @saudet suggested for the next step

@saudet
Copy link

saudet commented Mar 20, 2021

@stu1130 Can you point us to the code in DJL for that benchmark?

@stu1130
Copy link
Contributor

stu1130 commented Mar 20, 2021

@saudet
Copy link

saudet commented Mar 20, 2021

I mean, what does it end up calling in TF Java? For example, @roywei says that

Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.

It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?

@saudet
Copy link

saudet commented Mar 21, 2021

Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.

@stu1130
Copy link
Contributor

stu1130 commented Mar 22, 2021

I mean, what does it end up calling in TF Java? For example, @roywei says that

Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.

It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?

When TfNDArray.close(), the operand is set to null.

Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

@karllessard
Copy link

karllessard commented Mar 22, 2021

The behavior is not the same in eager or graph mode.

In eager mode, it is true that the lifetime of a Tensor is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.

In graph mode though, all tensors returned by a session.run() must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.

When TfNDArray.close(), the operand is set to null.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

@stu1130
Copy link
Contributor

stu1130 commented Mar 22, 2021

The behavior is not the same in eager or graph mode.

In eager mode, it is true that the lifetime of a Tensor is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.

In graph mode though, all tensors returned by a session.run() must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.

When TfNDArray.close(), the operand is set to null.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

Thanks @karllessard for those pointers. We will try to close tensors for both EagerSession/Session ASAP.

@saudet
Copy link

saudet commented Mar 23, 2021

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

That's not true. I've explained this many times already, so I won't try to reexplain this here, but with the way you've "fixed" this, leaks will still happen, unless of course everything is done manually.

@karllessard
Copy link

Hi @stu1130 ,

I was experimenting with DJL's benchmark this evening and while I cannot draw conclusions yet on the TF side, I've noticed that the benchmark is collecting multiple metrics on each inference, which are being piled up in a list for the whole duration of the test and reaching the millions in number after a few hours. While instances of Metric seem quite small in size, it could be interesting to see if you still observe OOM on a JVM of 256M when disabling the metric collection, like it seems possible to do here, to make sure this constantly-growing list does not interfere with the experiment. What do you think?

@saudet
Copy link

saudet commented Mar 24, 2021

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.

To clarify what I wrote above, the "issue" I'm referring to isn't the one in this thread about memory leaks in DJL, but another one concerning the incorrect use of WeakReference that @stu1130 inquired about. I think everyone is on the same page about fixing memory leaks in general, and right now we can avoid them by making sure that everything gets deallocated manually. Even if we started using WeakReference correctly somehow, it is my opinion that relying on GC is a bad idea anyway, and this is where my opinion diverges with the other members of SIG JVM. We all agree that calling things like Tensor.close() should work though, so if there's a bug there, we'll fix that.

@Craigacp
Copy link

Craigacp commented Mar 24, 2021

When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?

Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 296.217 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading:     100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2050.987 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

I don't usually get this out of TF-Java when we run it separately, and I know that DJL uses it's own library loader which partially delegates to ours.

@skirdey
Copy link
Contributor Author

skirdey commented Mar 26, 2021 via email

@Craigacp
Copy link

Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.

@skirdey
Copy link
Contributor Author

skirdey commented Mar 26, 2021 via email

@skirdey
Copy link
Contributor Author

skirdey commented Mar 26, 2021 via email

@karllessard
Copy link

karllessard commented Mar 26, 2021

Thanks @skirdey , please don't forget to disable DJL's metrics since they can grow drastically during the benchmark and cause an OOM if you are running on a low-memory JVM.

@stu1130
Copy link
Contributor

stu1130 commented Mar 29, 2021

@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?

./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'

I found we don't release TF_Output pointer in session run. Can it be the reason?

@saudet
Copy link

saudet commented Mar 29, 2021

When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?

@Craigacp It just means that the jnijavacpp library itself is missing, but same as tensorflow/java#189 (comment), we can ignore those warnings, unless we get some other loading error that is...

@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?

./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'

I found we don't release TF_Output pointer in session run. Can it be the reason?

When "org.bytedeco.javacpp.nopointergc" is true, JavaCPP won't be trying to use GC, so we're not going to get any "Collecting ..." lines in the debug log. You'll need to enable GC temporarily to see if it can pick up something that way. It won't be able to track anything that isn't registered with it though, so if you're still not seeing anything with GC enabled, I would say it's something it doesn't know about that is leaking. However, native arrays such as the one that you're linking to for TF_Output get allocated with the new operator from C++, and JavaCPP automatically registers deallocators that call their destructors using the delete operator for those, so if they were "leaking", it would most likely show up in JavaCPP's debug log. Also, it's created inside a PointerScope, so it's almost certainly getting deallocated there. To make sure, you can check that outputs.isNull() is true after leaving the scope.

As for detecting memory leaks in native code in general, I find that Valgrind works the least badly with the JVM:
https://github.com/deeplearning4j/libnd4j/wiki/Debugging-libnd4j
Nevertheless, with a bit more of hacking, it's also possible to use Address Sanitizer:
http://btorpey.github.io/blog/2014/03/27/using-clangs-address-sanitizer/

@stu1130
Copy link
Contributor

stu1130 commented Mar 29, 2021

So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).

Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?

BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.

I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]) but unfortunate got OOM
command

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3'
[ERROR] -
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at ai.djl.examples.inference.benchmark.MultithreadedBenchmark.predict(MultithreadedBenchmark.java:86) [main/:?]
        at ai.djl.examples.inference.benchmark.util.AbstractBenchmark.runBenchmark(AbstractBenchmark.java:132) [main/:?]
        at ai.djl.examples.inference.benchmark.Benchmark.main(Benchmark.java:31) [main/:?]
Caused by: java.lang.OutOfMemoryError: Java heap space
[ERROR] - Only 0/10 threads finished.

> Task :examples:benchmark FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':examples:benchmark'.
> Process 'command '/usr/lib/jvm/java-12-openjdk-amd64/bin/java'' finished with non-zero exit value 255

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full
insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2h 1m 59s

Maybe I am using a more powful CPU machine (c5.2xlarge) so small objects like Pointer got accumulated faster to reach OOM

@karllessard
Copy link

I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]

This is strange, indeed, and not easy to debug if I cannot reproduce it on my side. @Craigacp , do you still have access to these fast machines so we can run the same test on them?

BTW, TF Java 0.3.1 has been released this morning which not only fixes a leak with the new string tensor but also potentially another leak (never observed so far) that could happened when closing multiple graph sessions. It might be better to work with this version from now on.

@saudet
Copy link

saudet commented Mar 30, 2021

@stu1130 128 MB is not a lot of memory for the JVM itself + a couple of things from TF and DJL.
Make sure to try with something a bit larger like at least 256 or 512 MB as well.

@stu1130
Copy link
Contributor

stu1130 commented Mar 30, 2021

@saudet I tested mine new refactoring(1G jvm size && nopointergc=true) and it went OOM after 2 hours. I used the jconsole and found native non-heap memory is pretty stable but old gen kept growing until OOM exception with several old gen GC attempts. I will share my experiment with nopointergc=false (GC enabled) as a baseline.

@stu1130
Copy link
Contributor

stu1130 commented Mar 31, 2021

@saudet
Experiment 1:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true
  • EC2: c5.2xlarge
  • -Xmx512m -Xms512m G1GC
  • -Dorg.bytedeco.javacpp.nopointergc=true
  • OOM in 55 mins
  • noheap memory usage is stable

Here is JVM dump 3,4 mins before OOM. I tried to hit the perform GC button several times.
looks like DeallocatorReference and NativeDeallocator, and AtomicInteger are root cause of the leak.
Screen Shot 2021-03-31 at 3 44 37 PM

Experiment 2:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 1000000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=false
  • EC2: c5.2xlarge
  • -Xmx512m -Xms512m G1GC
  • -Dorg.bytedeco.javacpp.nopointergc=false
  • pass the test after 5 hours
  • noheap memory usage is stable

Here is the jconsole snapshot
Screen Shot 2021-03-31 at 11 08 37 PM
Here is the heap dump
Screen Shot 2021-04-01 at 8 17 48 AM
it turns out majority of memory usage is from the Metric. So with GC enabled, system is able to release Pointer object.
@karllessard it also proves removing the metrics helps a lot

@saudet Is there a way to allow GC to release those pointer objects while -Dorg.bytedeco.javacpp.nopointergc=true?

I think I found the problem. When nopointergc is true, the deallocator keep appending like a LinkedList with pointer.deallocator(new DeleteDeallocator(o)) or pointer.deallocator(new Pointer.NativeDeallocator(this, ownerAddress, deallocatorAddress)) if the pointer is a PoiterPointer or IntPointer. But it never got the chance to call remove. Each node in the LinkedList holds the strong reference to the next node causing the memory leak.

saudet added a commit to bytedeco/javacpp that referenced this issue Apr 1, 2021
@saudet
Copy link

saudet commented Apr 1, 2021

Yes, I've noticed as well. I fixed this in commit bytedeco/javacpp@7f27899 along with proper testing for that case. Sorry about that. In practice though, there is little to be gained by disabling GC entirely. As long as we make sure to deallocate explicitly everything we're not going to lose any performance.

Incidentally, this gives us a benchmark about the kind of performance we can expect with and without relying on GC. On my machine (Fedora 32), PointerTest#testDeallocator produces these kinds of numbers:

  • Without explicit deallocation: Took 11330 ms
  • With explicit deallocation: Took 52 ms

(Windows gives me similar but slightly worse results.) The time for the latter explicit case doesn't change regardless of whether "org.bytedeco.javacpp.nopointergc" is true or not. Clearly, the problem isn't only with GC itself, but also with allocating native memory in one thread, and deallocating it in another thread...

@stu1130
Copy link
Contributor

stu1130 commented Apr 1, 2021

@saudet awesome! I was testing my new code with explicit deallocator call when I am done with that pointer. Looks like I don't need to create that PR! So I can now totally rely on PointerScope without any explicit deallocate() call right? Or do you still recommend to add deallocate call at the end of PointerScope? The reason why we would like nopointergc=true is not only about GC but we don't want the blocking deallocator thread that calls System.gc(). But thanks for that experiment. We are more clear to what direction is right. Let me know when the 1.5.6-snapshot is out and when you are going to release 1.5.6 with that fix! Thanks!

@saudet
Copy link

saudet commented Apr 1, 2021

PointerScope.close() ends up calling Pointer.deallocate() under the hood, so it's the same as calling it directly. In other words, it's explicit deallocation that doesn't rely on GC. About System.gc(), another way to prevent calls to that is by setting maxBytes to 0, but that's already being done in DJL by default:
https://github.com/frankfliu/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/LibUtils.java#L52
So it should never get called anyway and you don't need to have a new version of JavaCPP for that. Are you concerned about anything else though?

@stu1130
Copy link
Contributor

stu1130 commented Apr 1, 2021

That's all I need. Thanks. I checked javacpp 1.6.0-SNAPSHOT but found the fix bytedeco/javacpp@7f27899 is not included yet. Let me know when it is available. I would like to try it out!

@saudet
Copy link

saudet commented Apr 1, 2021

It's in 1.5.6-SNAPSHOT: https://github.com/bytedeco/javacpp/blob/7f27899578dfa18e22738a3dd49701e1806b464a/pom.xml#L7

1.6.0-SNAPSHOT is something else that's probably not going to happen.

@stu1130
Copy link
Contributor

stu1130 commented Apr 1, 2021

@saudet So far I didn't find accumulating DeallocatorReference & NativeReference any more with the new fix!
One more question, what is the life cycle of the "Pointer" returned by TF_TensorData(TF_Tensor)? Should we call close() or deallactor()? I found it doesn't call setNull() so isNull() is always false even if I close the TF_Tensor.

@skirdey
Copy link
Contributor Author

skirdey commented Apr 1, 2021

accumulating DeallocatorReference & NativeReference is what I also saw in the root of the thread, so it is awesome that it is fixed! I don't think that there were any other issues when running DJL / TF / nopointergc=true in prod, but need to try run inference again at scale.

@saudet
Copy link

saudet commented Apr 2, 2021

@saudet So far I didn't find accumulating DeallocatorReference & NativeReference any more with new the fix!
One more question, what is the life cycle of the "Pointer" returned by TF_TensorData(TF_Tensor)? Should we call close() or deallactor()? I found it doesn't call setNull() so isNull() is always false even if I close the TF_Tensor.

TF_TensorData() is just a C function that returns a pointer and doesn't allocate anything. The Pointer object it returns is merely a wrapper around a long value, so we don't need to worry about it. Pointer.close() doesn't do anything when a deallocator hasn't been registered, that's normal.

accumulating DeallocatorReference & NativeReference is what I also saw in the root of the thread, so it is awesome that it is fixed! I don't think that there were any other issues when running DJL / TF / nopointergc=true in prod, but need to try run inference again at scale.

There should be no discernible difference in speed between JavaCPP 1.5.5 with maxBytes = 0 and JavaCPP 1.5.6-SNAPSHOT with noPointerGC = true, but if you do notice a difference, please let me know and I'll release JavaCPP 1.5.5-1 or something with the fix. Thanks for testing!

@saudet
Copy link

saudet commented May 7, 2021

FYI, I've just updated JavaCPP to skip over all synchronized code when GC fallback is disabled, see bytedeco/javacpp@d788390. That may be able to help further with latency, but I'm not seeing any difference in my tests.

@saudet
Copy link

saudet commented May 31, 2021

@skirdey BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try.
DJL uses the official Java API of TF Lite, but it is limited and inefficient, as per its own documentation:

So I've created builds that expose the C++ API with JavaCPP:

Those are currently low-level wrappers for the C++ API, but DJL could of course use this instead of the official API.

@stu1130 If this looks like something you guys may want to use but would prefer to maintain them here, please let me know!

@saudet
Copy link

saudet commented Aug 3, 2021

BTW, I've released JavaCPP 1.5.6, so you can start using that.

@frankfliu
Copy link
Contributor

@saudet Thank you so much!

We have updated to 1.5.6 in master branch, will release with your fix in next release.

@skirdey skirdey closed this as completed Apr 2, 2022
@skirdey
Copy link
Contributor Author

skirdey commented Apr 2, 2022

issue was resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Tensorflow Tensorflow Engine
Projects
None yet
Development

No branches or pull requests

6 participants