Pointer.deallocator is a performance killer #231

jentfoo · 2018-03-20T23:24:47Z

In my profiling I am seeing a ton of blocking on this lock:
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Pointer.java#L546

I don't see why the synchronization here needs to happen across all instances? Should this not synchronize on this?

I am also unsure about holding the lock while doing a Thread.sleep. I assume this is to try and let the gc() have effect before trimMemory() is invoked (which still may or may not happen). Could the gc / sleep / trim happen outside of the lock?

If none of the above are possible (I am rather unfamiliar with this code), then maybe we could do an optimistic check of volatile state before we lock?

Again, just shots in the dark, but I am seeing this in my profiles a fair bit.

The text was updated successfully, but these errors were encountered:

saudet · 2018-03-21T00:05:06Z

This needs to be synchronized across all threads because if System.gc() gives us some memory back, we need to take that memory in add() before some other thread steals it from us. This happens only when we're actually running out of memory, so if this is happening a lot, the code is probably allocating too often anyway. Native memory allocation usually takes more time than that lock. In any case, if you can figure out how to make this work well without lock, that'd be great!

jentfoo · 2018-03-21T13:43:23Z

Sorry, closed when instead I decided to revisit this and comment later.

jentfoo · 2018-03-21T22:30:51Z

@saudet I have been looking at this a little bit more.

This happens only when we're actually running out of memory, so if this is happening a lot, the code is probably allocating too often anyway.

After looking at this more and building some more precise tests, I was not quite accurate. I do agree that these may be allocated too often, and maybe there is some improvements to be had there, but I think there is some potential improvements we can have here as well.

It turns out that the sleep is not so much problematic, as the fact that the global lock exists at all. I am trying to run a very highly parallelized workload, and I believe it is probably bringing this out a bit more. Even though each thread is holding the synchronized block for less than a millsecond, other threads may end up blocking on the lock for 100's of milliseconds just because of the level of lock contention.

I have some prototypes of an optimistic lock-free, but fallback to locking that are so far showing significant improvements. But I was hoping you could help me. This project is still fairly new to me, and I have not branched out into the native code yet. However conceptually I don't understand the difference between physicalBytes (returned by native) and totalBytes (stored in DeallocatorReference). As I understand it, (assuming single threaded) as long as trimMemory() is invoked physicalBytes should == totalBytes. Is that accurate?

saudet · 2018-03-22T01:38:50Z

It was lock-free with a fallback previously, but it didn't work, because we have to update the linked list atomically anyway. If you can implement a lock-free linked list that we can update atomically, then that would probably work. This looks interesting: https://en.wikipedia.org/wiki/Non-blocking_linked_list

saudet · 2018-03-22T01:43:11Z

physicalBytes() is a number reported by the OS, which includes off-heap memory, the Java heap, etc, while totalBytes() is something that is tracked internally by JavaCPP, but it isn't aware of anything that we don't explicitly give it, which at the moment includes only base classes like IntPointer, FloatPointer, etc.

jentfoo · 2018-03-22T05:50:05Z

If you can implement a lock-free linked list that we can update atomically, then that would probably work.

I already got this done. I actually see a couple ways we could do this. Maybe you can help me. I have my one use case with kmeans on deeplearning4j, but I don't know what other use patterns are common. Does this linked list ever get too long? Is removing from it common? We can solve it simply by using a ConcurrentLinkedQueue, or we can use an AtomicReference depending on performance vs code readability preferences.

What I am finding hard to deal with is the physical Bytes. I can estimate them, but having them atomically updated without knowing how much is going to be used is really hard. Do you think some slight fudging with that value would be acceptable? It means over provisioning might be possible, but hopefully not too much, nor would it get worse with time.

saudet · 2018-03-22T06:03:33Z

A long time ago, I tried to use containers like that, but the garbage collector was never able to recognize those as phantom reachable. Maybe that has changed. If you could give it a try, that would be great!

Of course, in general, we can't know what other parts of the process are doing with the memory and can never be sure exactly how much memory we are using.

… (issue #231)

gcmcnutt · 2023-02-01T01:35:04Z

I've also been working on better scaling of this implementation -- running into the same concurrency issues on Scanner.scan.

I've prototyped a second interface where a BytePointer can be passed into the scan call -- managed by the callers. This helps quite a bit.

Anyone here to examine pull requests?

saudet · 2023-02-01T04:18:32Z

For performance, we should disable all that anyway, and deallocate everything manually, perhaps with the help of a class such as PointerScope: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

jentfoo closed this as completed Mar 21, 2018

jentfoo reopened this Mar 21, 2018

saudet added the enhancement label Mar 22, 2018

jentfoo mentioned this issue Mar 23, 2018

Significant performance improvements to Pointer.deallocator #232

Closed

saudet added a commit that referenced this issue Mar 23, 2018

* Make call to Pointer.physicalBytes() thread safe and remove lock…

273d4bc

… (issue #231)

jentfoo closed this as completed Apr 2, 2018

jentfoo mentioned this issue Apr 2, 2018

KMeans implementation is slow deeplearning4j/deeplearning4j#4882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pointer.deallocator is a performance killer #231

Pointer.deallocator is a performance killer #231

jentfoo commented Mar 20, 2018

saudet commented Mar 21, 2018

jentfoo commented Mar 21, 2018

jentfoo commented Mar 21, 2018

saudet commented Mar 22, 2018

saudet commented Mar 22, 2018

jentfoo commented Mar 22, 2018

saudet commented Mar 22, 2018

gcmcnutt commented Feb 1, 2023

saudet commented Feb 1, 2023

Pointer.deallocator is a performance killer #231

Pointer.deallocator is a performance killer #231

Comments

jentfoo commented Mar 20, 2018

saudet commented Mar 21, 2018

jentfoo commented Mar 21, 2018

jentfoo commented Mar 21, 2018

saudet commented Mar 22, 2018

saudet commented Mar 22, 2018

jentfoo commented Mar 22, 2018

saudet commented Mar 22, 2018

gcmcnutt commented Feb 1, 2023

saudet commented Feb 1, 2023