Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointer.deallocator is a performance killer #231

Closed
jentfoo opened this issue Mar 20, 2018 · 9 comments
Closed

Pointer.deallocator is a performance killer #231

jentfoo opened this issue Mar 20, 2018 · 9 comments

Comments

@jentfoo
Copy link

jentfoo commented Mar 20, 2018

In my profiling I am seeing a ton of blocking on this lock:
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Pointer.java#L546

I don't see why the synchronization here needs to happen across all instances? Should this not synchronize on this?

I am also unsure about holding the lock while doing a Thread.sleep. I assume this is to try and let the gc() have effect before trimMemory() is invoked (which still may or may not happen). Could the gc / sleep / trim happen outside of the lock?

If none of the above are possible (I am rather unfamiliar with this code), then maybe we could do an optimistic check of volatile state before we lock?

Again, just shots in the dark, but I am seeing this in my profiles a fair bit.

@saudet
Copy link
Member

saudet commented Mar 21, 2018

This needs to be synchronized across all threads because if System.gc() gives us some memory back, we need to take that memory in add() before some other thread steals it from us. This happens only when we're actually running out of memory, so if this is happening a lot, the code is probably allocating too often anyway. Native memory allocation usually takes more time than that lock. In any case, if you can figure out how to make this work well without lock, that'd be great!

@jentfoo jentfoo closed this as completed Mar 21, 2018
@jentfoo jentfoo reopened this Mar 21, 2018
@jentfoo
Copy link
Author

jentfoo commented Mar 21, 2018

Sorry, closed when instead I decided to revisit this and comment later.

@jentfoo
Copy link
Author

jentfoo commented Mar 21, 2018

@saudet I have been looking at this a little bit more.

This happens only when we're actually running out of memory, so if this is happening a lot, the code is probably allocating too often anyway.

After looking at this more and building some more precise tests, I was not quite accurate. I do agree that these may be allocated too often, and maybe there is some improvements to be had there, but I think there is some potential improvements we can have here as well.

It turns out that the sleep is not so much problematic, as the fact that the global lock exists at all. I am trying to run a very highly parallelized workload, and I believe it is probably bringing this out a bit more. Even though each thread is holding the synchronized block for less than a millsecond, other threads may end up blocking on the lock for 100's of milliseconds just because of the level of lock contention.

I have some prototypes of an optimistic lock-free, but fallback to locking that are so far showing significant improvements. But I was hoping you could help me. This project is still fairly new to me, and I have not branched out into the native code yet. However conceptually I don't understand the difference between physicalBytes (returned by native) and totalBytes (stored in DeallocatorReference). As I understand it, (assuming single threaded) as long as trimMemory() is invoked physicalBytes should == totalBytes. Is that accurate?

@saudet
Copy link
Member

saudet commented Mar 22, 2018

It was lock-free with a fallback previously, but it didn't work, because we have to update the linked list atomically anyway. If you can implement a lock-free linked list that we can update atomically, then that would probably work. This looks interesting: https://en.wikipedia.org/wiki/Non-blocking_linked_list

@saudet
Copy link
Member

saudet commented Mar 22, 2018

physicalBytes() is a number reported by the OS, which includes off-heap memory, the Java heap, etc, while totalBytes() is something that is tracked internally by JavaCPP, but it isn't aware of anything that we don't explicitly give it, which at the moment includes only base classes like IntPointer, FloatPointer, etc.

@jentfoo
Copy link
Author

jentfoo commented Mar 22, 2018

If you can implement a lock-free linked list that we can update atomically, then that would probably work.

I already got this done. I actually see a couple ways we could do this. Maybe you can help me. I have my one use case with kmeans on deeplearning4j, but I don't know what other use patterns are common. Does this linked list ever get too long? Is removing from it common? We can solve it simply by using a ConcurrentLinkedQueue, or we can use an AtomicReference depending on performance vs code readability preferences.

What I am finding hard to deal with is the physical Bytes. I can estimate them, but having them atomically updated without knowing how much is going to be used is really hard. Do you think some slight fudging with that value would be acceptable? It means over provisioning might be possible, but hopefully not too much, nor would it get worse with time.

@saudet
Copy link
Member

saudet commented Mar 22, 2018

A long time ago, I tried to use containers like that, but the garbage collector was never able to recognize those as phantom reachable. Maybe that has changed. If you could give it a try, that would be great!

Of course, in general, we can't know what other parts of the process are doing with the memory and can never be sure exactly how much memory we are using.

@gcmcnutt
Copy link

gcmcnutt commented Feb 1, 2023

I've also been working on better scaling of this implementation -- running into the same concurrency issues on Scanner.scan.

I've prototyped a second interface where a BytePointer can be passed into the scan call -- managed by the callers. This helps quite a bit.

Anyone here to examine pull requests?

@saudet
Copy link
Member

saudet commented Feb 1, 2023

For performance, we should disable all that anyway, and deallocate everything manually, perhaps with the help of a class such as PointerScope: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants