Significant performance improvements to Pointer.deallocator #232

jentfoo · 2018-03-23T00:14:39Z

This improves thread contention when locking on the global lock for Pointer allocations in most cases.
The global lock can be a performance killer in highly parallel operations, even when memory pressures are not high.
In this change we do optimistic provisioning using Atomic operations, and only fall back to locking when there is memory pressure.

For the totalBytes and the internal linked list this was an easy change. Switching to an AtomicLong and ConcurrentLinkedQueue helps in us being able to do those common actions without needing to lock.
The handling of the "PhysicalBytes" however is a bit different in this implementation. We use an AtomicLong to approximate the size based off the internally stored bytes. However on de-allocations this is NOT decremented (while totalBytes is decremented). This means that one of two other methods will need to sync the physicalBytes back to reality. The first (hopefully most likely) would be a sync that occurs every 1000 allocations. The second would be that if we fail to allocate, we will do a trim which will sync this state as well.

@saudet give me your thoughts on this to resolve #231

My biggest concerns are around how I am approximating the physical bytes. If there is a better way to estimate the size, or if you think the sync interval needs to be different, let me know. For my use case I could be syncing every million or two. That said, these changes DRAMATICALLY speed up deeplearning4j's kmeans implementation for me.

saudet · 2018-03-23T00:28:53Z

PhantomReference actually works with collections now? Good to know! Though we don't get the node of the linked list back from ReferenceQueue so remove operations are no longer O(1). Maybe we should use something like ConcurrentHashMap?

jentfoo · 2018-03-23T00:39:24Z

Though we don't get the node of the linked list back from ReferenceQueue so remove operations are no longer O(1)

This is true, this is what I was talking about in the issue about performance vs simplicity. We would need to implement basically the linked list using AtomicReferences to get O(1) back. A ConcurrentHashMap would provide faster removals if the collection is large, if it's small the CLQ may be faster. The only other reason I see that a CLQ may be better than a CHM is that the CHM will have more memory overhead (though we could do a CHM with only 2 shards / concurrency level to help that).

Since I only have one benchmark case, I would like to defer to you to which one we should go with.

PhantomReference actually works with collections now?

I forgot you mentioned in the issue this was not working. I actually need to 100% verify this still. I have only run the unit tests and verified my kmeans use case that brought this up has been running ok a while now. I will get back to you later today or tomorrow as I look into this to be sure.

edit** even if PhantomReference is not working right, I am pretty sure we can change that component out without needing to go back to locking. At a minimum I figured it was worth getting questions / thoughts earlier than later.

saudet · 2018-03-23T00:58:56Z

Ok, let me think about this.

About physicalBytes() why do you need an approximation? Is it just to avoid the lock or is the call too slow in itself?

jentfoo · 2018-03-23T01:20:11Z

About physicalBytes() why do you need an approximation? Is it just to avoid the lock or is the call too slow in itself?

It looked like the native call was also synchronized. I can give it a try and see how it does, but the goal was just to avoid locking entirely if possible.

Sorry about the quick post / delete. I at least confirmed that PhantomReference works with the DeallocatorThread running (ie they are added to the queue for cleanup). And right after I posted that, I realized you can disable that cleanup thread. I am still trying to verify it works with that thread disabled.

jentfoo · 2018-03-23T01:50:14Z

I am not sure that PhantomReference works in master with that thread disabled actually. I am pretty sure we need that ReferenceQueue for this to work. We could avoid the thread and instead process the queue at allocation or another point though. Otherwise best I can tell I am replicating the master behavior with PhantomReference.

Let me know your thoughts, thanks!

saudet · 2018-03-23T02:44:04Z

I've made the call to physicalBytes() thread safe and removed the lock in commit 273d4bc, so we can remove these approximations now.

We need to ReferenceQueue of course. We can process it in another thread with Pointer.deallocateReferences(). Thanks for testing this out!

This improves thread contention when locking on the global lock for Pointer allocations in most cases. The global lock can be a performance killer in highly parallel operations, even when memory pressures are not high. In this change we do optimistic provisioning using Atomic operations, and only fall back to locking when there is memory pressure. For the totalBytes and the internal linked list this was an easy change. Switching to an `AtomicLong` and `ConcurrentLinkedQueue` helps in us being able to do those common actions without needing to lock. The handling of the "PhysicalBytes" however is a bit different in this implementation. We use an `AtomicLong` to approximate the size based off the internally stored bytes. However on de-allocations this is NOT decremented (while totalBytes is decremented). This means that one of two other methods will need to sync the physicalBytes back to reality. The first (hopefully most likely) would be a sync that occurs every 1000 allocations. The second would be that if we fail to allocate, we will do a `trim` which will sync this state as well.

… our limit This means as we get closer to the limit our internal representation should be more accurate.

jentfoo · 2018-03-23T04:08:04Z

@saudet I love it! I just pushed some changes which no longer have the AtomicLong for the physical bytes, and instead we just reference the native function directly. Since we are referencing the native directly, I also changed the trim function to take in how many bytes we need and to continue the loop till we can allocate that much. The exception throwing code needs to remain the same however since there is still the potential that another thread will steal the memory we freed before we get a chance.

I feel a lot better about this without approximating the physical bytes, that was the most concerning aspect to me personally. Let me know your thoughts (also let me know if you want me to sub in the ConcurrentHashMap as well).

saudet · 2018-03-23T08:02:53Z

Good! Thanks. Before I start reviewing this in more detail, there is one thing that bothers me. Even if we were to implement manually a linked list using AtomicReference, I don't see a way to get O(1) removal. Even the Iterator of ConcurrentLinkedQueue doesn't actually remove the Node, it just hopes that the user eventually goes through the list at some point in the future to clean it up, which would never happen if we were to implement add() and remove() with AtomicReference. Am I missing something obvious or is this a harder problem than it looks like?

BTW, the idea with "org.bytedeco.javacpp.nopointergc" is to disable all that, to remove the burden from the GC when the user wants to manage everything manually, and for that we need to not create a ReferenceQueue.

jentfoo · 2018-03-23T13:51:09Z

the idea with "org.bytedeco.javacpp.nopointergc" is to disable all that, to remove the burden from the GC when the user wants to manage everything manually

Ok, so I assume you would like me to revert my last commit? I feel there is some oddness when you are using these PhantomReference's without a queue to clean them up. But if the idea is let them be GC'ed then force the user to manually cleanup the native side, I guess I can understand that. However that means that when this is enabled (deallocator thread disabled) deallocateReferences() wont work. I can mention that in the javadocs if you want, unless you have an idea of how to make deallocateReferences() work when the deallocator thread is not there (ie does it make sense to have a reference queue if there is no deallocator thread).

Even if we were to implement manually a linked list using AtomicReference, I don't see a way to get O(1) removal.

O(1) might be hard, it might be possible assuming no concurrent conflicts, but I would expect it to be higher than that. Having gone down this road before, I would suggest we don't do it here. To give you an idea of what something like this might look like here is a similar implementation I did in threadly. It actually uses an AtomicReference for the head, then a volatile for each section of the chain. However it does require synchronization for removing from the list.

Head of list: https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L386

Chin link: https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L867

Adding and removing from list: (you will notice a lock at removing)
https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L662-L718

Something like that could totally work here too, I just question if the complexity is worth it at this point in the code. I feel like in my use case a CHM or CLQ works fine. But that said, I obviously can build things like this (and have), and will if you prefer that type of implementation. The reason it exists in threadly though is because of countless hours of benchmarking and proving that it truly is the best solution. My concern here is adding the complexity without the truly proven benefit.

Let me know your thoughts, thanks!

saudet · 2018-03-23T16:05:02Z

PhantomReference is a light class. If they're not registered with a ReferenceQueue, they're not doing any harm. They are not bringing anything either, but the subclasses still implement the Deallocator interface for users to call manually. Yes, deallocateReferences() won't work, but I haven't come across a case when that would have been useful yet. Just making the docs clearer for now would be best I think yes.

I also want to keep things simple, but I also don't want to replace one problem with another one... Eliminating locks isn't the goal here though right? If we could make most calls from different threads lock different objects most of the time, that would also be acceptable, correct?

jentfoo · 2018-03-23T16:13:04Z

Just making the docs clearer for now would be best I think yes.

Sounds good, I will revert that commit and update the docs.

Eliminating locks isn't the goal here though right? If we could make most calls from different threads lock different objects most of the time, that would also be acceptable, correct?

I guess my goal was just to provide the maximum amount of improvement I could. I noticed a significant issue in my use of kmeans, so anything that improves that is good enough for me.

What did you have in mind? Or were you referring to the synchronized block when removing from my linked list?

In any case, I don't have any philosophical problem with locking or anything, they are an extremely useful tool, and in some cases lock free designs can be worse (if the CAS operations fail a lot for example). It just seemed to me that a lock free design in this part of the code was more natural than a granular lock solution. I did consider potential reader / writer lock solutions as well, but in the end I just thought this was better.

jentfoo · 2018-03-23T16:16:08Z

P.S. There may be a problem in this. After long runs I am running out of memory, and I am not exactly sure why yet. It may be a problem in my code, it's hard for me to evaluate if this would happen in master because the speed differences are so dramatic, I would have to let it run for a week with master. But if you notice anything, let me know.

saudet · 2018-03-24T00:32:49Z

What about using, for example, 16 linked lists and adding the reference to the list based on the least significant bits of the thread ID? Would that be efficient based on your experience?

jentfoo · 2018-03-24T01:47:18Z

Something like that probably would work fine too. I personally would be more likely to use a ConcurrentHashMap though. Your solution very well may perform better, but without knowing more about the use patterns and other benchmarks I would tend to go with the simpler solution. But I do think doing a striped lock like design could work just fine as well. :)

saudet · 2018-03-24T01:52:20Z

I have been thinking about ConcurrentHashMap as well, but that looks like quite the monster. It would be easy to test though. Could run your benchmarks on that and let's see how that does?

Also improved javadocs to describe condition where `deallocateReferences()` will become a no-op.

jentfoo · 2018-03-26T19:37:49Z

@saudet sorry for the delay. I just pushed the change to use a CHM. I actually only set a concurrency level of 1, which seemed fine for my benchmarks, and attempts to minimize memory usage. This seemed to perform a little better than the original CLQ.

I also have started setting the max physical bytes to zero, and instead just using max bytes to limit things. It has improved speed a lot. So there is this out for those willing to do the tuning to set a reasonable limit just by this in-heap bytes. However if we could figure out a scaling constant between the max bytes and physical bytes, we would get the best of both worlds. Not needed, just something to think about.

saudet · 2018-03-28T03:45:03Z

Sounds good! Thanks. We probably want to make these parameters configurable via system properties, but let's keep that for later...

To avoid OutOfMemoryError, try to increase the "org.bytedeco.javacpp.maxretries" system property to some large value like 1000. System.gc() might take a while to have an effect on heavy loads. Let me know if you're having any issues even with a large number of retries though. Remember that other threads might be stealing memory away, and any given thread might never get the chance to allocate memory, so make sure the logic accounts for fairness.

jentfoo · 2018-03-28T17:51:36Z

I actually think I got my OutOfMemoryError's sorted out. It was a problem on my end to be sure, I don't think anything is wrong with this PR.

However I do have some ideas for more improvements. I am not sure if we should include them in this PR, or another, but give me your thoughts on this:
-> I think the Thread.sleep should be configurable. In theory it's reasonable to think that someone might want to configure this to a very high value (say 10 seconds with 100 retries) and use this mechanism as a form of limiting. Ie this code now would block the application if it is memory limited, rather than making the developer limit by other means (thread pool size, semaphore, etc). Of course this is not a significant burden on people, I just don't see any reason the code can't equally function for both usage types.

-> I still am not 100% happy with the deallocator thread design. I feel like you should be able to do de-allocations with the reference queue but not require an entire thread for it. I personally would like to add a static function which shutdown the deallocator thread and returns the ReferenceQueue for it to be handled manually. A user could choose to ignore the returned queue and manually invoke deallocateReferences() as well, so if you prefer to keep the queue hidden we could force people to rely on that. If there is a concern about safety we could add a lazy check of the queue before new allocations (similar to my commit I removed). This seems slightly better to me as a static function than a system property because you are basically taking over this functionality within your code anyways.

jentfoo · 2018-03-28T21:21:32Z

@saudet Sorry to add more to this, but I found out some interesting additional details.

I ended up playing with the current released 1.3.3 version, and it's actually a little faster IF you set max physical bytes to zero. I was seeing gains in this implementation when I disabled checking physical bytes, but never went back and tested master again. It seems that the master version is actually a little faster if you are not making that native call. Which leads me to believe that the real burden here is the native call inside the synchronized block.

I still suggest that we go forward with these changes, just because physical byte checking is enabled by default. But it's worth being aware of at least.

Also I played with the Thread.sleep / Thread.yield a little bit. I was not able to realize any significant benefits. I have implementations I think are better, but I just don't think I am hitting that case enough to really expose any deficiencies. I am unsure if you would like me to PR potential options (maybe another PR? Or this?), or should I just remove the TODO note?

Thanks as always!

saudet · 2018-03-29T01:16:02Z

DeallocatorThread was introduced to reduce lock contention on ReferenceQueue (issue #103). Of course we can add options to let users do whatever they want, but let's get this running well enough with defaults value first!

So, are you saying that the current simple linked list with locking is fast enough now? Not having to rely on ConcurrentHashMap would sure be a good thing IMO, but let's make sure we're doing the right thing here.

saudet · 2018-03-29T05:24:29Z

BTW, if the goal is performance, we shouldn't be fiddling with the garbage collector anyway. For that, we should be working on "scopes". Basically the API would look like this:

while (...) {
    try (Pointer.Scope scope = new Pointer.Scope()) {
        doProcessingHere();
    }
}

And what would happen behind the curtain is that new Pointer.Scope() would first put itself in a list in thread-local storage, and then for each call to a new Pointer() in doProcessingHere(), when it finds a scope in TLS, instead of registering with the ReferenceQueue, it would put itself on the stack of the scope. Then, on Pointer.Scope.close(), it would pop all the Pointer on the stack one by one and call close()/deallocate() on them, before removing itself from the list on the TLS. That's pretty much the closest we can get to C++'s RAII on Java: Scoped Types for Real-time Java and what Panama is also proposing (but not for C++ types unfortunately): https://www.slideshare.net/ChristophEngelbert/project-panama-beyond-the-jvm-wall

/cc @cypof

jentfoo · 2018-03-30T00:29:49Z

So, are you saying that the current simple linked list with locking is fast enough now?

Not exactly. If you disable max physical bytes it does not do the native call inside the synchronized block. It seems that this alone gets us most the benefits. But I was still recommendating we go with this because physical byte limits are enabled by default, and when enabled there is still significant advantages here.

Fwiw, the root of this issue is the deeplearning4j implementation of kmeans. I decided to re-implement that today and got another 200x improvement!!

I still suggest going with these changes because they seem like an improvement to me, but for me personally they are not as critical as they used to be .

saudet · 2018-03-30T00:59:02Z

Yes, not relying on the garbage collector is the way to gain performance. :)

In any case, I think your code can produce resource starvation. Could you prove me otherwise? The following scenario appears plausible to me:

Thread 1 enters deallocator() and fails on add(), starts calling trim(), but is suspended by the system
The system frees enough memory for thread 1
Thread 2 enters deallocator() and add() succeeds, taking away the just freed memory
Thread 1 resumes execution, keeps looping over trim() because of lack of memory, and the system suspends it again
Goto 2

How is this prevented from happening? This is exactly why I had to put a lock there in the first place.

jentfoo · 2018-03-30T01:10:01Z

This is possible, but since the synchronized block is not fair anyways, I don't see why it matters what thread wins? If you are over allocating your are over allocating, and OOM conditions seem inevitable to me. Unless we make the sleep time configurable or otherwise make this a queue which blocks till they are available, I don't think you have really solved that concern anyways.

saudet · 2018-03-30T01:18:27Z

You're right, the order isn't specified, but it did solve the problem :) I suppose I'd go with ReentrantLock and it's "fair" flag though. Any way to emulate that without locks? The idea here is to not throw an OutOfMemoryError while there is still actually memory, and fairness does solve this issue...

jentfoo · 2018-03-30T01:29:09Z

If it were me, I would do the following:

Make the sleep time in gc configurable. Document it to be configured for 1/2 to 1/10 the time a Pointer will likely be retained. Default to 100ms is fine
Get rid of retry count, instead allow a max wait time to be configured. Default this to at least 1 second IMO, maybe even higher.

This would accomplish what you want, which is to not error if memory can become available. It does change this class to be less fair, and instead is just willing to block longer if needed till memory is available.

But this implementation will always be unfair. I personally don't think that is a significant issue that anyone would notice. But I tend to favor unfair designs based on my experience.

saudet · 2018-03-30T01:40:19Z

About System.gc(), sometimes the JVM ignores it (actually, there is an option to ignore it all the time), so we do need to call it multiple times under normal circumstances anyway.

Actually, PointerTest.Deallocator is supposed to be testing against resource starvation:
https://github.com/bytedeco/javacpp/blob/master/src/test/java/org/bytedeco/javacpp/PointerTest.java#L688
So this passes with your patch it seems? Interesting...

jentfoo · 2018-03-30T01:44:49Z

Yeah, I noticed that test during development, and yes it passes fine.

saudet · 2018-03-31T06:36:52Z

I'm trying to test things out, but on your branch PointerTest.Deallocator takes ~60 s, while it only takes ~10 s on master. Also I'm not observing any performance improvements using either this pull request or -Dorg.bytedeco.javacpp.maxphysicalbytes=0 on master with Deeplearning4j and something like this:

KMeansClustering kMeansClustering = KMeansClustering.setup(500, 500, "euclidean");
List<Point> points = Point.toPoints(Nd4j.randn(500, 500));
ClusterSet clusterSet = kMeansClustering.applyTo(points);

Could you post some code that demonstrates the issue?

…r.deallocator()` to reduce contention (pull #232)

saudet · 2018-04-01T00:41:47Z

Your observation about physicalBytes() made me realize something. The first call to it is the bottleneck for the great majority of calls to Pointer.deallocator() (that is when there is enough memory for allocation), but we can easily put it outside the lock, so I've done just that in the commit above. That should give good results for the case that you were testing? Let me know! Thanks

jentfoo · 2018-04-02T14:37:00Z

@saudet Unfortunately my ML code right now is proprietary. I tried to create an example benchmark that you could use, but as I was building it, I started to witness what you did as well. But even in my companies application, where this branch does show significant benefit to the current release version, moving the native call before the synchronized block performs equally well for me. I suggest we just go with that for now (and is thus why I am closing this PR).

Thanks for looking into this and making that change, I do think it provides a significant benefit on high core count systems.

saudet · 2018-04-03T00:00:51Z

Awesome! Thanks for the feedback and for testing this out!

saudet · 2018-05-13T03:12:38Z

FYI, I've implemented PointerScope in commit d98d323.

/cc @cypof

jentfoo added 4 commits March 22, 2018 21:50

Pointer: Sync physical bytes size more frequently as we get closer to…

650df58

… our limit This means as we get closer to the limit our internal representation should be more accurate.

Pointer: fixup, minor spacing fix and remove missed synchronization

ccb07e9

Pointer: Don't approxmiate physicalBytes, just call out as needed

3c0932a

jentfoo force-pushed the pointerLockingImprovements branch from f6c2c42 to 0a23673 Compare March 23, 2018 04:05

jentfoo force-pushed the pointerLockingImprovements branch from 0a23673 to 9c51b21 Compare March 26, 2018 19:35

Pointer: Switch from ConcurrentLinkedQueue to ConcurrentHashMap

9c51b21

Also improved javadocs to describe condition where `deallocateReferences()` will become a no-op.

Merge branch 'master' into pointerLockingImprovements

46c493b

saudet added a commit that referenced this pull request Apr 1, 2018

* Avoid synchronized on first call to physicalBytes() in `Pointe…

41289a7

…r.deallocator()` to reduce contention (pull #232)

jentfoo closed this Apr 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance improvements to Pointer.deallocator #232

Significant performance improvements to Pointer.deallocator #232

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018 •

edited

Loading

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 24, 2018

jentfoo commented Mar 24, 2018

saudet commented Mar 24, 2018 via email

jentfoo commented Mar 26, 2018

saudet commented Mar 28, 2018

jentfoo commented Mar 28, 2018

jentfoo commented Mar 28, 2018

saudet commented Mar 29, 2018

saudet commented Mar 29, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 30, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 30, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 30, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 31, 2018

saudet commented Apr 1, 2018

jentfoo commented Apr 2, 2018

saudet commented Apr 3, 2018 via email

saudet commented May 13, 2018

Significant performance improvements to Pointer.deallocator #232

Significant performance improvements to Pointer.deallocator #232

Conversation

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018 • edited Loading

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 23, 2018

jentfoo commented Mar 23, 2018

jentfoo commented Mar 23, 2018

saudet commented Mar 24, 2018

jentfoo commented Mar 24, 2018

saudet commented Mar 24, 2018 via email

jentfoo commented Mar 26, 2018

saudet commented Mar 28, 2018

jentfoo commented Mar 28, 2018

jentfoo commented Mar 28, 2018

saudet commented Mar 29, 2018

saudet commented Mar 29, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 30, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 30, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 30, 2018

jentfoo commented Mar 30, 2018

saudet commented Mar 31, 2018

saudet commented Apr 1, 2018

jentfoo commented Apr 2, 2018

saudet commented Apr 3, 2018 via email

saudet commented May 13, 2018

jentfoo commented Mar 23, 2018 •

edited

Loading