Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance improvements to Pointer.deallocator #232

Closed
wants to merge 6 commits into from

Conversation

jentfoo
Copy link

@jentfoo jentfoo commented Mar 23, 2018

This improves thread contention when locking on the global lock for Pointer allocations in most cases.
The global lock can be a performance killer in highly parallel operations, even when memory pressures are not high.
In this change we do optimistic provisioning using Atomic operations, and only fall back to locking when there is memory pressure.

For the totalBytes and the internal linked list this was an easy change. Switching to an AtomicLong and ConcurrentLinkedQueue helps in us being able to do those common actions without needing to lock.
The handling of the "PhysicalBytes" however is a bit different in this implementation. We use an AtomicLong to approximate the size based off the internally stored bytes. However on de-allocations this is NOT decremented (while totalBytes is decremented). This means that one of two other methods will need to sync the physicalBytes back to reality. The first (hopefully most likely) would be a sync that occurs every 1000 allocations. The second would be that if we fail to allocate, we will do a trim which will sync this state as well.

@saudet give me your thoughts on this to resolve #231

My biggest concerns are around how I am approximating the physical bytes. If there is a better way to estimate the size, or if you think the sync interval needs to be different, let me know. For my use case I could be syncing every million or two. That said, these changes DRAMATICALLY speed up deeplearning4j's kmeans implementation for me.

@saudet
Copy link
Member

saudet commented Mar 23, 2018

PhantomReference actually works with collections now? Good to know! Though we don't get the node of the linked list back from ReferenceQueue so remove operations are no longer O(1). Maybe we should use something like ConcurrentHashMap?

@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

Though we don't get the node of the linked list back from ReferenceQueue so remove operations are no longer O(1)

This is true, this is what I was talking about in the issue about performance vs simplicity. We would need to implement basically the linked list using AtomicReferences to get O(1) back. A ConcurrentHashMap would provide faster removals if the collection is large, if it's small the CLQ may be faster. The only other reason I see that a CLQ may be better than a CHM is that the CHM will have more memory overhead (though we could do a CHM with only 2 shards / concurrency level to help that).

Since I only have one benchmark case, I would like to defer to you to which one we should go with.

PhantomReference actually works with collections now?

I forgot you mentioned in the issue this was not working. I actually need to 100% verify this still. I have only run the unit tests and verified my kmeans use case that brought this up has been running ok a while now. I will get back to you later today or tomorrow as I look into this to be sure.

edit** even if PhantomReference is not working right, I am pretty sure we can change that component out without needing to go back to locking. At a minimum I figured it was worth getting questions / thoughts earlier than later.

@saudet
Copy link
Member

saudet commented Mar 23, 2018

Ok, let me think about this.

About physicalBytes() why do you need an approximation? Is it just to avoid the lock or is the call too slow in itself?

@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

About physicalBytes() why do you need an approximation? Is it just to avoid the lock or is the call too slow in itself?

It looked like the native call was also synchronized. I can give it a try and see how it does, but the goal was just to avoid locking entirely if possible.

Sorry about the quick post / delete. I at least confirmed that PhantomReference works with the DeallocatorThread running (ie they are added to the queue for cleanup). And right after I posted that, I realized you can disable that cleanup thread. I am still trying to verify it works with that thread disabled.

@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

I am not sure that PhantomReference works in master with that thread disabled actually. I am pretty sure we need that ReferenceQueue for this to work. We could avoid the thread and instead process the queue at allocation or another point though. Otherwise best I can tell I am replicating the master behavior with PhantomReference.

Let me know your thoughts, thanks!

@saudet
Copy link
Member

saudet commented Mar 23, 2018

I've made the call to physicalBytes() thread safe and removed the lock in commit 273d4bc, so we can remove these approximations now.

We need to ReferenceQueue of course. We can process it in another thread with Pointer.deallocateReferences(). Thanks for testing this out!

This improves thread contention when locking on the global lock for Pointer allocations in most cases.
The global lock can be a performance killer in highly parallel operations, even when memory pressures are not high.
In this change we do optimistic provisioning using Atomic operations, and only fall back to locking when there is memory pressure.

For the totalBytes and the internal linked list this was an easy change.  Switching to an `AtomicLong` and `ConcurrentLinkedQueue` helps in us being able to do those common actions without needing to lock.
The handling of the "PhysicalBytes" however is a bit different in this implementation.  We use an `AtomicLong` to approximate the size based off the internally stored bytes.  However on de-allocations this is NOT decremented (while totalBytes is decremented).  This means that one of two other methods will need to sync the physicalBytes back to reality.  The first (hopefully most likely) would be a sync that occurs every 1000 allocations.  The second would be that if we fail to allocate, we will do a `trim` which will sync this state as well.
… our limit

This means as we get closer to the limit our internal representation should be more accurate.
@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

@saudet I love it! I just pushed some changes which no longer have the AtomicLong for the physical bytes, and instead we just reference the native function directly. Since we are referencing the native directly, I also changed the trim function to take in how many bytes we need and to continue the loop till we can allocate that much. The exception throwing code needs to remain the same however since there is still the potential that another thread will steal the memory we freed before we get a chance.

I feel a lot better about this without approximating the physical bytes, that was the most concerning aspect to me personally. Let me know your thoughts (also let me know if you want me to sub in the ConcurrentHashMap as well).

@saudet
Copy link
Member

saudet commented Mar 23, 2018

Good! Thanks. Before I start reviewing this in more detail, there is one thing that bothers me. Even if we were to implement manually a linked list using AtomicReference, I don't see a way to get O(1) removal. Even the Iterator of ConcurrentLinkedQueue doesn't actually remove the Node, it just hopes that the user eventually goes through the list at some point in the future to clean it up, which would never happen if we were to implement add() and remove() with AtomicReference. Am I missing something obvious or is this a harder problem than it looks like?

BTW, the idea with "org.bytedeco.javacpp.nopointergc" is to disable all that, to remove the burden from the GC when the user wants to manage everything manually, and for that we need to not create a ReferenceQueue.

@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

the idea with "org.bytedeco.javacpp.nopointergc" is to disable all that, to remove the burden from the GC when the user wants to manage everything manually

Ok, so I assume you would like me to revert my last commit? I feel there is some oddness when you are using these PhantomReference's without a queue to clean them up. But if the idea is let them be GC'ed then force the user to manually cleanup the native side, I guess I can understand that. However that means that when this is enabled (deallocator thread disabled) deallocateReferences() wont work. I can mention that in the javadocs if you want, unless you have an idea of how to make deallocateReferences() work when the deallocator thread is not there (ie does it make sense to have a reference queue if there is no deallocator thread).

Even if we were to implement manually a linked list using AtomicReference, I don't see a way to get O(1) removal.

O(1) might be hard, it might be possible assuming no concurrent conflicts, but I would expect it to be higher than that. Having gone down this road before, I would suggest we don't do it here. To give you an idea of what something like this might look like here is a similar implementation I did in threadly. It actually uses an AtomicReference for the head, then a volatile for each section of the chain. However it does require synchronization for removing from the list.

Head of list: https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L386

Chin link: https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L867

Adding and removing from list: (you will notice a lock at removing)
https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L662-L718

Something like that could totally work here too, I just question if the complexity is worth it at this point in the code. I feel like in my use case a CHM or CLQ works fine. But that said, I obviously can build things like this (and have), and will if you prefer that type of implementation. The reason it exists in threadly though is because of countless hours of benchmarking and proving that it truly is the best solution. My concern here is adding the complexity without the truly proven benefit.

Let me know your thoughts, thanks!

@saudet
Copy link
Member

saudet commented Mar 23, 2018

PhantomReference is a light class. If they're not registered with a ReferenceQueue, they're not doing any harm. They are not bringing anything either, but the subclasses still implement the Deallocator interface for users to call manually. Yes, deallocateReferences() won't work, but I haven't come across a case when that would have been useful yet. Just making the docs clearer for now would be best I think yes.

I also want to keep things simple, but I also don't want to replace one problem with another one... Eliminating locks isn't the goal here though right? If we could make most calls from different threads lock different objects most of the time, that would also be acceptable, correct?

@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

Just making the docs clearer for now would be best I think yes.

Sounds good, I will revert that commit and update the docs.

Eliminating locks isn't the goal here though right? If we could make most calls from different threads lock different objects most of the time, that would also be acceptable, correct?

I guess my goal was just to provide the maximum amount of improvement I could. I noticed a significant issue in my use of kmeans, so anything that improves that is good enough for me.

What did you have in mind? Or were you referring to the synchronized block when removing from my linked list?

In any case, I don't have any philosophical problem with locking or anything, they are an extremely useful tool, and in some cases lock free designs can be worse (if the CAS operations fail a lot for example). It just seemed to me that a lock free design in this part of the code was more natural than a granular lock solution. I did consider potential reader / writer lock solutions as well, but in the end I just thought this was better.

@jentfoo
Copy link
Author

jentfoo commented Mar 23, 2018

P.S. There may be a problem in this. After long runs I am running out of memory, and I am not exactly sure why yet. It may be a problem in my code, it's hard for me to evaluate if this would happen in master because the speed differences are so dramatic, I would have to let it run for a week with master. But if you notice anything, let me know.

@saudet
Copy link
Member

saudet commented Mar 24, 2018

What about using, for example, 16 linked lists and adding the reference to the list based on the least significant bits of the thread ID? Would that be efficient based on your experience?

@jentfoo
Copy link
Author

jentfoo commented Mar 24, 2018

Something like that probably would work fine too. I personally would be more likely to use a ConcurrentHashMap though. Your solution very well may perform better, but without knowing more about the use patterns and other benchmarks I would tend to go with the simpler solution. But I do think doing a striped lock like design could work just fine as well. :)

@saudet
Copy link
Member

saudet commented Mar 24, 2018 via email

Also improved javadocs to describe condition where `deallocateReferences()` will become a no-op.
@jentfoo
Copy link
Author

jentfoo commented Mar 26, 2018

@saudet sorry for the delay. I just pushed the change to use a CHM. I actually only set a concurrency level of 1, which seemed fine for my benchmarks, and attempts to minimize memory usage. This seemed to perform a little better than the original CLQ.

I also have started setting the max physical bytes to zero, and instead just using max bytes to limit things. It has improved speed a lot. So there is this out for those willing to do the tuning to set a reasonable limit just by this in-heap bytes. However if we could figure out a scaling constant between the max bytes and physical bytes, we would get the best of both worlds. Not needed, just something to think about.

@saudet
Copy link
Member

saudet commented Mar 28, 2018

Sounds good! Thanks. We probably want to make these parameters configurable via system properties, but let's keep that for later...

To avoid OutOfMemoryError, try to increase the "org.bytedeco.javacpp.maxretries" system property to some large value like 1000. System.gc() might take a while to have an effect on heavy loads. Let me know if you're having any issues even with a large number of retries though. Remember that other threads might be stealing memory away, and any given thread might never get the chance to allocate memory, so make sure the logic accounts for fairness.

@jentfoo
Copy link
Author

jentfoo commented Mar 28, 2018

I actually think I got my OutOfMemoryError's sorted out. It was a problem on my end to be sure, I don't think anything is wrong with this PR.

However I do have some ideas for more improvements. I am not sure if we should include them in this PR, or another, but give me your thoughts on this:
-> I think the Thread.sleep should be configurable. In theory it's reasonable to think that someone might want to configure this to a very high value (say 10 seconds with 100 retries) and use this mechanism as a form of limiting. Ie this code now would block the application if it is memory limited, rather than making the developer limit by other means (thread pool size, semaphore, etc). Of course this is not a significant burden on people, I just don't see any reason the code can't equally function for both usage types.

-> I still am not 100% happy with the deallocator thread design. I feel like you should be able to do de-allocations with the reference queue but not require an entire thread for it. I personally would like to add a static function which shutdown the deallocator thread and returns the ReferenceQueue for it to be handled manually. A user could choose to ignore the returned queue and manually invoke deallocateReferences() as well, so if you prefer to keep the queue hidden we could force people to rely on that. If there is a concern about safety we could add a lazy check of the queue before new allocations (similar to my commit I removed). This seems slightly better to me as a static function than a system property because you are basically taking over this functionality within your code anyways.

@jentfoo
Copy link
Author

jentfoo commented Mar 28, 2018

@saudet Sorry to add more to this, but I found out some interesting additional details.

I ended up playing with the current released 1.3.3 version, and it's actually a little faster IF you set max physical bytes to zero. I was seeing gains in this implementation when I disabled checking physical bytes, but never went back and tested master again. It seems that the master version is actually a little faster if you are not making that native call. Which leads me to believe that the real burden here is the native call inside the synchronized block.

I still suggest that we go forward with these changes, just because physical byte checking is enabled by default. But it's worth being aware of at least.

Also I played with the Thread.sleep / Thread.yield a little bit. I was not able to realize any significant benefits. I have implementations I think are better, but I just don't think I am hitting that case enough to really expose any deficiencies. I am unsure if you would like me to PR potential options (maybe another PR? Or this?), or should I just remove the TODO note?

Thanks as always!

@saudet
Copy link
Member

saudet commented Mar 29, 2018

DeallocatorThread was introduced to reduce lock contention on ReferenceQueue (issue #103). Of course we can add options to let users do whatever they want, but let's get this running well enough with defaults value first!

So, are you saying that the current simple linked list with locking is fast enough now? Not having to rely on ConcurrentHashMap would sure be a good thing IMO, but let's make sure we're doing the right thing here.

@saudet
Copy link
Member

saudet commented Mar 29, 2018

BTW, if the goal is performance, we shouldn't be fiddling with the garbage collector anyway. For that, we should be working on "scopes". Basically the API would look like this:

while (...) {
    try (Pointer.Scope scope = new Pointer.Scope()) {
        doProcessingHere();
    }
}

And what would happen behind the curtain is that new Pointer.Scope() would first put itself in a list in thread-local storage, and then for each call to a new Pointer() in doProcessingHere(), when it finds a scope in TLS, instead of registering with the ReferenceQueue, it would put itself on the stack of the scope. Then, on Pointer.Scope.close(), it would pop all the Pointer on the stack one by one and call close()/deallocate() on them, before removing itself from the list on the TLS. That's pretty much the closest we can get to C++'s RAII on Java: Scoped Types for Real-time Java and what Panama is also proposing (but not for C++ types unfortunately): https://www.slideshare.net/ChristophEngelbert/project-panama-beyond-the-jvm-wall

/cc @cypof

@jentfoo
Copy link
Author

jentfoo commented Mar 30, 2018

So, are you saying that the current simple linked list with locking is fast enough now?

Not exactly. If you disable max physical bytes it does not do the native call inside the synchronized block. It seems that this alone gets us most the benefits. But I was still recommendating we go with this because physical byte limits are enabled by default, and when enabled there is still significant advantages here.

Fwiw, the root of this issue is the deeplearning4j implementation of kmeans. I decided to re-implement that today and got another 200x improvement!!

I still suggest going with these changes because they seem like an improvement to me, but for me personally they are not as critical as they used to be .

@saudet
Copy link
Member

saudet commented Mar 30, 2018

Yes, not relying on the garbage collector is the way to gain performance. :)

In any case, I think your code can produce resource starvation. Could you prove me otherwise? The following scenario appears plausible to me:

  1. Thread 1 enters deallocator() and fails on add(), starts calling trim(), but is suspended by the system
  2. The system frees enough memory for thread 1
  3. Thread 2 enters deallocator() and add() succeeds, taking away the just freed memory
  4. Thread 1 resumes execution, keeps looping over trim() because of lack of memory, and the system suspends it again
  5. Goto 2

How is this prevented from happening? This is exactly why I had to put a lock there in the first place.

@jentfoo
Copy link
Author

jentfoo commented Mar 30, 2018

This is possible, but since the synchronized block is not fair anyways, I don't see why it matters what thread wins? If you are over allocating your are over allocating, and OOM conditions seem inevitable to me. Unless we make the sleep time configurable or otherwise make this a queue which blocks till they are available, I don't think you have really solved that concern anyways.

@saudet
Copy link
Member

saudet commented Mar 30, 2018

You're right, the order isn't specified, but it did solve the problem :) I suppose I'd go with ReentrantLock and it's "fair" flag though. Any way to emulate that without locks? The idea here is to not throw an OutOfMemoryError while there is still actually memory, and fairness does solve this issue...

@jentfoo
Copy link
Author

jentfoo commented Mar 30, 2018

If it were me, I would do the following:

  • Make the sleep time in gc configurable. Document it to be configured for 1/2 to 1/10 the time a Pointer will likely be retained. Default to 100ms is fine
  • Get rid of retry count, instead allow a max wait time to be configured. Default this to at least 1 second IMO, maybe even higher.

This would accomplish what you want, which is to not error if memory can become available. It does change this class to be less fair, and instead is just willing to block longer if needed till memory is available.

But this implementation will always be unfair. I personally don't think that is a significant issue that anyone would notice. But I tend to favor unfair designs based on my experience.

@saudet
Copy link
Member

saudet commented Mar 30, 2018

About System.gc(), sometimes the JVM ignores it (actually, there is an option to ignore it all the time), so we do need to call it multiple times under normal circumstances anyway.

Actually, PointerTest.Deallocator is supposed to be testing against resource starvation:
https://github.com/bytedeco/javacpp/blob/master/src/test/java/org/bytedeco/javacpp/PointerTest.java#L688
So this passes with your patch it seems? Interesting...

@jentfoo
Copy link
Author

jentfoo commented Mar 30, 2018

Yeah, I noticed that test during development, and yes it passes fine.

@saudet
Copy link
Member

saudet commented Mar 31, 2018

I'm trying to test things out, but on your branch PointerTest.Deallocator takes ~60 s, while it only takes ~10 s on master. Also I'm not observing any performance improvements using either this pull request or -Dorg.bytedeco.javacpp.maxphysicalbytes=0 on master with Deeplearning4j and something like this:

KMeansClustering kMeansClustering = KMeansClustering.setup(500, 500, "euclidean");
List<Point> points = Point.toPoints(Nd4j.randn(500, 500));
ClusterSet clusterSet = kMeansClustering.applyTo(points);

Could you post some code that demonstrates the issue?

saudet added a commit that referenced this pull request Apr 1, 2018
…r.deallocator()` to reduce contention (pull #232)
@saudet
Copy link
Member

saudet commented Apr 1, 2018

Your observation about physicalBytes() made me realize something. The first call to it is the bottleneck for the great majority of calls to Pointer.deallocator() (that is when there is enough memory for allocation), but we can easily put it outside the lock, so I've done just that in the commit above. That should give good results for the case that you were testing? Let me know! Thanks

@jentfoo
Copy link
Author

jentfoo commented Apr 2, 2018

@saudet Unfortunately my ML code right now is proprietary. I tried to create an example benchmark that you could use, but as I was building it, I started to witness what you did as well. But even in my companies application, where this branch does show significant benefit to the current release version, moving the native call before the synchronized block performs equally well for me. I suggest we just go with that for now (and is thus why I am closing this PR).

Thanks for looking into this and making that change, I do think it provides a significant benefit on high core count systems.

@jentfoo jentfoo closed this Apr 2, 2018
@saudet
Copy link
Member

saudet commented Apr 3, 2018 via email

@saudet
Copy link
Member

saudet commented May 13, 2018

FYI, I've implemented PointerScope in commit d98d323.

/cc @cypof

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pointer.deallocator is a performance killer
2 participants