-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance improvements to Pointer.deallocator #232
Conversation
PhantomReference actually works with collections now? Good to know! Though we don't get the node of the linked list back from ReferenceQueue so remove operations are no longer O(1). Maybe we should use something like ConcurrentHashMap? |
This is true, this is what I was talking about in the issue about performance vs simplicity. We would need to implement basically the linked list using Since I only have one benchmark case, I would like to defer to you to which one we should go with.
I forgot you mentioned in the issue this was not working. I actually need to 100% verify this still. I have only run the unit tests and verified my kmeans use case that brought this up has been running ok a while now. I will get back to you later today or tomorrow as I look into this to be sure. edit** even if PhantomReference is not working right, I am pretty sure we can change that component out without needing to go back to locking. At a minimum I figured it was worth getting questions / thoughts earlier than later. |
Ok, let me think about this. About physicalBytes() why do you need an approximation? Is it just to avoid the lock or is the call too slow in itself? |
It looked like the native call was also synchronized. I can give it a try and see how it does, but the goal was just to avoid locking entirely if possible. Sorry about the quick post / delete. I at least confirmed that PhantomReference works with the |
I am not sure that Let me know your thoughts, thanks! |
I've made the call to We need to |
This improves thread contention when locking on the global lock for Pointer allocations in most cases. The global lock can be a performance killer in highly parallel operations, even when memory pressures are not high. In this change we do optimistic provisioning using Atomic operations, and only fall back to locking when there is memory pressure. For the totalBytes and the internal linked list this was an easy change. Switching to an `AtomicLong` and `ConcurrentLinkedQueue` helps in us being able to do those common actions without needing to lock. The handling of the "PhysicalBytes" however is a bit different in this implementation. We use an `AtomicLong` to approximate the size based off the internally stored bytes. However on de-allocations this is NOT decremented (while totalBytes is decremented). This means that one of two other methods will need to sync the physicalBytes back to reality. The first (hopefully most likely) would be a sync that occurs every 1000 allocations. The second would be that if we fail to allocate, we will do a `trim` which will sync this state as well.
… our limit This means as we get closer to the limit our internal representation should be more accurate.
f6c2c42
to
0a23673
Compare
@saudet I love it! I just pushed some changes which no longer have the I feel a lot better about this without approximating the physical bytes, that was the most concerning aspect to me personally. Let me know your thoughts (also let me know if you want me to sub in the ConcurrentHashMap as well). |
Good! Thanks. Before I start reviewing this in more detail, there is one thing that bothers me. Even if we were to implement manually a linked list using AtomicReference, I don't see a way to get O(1) removal. Even the Iterator of ConcurrentLinkedQueue doesn't actually remove the Node, it just hopes that the user eventually goes through the list at some point in the future to clean it up, which would never happen if we were to implement add() and remove() with AtomicReference. Am I missing something obvious or is this a harder problem than it looks like? BTW, the idea with "org.bytedeco.javacpp.nopointergc" is to disable all that, to remove the burden from the GC when the user wants to manage everything manually, and for that we need to not create a ReferenceQueue. |
Ok, so I assume you would like me to revert my last commit? I feel there is some oddness when you are using these PhantomReference's without a queue to clean them up. But if the idea is let them be GC'ed then force the user to manually cleanup the native side, I guess I can understand that. However that means that when this is enabled (deallocator thread disabled)
O(1) might be hard, it might be possible assuming no concurrent conflicts, but I would expect it to be higher than that. Having gone down this road before, I would suggest we don't do it here. To give you an idea of what something like this might look like here is a similar implementation I did in Head of list: https://github.com/threadly/threadly/blob/master/src/main/java/org/threadly/concurrent/PriorityScheduler.java#L386 Adding and removing from list: (you will notice a lock at removing) Something like that could totally work here too, I just question if the complexity is worth it at this point in the code. I feel like in my use case a CHM or CLQ works fine. But that said, I obviously can build things like this (and have), and will if you prefer that type of implementation. The reason it exists in threadly though is because of countless hours of benchmarking and proving that it truly is the best solution. My concern here is adding the complexity without the truly proven benefit. Let me know your thoughts, thanks! |
I also want to keep things simple, but I also don't want to replace one problem with another one... Eliminating locks isn't the goal here though right? If we could make most calls from different threads lock different objects most of the time, that would also be acceptable, correct? |
Sounds good, I will revert that commit and update the docs.
I guess my goal was just to provide the maximum amount of improvement I could. I noticed a significant issue in my use of kmeans, so anything that improves that is good enough for me. What did you have in mind? Or were you referring to the synchronized block when removing from my linked list? In any case, I don't have any philosophical problem with locking or anything, they are an extremely useful tool, and in some cases lock free designs can be worse (if the CAS operations fail a lot for example). It just seemed to me that a lock free design in this part of the code was more natural than a granular lock solution. I did consider potential reader / writer lock solutions as well, but in the end I just thought this was better. |
P.S. There may be a problem in this. After long runs I am running out of memory, and I am not exactly sure why yet. It may be a problem in my code, it's hard for me to evaluate if this would happen in |
What about using, for example, 16 linked lists and adding the reference to the list based on the least significant bits of the thread ID? Would that be efficient based on your experience? |
Something like that probably would work fine too. I personally would be more likely to use a ConcurrentHashMap though. Your solution very well may perform better, but without knowing more about the use patterns and other benchmarks I would tend to go with the simpler solution. But I do think doing a striped lock like design could work just fine as well. :) |
I have been thinking about ConcurrentHashMap as well, but that looks like
quite the monster. It would be easy to test though. Could run your
benchmarks on that and let's see how that does?
|
0a23673
to
9c51b21
Compare
Also improved javadocs to describe condition where `deallocateReferences()` will become a no-op.
@saudet sorry for the delay. I just pushed the change to use a CHM. I actually only set a concurrency level of I also have started setting the max physical bytes to zero, and instead just using max bytes to limit things. It has improved speed a lot. So there is this out for those willing to do the tuning to set a reasonable limit just by this in-heap bytes. However if we could figure out a scaling constant between the max bytes and physical bytes, we would get the best of both worlds. Not needed, just something to think about. |
Sounds good! Thanks. We probably want to make these parameters configurable via system properties, but let's keep that for later... To avoid |
I actually think I got my However I do have some ideas for more improvements. I am not sure if we should include them in this PR, or another, but give me your thoughts on this: -> I still am not 100% happy with the deallocator thread design. I feel like you should be able to do de-allocations with the reference queue but not require an entire thread for it. I personally would like to add a static function which shutdown the deallocator thread and returns the |
@saudet Sorry to add more to this, but I found out some interesting additional details. I ended up playing with the current released I still suggest that we go forward with these changes, just because physical byte checking is enabled by default. But it's worth being aware of at least. Also I played with the Thread.sleep / Thread.yield a little bit. I was not able to realize any significant benefits. I have implementations I think are better, but I just don't think I am hitting that case enough to really expose any deficiencies. I am unsure if you would like me to PR potential options (maybe another PR? Or this?), or should I just remove the TODO note? Thanks as always! |
DeallocatorThread was introduced to reduce lock contention on ReferenceQueue (issue #103). Of course we can add options to let users do whatever they want, but let's get this running well enough with defaults value first! So, are you saying that the current simple linked list with locking is fast enough now? Not having to rely on ConcurrentHashMap would sure be a good thing IMO, but let's make sure we're doing the right thing here. |
BTW, if the goal is performance, we shouldn't be fiddling with the garbage collector anyway. For that, we should be working on "scopes". Basically the API would look like this: while (...) {
try (Pointer.Scope scope = new Pointer.Scope()) {
doProcessingHere();
}
} And what would happen behind the curtain is that /cc @cypof |
Not exactly. If you disable max physical bytes it does not do the native call inside the synchronized block. It seems that this alone gets us most the benefits. But I was still recommendating we go with this because physical byte limits are enabled by default, and when enabled there is still significant advantages here. Fwiw, the root of this issue is the deeplearning4j implementation of kmeans. I decided to re-implement that today and got another 200x improvement!! I still suggest going with these changes because they seem like an improvement to me, but for me personally they are not as critical as they used to be . |
Yes, not relying on the garbage collector is the way to gain performance. :) In any case, I think your code can produce resource starvation. Could you prove me otherwise? The following scenario appears plausible to me:
How is this prevented from happening? This is exactly why I had to put a lock there in the first place. |
This is possible, but since the synchronized block is not fair anyways, I don't see why it matters what thread wins? If you are over allocating your are over allocating, and OOM conditions seem inevitable to me. Unless we make the sleep time configurable or otherwise make this a queue which blocks till they are available, I don't think you have really solved that concern anyways. |
You're right, the order isn't specified, but it did solve the problem :) I suppose I'd go with |
If it were me, I would do the following:
This would accomplish what you want, which is to not error if memory can become available. It does change this class to be less fair, and instead is just willing to block longer if needed till memory is available. But this implementation will always be unfair. I personally don't think that is a significant issue that anyone would notice. But I tend to favor unfair designs based on my experience. |
About Actually, |
Yeah, I noticed that test during development, and yes it passes fine. |
I'm trying to test things out, but on your branch KMeansClustering kMeansClustering = KMeansClustering.setup(500, 500, "euclidean");
List<Point> points = Point.toPoints(Nd4j.randn(500, 500));
ClusterSet clusterSet = kMeansClustering.applyTo(points); Could you post some code that demonstrates the issue? |
…r.deallocator()` to reduce contention (pull #232)
Your observation about |
@saudet Unfortunately my ML code right now is proprietary. I tried to create an example benchmark that you could use, but as I was building it, I started to witness what you did as well. But even in my companies application, where this branch does show significant benefit to the current release version, moving the native call before the synchronized block performs equally well for me. I suggest we just go with that for now (and is thus why I am closing this PR). Thanks for looking into this and making that change, I do think it provides a significant benefit on high core count systems. |
Awesome! Thanks for the feedback and for testing this out!
|
This improves thread contention when locking on the global lock for Pointer allocations in most cases.
The global lock can be a performance killer in highly parallel operations, even when memory pressures are not high.
In this change we do optimistic provisioning using Atomic operations, and only fall back to locking when there is memory pressure.
For the totalBytes and the internal linked list this was an easy change. Switching to an
AtomicLong
andConcurrentLinkedQueue
helps in us being able to do those common actions without needing to lock.The handling of the "PhysicalBytes" however is a bit different in this implementation. We use an
AtomicLong
to approximate the size based off the internally stored bytes. However on de-allocations this is NOT decremented (while totalBytes is decremented). This means that one of two other methods will need to sync the physicalBytes back to reality. The first (hopefully most likely) would be a sync that occurs every 1000 allocations. The second would be that if we fail to allocate, we will do atrim
which will sync this state as well.@saudet give me your thoughts on this to resolve #231
My biggest concerns are around how I am approximating the physical bytes. If there is a better way to estimate the size, or if you think the sync interval needs to be different, let me know. For my use case I could be syncing every million or two. That said, these changes DRAMATICALLY speed up deeplearning4j's kmeans implementation for me.