Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix resource leak caused by not cleaning up unused threads #3734
We had a small memory leak in the code base. Namely, some thread pools have not been shutdown when they have been no longer needed. Result was that the threads and the parent threads have been kept alive which lead to hundreds of stale threads over the course of several days.
We had a small memory leak in the code base. Namely, there have been some threadpools in use but not shutdown when they have been no longer needed. Result was that the threads and the parent threads have been kept alive which lead to hundreds of stale threads over the course of several days.
What was the memory footprint change after running with your patch? One leaked executor per connection does seem bad, but it would be good to understand the full impact.
Any additional information on which test cases may need to be re-run to verify everything works as expected if this goes into v1.2.4?
Any screenshots or tools we can use to help verify the new behavior?
I think this type of bug is way too easy to make right now. Does anyone have an RAII solution to these single thread pool use cases that could be added to guarantee that shutdownNow() is called for each of these resources so they can be GCed?
Long term, it may make more sense to attach the per-connection executors to the manager object instead. Right now the threads grow with the number of connections which depends on the manager logic to decide how many connections to keep around.
This is much harder to understand from a performance point of view than a parent thread pool with a fixed size regardless of the number of connections.
julianknutsen left a comment
BUNDLE_OF_ENVELOPES looks deprecated so nothing should be scheduled on the executor, right?. What is the supported backwards-compatibility here and Is there a point to even keeping this code path around?
I tested v1.2.4 without this patch and don't see the threads increasing with simple disconnect/reconnect. Is there something I need to do to exercise the code path?
Just as I attempted to get numbers, jProfiler crashed after a 48h test run and with it my bisq-under-test and the whole gnome shell. I did, however, see the fix doing its job. All in all, I decided to not waste another couple of days to rerun anything but put the PR up for merging as this is a real bug that needs fixing.
due to the executors not being shut down, these threads stayed alive and with them, the parent threads per connection as well. Over a couple of days this behavior lead to hundreds of (unused) threads. But only if there is high network load and the
A JVM profiler showed me that there has been a huge number of threads all in waiting-state. So I tracked down their source, found that there is indeed a bug causing a "thread-leak" and fixed it. My guess is that with the recently forced update, the bundle-of-envelopes stuff finally went really active as all clients made use of this.
I honestly do not know if that bug is responsible for some resource-related issues reported recently. There has been incorrect behavior and it is fixed now. I would leave it for time to tell if it has been that great of an impact in the first place.
As always with our P2P part, there is no exhaustive testing possible right now. Since the change only cleanes up stuff, there should be no need for any testplan testcases to be redone. However, I suggest to maybe redo the trading tests just to be sure.
Well take a JVM profiler, start up 1.2.3 and this fix and compare the number of threads after 2 days of runtime. With the fix, there should be substantially less threads. Please note that I am not entirely confident that there isn't another "thread-leak" somewhere, still. But that is for another day.
Agreed. Generally, thread handling and synchronization is a mess right now. Needs fixing sometime.
yes, the capability "BUNDLE_OF_ENVELOPES" has been deprecated as we made it default with the forced update. So the feature is still there and active with current clients, but the capability is not required anymore.
The code path only gets triggered if the throttle limits get hit (defined in
as said above, this stuff is hard to test. After initial tests I had either version run in production for days with a profiler attached...
julianknutsen left a comment •
Luckily verifying this didn't take the two days you suggested. I spent some time learning more about jprofiler views and updating the executor to set the thread name so it could be tracked. Here is the comparison before and after your patch to verify the threads no longer leak:
From the heap inspection, it looks like each executor is about 300-600 bytes each so that is the magnitude of the leak on each disconnect that isn't cleaned up.
I ran this through a few localnet smoke tests that take nodes up and down cases to verify no egregious stack traces. As suggested, it is probably a good idea to rerun trade tests that take nodes up and down if this is cherry-picked to the branch.
I audited the other executor shutdown paths. It follows the same pattern as the other executor in the Connection object so any bug in the shutdown path would likely have triggered there already.