Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resource leak caused by not cleaning up unused threads #3734

Merged
merged 1 commit into from Dec 4, 2019

Conversation

@freimair
Copy link
Member

freimair commented Dec 3, 2019

We had a small memory leak in the code base. Namely, some thread pools have not been shutdown when they have been no longer needed. Result was that the threads and the parent threads have been kept alive which lead to hundreds of stale threads over the course of several days.

We had a small memory leak in the code base. Namely, there have been some
threadpools in use but not shutdown when they have been no longer needed.
Result was that the threads and the parent threads have been kept alive
which lead to hundreds of stale threads over the course of several days.
@freimair freimair requested a review from ripcurlx as a code owner Dec 3, 2019
@ripcurlx

This comment has been minimized.

Copy link
Member

ripcurlx commented Dec 3, 2019

@freimair Could you extend the title a little bit, so it is more clear what kind of memory leak is fixed if someone is just looking through the PR titles (e.g. during release note creation)

Copy link
Member

ripcurlx left a comment

utACK

@freimair freimair changed the title Fix memory leak Fix memory leak caused by not cleaning up unused threads Dec 3, 2019
@freimair freimair mentioned this pull request Dec 3, 2019
@julianknutsen

This comment has been minimized.

Copy link
Contributor

julianknutsen commented Dec 3, 2019

What was the memory footprint change after running with your patch? One leaked executor per connection does seem bad, but it would be good to understand the full impact.

Any additional information on which test cases may need to be re-run to verify everything works as expected if this goes into v1.2.4?

Any screenshots or tools we can use to help verify the new behavior?

@julianknutsen

This comment has been minimized.

Copy link
Contributor

julianknutsen commented Dec 3, 2019

I think this type of bug is way too easy to make right now. Does anyone have an RAII solution to these single thread pool use cases that could be added to guarantee that shutdownNow() is called for each of these resources so they can be GCed?

Long term, it may make more sense to attach the per-connection executors to the manager object instead. Right now the threads grow with the number of connections which depends on the manager logic to decide how many connections to keep around.

This is much harder to understand from a performance point of view than a parent thread pool with a fixed size regardless of the number of connections.

Copy link
Contributor

julianknutsen left a comment

BUNDLE_OF_ENVELOPES looks deprecated so nothing should be scheduled on the executor, right?. What is the supported backwards-compatibility here and Is there a point to even keeping this code path around?

I tested v1.2.4 without this patch and don't see the threads increasing with simple disconnect/reconnect. Is there something I need to do to exercise the code path?

@freimair

This comment has been minimized.

Copy link
Member Author

freimair commented Dec 3, 2019

What was the memory footprint change after running with your patch?

Just as I attempted to get numbers, jProfiler crashed after a 48h test run and with it my bisq-under-test and the whole gnome shell. I did, however, see the fix doing its job. All in all, I decided to not waste another couple of days to rerun anything but put the PR up for merging as this is a real bug that needs fixing.

One leaked executor per connection does seem bad, but it would be good to understand the full impact.

due to the executors not being shut down, these threads stayed alive and with them, the parent threads per connection as well. Over a couple of days this behavior lead to hundreds of (unused) threads. But only if there is high network load and the Bundle_of_envelopes feature triggered.
Overall, the memory-leak part hasn't been that bad as it seems (there isn't that much change in a clients connection pool) but I have seen increased CPU usage as well - although I have no hard evidence that this is related.

A JVM profiler showed me that there has been a huge number of threads all in waiting-state. So I tracked down their source, found that there is indeed a bug causing a "thread-leak" and fixed it. My guess is that with the recently forced update, the bundle-of-envelopes stuff finally went really active as all clients made use of this.

I honestly do not know if that bug is responsible for some resource-related issues reported recently. There has been incorrect behavior and it is fixed now. I would leave it for time to tell if it has been that great of an impact in the first place.

Any additional information on which test cases may need to be re-run to verify everything works as expected if this goes into v1.2.4?

As always with our P2P part, there is no exhaustive testing possible right now. Since the change only cleanes up stuff, there should be no need for any testplan testcases to be redone. However, I suggest to maybe redo the trading tests just to be sure.

Any screenshots or tools we can use to help verify the new behavior?

Well take a JVM profiler, start up 1.2.3 and this fix and compare the number of threads after 2 days of runtime. With the fix, there should be substantially less threads. Please note that I am not entirely confident that there isn't another "thread-leak" somewhere, still. But that is for another day.

One leaked executor per connection does seem bad, but it would be good to understand the full impact.

Agreed. Generally, thread handling and synchronization is a mess right now. Needs fixing sometime.

@freimair freimair changed the title Fix memory leak caused by not cleaning up unused threads Fix resource leak caused by not cleaning up unused threads Dec 3, 2019
@freimair

This comment has been minimized.

Copy link
Member Author

freimair commented Dec 4, 2019

BUNDLE_OF_ENVELOPES looks deprecated so nothing should be scheduled on the executor, right?. What is the supported backwards-compatibility here and Is there a point to even keeping this code path around?

yes, the capability "BUNDLE_OF_ENVELOPES" has been deprecated as we made it default with the forced update. So the feature is still there and active with current clients, but the capability is not required anymore.

I tested v1.2.4 without this patch and don't see the threads increasing with simple disconnect/reconnect. Is there something I need to do to exercise the code path?

The code path only gets triggered if the throttle limits get hit (defined in

connectionConfig = new ConnectionConfig(MSG_THROTTLE_PER_SEC, MSG_THROTTLE_PER_10_SEC, SEND_MSG_THROTTLE_TRIGGER, SEND_MSG_THROTTLE_SLEEP);
)

as said above, this stuff is hard to test. After initial tests I had either version run in production for days with a profiler attached...

Copy link
Contributor

julianknutsen left a comment

ACK

Luckily verifying this didn't take the two days you suggested. I spent some time learning more about jprofiler views and updating the executor to set the thread name so it could be tracked. Here is the comparison before and after your patch to verify the threads no longer leak:

Before: (bundleSender > # connected peers) & no dead thread
9V12

After: (bundleSender == # connected peers) & 1 dead thread
dead_thread

From the heap inspection, it looks like each executor is about 300-600 bytes each so that is the magnitude of the leak on each disconnect that isn't cleaned up.

Testing

I ran this through a few localnet smoke tests that take nodes up and down cases to verify no egregious stack traces. As suggested, it is probably a good idea to rerun trade tests that take nodes up and down if this is cherry-picked to the branch.

I audited the other executor shutdown paths. It follows the same pattern as the other executor in the Connection object so any bug in the shutdown path would likely have triggered there already.

@julianknutsen julianknutsen mentioned this pull request Dec 4, 2019
@ripcurlx ripcurlx merged commit d0bed93 into bisq-network:master Dec 4, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@freimair freimair deleted the freimair:fix_thread_leak branch Dec 4, 2019
@freimair freimair restored the freimair:fix_thread_leak branch Dec 6, 2019
@freimair freimair deleted the freimair:fix_thread_leak branch Dec 6, 2019
@ripcurlx ripcurlx mentioned this pull request Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.