-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 10010][Client] fixed memory leak #10028
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, can you add some unit test around it? Like test throwing exception while creating pulsarclient and verify resources are properly released.
state.set(State.Open); | ||
} catch (Throwable t) { | ||
shutdown(); | ||
shutdownEventLoopGroup(eventLoopGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant? We already call shutdownEventLoopGroup(eventLoopGroup)
in shutDown()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this call is for the eventLoopGroup passed as a constructor argument. if an exception occurs before we assign it to our class member, we will need to close the parameter. once assigned, they will be the same and hence I have added a check whether it's already shutdown or not
try { | ||
// Shutting down eventLoopGroup separately because in some cases, cnxPool might be using different | ||
// eventLoopGroup. | ||
shutdownEventLoopGroup(eventLoopGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If connectionPool and PulsarClient is using the same eventLoopGroup, this might cause call shutDown on it multiple times, not sure what will happen in this situation? Probably move it to after closing cnxPool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't cause any issue as I am checking for already shutdown case. If so, I will not attempt to shut it down. Though I can move it below if you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some tests for the new change to avoid the regression?
@codelipenghui @MarvinCai I have added some tests. I tried to run some of the failed tests from the previous build but they passed on my local system |
/pulsarbot run-failure-checks |
/pulsarbot run-failure-checks |
pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the shutdown of a Producer will cause other Consumers and Producers to also shut down. This will be incompatible with the current behavior. @codelipenghui
@315157973 this is the elaborated version of your previous comment right? Now, we are not calling the close method so producers and consumers won't be shut down. tests are still failing :( |
lookup.close(); | ||
cnxPool.close(); | ||
timer.stop(); | ||
externalExecutorProvider.shutdownNow(); | ||
internalExecutorService.shutdownNow(); | ||
conf.getAuthentication().close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the leak is that when a close is abnormal, the following close will not be executed
So , we just need to ensure that these are closed normally. But now you shutdown the eventLoopGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should be done then? How does close get invoked if there occurs some exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhilashmandaliya Sorry I missed that, in some case the eventloopgroup is passed in when constructing the PulsarClient, so it's not a resource created or owned by the PulsarClient and it might be used somewhere else. In this case PulsarClient shouldn't just shut the eventloopgroup down when closing itself which might break other component using it. So we should check if the eventloopgroup is created by this PulsarClient before trying to shut it down, I think a simple boolean flag should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarvinCai @315157973
if we go with the boolean flag option, then the same needs to be passed to the ConnectionPool as it is also using the same eventLoopGroup. Putting a check on cnxPool.close at the caller site is not safe as ConnectionPool.close is closing some other resources as well. Adding such boolean parameter to the public constructor looks odd. So we can create a protected or package private constructor in ConnectionPool which will accept this new boolean param. Does that look good to you people or you have some other better choice here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Abhilash, we think it is reasonable to just close whatever resources the PulsarClientImpl create and own when constructor error out, e.g. a cnxPool or eventLoopGroup & cnxPool, and user should be responsible to close whatever resources they passed in the PulsarClientImpl's constructor when catching the exception. We will update the documentation of the API to remind user.
With this change, situation should only be better in exception case since now the PulsarClientImpl constructor is not closing any resources if there's an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarvinCai @315157973 I have made the changes as we discussed. Also updated the java doc for the same. kindly review it and let me know if any further change is required.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java
Show resolved
Hide resolved
e33793a
to
8e798ae
Compare
@@ -281,7 +281,8 @@ static ClientBuilder builder() { | |||
* Perform immediate shutdown of PulsarClient. | |||
* | |||
* <p>Release all the resources and close all the producer, consumer and reader instances without waiting | |||
* for ongoing operations to complete. | |||
* for ongoing operations to complete. Resources passed from the outside like EventLoopGroup or ConnectionPool | |||
* will not be closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EventLoopGroup
and ConnectionPool
classes are not part of the public API (even though they can be passed to a PulsarClientImpl
constructor, which is not public), so they should not be mentioned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merlimat @MarvinCai was suggesting to document it so I added it there. Any alternative or we don't need to mention that at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PulsarClientImpl's constructors are actually public(probably shouldn't be as we don't expect user direct create one themself?)
@abhilashmandaliya I was thinking about adding doc on PulsarClientImpl's constructor, but as it not intended to be used directly by user. And seems the ClientBuilder doesn't support pass in eventloopgroup and cnxpool for now, it doesn't make much sense to tell user about the internal of pulsarclient, we probably don't need that doc for now, and if in the future we support pass in eventloopgroup/cnxpool we can add them then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright. I have reverted the java doc comment. Hope tests pass this time 🤞
/pulsarbot run-failure-checks |
1 similar comment
/pulsarbot run-failure-checks |
62501f5
to
ce71b40
Compare
/pulsarbot run-failure-checks |
ce71b40
to
6844d80
Compare
/pulsarbot run-failure-checks |
6844d80
to
7b8eb75
Compare
/pulsarbot run-failure-checks |
2 similar comments
/pulsarbot run-failure-checks |
/pulsarbot run-failure-checks |
trying to close all the resources even if previous one fails during close/shutdown code clean up as per the idea suggestions
Co-authored-by: Matteo Merli <mmerli@apache.org>
This reverts commit 3c75a2c.
Potential reason could be: multiple clients are closed and close method closes more resources now
7b8eb75
to
4b78b94
Compare
/pulsarbot run-failure-checks |
This patch does not apply to branch-7.2 |
trying to close all the resources even if previous one fails during close/shutdown code clean up as per the idea suggestions Fixes apache#10010 Fixed memory leak caused by reception thrown from the PulsarClientImpl constructor in some cases. Resources were not getting closed which used to end up throwing OOM if user code tries to create a client unless succeeded. Handled exception in the constructor with try-catch and updated shutdown method to close all resources correctly. Previously, it was not attempting to close further resource if the previous one failed. Now it will try to close all of them individully. This change is a trivial rework / code cleanup without any test coverage. - Dependencies (does it add or upgrade a dependency): (yes / no): no - The public API: (yes / no): no - The schema: (yes / no / don't know): no - The default values of configurations: (yes / no): no - The wire protocol: (yes / no): no - The rest endpoints: (yes / no): no - The admin cli options: (yes / no): no - Anything that affects deployment: (yes / no / don't know): no - Does this pull request introduce a new feature? (yes / no): no - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented): not applicable - If a feature is not applicable for documentation, explain why?: nothing to concerned with the end-user - If a feature is not documented yet in this PR, please create a followup issue for adding the documentation: not required (cherry picked from commit 197bd93)
trying to close all the resources even if previous one fails during close/shutdown code clean up as per the idea suggestions Fixes apache#10010 Fixed memory leak caused by reception thrown from the PulsarClientImpl constructor in some cases. Resources were not getting closed which used to end up throwing OOM if user code tries to create a client unless succeeded. Handled exception in the constructor with try-catch and updated shutdown method to close all resources correctly. Previously, it was not attempting to close further resource if the previous one failed. Now it will try to close all of them individully. This change is a trivial rework / code cleanup without any test coverage. - Dependencies (does it add or upgrade a dependency): (yes / no): no - The public API: (yes / no): no - The schema: (yes / no / don't know): no - The default values of configurations: (yes / no): no - The wire protocol: (yes / no): no - The rest endpoints: (yes / no): no - The admin cli options: (yes / no): no - Anything that affects deployment: (yes / no / don't know): no - Does this pull request introduce a new feature? (yes / no): no - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented): not applicable - If a feature is not applicable for documentation, explain why?: nothing to concerned with the end-user - If a feature is not documented yet in this PR, please create a followup issue for adding the documentation: not required (cherry picked from commit 197bd93)
trying to close all the resources even if previous one fails during close/shutdown
code clean up as per the idea suggestions
Fixes #10010
Motivation
Fixed memory leak caused by reception thrown from the PulsarClientImpl constructor in some cases. Resources were not getting closed which used to end up throwing OOM if user code tries to create a client unless succeeded.
Modifications
Handled exception in the constructor with try-catch and updated shutdown method to close all resources correctly. Previously, it was not attempting to close further resource if the previous one failed. Now it will try to close all of them individully.
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
Documentation