Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 10010][Client] fixed memory leak #10028

Merged
merged 10 commits into from
Apr 13, 2021

Conversation

abhilashmandaliya
Copy link
Contributor

trying to close all the resources even if previous one fails during close/shutdown
code clean up as per the idea suggestions

Fixes #10010

Motivation

Fixed memory leak caused by reception thrown from the PulsarClientImpl constructor in some cases. Resources were not getting closed which used to end up throwing OOM if user code tries to create a client unless succeeded.

Modifications

Handled exception in the constructor with try-catch and updated shutdown method to close all resources correctly. Previously, it was not attempting to close further resource if the previous one failed. Now it will try to close all of them individully.

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no): no
  • The public API: (yes / no): no
  • The schema: (yes / no / don't know): no
  • The default values of configurations: (yes / no): no
  • The wire protocol: (yes / no): no
  • The rest endpoints: (yes / no): no
  • The admin cli options: (yes / no): no
  • Anything that affects deployment: (yes / no / don't know): no

Documentation

  • Does this pull request introduce a new feature? (yes / no): no
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented): not applicable
  • If a feature is not applicable for documentation, explain why?: nothing to concerned with the end-user
  • If a feature is not documented yet in this PR, please create a followup issue for adding the documentation: not required

Copy link
Contributor

@MarvinCai MarvinCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, can you add some unit test around it? Like test throwing exception while creating pulsarclient and verify resources are properly released.

state.set(State.Open);
} catch (Throwable t) {
shutdown();
shutdownEventLoopGroup(eventLoopGroup);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant? We already call shutdownEventLoopGroup(eventLoopGroup) in shutDown()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call is for the eventLoopGroup passed as a constructor argument. if an exception occurs before we assign it to our class member, we will need to close the parameter. once assigned, they will be the same and hence I have added a check whether it's already shutdown or not

try {
// Shutting down eventLoopGroup separately because in some cases, cnxPool might be using different
// eventLoopGroup.
shutdownEventLoopGroup(eventLoopGroup);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If connectionPool and PulsarClient is using the same eventLoopGroup, this might cause call shutDown on it multiple times, not sure what will happen in this situation? Probably move it to after closing cnxPool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't cause any issue as I am checking for already shutdown case. If so, I will not attempt to shut it down. Though I can move it below if you suggest.

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some tests for the new change to avoid the regression?

@abhilashmandaliya
Copy link
Contributor Author

Could you please add some tests for the new change to avoid the regression?

@codelipenghui @MarvinCai I have added some tests.

I tried to run some of the failed tests from the previous build but they passed on my local system

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@MarvinCai
Copy link
Contributor

/pulsarbot run-failure-checks

Copy link
Contributor

@315157973 315157973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the shutdown of a Producer will cause other Consumers and Producers to also shut down. This will be incompatible with the current behavior. @codelipenghui

@abhilashmandaliya
Copy link
Contributor Author

abhilashmandaliya commented Mar 26, 2021

Sorry, the shutdown of a Producer will cause other Consumers and Producers to also shut down. This will be incompatible with the current behavior. @codelipenghui

@315157973 this is the elaborated version of your previous comment right? Now, we are not calling the close method so producers and consumers won't be shut down.

tests are still failing :(

Comment on lines 672 to 690
lookup.close();
cnxPool.close();
timer.stop();
externalExecutorProvider.shutdownNow();
internalExecutorService.shutdownNow();
conf.getAuthentication().close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the leak is that when a close is abnormal, the following close will not be executed
So , we just need to ensure that these are closed normally. But now you shutdown the eventLoopGroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should be done then? How does close get invoked if there occurs some exception?

Copy link
Contributor

@MarvinCai MarvinCai Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhilashmandaliya Sorry I missed that, in some case the eventloopgroup is passed in when constructing the PulsarClient, so it's not a resource created or owned by the PulsarClient and it might be used somewhere else. In this case PulsarClient shouldn't just shut the eventloopgroup down when closing itself which might break other component using it. So we should check if the eventloopgroup is created by this PulsarClient before trying to shut it down, I think a simple boolean flag should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarvinCai @315157973
if we go with the boolean flag option, then the same needs to be passed to the ConnectionPool as it is also using the same eventLoopGroup. Putting a check on cnxPool.close at the caller site is not safe as ConnectionPool.close is closing some other resources as well. Adding such boolean parameter to the public constructor looks odd. So we can create a protected or package private constructor in ConnectionPool which will accept this new boolean param. Does that look good to you people or you have some other better choice here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Abhilash, we think it is reasonable to just close whatever resources the PulsarClientImpl create and own when constructor error out, e.g. a cnxPool or eventLoopGroup & cnxPool, and user should be responsible to close whatever resources they passed in the PulsarClientImpl's constructor when catching the exception. We will update the documentation of the API to remind user.
With this change, situation should only be better in exception case since now the PulsarClientImpl constructor is not closing any resources if there's an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarvinCai @315157973 I have made the changes as we discussed. Also updated the java doc for the same. kindly review it and let me know if any further change is required.

@@ -281,7 +281,8 @@ static ClientBuilder builder() {
* Perform immediate shutdown of PulsarClient.
*
* <p>Release all the resources and close all the producer, consumer and reader instances without waiting
* for ongoing operations to complete.
* for ongoing operations to complete. Resources passed from the outside like EventLoopGroup or ConnectionPool
* will not be closed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EventLoopGroup and ConnectionPool classes are not part of the public API (even though they can be passed to a PulsarClientImpl constructor, which is not public), so they should not be mentioned here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat @MarvinCai was suggesting to document it so I added it there. Any alternative or we don't need to mention that at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PulsarClientImpl's constructors are actually public(probably shouldn't be as we don't expect user direct create one themself?)
@abhilashmandaliya I was thinking about adding doc on PulsarClientImpl's constructor, but as it not intended to be used directly by user. And seems the ClientBuilder doesn't support pass in eventloopgroup and cnxpool for now, it doesn't make much sense to tell user about the internal of pulsarclient, we probably don't need that doc for now, and if in the future we support pass in eventloopgroup/cnxpool we can add them then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright. I have reverted the java doc comment. Hope tests pass this time 🤞

@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

1 similar comment
@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

2 similar comments
@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

@abhilashmandaliya
Copy link
Contributor Author

/pulsarbot run-failure-checks

@MarvinCai
Copy link
Contributor

@codelipenghui @315157973 @merlimat PTAL

@codelipenghui codelipenghui merged commit 197bd93 into apache:master Apr 13, 2021
@eolivelli
Copy link
Contributor

This patch does not apply to branch-7.2

eolivelli pushed a commit to datastax/pulsar that referenced this pull request May 17, 2021
trying to close all the resources even if previous one fails during close/shutdown
code clean up as per the idea suggestions

Fixes apache#10010

Fixed memory leak caused by reception thrown from the PulsarClientImpl constructor in some cases. Resources were not getting closed which used to end up throwing OOM if user code tries to create a client unless succeeded.

Handled exception in the constructor with try-catch and updated shutdown method to close all resources correctly. Previously, it was not attempting to close further resource if the previous one failed. Now it will try to close all of them individully.

This change is a trivial rework / code cleanup without any test coverage.

  - Dependencies (does it add or upgrade a dependency): (yes / no): no
  - The public API: (yes / no): no
  - The schema: (yes / no / don't know): no
  - The default values of configurations: (yes / no): no
  - The wire protocol: (yes / no): no
  - The rest endpoints: (yes / no): no
  - The admin cli options: (yes / no): no
  - Anything that affects deployment: (yes / no / don't know): no

  - Does this pull request introduce a new feature? (yes / no): no
  - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented): not applicable
  - If a feature is not applicable for documentation, explain why?: nothing to concerned with the end-user
  - If a feature is not documented yet in this PR, please create a followup issue for adding the documentation: not required

(cherry picked from commit 197bd93)
eolivelli pushed a commit to datastax/pulsar that referenced this pull request May 17, 2021
trying to close all the resources even if previous one fails during close/shutdown
code clean up as per the idea suggestions

Fixes apache#10010

Fixed memory leak caused by reception thrown from the PulsarClientImpl constructor in some cases. Resources were not getting closed which used to end up throwing OOM if user code tries to create a client unless succeeded.

Handled exception in the constructor with try-catch and updated shutdown method to close all resources correctly. Previously, it was not attempting to close further resource if the previous one failed. Now it will try to close all of them individully.

This change is a trivial rework / code cleanup without any test coverage.

  - Dependencies (does it add or upgrade a dependency): (yes / no): no
  - The public API: (yes / no): no
  - The schema: (yes / no / don't know): no
  - The default values of configurations: (yes / no): no
  - The wire protocol: (yes / no): no
  - The rest endpoints: (yes / no): no
  - The admin cli options: (yes / no): no
  - Anything that affects deployment: (yes / no / don't know): no

  - Does this pull request introduce a new feature? (yes / no): no
  - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented): not applicable
  - If a feature is not applicable for documentation, explain why?: nothing to concerned with the end-user
  - If a feature is not documented yet in this PR, please create a followup issue for adding the documentation: not required

(cherry picked from commit 197bd93)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Client] Potential resource leak in Java PulsarClient
7 participants