New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2743: Netty connection leaks JMX connection bean. #211

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@hanm
Contributor

hanm commented Apr 1, 2017

See https://issues.apache.org/jira/browse/ZOOKEEPER-2743 for details on the symptom and diagnostic.

There are many ways of fixing this. For example we can enforce certain ordering of the close operations to eliminate the race condition however that would incur non trivial performance penalty as a result of synchronization. I think the solution in this patch is simple and effective as it observes that the bean unregister call is idempotent and the call itself is trivial so we just always unregister connection upon close the connection, to ensure the bean is unregistered regardless of the races.

I've also stress tested this solution on internal Jenkins which indicates no failures of https://issues.apache.org/jira/browse/ZOOKEEPER-2707 for the past two days.

A side note is NIO connection in theory would suffer the same problem however I am unable to reproduce the same racing with existing unit test. So I just leave NIO as it is, for now.

ZOOKEEPER-2743: Netty connection leaks JMX connection bean upon conne…
…ction close in certain race conditions.

Always unregister connection upon close to prevent connection bean leak under certain race conditions.
// ZOOKEEPER-2743:
// Always unregister connection upon close to prevent
// connection bean leak under certain race conditions.
factory.unregisterConnection(this);

This comment has been minimized.

@rakeshadr

rakeshadr Apr 6, 2017

Contributor

Any chance to occur another race condition, [1] session finalization is in progress [2] client cnxn closure due to server shutdown. Assume [1] step is in progress and not yet registered the connection bean [2] step is in progress and as per this patch unregistered bean irrespective of cnxn exists in factory and returns. Again assume now, [1] step continues and registering the connection bean. This is again leaking the cnxn bean, right?

@rakeshadr

rakeshadr Apr 6, 2017

Contributor

Any chance to occur another race condition, [1] session finalization is in progress [2] client cnxn closure due to server shutdown. Assume [1] step is in progress and not yet registered the connection bean [2] step is in progress and as per this patch unregistered bean irrespective of cnxn exists in factory and returns. Again assume now, [1] step continues and registering the connection bean. This is again leaking the cnxn bean, right?

This comment has been minimized.

@hanm

hanm Apr 6, 2017

Contributor

That is fine, I might able to provide a formal verification of the theorem but here is a quick prove of that case:

  • Assume close is called before connection bean is registered [1]
  • The unregister bean in close call is no-op because the bean is not registered. But the channel will be closed, as part of close call.
  • Now before finalizing session returns, some sort of exception is going to throw, because the channel is closed. Probably here [2].
  • As part of exception the close is called again. This time it will unregister the bean (before this fix it will not, so it will miss this edge case.).

Basically we are safe as close will be called multiple times and guaranteed at least one close call will happen after cnx bean is registered.

[1] https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L699
[2]
https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L716

@hanm

hanm Apr 6, 2017

Contributor

That is fine, I might able to provide a formal verification of the theorem but here is a quick prove of that case:

  • Assume close is called before connection bean is registered [1]
  • The unregister bean in close call is no-op because the bean is not registered. But the channel will be closed, as part of close call.
  • Now before finalizing session returns, some sort of exception is going to throw, because the channel is closed. Probably here [2].
  • As part of exception the close is called again. This time it will unregister the bean (before this fix it will not, so it will miss this edge case.).

Basically we are safe as close will be called multiple times and guaranteed at least one close call will happen after cnx bean is registered.

[1] https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L699
[2]
https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L716

This comment has been minimized.

@hanm

hanm Apr 6, 2017

Contributor

Also it is guaranteed that the close call will be called at least once after the cnx bean is registered (e.g. when sever shutdown, the cnx factory will iterate through every cnx and close them..).

@hanm

hanm Apr 6, 2017

Contributor

Also it is guaranteed that the close call will be called at least once after the cnx bean is registered (e.g. when sever shutdown, the cnx factory will iterate through every cnx and close them..).

This comment has been minimized.

@rakeshadr

rakeshadr Apr 10, 2017

Contributor

Agreed Michael. I've noticed NIOServerCnxnFactory#removeCnxn() execution path, we should correct this part as well, like you mentioned in the PR comment. Could you please raise a new jira to track the changes.

+1 the netty code changes looks good to me.

@rakeshadr

rakeshadr Apr 10, 2017

Contributor

Agreed Michael. I've noticed NIOServerCnxnFactory#removeCnxn() execution path, we should correct this part as well, like you mentioned in the PR comment. Could you please raise a new jira to track the changes.

+1 the netty code changes looks good to me.

This comment has been minimized.

@hanm

hanm Apr 10, 2017

Contributor

Could you please raise a new jira to track the changes.

Created https://issues.apache.org/jira/browse/ZOOKEEPER-2751 for this. Merging this PR now.

@hanm

hanm Apr 10, 2017

Contributor

Could you please raise a new jira to track the changes.

Created https://issues.apache.org/jira/browse/ZOOKEEPER-2751 for this. Merging this PR now.

@asfgit asfgit closed this in b3ef40b Apr 10, 2017

asfgit pushed a commit that referenced this pull request Apr 10, 2017

ZOOKEEPER-2743: Netty connection leaks JMX connection bean.
See https://issues.apache.org/jira/browse/ZOOKEEPER-2743 for details on the symptom and diagnostic.

There are many ways of fixing this. For example we can enforce certain ordering of the close operations to eliminate the race condition however that would incur non trivial performance penalty as a result of synchronization. I think the solution in this patch is simple and effective as it observes that the bean unregister call is idempotent and the call itself is trivial so we just always unregister connection upon close the connection, to ensure the bean is unregistered regardless of the races.

I've also stress tested this solution on internal Jenkins which indicates no failures of https://issues.apache.org/jira/browse/ZOOKEEPER-2707 for the past two days.

A side note is NIO connection in theory would suffer the same problem however I am unable to reproduce the same racing with existing unit test. So I just leave NIO as it is, for now.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>

Closes #211 from hanm/ZOOKEEPER-2743

(cherry picked from commit b3ef40b)
Signed-off-by: Michael Han <hanm@apache.org>

asfgit pushed a commit that referenced this pull request Apr 17, 2017

ZOOKEEPER-2743: Netty connection leaks JMX connection bean.
See https://issues.apache.org/jira/browse/ZOOKEEPER-2743 for details on the symptom and diagnostic.

There are many ways of fixing this. For example we can enforce certain ordering of the close operations to eliminate the race condition however that would incur non trivial performance penalty as a result of synchronization. I think the solution in this patch is simple and effective as it observes that the bean unregister call is idempotent and the call itself is trivial so we just always unregister connection upon close the connection, to ensure the bean is unregistered regardless of the races.

I've also stress tested this solution on internal Jenkins which indicates no failures of https://issues.apache.org/jira/browse/ZOOKEEPER-2707 for the past two days.

A side note is NIO connection in theory would suffer the same problem however I am unable to reproduce the same racing with existing unit test. So I just leave NIO as it is, for now.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>

Closes #211 from hanm/ZOOKEEPER-2743

lvfangmin pushed a commit to lvfangmin/zookeeper that referenced this pull request Jun 17, 2018

ZOOKEEPER-2743: Netty connection leaks JMX connection bean.
See https://issues.apache.org/jira/browse/ZOOKEEPER-2743 for details on the symptom and diagnostic.

There are many ways of fixing this. For example we can enforce certain ordering of the close operations to eliminate the race condition however that would incur non trivial performance penalty as a result of synchronization. I think the solution in this patch is simple and effective as it observes that the bean unregister call is idempotent and the call itself is trivial so we just always unregister connection upon close the connection, to ensure the bean is unregistered regardless of the races.

I've also stress tested this solution on internal Jenkins which indicates no failures of https://issues.apache.org/jira/browse/ZOOKEEPER-2707 for the past two days.

A side note is NIO connection in theory would suffer the same problem however I am unable to reproduce the same racing with existing unit test. So I just leave NIO as it is, for now.

Author: Michael Han <hanm@apache.org>

Reviewers: Rakesh Radhakrishnan <rakeshr@apache.org>

Closes apache#211 from hanm/ZOOKEEPER-2743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment