Skip to content

Conversation

@anmolnar
Copy link
Contributor

@anmolnar anmolnar commented Nov 12, 2018

Fixes the Java 11 build issue.

Issue

NIOServerCnxnFactory doesn't properly close the socket when the shutdown() is called before the factory has even started. This is mostly the case in tests which use QuorumUtil that creates multiple QuorumPeers when instantiated without starting them and when startAll() gets called it shuts down the previous ones.

Reason

Selectors which have been registered with the socket must be closed in order to properly release the socket. NIOServerCnxnFactory registers selectors on instantiation, but only releases them in the thread run() method. So, if factory doesn't get started, it won't release the registered selectors.

This wasn't the issue before Java 11 and probably caused by:
https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562

Also this is not an issue when ZooKeeper is used as a separate process (not embedded), because on shutdown the entire JVM stops anyway.

Resolution

I decided to try fixing the issue in the connection factory instead of fixing the tests only, because originally it's a bug in the way factory works. Resolution is to open selectors in lazy way: only when accept and selector thread starts, so they don't need to be released if the thread was not even started.

@anmolnar anmolnar changed the title ZOOKEEPER-1441 Some test cases are failing because Port bind issue. ZOOKEEPER-1441 - JAVA 11 - Some test cases are failing because Port bind issue. Nov 12, 2018
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this problem.
It may be a nuisance in tests of down stream applications.
I left a couple of questions

}
}
} catch (IOException e) {
LOG.error("Exception when running accept thread. Unable to register selector?", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we abort the server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Any suggestion how to make it properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmolnar sorry for late reply.
I think it will be tricky. I will spend some time and try to find a suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay. I did not have time to find a good solution, I don't want to block this change

private class AcceptThread extends AbstractSelectThread {
private final ServerSocketChannel acceptSocket;
private final SelectionKey acceptKey;
private SelectionKey acceptKey;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this variable is nullable.
Shouldn't we add some null check at every access?

Copy link
Contributor Author

@anmolnar anmolnar Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. However the only access in this class is called from the select() method which is called only when the registration was successful, so I think we're fine.

@anmolnar
Copy link
Contributor Author

@hanm @lvfangmin @phunt
Please review this, looks like a quite easy win to get Java11 builds green. Thanks!

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}
}
} catch (IOException e) {
LOG.error("Exception when running accept thread. Unable to register selector?", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are binding a port that's already in use here then I think we'll have a problem because the expectation here is the acceptor thread will always be available unless explicitly being shut down by caller (for this reason we caught but ignored all checked and unchecked exception.). The problem is the control flow does not reach higher level from within this acceptor thread - thus if we have an instantiated but stopped NIOServerCnxnFactory due to acceptor thread exceptions, the entire zk process could end up in a weird state. Previous code does not have this issue because it tries to bind port early and complain if it can't such that caller would catch the issue earlier before the acceptor threads started. This should be easily testable if we spin up a ZK server with an unavailable port with this fix and see what happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also curious why 3.4 does not have this test issue. Technically 3.4 works similar with master and 3.5 w.r.t. the selector registration - both register selector before the server factory is 'started' (though in 3.5 and trunk, we made NIOServerCnxn multi-threaded`).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.4 doesn't have this issue, because the selector is explicitly closed in the shutdown method. In 3.5 and master only the owning thread (accept) is able to close it, but it won't, if it doesn't even started.

I'll do the test that you suggested, however I'm not sure how port binding is related to selector registration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018-11-20 15:12:15,758 [myid:] - INFO  [main:NIOServerCnxnFactory@685] - binding to port 0.0.0.0/0.0.0.0:2181
2018-11-20 15:12:15,760 [myid:] - ERROR [main:ZooKeeperServerMain@87] - Unexpected exception, exiting abnormally
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:433)
	at sun.nio.ch.Net.bind(Net.java:425)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
	at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:686)
	at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:157)
	at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:110)
	at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:68)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:133)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:87)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, please ignore the port binding part of my statement previously. I was actually referring to port selection. Before this fix, we have selector registration in constructor of AcceptorThread:
this.acceptKey = acceptSocket.register(selector, SelectionKey.OP_ACCEPT);
If we get an exception here, we will bail out and zk server will not start.

If we selector registration inside AcceptorThread and got an exception, then the acceptor thread will shutdown, but the caller will not be aware of this so the zk server could be in a weird state. That's my only concern of this patch. Does this sound a reasonable concern?

…et. Stop threads before reconfig to properly close sockets.
@anmolnar
Copy link
Contributor Author

anmolnar commented Nov 21, 2018

@eolivelli @hanm Now, an entirely different approach. Solving 2 issues at the same time:

  1. Close selector in shutdown() when Accept and Select threads haven't been started yet. This will keep the original behaviour of eagerly registering the selector in the constructor and addresses your concerns.
  2. Turned out that we have a similar issue with Reconfig: ReconfigTest.testPortChangeToBlockedPort failed, becuase it expects the original port to be released even if the new port cannot be bound. Sounds like a reasonable expectation and the exception handler in the reconfigure() method confirms this. Hence I fixed the code, not the test.

@hanm
Copy link
Contributor

hanm commented Nov 21, 2018

This will keep the original behaviour of eagerly registering the selector in the constructor and addresses your concerns.

LGTM, thanks for doing this!

ReconfigTest.testPortChangeToBlockedPort failed, becuase it expects the original port to be released even if the new port cannot be bound.

The proposed change looks reasonable. Though I am wondering why the current logic would fail this test: if new port fails to bind, the old port will be closed as part of tryClose(oldSS); in the catch block. Mind to elaborate a little bit regarding which cases the current code fails to close the old socket?

@anmolnar
Copy link
Contributor Author

@hanm The issue with Java 11 builds is the refactoring that was made in NIO from Java 11:
https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562
The key part is:

Closing a connected SocketChannel that is registered with a Selector will now consistently delay closing the underlying connection until the closed channel is flushed from all selectors that it is registered with.

In other words, we have to close all registered Selectors to properly close the socket. Therefore tryClose(oldSS) on its own is not enough.

@hanm
Copy link
Contributor

hanm commented Nov 21, 2018

@anmolnar So if I understand this correctly, this test ReconfigTest.testPortChangeToBlockedPort will consistently failing under Java 11, because this.ss = ServerSocketChannel.open(); will prevent the socket close in first tryClose(oldSS), which is why you moved the code block of this.ss = ServerSocketChannel.open(); after tryClose(oldSS), right?

@anmolnar
Copy link
Contributor Author

retest this please

@anmolnar
Copy link
Contributor Author

anmolnar commented Nov 21, 2018

@hanm Not exactly.
The test case is about reconfig the server to a port which is unavailable (blocked). Steps:

  1. Bind the server to "oldClientPort"
  2. Bind "newClientPort" with custom socket
  3. Reconfig the server to bind "newClientPort" - fails
  4. Check if "oldClientPort" has become available
    ...

Step 4) fails, because of the aforementioned reason: reconfig fails to bind the port, throws exception, but exception handler doesn't close the selector, only the socket.
I do things in swapped order: close the thread and socket first, then try to bind the new port.

@anmolnar
Copy link
Contributor Author

retest this please

@hanm
Copy link
Contributor

hanm commented Nov 22, 2018

@anmolnar gotcha. +1, will commit this when i have a chance.

@asfgit asfgit closed this in c3babb9 Nov 23, 2018
asfgit pushed a commit that referenced this pull request Nov 23, 2018
…nd issue.

Fixes the Java 11 build issue.

**Issue**

`NIOServerCnxnFactory` doesn't properly close the socket when the shutdown() is called before the factory has even started. This is mostly the case in tests which use `QuorumUtil` that creates multiple QuorumPeers when instantiated without starting them and when `startAll()` gets called it shuts down the previous ones.

**Reason**

`Selectors` which have been registered with the socket must be closed in order to properly release the socket. `NIOServerCnxnFactory` registers selectors on instantiation, but only releases them in the thread run() method. So, if factory doesn't get started, it won't release the registered selectors.

This wasn't the issue before Java 11 and probably caused by:
https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562

Also this is not an issue when ZooKeeper is used as a separate process (not embedded), because on shutdown the entire JVM stops anyway.

**Resolution**

I decided to try fixing the issue in the connection factory instead of fixing the tests only, because originally it's a bug in the way factory works. Resolution is to open selectors in lazy way: only when accept and selector thread starts, so they don't need to be released if the thread was not even started.

Author: Andor Molnar <andor@apache.org>

Reviewers: hanm@apache.org

Closes #700 from anmolnar/ZOOKEEPER-1441

(cherry picked from commit c3babb9)
Signed-off-by: Andor Molnar <andor@apache.org>
@anmolnar
Copy link
Contributor Author

Being the bad guy again: committed my own patch to master and 3.5 branches. Thanks @hanm and @eolivelli !

@anmolnar anmolnar deleted the ZOOKEEPER-1441 branch November 23, 2018 10:53
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
…nd issue.

Fixes the Java 11 build issue.

**Issue**

`NIOServerCnxnFactory` doesn't properly close the socket when the shutdown() is called before the factory has even started. This is mostly the case in tests which use `QuorumUtil` that creates multiple QuorumPeers when instantiated without starting them and when `startAll()` gets called it shuts down the previous ones.

**Reason**

`Selectors` which have been registered with the socket must be closed in order to properly release the socket. `NIOServerCnxnFactory` registers selectors on instantiation, but only releases them in the thread run() method. So, if factory doesn't get started, it won't release the registered selectors.

This wasn't the issue before Java 11 and probably caused by:
https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562

Also this is not an issue when ZooKeeper is used as a separate process (not embedded), because on shutdown the entire JVM stops anyway.

**Resolution**

I decided to try fixing the issue in the connection factory instead of fixing the tests only, because originally it's a bug in the way factory works. Resolution is to open selectors in lazy way: only when accept and selector thread starts, so they don't need to be released if the thread was not even started.

Author: Andor Molnar <andor@apache.org>

Reviewers: hanm@apache.org

Closes apache#700 from anmolnar/ZOOKEEPER-1441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants