-
Notifications
You must be signed in to change notification settings - Fork 7.3k
ZOOKEEPER-1441 - JAVA 11 - Some test cases are failing because Port bind issue. #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing this problem.
It may be a nuisance in tests of down stream applications.
I left a couple of questions
| } | ||
| } | ||
| } catch (IOException e) { | ||
| LOG.error("Exception when running accept thread. Unable to register selector?", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we abort the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Any suggestion how to make it properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anmolnar sorry for late reply.
I think it will be tricky. I will spend some time and try to find a suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay. I did not have time to find a good solution, I don't want to block this change
| private class AcceptThread extends AbstractSelectThread { | ||
| private final ServerSocketChannel acceptSocket; | ||
| private final SelectionKey acceptKey; | ||
| private SelectionKey acceptKey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this variable is nullable.
Shouldn't we add some null check at every access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. However the only access in this class is called from the select() method which is called only when the registration was successful, so I think we're fine.
|
@hanm @lvfangmin @phunt |
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| } | ||
| } | ||
| } catch (IOException e) { | ||
| LOG.error("Exception when running accept thread. Unable to register selector?", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are binding a port that's already in use here then I think we'll have a problem because the expectation here is the acceptor thread will always be available unless explicitly being shut down by caller (for this reason we caught but ignored all checked and unchecked exception.). The problem is the control flow does not reach higher level from within this acceptor thread - thus if we have an instantiated but stopped NIOServerCnxnFactory due to acceptor thread exceptions, the entire zk process could end up in a weird state. Previous code does not have this issue because it tries to bind port early and complain if it can't such that caller would catch the issue earlier before the acceptor threads started. This should be easily testable if we spin up a ZK server with an unavailable port with this fix and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also curious why 3.4 does not have this test issue. Technically 3.4 works similar with master and 3.5 w.r.t. the selector registration - both register selector before the server factory is 'started' (though in 3.5 and trunk, we made NIOServerCnxn multi-threaded`).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.4 doesn't have this issue, because the selector is explicitly closed in the shutdown method. In 3.5 and master only the owning thread (accept) is able to close it, but it won't, if it doesn't even started.
I'll do the test that you suggested, however I'm not sure how port binding is related to selector registration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018-11-20 15:12:15,758 [myid:] - INFO [main:NIOServerCnxnFactory@685] - binding to port 0.0.0.0/0.0.0.0:2181
2018-11-20 15:12:15,760 [myid:] - ERROR [main:ZooKeeperServerMain@87] - Unexpected exception, exiting abnormally
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:686)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:157)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:110)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:68)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:133)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:87)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, please ignore the port binding part of my statement previously. I was actually referring to port selection. Before this fix, we have selector registration in constructor of AcceptorThread:
this.acceptKey = acceptSocket.register(selector, SelectionKey.OP_ACCEPT);
If we get an exception here, we will bail out and zk server will not start.
If we selector registration inside AcceptorThread and got an exception, then the acceptor thread will shutdown, but the caller will not be aware of this so the zk server could be in a weird state. That's my only concern of this patch. Does this sound a reasonable concern?
…et. Stop threads before reconfig to properly close sockets.
8c8870d to
17c9e20
Compare
|
@eolivelli @hanm Now, an entirely different approach. Solving 2 issues at the same time:
|
LGTM, thanks for doing this!
The proposed change looks reasonable. Though I am wondering why the current logic would fail this test: if new port fails to bind, the old port will be closed as part of |
|
@hanm The issue with Java 11 builds is the refactoring that was made in NIO from Java 11:
In other words, we have to close all registered Selectors to properly close the socket. Therefore |
|
@anmolnar So if I understand this correctly, this test |
|
retest this please |
|
@hanm Not exactly.
Step 4) fails, because of the aforementioned reason: reconfig fails to bind the port, throws exception, but exception handler doesn't close the selector, only the socket. |
|
retest this please |
|
@anmolnar gotcha. +1, will commit this when i have a chance. |
…nd issue. Fixes the Java 11 build issue. **Issue** `NIOServerCnxnFactory` doesn't properly close the socket when the shutdown() is called before the factory has even started. This is mostly the case in tests which use `QuorumUtil` that creates multiple QuorumPeers when instantiated without starting them and when `startAll()` gets called it shuts down the previous ones. **Reason** `Selectors` which have been registered with the socket must be closed in order to properly release the socket. `NIOServerCnxnFactory` registers selectors on instantiation, but only releases them in the thread run() method. So, if factory doesn't get started, it won't release the registered selectors. This wasn't the issue before Java 11 and probably caused by: https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562 Also this is not an issue when ZooKeeper is used as a separate process (not embedded), because on shutdown the entire JVM stops anyway. **Resolution** I decided to try fixing the issue in the connection factory instead of fixing the tests only, because originally it's a bug in the way factory works. Resolution is to open selectors in lazy way: only when accept and selector thread starts, so they don't need to be released if the thread was not even started. Author: Andor Molnar <andor@apache.org> Reviewers: hanm@apache.org Closes #700 from anmolnar/ZOOKEEPER-1441 (cherry picked from commit c3babb9) Signed-off-by: Andor Molnar <andor@apache.org>
|
Being the bad guy again: committed my own patch to master and 3.5 branches. Thanks @hanm and @eolivelli ! |
…nd issue. Fixes the Java 11 build issue. **Issue** `NIOServerCnxnFactory` doesn't properly close the socket when the shutdown() is called before the factory has even started. This is mostly the case in tests which use `QuorumUtil` that creates multiple QuorumPeers when instantiated without starting them and when `startAll()` gets called it shuts down the previous ones. **Reason** `Selectors` which have been registered with the socket must be closed in order to properly release the socket. `NIOServerCnxnFactory` registers selectors on instantiation, but only releases them in the thread run() method. So, if factory doesn't get started, it won't release the registered selectors. This wasn't the issue before Java 11 and probably caused by: https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562 Also this is not an issue when ZooKeeper is used as a separate process (not embedded), because on shutdown the entire JVM stops anyway. **Resolution** I decided to try fixing the issue in the connection factory instead of fixing the tests only, because originally it's a bug in the way factory works. Resolution is to open selectors in lazy way: only when accept and selector thread starts, so they don't need to be released if the thread was not even started. Author: Andor Molnar <andor@apache.org> Reviewers: hanm@apache.org Closes apache#700 from anmolnar/ZOOKEEPER-1441
Fixes the Java 11 build issue.
Issue
NIOServerCnxnFactorydoesn't properly close the socket when the shutdown() is called before the factory has even started. This is mostly the case in tests which useQuorumUtilthat creates multiple QuorumPeers when instantiated without starting them and whenstartAll()gets called it shuts down the previous ones.Reason
Selectorswhich have been registered with the socket must be closed in order to properly release the socket.NIOServerCnxnFactoryregisters selectors on instantiation, but only releases them in the thread run() method. So, if factory doesn't get started, it won't release the registered selectors.This wasn't the issue before Java 11 and probably caused by:
https://www.oracle.com/technetwork/java/javase/11-relnote-issues-5012449.html#JDK-8198562
Also this is not an issue when ZooKeeper is used as a separate process (not embedded), because on shutdown the entire JVM stops anyway.
Resolution
I decided to try fixing the issue in the connection factory instead of fixing the tests only, because originally it's a bug in the way factory works. Resolution is to open selectors in lazy way: only when accept and selector thread starts, so they don't need to be released if the thread was not even started.