-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZOOKEEPER-2184 Zookeeper Client should re-resolve hosts when connection attempts fail #534
Conversation
@hanm Did you have a chance to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a brief look, I think this pull request is missing some of the comment changes from #451:
- The comments change in HostProvider interface regarding caching at different levels, etc.
- Some comments in StaticHostProvider regarding the semantics of next().
@@ -73,15 +80,27 @@ | |||
* if serverAddresses is empty or resolves to an empty list | |||
*/ | |||
public StaticHostProvider(Collection<InetSocketAddress> serverAddresses) { | |||
sourceOfRandomness = new Random(System.currentTimeMillis() ^ this.hashCode()); | |||
init(serverAddresses, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Call StaticHostProvider(Collection serverAddresses, Resolver resolver).
Please change the other two constructor as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @lvfangmin , I might be missing the point here. Shall I change the signature to use non-generic Collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @lvfangmin meant is instead of duplicating the code inside three constructors, only keep one concrete parameterized constructor implementation and let the other two invoke that one (with different parameters.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @hanm said, instead of call init, call this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried that. Cannot call the other constructor, because this
is referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think referencing this before the object is constructed is prohibited. I think we can probably get around it but it looks like not worth the extra effort (I don't know a simple of way of doing it without dropping using the this.hashCode).
So I am ok with the current form of having all ctors invoking init method.
@@ -314,7 +340,7 @@ public InetSocketAddress next(long spinDelay) { | |||
addr = nextHostInReconfigMode(); | |||
if (addr != null) { | |||
currentIndex = serverAddresses.indexOf(addr); | |||
return addr; | |||
return resolve(addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do some test to find out performance/latency of resolving an addr? If it's costly, maybe we should cache the resolved one instead of resolve it every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. We should not do any caching in our codebase, because there're multiple levels of caching already present in DNS infrastructure, like JVM caching, os-level caching, DNS servers caching, etc. resolve()
will eventually become a no-op if any of these caches find a hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there is DNS caching, not aware that we have JVM or os caching for this, if we tested this is trivial, I totally agree we should keep this simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO JVM caching is overkill too. It's not application's responsibility to deal with DNS re-resolution, because it shouldn't happen in failure scenarios only. On the flipside to avoid DNS server flooding some caching in the app would be reasonable and JVM is a standard way of doing that.
} | ||
Collections.shuffle(resolvedAddresses); | ||
return new InetSocketAddress(resolvedAddresses.get(0), address.getPort()); | ||
} catch (UnknownHostException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should log the exception here. Does the caller handle the unknown address correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. The caller will end up getting UnknownHostException when trying to open the socket to the unresolvable address:
2018-06-04 12:31:26,022 [myid:huhuuhujkdshgfjksgd.com:2181] - WARN [main-SendThread(huhuuhujkdshgfjksgd.com:2181):ClientCnxn$SendThread@1237] - Session 0x0 for server huhuuhujkdshgfjksgd.com:2181, unexpected error, closing socket connection and attempting reconnect
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.apache.zookeeper.ClientCnxnSocketNIO.registerAndConnect(ClientCnxnSocketNIO.java:275)
at org.apache.zookeeper.ClientCnxnSocketNIO.connect(ClientCnxnSocketNIO.java:285)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1091)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1133)
The reason for doing this way is to avoid API change of next()
.
Logging makes sense, I added an error log entry to make it clear.
@hanm Thanks, I updated the comments to be consistent with the original PR. |
62a8978
to
d2a5696
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -73,15 +80,27 @@ | |||
* if serverAddresses is empty or resolves to an empty list | |||
*/ | |||
public StaticHostProvider(Collection<InetSocketAddress> serverAddresses) { | |||
sourceOfRandomness = new Random(System.currentTimeMillis() ^ this.hashCode()); | |||
init(serverAddresses, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @lvfangmin meant is instead of duplicating the code inside three constructors, only keep one concrete parameterized constructor implementation and let the other two invoke that one (with different parameters.).
// Resolve server addresses and shuffle them | ||
List<InetSocketAddress> resolvedList = resolveAndShuffle(serverAddresses); | ||
if (resolvedList.isEmpty()) { | ||
List<InetSocketAddress> shuffledList = shuffle(serverAddresses); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why we change the resolveAndShuffle
to shuffle
here, they are semantically different (one tries to resolve address, the other does not and only does shuffle.). The serverAddresses
passed in this method is unresolved address, and we need it resolved because we rely on the resolved addresses to compare the old / new server list (in the context of probability rebalancing clients for dynamic reconfiguration).
Without resolving I think the client rebalance logic will be broken. A side note that all tests still passed probably indicate we don't have a 100% coverage for the logic in our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, the old / new server list both contains unresolved address, so it's comparable, but it's meaningless to compare addr.getAddress because it will return null when it's unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolveAndShuffle
has been split into two methods: resolve
and shuffle
, because we have to do it separately. Github shows the diff as original has been renamed and new one has been created.
I'll double check the rebalance functionality to make sure it's working properly and will also check related unit tests. I think what @lvfangmin is saying makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparison works for unresolved addresses too, because of the last if condition here:
if (addr.getPort() == myServer.getPort()
&& ((addr.getAddress() != null
&& myServer.getAddress() != null && addr
.getAddress().equals(myServer.getAddress())) || addr
.getHostString().equals(myServer.getHostString()))) {
As long as getHostString()
works for unresolved addresses, we're fine. However, exactly the same functionality has already been implemented in the InetSocketAddress.equals()
method, so I refactored this part to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, exactly the same functionality has already been implemented in the InetSocketAddress.equals() method
I am wondering if this is the case or not. I just did a random peek at https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/net/InetSocketAddress.java#L300, looks like if we compare an unresolved address with a resolved address, the equals will return false - but the code you pasted will return true if getHostString works for both resolved and unresolved address... could you double check this behavior?
Also, is it possible to add a test case to cover the case where the second parameter of updateServerList is a resolved address? The existing test cases only cover the case where the second parameter (myServer) is unresolved. In practice I think the method updateServerList is called by ZooKeeper's updateServerList method with second parameter as a resolved address (the remote server where current client connected to.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanm Good catch, that makes perfect sense. The implementation is slightly different.
I added 4 new test cases to cover the different combinations of the 2nd and 3rd parameter. Some of them could be redundant (as you said tests present already), but I've already opened a Jira to clean-up this test file, so it should be okay. Test scenarios:
Given: list of servers contains 1 element, client is connected (currentHost (myServer) != null), trying to replace server list with the same address (replaceHost).
New impl
currentHost
resolved,replaceHost
unresolved => client should disconnect,currentHost
resolved,replaceHost
resolved => client should not disconnect,currentHost
unresolved,replaceHost
unresolved => client should disconnect,currentHost
unresolved,replaceHost
resolved => client should disconnect
Old impl
currentHost
resolved,replaceHost
unresolved => client should not disconnect,currentHost
resolved,replaceHost
resolved => client should not disconnect,currentHost
unresolved,replaceHost
unresolved => client should disconnect,currentHost
unresolved,replaceHost
resolved => client should not disconnect
Basically the difference is in the case when comparing resolved address with unresolved. The built-in implementation treats them as different address, hence forcing the client to disconnect which makes slightly more sense to me.
Anyway I'm happy to revert the change if you think this is not acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for quick update, @anmolnar !
The built-in implementation treats them as different address
That's what I am not sure about. If an unresolved address and a resolved address actually maps same address (after unresolved address gets resolved), should both be treated as same address (old behavior, via getHostString), instead of different address?
What I am thinking is to instead of using equal - use the old verbose code which will work (so far) for all combinations of comparing resolved and unresolved address. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is with the first case: IP address might have changed of the host, so client should disconnect and trigger a re-resolution.
Anyway. No issue has been reported with the logic we're talking about, so let's just leave it out from this patch. Reverted.
@anmolnar do you have a chance to address the issues from code review? Please let me know if you plan to work on this in next few days - otherwise I'll just go ahead and do the 3.4 release without waiting on this merged to 3.5 and master. |
@hanm Sorry for the delay. I'll work on it today. |
…nit tests to cover unresolved hostnames cases
@hanm @lvfangmin Reconfig tests have previously been implemented only for IP addresses and hasn't covered hostnames which need to be resolved. I added a few unit tests for coverage, but found that existing tests should also be good to refactor. I don't want to overload this patch, so will create a separate Jira for the refactoring instead. |
Unit test refactoring: https://issues.apache.org/jira/browse/ZOOKEEPER-3065 |
@hanm @lvfangmin Are you happy with commiting the patch? |
a2565d7
to
b557298
Compare
…esolved/unresolved addresses
b557298
to
420f050
Compare
@lvfangmin Unit tests failed, but it seems like we also introduced a findbug issue with #562 |
I think this one is good to commit - I'll commit in next few days if no one raise other concerns. |
be04719
to
6bccacd
Compare
…ion attempts fail This is the master/3.5 port of #451 Author: Andor Molnar <andor@cloudera.com> Author: Andor Molnar <andor@apache.org> Reviewers: Michael Han <hanm@apache.org>, Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Mark Fenes <mfenes@cloudera.com>, Abraham Fine <afine@apache.org> Closes #534 from anmolnar/ZOOKEEPER-2184_master (cherry picked from commit 0a31187) Signed-off-by: Michael Han <hanm@apache.org>
…ion attempts fail This is the master/3.5 port of apache#451 Author: Andor Molnar <andor@cloudera.com> Author: Andor Molnar <andor@apache.org> Reviewers: Michael Han <hanm@apache.org>, Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Mark Fenes <mfenes@cloudera.com>, Abraham Fine <afine@apache.org> Closes apache#534 from anmolnar/ZOOKEEPER-2184_master
This is the master/3.5 port of #451