Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2184 Zookeeper Client should re-resolve hosts when connection attempts fail #534

Closed
wants to merge 7 commits into from

Conversation

anmolnar
Copy link
Contributor

This is the master/3.5 port of #451

@anmolnar
Copy link
Contributor Author

@hanm Did you have a chance to take a look?

Copy link
Contributor

@hanm hanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a brief look, I think this pull request is missing some of the comment changes from #451:

  • The comments change in HostProvider interface regarding caching at different levels, etc.
  • Some comments in StaticHostProvider regarding the semantics of next().

@@ -73,15 +80,27 @@
* if serverAddresses is empty or resolves to an empty list
*/
public StaticHostProvider(Collection<InetSocketAddress> serverAddresses) {
sourceOfRandomness = new Random(System.currentTimeMillis() ^ this.hashCode());
init(serverAddresses,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Call StaticHostProvider(Collection serverAddresses, Resolver resolver).

Please change the other two constructor as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @lvfangmin , I might be missing the point here. Shall I change the signature to use non-generic Collection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @lvfangmin meant is instead of duplicating the code inside three constructors, only keep one concrete parameterized constructor implementation and let the other two invoke that one (with different parameters.).

Copy link
Contributor

@lvfangmin lvfangmin Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @hanm said, instead of call init, call this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried that. Cannot call the other constructor, because this is referenced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think referencing this before the object is constructed is prohibited. I think we can probably get around it but it looks like not worth the extra effort (I don't know a simple of way of doing it without dropping using the this.hashCode).
So I am ok with the current form of having all ctors invoking init method.

@@ -314,7 +340,7 @@ public InetSocketAddress next(long spinDelay) {
addr = nextHostInReconfigMode();
if (addr != null) {
currentIndex = serverAddresses.indexOf(addr);
return addr;
return resolve(addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do some test to find out performance/latency of resolving an addr? If it's costly, maybe we should cache the resolved one instead of resolve it every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. We should not do any caching in our codebase, because there're multiple levels of caching already present in DNS infrastructure, like JVM caching, os-level caching, DNS servers caching, etc. resolve() will eventually become a no-op if any of these caches find a hit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there is DNS caching, not aware that we have JVM or os caching for this, if we tested this is trivial, I totally agree we should keep this simple.

Copy link
Contributor Author

@anmolnar anmolnar Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO JVM caching is overkill too. It's not application's responsibility to deal with DNS re-resolution, because it shouldn't happen in failure scenarios only. On the flipside to avoid DNS server flooding some caching in the app would be reasonable and JVM is a standard way of doing that.

}
Collections.shuffle(resolvedAddresses);
return new InetSocketAddress(resolvedAddresses.get(0), address.getPort());
} catch (UnknownHostException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should log the exception here. Does the caller handle the unknown address correctly?

Copy link
Contributor Author

@anmolnar anmolnar Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. The caller will end up getting UnknownHostException when trying to open the socket to the unresolvable address:

2018-06-04 12:31:26,022 [myid:huhuuhujkdshgfjksgd.com:2181] - WARN  [main-SendThread(huhuuhujkdshgfjksgd.com:2181):ClientCnxn$SendThread@1237] - Session 0x0 for server huhuuhujkdshgfjksgd.com:2181, unexpected error, closing socket connection and attempting reconnect
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at org.apache.zookeeper.ClientCnxnSocketNIO.registerAndConnect(ClientCnxnSocketNIO.java:275)
	at org.apache.zookeeper.ClientCnxnSocketNIO.connect(ClientCnxnSocketNIO.java:285)
	at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1091)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1133)

The reason for doing this way is to avoid API change of next().

Logging makes sense, I added an error log entry to make it clear.

@anmolnar
Copy link
Contributor Author

anmolnar commented Jun 4, 2018

@hanm Thanks, I updated the comments to be consistent with the original PR.
Unfortunately the comment of HostProvider wasn't updated, so I also made changes in there.

Copy link
Contributor

@nkalmar nkalmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -73,15 +80,27 @@
* if serverAddresses is empty or resolves to an empty list
*/
public StaticHostProvider(Collection<InetSocketAddress> serverAddresses) {
sourceOfRandomness = new Random(System.currentTimeMillis() ^ this.hashCode());
init(serverAddresses,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @lvfangmin meant is instead of duplicating the code inside three constructors, only keep one concrete parameterized constructor implementation and let the other two invoke that one (with different parameters.).

// Resolve server addresses and shuffle them
List<InetSocketAddress> resolvedList = resolveAndShuffle(serverAddresses);
if (resolvedList.isEmpty()) {
List<InetSocketAddress> shuffledList = shuffle(serverAddresses);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we change the resolveAndShuffle to shuffle here, they are semantically different (one tries to resolve address, the other does not and only does shuffle.). The serverAddresses passed in this method is unresolved address, and we need it resolved because we rely on the resolved addresses to compare the old / new server list (in the context of probability rebalancing clients for dynamic reconfiguration).

Without resolving I think the client rebalance logic will be broken. A side note that all tests still passed probably indicate we don't have a 100% coverage for the logic in our tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, the old / new server list both contains unresolved address, so it's comparable, but it's meaningless to compare addr.getAddress because it will return null when it's unresolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveAndShuffle has been split into two methods: resolve and shuffle, because we have to do it separately. Github shows the diff as original has been renamed and new one has been created.

I'll double check the rebalance functionality to make sure it's working properly and will also check related unit tests. I think what @lvfangmin is saying makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison works for unresolved addresses too, because of the last if condition here:

if (addr.getPort() == myServer.getPort()
    && ((addr.getAddress() != null
    && myServer.getAddress() != null && addr
    .getAddress().equals(myServer.getAddress())) || addr
    .getHostString().equals(myServer.getHostString()))) {

As long as getHostString() works for unresolved addresses, we're fine. However, exactly the same functionality has already been implemented in the InetSocketAddress.equals() method, so I refactored this part to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, exactly the same functionality has already been implemented in the InetSocketAddress.equals() method

I am wondering if this is the case or not. I just did a random peek at https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/net/InetSocketAddress.java#L300, looks like if we compare an unresolved address with a resolved address, the equals will return false - but the code you pasted will return true if getHostString works for both resolved and unresolved address... could you double check this behavior?

Also, is it possible to add a test case to cover the case where the second parameter of updateServerList is a resolved address? The existing test cases only cover the case where the second parameter (myServer) is unresolved. In practice I think the method updateServerList is called by ZooKeeper's updateServerList method with second parameter as a resolved address (the remote server where current client connected to.).

Copy link
Contributor Author

@anmolnar anmolnar Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanm Good catch, that makes perfect sense. The implementation is slightly different.

I added 4 new test cases to cover the different combinations of the 2nd and 3rd parameter. Some of them could be redundant (as you said tests present already), but I've already opened a Jira to clean-up this test file, so it should be okay. Test scenarios:

Given: list of servers contains 1 element, client is connected (currentHost (myServer) != null), trying to replace server list with the same address (replaceHost).

New impl

  1. currentHost resolved, replaceHost unresolved => client should disconnect,
  2. currentHost resolved, replaceHost resolved => client should not disconnect,
  3. currentHost unresolved, replaceHost unresolved => client should disconnect,
  4. currentHost unresolved, replaceHost resolved => client should disconnect

Old impl

  1. currentHost resolved, replaceHost unresolved => client should not disconnect,
  2. currentHost resolved, replaceHost resolved => client should not disconnect,
  3. currentHost unresolved, replaceHost unresolved => client should disconnect,
  4. currentHost unresolved, replaceHost resolved => client should not disconnect

Basically the difference is in the case when comparing resolved address with unresolved. The built-in implementation treats them as different address, hence forcing the client to disconnect which makes slightly more sense to me.

Anyway I'm happy to revert the change if you think this is not acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quick update, @anmolnar !

The built-in implementation treats them as different address

That's what I am not sure about. If an unresolved address and a resolved address actually maps same address (after unresolved address gets resolved), should both be treated as same address (old behavior, via getHostString), instead of different address?

What I am thinking is to instead of using equal - use the old verbose code which will work (so far) for all combinations of comparing resolved and unresolved address. What do you think about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is with the first case: IP address might have changed of the host, so client should disconnect and trigger a re-resolution.

Anyway. No issue has been reported with the logic we're talking about, so let's just leave it out from this patch. Reverted.

@hanm
Copy link
Contributor

hanm commented Jun 18, 2018

@anmolnar do you have a chance to address the issues from code review? Please let me know if you plan to work on this in next few days - otherwise I'll just go ahead and do the 3.4 release without waiting on this merged to 3.5 and master.

@anmolnar
Copy link
Contributor Author

@hanm Sorry for the delay. I'll work on it today.

…nit tests to cover unresolved hostnames cases
@anmolnar
Copy link
Contributor Author

anmolnar commented Jun 18, 2018

@hanm @lvfangmin Reconfig tests have previously been implemented only for IP addresses and hasn't covered hostnames which need to be resolved. I added a few unit tests for coverage, but found that existing tests should also be good to refactor.

I don't want to overload this patch, so will create a separate Jira for the refactoring instead.

@anmolnar
Copy link
Contributor Author

Unit test refactoring: https://issues.apache.org/jira/browse/ZOOKEEPER-3065

@anmolnar
Copy link
Contributor Author

anmolnar commented Jul 9, 2018

@hanm @lvfangmin Are you happy with commiting the patch?

@anmolnar anmolnar force-pushed the ZOOKEEPER-2184_master branch 2 times, most recently from a2565d7 to b557298 Compare July 11, 2018 15:30
@anmolnar
Copy link
Contributor Author

@lvfangmin Unit tests failed, but it seems like we also introduced a findbug issue with #562
Shall I fix it in this patch or would you like to do it separately?

@hanm
Copy link
Contributor

hanm commented Jul 12, 2018

I think this one is good to commit - I'll commit in next few days if no one raise other concerns.

@hanm
Copy link
Contributor

hanm commented Jul 13, 2018

it seems like we also introduced a findbug issue with #562

Let's do this as a separate issue. I committed the patch, so I feel obligated to fix it:
#568

asfgit pushed a commit that referenced this pull request Jul 16, 2018
…ion attempts fail

This is the master/3.5 port of #451

Author: Andor Molnar <andor@cloudera.com>
Author: Andor Molnar <andor@apache.org>

Reviewers: Michael Han <hanm@apache.org>, Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Mark Fenes <mfenes@cloudera.com>, Abraham Fine <afine@apache.org>

Closes #534 from anmolnar/ZOOKEEPER-2184_master

(cherry picked from commit 0a31187)
Signed-off-by: Michael Han <hanm@apache.org>
@asfgit asfgit closed this in 0a31187 Jul 16, 2018
@anmolnar anmolnar deleted the ZOOKEEPER-2184_master branch July 16, 2018 09:43
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
…ion attempts fail

This is the master/3.5 port of apache#451

Author: Andor Molnar <andor@cloudera.com>
Author: Andor Molnar <andor@apache.org>

Reviewers: Michael Han <hanm@apache.org>, Flavio Junqueira <fpj@apache.org>, Edward Ribeiro <edward.ribeiro@gmail.com>, Mark Fenes <mfenes@cloudera.com>, Abraham Fine <afine@apache.org>

Closes apache#534 from anmolnar/ZOOKEEPER-2184_master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants