Skip to content

Commit

Permalink
ZOOKEEPER-3188: Improve resilience to network
Browse files Browse the repository at this point in the history
This PR is the rebase of the [previous pull request](#730), so all the kudos should go to the original authors...

In [ZOOKEEPER-3188](https://issues.apache.org/jira/browse/ZOOKEEPER-3188) we add ability to specify several addresses for quorum operations. Also added reconnection attempts if connection to leader lost.

In this PR I rebased the changes on the current master, resolving some minor conflicts with:
- [ZOOKEEPER-3296](https://issues.apache.org/jira/browse/ZOOKEEPER-3296): Explicitly closing the sslsocket when it failed handshake to prevent issue where peers cannot join quorum
- [ZOOKEEPER-3320](https://issues.apache.org/jira/browse/ZOOKEEPER-3320): Leader election port stop listen when hostname unresolvable for some time
- [ZOOKEEPER-3385](https://issues.apache.org/jira/browse/ZOOKEEPER-3385): Add admin command to display leader
- [ZOOKEEPER-3386](https://issues.apache.org/jira/browse/ZOOKEEPER-3386): Add admin command to display voting view
- [ZOOKEEPER-3398](https://issues.apache.org/jira/browse/ZOOKEEPER-3398): Learner.connectToLeader() may take too long to time-out

I still want to test the feature manually (e.g. using docker containers with multiple virtual networks / interfaces). The steps to the manual test could be recorded in the [google docs](https://docs.google.com/document/d/1iGVwxeHp57qogwfdodCh9b32P2_kOQaJZ2GDo7j36fI/edit?usp=sharing) as well.

Also I think we could add a few more unit tests where we are using multiple addresses. The current tests are using a single address only.

Also the Zookeeper documentation needs to be changed (e.g. by a follow-up Jira?) to promote the new feature and the new config format (possibly including also the admin command documentation in relation with [ZOOKEEPER-3386](https://issues.apache.org/jira/browse/ZOOKEEPER-3386) and [ZOOKEEPER-3461](https://issues.apache.org/jira/browse/ZOOKEEPER-3461))

Author: Mate Szalay-Beko <szalay.beko.mate@gmail.com>
Author: Mate Szalay-Beko <mszalay@cloudera.com>

Reviewers: eolivelli@apache.org, andor@apache.org

Closes #1048 from symat/ZOOKEEPER-3188 and squashes the following commits:

3c6fc52 [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
356882d [Mate Szalay-Beko] ZOOKEEPER-3188: document new configuration format for using multiple addresses
45b6c0f [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
4b6bcea [Mate Szalay-Beko] ZOOKEEPER-3188: MultiAddress unit tests for Quorum TLS and Kerberos/Digest authentication
40bc44c [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
f875f5c [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
31805e7 [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
0f95678 [Mate Szalay-Beko] ZOOKEEPER-3188: skip unreachable addresses when Learner connects to Leader
e232c55 [Mate Szalay-Beko] ZOOKEEPER-3188: fix flaky unit MultiAddress unit test
e892d8d [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
6f2ab75 [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
2eedf26 [Mate Szalay-Beko] ZOOKEEPER-3188: fix PR commits; handle case when Leader can not bind to port on startup
483d2fc [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
a5d6bcb [Mate Szalay-Beko] ZOOKEEPER-3188: support for dynamic reconfig + add more unit tests
ed31d2c [Mate Szalay-Beko] ZOOKEEPER-3188: better shutdown for executors (following PR comments)
8713a5b [Mate Szalay-Beko] ZOOKEEPER-3188: add fixes for PR comments
05eae83 [Mate Szalay-Beko] Merge remote-tracking branch 'apache/master' into ZOOKEEPER-3188
e823af4 [Mate Szalay-Beko] Merge remote-tracking branch 'origin/master' into ZOOKEEPER-3188
de7bad2 [Mate Szalay-Beko] Merge remote-tracking branch 'origin/master' into ZOOKEEPER-3188
da98a8d [Mate Szalay-Beko] ZOOKEEPER-3188: fix JDK-13 warning
5bd1f4e [Mate Szalay-Beko] ZOOKEEPER-3188: supress spotbugs warning
42a52a6 [Mate Szalay-Beko] ZOOKEEPER-3188: improve based on code review comments
6c4220a [Mate Szalay-Beko] ZOOKEEPER-3188: fix SendWorker.asyncValidateIfSocketIsStillReachable
5b22432 [Mate Szalay-Beko] ZOOKEEPER-3188: fix LeaderElection to work with multiple election addresses
7bfbe7e [Mate Szalay-Beko] ZOOKEEPER-3188: Improve resilience to network
  • Loading branch information
symat authored and anmolnar committed Nov 29, 2019
1 parent 8e89050 commit 815c8f2
Show file tree
Hide file tree
Showing 33 changed files with 2,159 additions and 508 deletions.
25 changes: 23 additions & 2 deletions zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
Expand Up @@ -202,7 +202,13 @@ ensemble:
though about a few here:
Every machine that is part of the ZooKeeper ensemble should know
about every other machine in the ensemble. You accomplish this with
the series of lines of the form **server.id=host:port:port**. The parameters **host** and **port** are straightforward. You attribute the
the series of lines of the form **server.id=host:port:port**.
(The parameters **host** and **port** are straightforward, for each server
you need to specify first a Quorum port then a dedicated port for ZooKeeper leader
election). Since ZooKeeper 3.6.0 you can also [specify multiple addresses](#id_multi_address)
for each ZooKeeper server instance (this can increase availability when multiple physical
network interfaces can be used parallel in the cluster).
You attribute the
server id to each machine by creating a file named
*myid*, one for each server, which resides in
that server's data directory, as specified by the configuration file
Expand Down Expand Up @@ -1050,7 +1056,7 @@ of servers -- that is, when deploying clusters of servers.
>Turning on leader selection is highly recommended when
you have more than three ZooKeeper servers in an ensemble.

* *server.x=[hostname]:nnnnn[:nnnnn], etc* :
* *server.x=[hostname]:nnnnn[:nnnnn] etc* :
(No Java system property)
servers making up the ZooKeeper ensemble. When the server
starts up, it determines which server it is by looking for the
Expand All @@ -1065,6 +1071,21 @@ of servers -- that is, when deploying clusters of servers.
The first followers use to connect to the leader, and the second is for
leader election. If you want to test multiple servers on a single machine, then
different ports can be used for each server.


<a name="id_multi_address"></a>
Since ZooKeeper 3.6.0 it is possible to specify **multiple addresses** for each
ZooKeeper server (see [ZOOKEEPER-3188](https://issues.apache.org/jira/projects/ZOOKEEPER/issues/ZOOKEEPER-3188)).
This helps to increase availability and adds network level
resiliency to ZooKeeper. When multiple physical network interfaces are used
for the servers, ZooKeeper is able to bind on all interfaces and runtime switching
to a working interface in case a network error. The different addresses can be specified
in the config using a pipe ('|') character. A valid configuration using multiple addresses looks like:

server.1=zoo1-net1:2888:3888|zoo1-net2:2889:3889
server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889
server.3=zoo3-net1:2888:3888|zoo3-net2:2889:3889

* *syncLimit* :
(No Java system property)
Expand Down
21 changes: 21 additions & 0 deletions zookeeper-docs/src/main/resources/markdown/zookeeperReconfig.md
Expand Up @@ -19,6 +19,7 @@ limitations under the License.
* [Overview](#ch_reconfig_intro)
* [Changes to Configuration Format](#ch_reconfig_format)
* [Specifying the client port](#sc_reconfig_clientport)
* [Specifying multiple server addresses](#sc_multiaddress)
* [The standaloneEnabled flag](#sc_reconfig_standaloneEnabled)
* [The reconfigEnabled flag](#sc_reconfig_reconfigEnabled)
* [Dynamic configuration file](#sc_reconfig_file)
Expand Down Expand Up @@ -109,6 +110,26 @@ Examples of legal server statements:
server.5 = 125.23.63.23:1234:1235;125.23.63.24:1236
server.5 = 125.23.63.23:1234:1235:participant;125.23.63.23:1236


<a name="sc_multiaddress"></a>

### Specifying multiple server addresses

Since ZooKeeper 3.6.0 it is possible to specify multiple addresses for each
ZooKeeper server (see [ZOOKEEPER-3188](https://issues.apache.org/jira/projects/ZOOKEEPER/issues/ZOOKEEPER-3188)).
This helps to increase availability and adds network level
resiliency to ZooKeeper. When multiple physical network interfaces are used
for the servers, ZooKeeper is able to bind on all interfaces and runtime switching
to a working interface in case a network error. The different addresses can be
specified in the config using a pipe ('|') character.

Examples for a valid configurations using multiple addresses:

server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889;2188
server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889|zoo2-net3:2890:3890;2188
server.2=zoo2-net1:2888:3888|zoo2-net2:2889:3889;zoo2-net1:2188
server.2=zoo2-net1:2888:3888:observer|zoo2-net2:2889:3889:observer;2188

<a name="sc_reconfig_standaloneEnabled"></a>

### The _standaloneEnabled_ flag
Expand Down
Expand Up @@ -355,6 +355,11 @@ server have its own machine. It must be a completely separate
physical server. Multiple virtual machines on the same physical
host are still vulnerable to the complete failure of that host.

>If you have multiple network interfaces in your ZooKeeper machines,
you can also instruct ZooKeeper to bind on all of your interfaces and
automatically switch to a healthy interface in case of a network failure.
For details, see the [Configuration Parameters](zookeeperAdmin.html#id_multi_address).

<a name="other-optimizations"></a>

### Other Optimizations
Expand Down
Expand Up @@ -18,6 +18,7 @@

package org.apache.zookeeper.server;

import java.net.InetSocketAddress;
import org.apache.zookeeper.server.quorum.Observer;
import org.apache.zookeeper.server.quorum.ObserverMXBean;
import org.apache.zookeeper.server.quorum.QuorumPeer;
Expand Down Expand Up @@ -49,10 +50,11 @@ public String getQuorumAddress() {

public String getLearnerMaster() {
QuorumPeer.QuorumServer learnerMaster = observer.getCurrentLearnerMaster();
if (learnerMaster == null || learnerMaster.addr == null) {
if (learnerMaster == null || learnerMaster.addr.isEmpty()) {
return "Unknown";
}
return learnerMaster.addr.getAddress().getHostAddress() + ":" + learnerMaster.addr.getPort();
InetSocketAddress address = learnerMaster.addr.getReachableOrOne();
return address.getAddress().getHostAddress() + ":" + address.getPort();
}

public void setLearnerMaster(String learnerMaster) {
Expand Down
Expand Up @@ -18,7 +18,9 @@

package org.apache.zookeeper.server.admin;

import com.fasterxml.jackson.annotation.JsonAnyGetter;
import com.fasterxml.jackson.annotation.JsonProperty;
import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;
import java.net.InetSocketAddress;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
Expand All @@ -43,7 +45,9 @@
import org.apache.zookeeper.server.quorum.FollowerZooKeeperServer;
import org.apache.zookeeper.server.quorum.Leader;
import org.apache.zookeeper.server.quorum.LeaderZooKeeperServer;
import org.apache.zookeeper.server.quorum.MultipleAddresses;
import org.apache.zookeeper.server.quorum.QuorumPeer;
import org.apache.zookeeper.server.quorum.QuorumPeer.LearnerType;
import org.apache.zookeeper.server.quorum.QuorumZooKeeperServer;
import org.apache.zookeeper.server.quorum.ReadOnlyZooKeeperServer;
import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier;
Expand Down Expand Up @@ -673,53 +677,62 @@ public CommandResponse run(ZooKeeperServer zkServer, Map<String, String> kwargs)
CommandResponse response = initializeResponse();
if (zkServer instanceof QuorumZooKeeperServer) {
QuorumPeer peer = ((QuorumZooKeeperServer) zkServer).self;
VotingView votingView = new VotingView(peer.getVotingView());
Map<Long, QuorumServerView> votingView = peer.getVotingView().entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, e -> new QuorumServerView(e.getValue())));
response.put("current_config", votingView);
} else {
response.put("current_config", Collections.emptyMap());
}
return response;
}

private static class VotingView {

private final Map<Long, String> view;

VotingView(Map<Long, QuorumPeer.QuorumServer> view) {
this.view = view.entrySet()
.stream()
.filter(e -> e.getValue().addr != null)
.collect(Collectors.toMap(
Map.Entry::getKey,
e -> String.format(
"%s:%d%s:%s%s",
QuorumPeer.QuorumServer.delimitedHostString(e.getValue().addr),
e.getValue().addr.getPort(),
e.getValue().electionAddr == null ? "" : ":" + e.getValue().electionAddr.getPort(),
e.getValue().type.equals(QuorumPeer.LearnerType.PARTICIPANT) ? "participant" : "observer",
e.getValue().clientAddr == null || e.getValue().isClientAddrFromStatic
? ""
: String.format(
";%s:%d",
QuorumPeer.QuorumServer.delimitedHostString(e.getValue().clientAddr),
e.getValue().clientAddr.getPort())),
(v1, v2) -> v1, // cannot get duplicates as this straight draws from the other map
TreeMap::new));
@SuppressFBWarnings(value = "URF_UNREAD_FIELD", justification = "class is used only for JSON serialization")
private static class QuorumServerView {

@JsonProperty
private List<String> serverAddresses;

@JsonProperty
private List<String> electionAddresses;

@JsonProperty
private String clientAddress;

@JsonProperty
private String learnerType;

public QuorumServerView(QuorumPeer.QuorumServer quorumServer) {
this.serverAddresses = getMultiAddressString(quorumServer.addr);
this.electionAddresses = getMultiAddressString(quorumServer.electionAddr);
this.learnerType = quorumServer.type.equals(LearnerType.PARTICIPANT) ? "participant" : "observer";
this.clientAddress = getAddressString(quorumServer.clientAddr);
}

@JsonAnyGetter
public Map<Long, String> getView() {
return view;
private static List<String> getMultiAddressString(MultipleAddresses multipleAddresses) {
if (multipleAddresses == null) {
return Collections.emptyList();
}

return multipleAddresses.getAllAddresses().stream()
.map(QuorumServerView::getAddressString)
.collect(Collectors.toList());
}

}
private static String getAddressString(InetSocketAddress address) {
if (address == null) {
return "";
}
return String.format("%s:%d", QuorumPeer.QuorumServer.delimitedHostString(address), address.getPort());
}
}

}

/**
* Watch information aggregated by session. Returned Map contains:
* - "session_id_to_watched_paths": Map&lt;Long, Set&lt;String&gt;&gt; session ID -&gt; watched paths
* @see DataTree#getWatches()
* @see DataTree#getWatches()
*/
public static class WatchCommand extends CommandBase {

Expand Down
Expand Up @@ -22,6 +22,7 @@
import java.io.IOException;
import java.net.DatagramPacket;
import java.net.DatagramSocket;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.SocketException;
import java.nio.ByteBuffer;
Expand Down Expand Up @@ -712,7 +713,8 @@ private void process(ToSend m) {
}

for (QuorumServer server : self.getVotingView().values()) {
InetSocketAddress saddr = new InetSocketAddress(server.addr.getAddress(), port);
InetAddress address = server.addr.getReachableOrOne().getAddress();
InetSocketAddress saddr = new InetSocketAddress(address, port);
addrChallengeMap.put(saddr, new ConcurrentHashMap<Long, Long>());
}

Expand Down Expand Up @@ -740,7 +742,7 @@ public AuthFastLeaderElection(QuorumPeer self) {

private void starter(QuorumPeer self) {
this.self = self;
port = self.getVotingView().get(self.getId()).electionAddr.getPort();
port = self.getVotingView().get(self.getId()).electionAddr.getAllPorts().get(0);
proposedLeader = -1;
proposedZxid = -1;

Expand All @@ -763,14 +765,15 @@ private void leaveInstance() {
private void sendNotifications() {
for (QuorumServer server : self.getView().values()) {

InetSocketAddress address = self.getView().get(server.id).electionAddr.getReachableOrOne();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
AuthFastLeaderElection.sequencer++,
proposedLeader,
proposedZxid,
logicalclock.get(),
QuorumPeer.ServerState.LOOKING,
self.getView().get(server.id).electionAddr);
address);

sendqueue.offer(notmsg);
}
Expand Down

3 comments on commit 815c8f2

@aishwaryasoni1991
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symat Is there a way we can backport this fix for 3.5.5 or 3.5.8 as it got released recently? I am having trouble with zookeeper 3.6.1 (along with some misbehavior from its dependent services) and I need this fix.

@eolivelli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix was the cause of several regressions in 3.6.
It is better to fix 3.6

@symat
Copy link
Contributor Author

@symat symat commented on 815c8f2 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a large/complex patch also including some leader election message protocol version changes. Also there were several subsequent bugfixes related to the MultiAddress feature later (after this PR). I don't think we should backport all these to 3.5.
(we don't backport new major features to older branches to make sure we don't break anything in a bugfix release)

Please sign in to comment.