Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First seed node not rejoining cluster #25361

Closed
tpantelis opened this issue Jul 17, 2018 · 13 comments
Closed

First seed node not rejoining cluster #25361

tpantelis opened this issue Jul 17, 2018 · 13 comments
Labels
0 - new Ticket is unclear on it's purpose or if it is valid or not t:cluster

Comments

@tpantelis
Copy link

tpantelis commented Jul 17, 2018

akka version: 2.5.11

In a 3-node cluster setup, upon killing and restarting the first seed node (as listed in the akka.conf), we sometimes see that the node does not or is not allowed to rejoin the cluster. After the seed node timeout, which we have set to 12s, it joins itself and forms a cluster island. This is automated in a robot test environment.

As an example, we have node1 (first seed node - address 10.30.170.94), node2 (10.30.170.26), and node3 (10.30.170.84). Initially all the nodes are started and connected and the cluster is working correctly. Then a kill -9 is issued to node1. Both node2 and node3 lose connection and mark node1 as unreachable as expected. On node2, we see:

2018-07-16T11:17:14,609 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | ReliableDeliverySupervisor       | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]

2018-07-16T11:17:18,229 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.94:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.84:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.94:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.84:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.94:2550 Up seen=false]

On node3:

2018-07-16T11:18:14,859 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ReliableDeliverySupervisor       | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]

After restarting node1, we see that it sends InitJoin messages to node2 and node3 and actually reports an InitJoinAck from node2 but eventually joins itself after 22s:

2018-07-16T11:18:20,251 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.26:2550/system/cluster/core/daemon#-1596937295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]

2018-07-16T11:18:42,585 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] is JOINING, roles [member-1, dc-default]

2018-07-16T11:18:42,614 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] to [Up]

On node2, we see InitJoin messages incoming and InitJoinAcks outgoing and that it noticed that node1 was restarted with a new incarnation, marked it as Down, and stated it was removing the previous unreachable incarnation:

2018-07-16T11:18:20,105 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]

2018-07-16T11:18:20,106 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]]

2018-07-16T11:18:20,291 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.94:2550, status = Up)] is trying to join. **Existing will be removed from the cluster and then new member will be allowed to join.**

2018-07-16T11:18:20,292 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - **Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] as [Down]**

2018-07-16T11:18:21,226 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - **Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]**

2018-07-16T11:18:30,564 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]

2018-07-16T11:18:30,565 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]]

followed by several more InitJoin/InitJoinAck sequences...

node3 shows similar InitJoin/InitJoinAck sequences.

I can't see any reason from this output why node1 was not allowed to rejoin. Clearly there was network communication between node1 and node2/node3 after the restart and node2 and node3 remained connected (ie there were no disconnect/disassociation log messages reported). So the old incarnation of node1 should be removed (as was reported in the logs) and the new incarnation should be allowed to join.

Note this happens sporadically - most of the time it works as expected.

Also note that this only seems to happen when the node is killed - we haven’t seen it when the node is gracefully shutdown.

@johanandren
Copy link
Member

For reference, same issue in forum: https://discuss.lightbend.com/t/first-seed-node-not-rejoining-cluster/1637

@johanandren johanandren added t:cluster 0 - new Ticket is unclear on it's purpose or if it is valid or not labels Jul 25, 2018
@tpantelis
Copy link
Author

We reproduced it with akka.remote and akka.cluster debug enabled. I've attached the full logs for each node during the time node1 was restarted. The IPs are:

node1: 172.28.5.1
node2: 172.28.5.2
node3: 172.28.5.3

According to the output, the bottom line is that node1 was able to send messages (InitJoin) to nodes 2 and 3 but no messages from nodes 2 and 3 were received by node1 (IntJoinAck, Heartbeat). This differs slightly from the failure case described above in that node1 did actually receive one InitJoinAck message.

node1.log
node2.log
node3.log

@patriknw
Copy link
Member

@tpantelis Thanks for the logs. I can confirm what you said. node1 doesn't receive the messages from node2 and node3. Since it happens for both node2 and node3 I don't think it's one of those nasty race conditions bugs that we have been hunting over the years in classic remoting.

Perhaps you can try this config:

akka.remote.use-passive-connections = off

That will open a new tcp connection for the replies from node2 and node3 instead of reusing the inbound tcp connection. Might be a workaround, or we might understand something from other error messages (if any).

Can we completely exclude that this isn't a bug or misconfiguration in the OpenDailight network stack? Could packets be dropped in one direction of a tcp connection for some reason? Could you use tcpdump/wireshark or similar to investigate what is sent/received on a lower level?

@jluhrsen
Copy link

thanks for the reply @patriknw.

I am reproducing this with a 3 node OpenDaylight cluster in 3 docker containers on a single laptop. I
will see if I can grab some packet traces, but might take some time to get right. I have to run this test
in a loop and let it run until it hits (sometimes hours).

I also learned today (from @tpantelis) that artery is using UDP so I can also try with that configuration
separately to see if we can learn anything

We can see if there is something at a lower level in ODL w/ regard to what is sent/received.

This is definitely a fun one :)

@chbatey
Copy link
Member

chbatey commented Aug 1, 2018

FYI Artery can use UDP or TCP.

@tpantelis
Copy link
Author

We were able to get a packet capture. It basically shows what we've seen in the logs, ie node1 is sending out InitJoin messages to 2 & 3 and InitJoinAck messages are coming back. Also akka Heartbeat messages are incoming from 2 & 3 along with our (Opendaylight) AppendEntries heartbeats. So node1 is sending out messages and messages from 2 & 3 are getting to node1 (so there's no network issues) but are not getting processed for some reason.

I attached the zipped pcap file.
105.pcap.gz

@jluhrsen
Copy link

Hey guys, is there some module we can enable debugs on to confirm at the akka level that it
is (or isnt') receiving these InitJoinAck messages? We see it at the network layer with tcpdump,
so wonder if we can drill up somehow to find where they are getting lost.

@tpantelis
Copy link
Author

We've already enabled debug which logs all incoming and outgoing messages. I assume we'd have to get down to the lowest layer which actually reads from the socket.

We also have a packet capture for a successful run where it did re-join (I forgot to include that before). In this case, node1 sent an InternalClusterAction$Join message to node2 after receiving the InitJoinAck.
104.pcap.gz

@patriknw
Copy link
Member

Thanks for those traces and that we are able to narrow in on this. Next step would probably be to add some additional logging to Akka and run with a snapshot of that. Such as in https://github.com/akka/akka/blob/master/akka-remote/src/main/scala/akka/remote/transport/netty/TcpSupport.scala#L50 and https://github.com/akka/akka/blob/master/akka-remote/src/main/scala/akka/remote/Endpoint.scala#L938

@patriknw
Copy link
Member

@tpantelis I see that we never got further here. Is this still a problem or should we close? Have you tried with Artery TCP?

@tpantelis
Copy link
Author

The problem has not been seen with Artery UDP (was not tried with Artery TCP). It's still a problem with netty TCP but it doesn't seem there's much point in pursuing that code base.

@patriknw
Copy link
Member

Thanks for the update. Closing.

@gbrd
Copy link
Contributor

gbrd commented Dec 18, 2019

The problem has not been seen with Artery UDP (was not tried with Artery TCP). It's still a problem with netty TCP but it doesn't seem there's much point in pursuing that code base.

... maybe it could be because migrated from netty to artery seems to have code impact in some cases.
Our product is built on the principle that no messages are lost until Terminated is received (our actors are in stopping strategy) AND terminated is the last message received from one actorRef. With artery Terminated can be received before last messages from actorRef (because out of band...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - new Ticket is unclear on it's purpose or if it is valid or not t:cluster
Projects
None yet
Development

No branches or pull requests

6 participants