Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node not rejoining when restarted with other node unreachable #18067

Closed
tpantelis opened this issue Jul 24, 2015 · 20 comments
Closed

Node not rejoining when restarted with other node unreachable #18067

tpantelis opened this issue Jul 24, 2015 · 20 comments
Labels
obsolete – reopen if necessary Ticket closed as currently obsolete, reopen if discussion still relevant t:cluster

Comments

@tpantelis
Copy link

We're using akka (version 2.3.10) in a large open source project - Opendaylight SDN controller. We ran into an issue where a node is not allowed to rejoin a cluster when other node(s) are also down/unreachable until they all become reachable or are started.

Here's the scenario:

I have a 3-node cluster - call them node1, node2 and node3 - with node1 being the cluster leader. I stopped node2 and node3 (my app that is). node1 quickly declared the other nodes unreachable and continued to retry the connection, ie:

2015-07-23 02:55:40,271 | WARN  | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Association with remote system [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552]] Caused by: [Connection refused: /127.0.0.1:2552]

I also got this message which is expected based on the akka docs:

2015-07-23 02:56:16,234 | INFO  | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -> akka.tcp://opendaylight-cluster-data@127.0.0.1:2552: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -> akka.tcp://opendaylight-cluster-data@127.0.0.1:2554: Unreachable [Unreachable] (2)], member status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@127.0.0.1:2552 Up seen=false, akka.tcp://opendaylight-cluster-data@127.0.0.1:2554 Up seen=false]

After restarting node2, it tried to join and this output from node1 seems to indicate it did:

2015-07-23 02:56:38,965 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
2015-07-23 02:56:38,965 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] as [Down]

But it actually didn't and the "New incarnation of existing member..." message repeated over and over every 11 sec.

After more testing I found that node1 would only allow node2 back in after node3 was started or was "downed", either via auto-down or manual down via JMX.

Setting auto-down-unreachable-after to a small value like 10s gives us the behavior we want but auto-down-unreachable-after causes another issue so we disable it.

The akka docs say "When a member is considered by the failure detector to be unreachable the leader is not allowed to perform its duties, such as changing status of new joining members to 'Up'. The node must first become reachable again, or the status of the unreachable member must be changed to 'Down'.".

I take this to mean a new joining member that hadn't previously joined. However, in my case, node2 and node3 were already in the cluster and thus weren't new joining members. Also node2 did become reachable again. The behavior seems to be that all unreachable nodes must become reachable/downed before any node is allowed to rejoin.

I'm not clear on the reasoning behind this behavior and I can live with the former case above as new nodes aren't commonly added in our environment (clusters are mainly static). But for the latter case, it seems like a bug to me that node2, which was already in the cluster, isn't allowed to be transitioned to Up (ie rejoin) until node3 is restarted. It is perfectly fine to have node1 and node2 running in the cluster without node3. In fact, I can start node1 and node2 from an empty cluster and they join and form a cluster just fine with node3 not running. Therefore node2 should be able to rejoin node1 on restart with node3 not running. To me that's basic functionality.

On a side note, while node2 was in cluster "pergatory", node1 had a connection with node2 at the remoting layer and actors on node1 were able to send messages to node2. All this while akka clustering still deemed and reported node2 as unreachable. That seems broken - a major disconnect between the 2 components.

@ktoso ktoso added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster labels Jul 28, 2015
@ktoso
Copy link
Member

ktoso commented Jul 28, 2015

Thanks for reporting @tpantelis, we should soon have some capacity to look into this (have not read in detail yet). AFAIR it could be that we don't allow nodes to join in there is no convergence in the cluster, but will have to double check..

@patriknw
Copy link
Member

patriknw commented Aug 4, 2015

As you have found node2 cannot join until previous incarnation has been removed. When trying to join new incarnation the old is downed automatically because we see new uid. It will not be removed until all other unreachable have been downed also, i.e. node3 must be downed. Then all will be removed and node2 can join. If you would have tried joining node3 at the same time this would happen automatically.

This works as designed.

@tpantelis
Copy link
Author

It may be working as designed but I'm questioning the design. I'll re-iterate a couple passages above:

"It is perfectly fine to have node1 and node2 running in the cluster without node3. In fact, I can start node1 and node2 from an empty cluster and they join and form a cluster just fine with node3 not running."

But once all 3 are running, if node2 and node3 are both taken down, node2 can't rejoin until node3 is also restarted, unless both are auto-downed. This is the part I don't quite get. node2 was already in the cluster, so, when restarted, why can't it rejoin w/o node3 regardless of being "downed"? That seems inconsistent to me as it is allowed to join from an empty cluster w/o node3. We're using akka for HA so we only need 2 nodes running.

We want to disable auto-down b/c it has an undesirable effect when a node is unreachable and not actually down in that the node becomes quarantined and is not allowed back in until it is restarted. We need a node to automatically rejoin regardless. We're developing a commercial product and end users will expect automatic recovery, as would I. I have no control over when or how many nodes will become unreachable or stopped in real deployments.

So we have kind of a catch-22 wrt auto-downing. If we set it low we get one undesirable behavior and if we set it high we get another. How can we get a happy medium here?

@patriknw
Copy link
Member

patriknw commented Aug 6, 2015

Why would you want to leave node-3 in the cluster if it is unreachable?

@tpantelis
Copy link
Author

We're using akka to provide an HA solution (using Raft) so the cluster is for the most part static (seed-nodes is configured with all IPs in the cluster during initial setup). If a node is unreachable or down it will eventually come back, unless in the uncommon case where the VM/machine suffers catastrophic failure and needs to be replaced by another machine - this will require some manual intervention which is fine.

So regardless of whether a node becomes unreachable due to network partition or is shutdown/restarted (eg upgrade), we want it to seamlessly rejoin the cluster without manual intervention. Whether or not the node is actually removed as a member by the leader doesn't really matter as long as the shutdown/unreachable node is allowed to rejoin once restarted/healed. Disabling auto-down-unreachable-after does this except if more than one node is shutdown/unreachable, one of them can't rejoin until all of them are able to rejoin - this is not desirable. If we enable auto-down-unreachable-after (small value) then one shutdown/unreachable node can rejoin w/o the others but then an unreachable node due to network partition can't rejoin unless restarted which is also not desirable. Temporary network blips will occur in production environments and it will be unacceptable for customers to have to know when a blip occurred and restart nodes.

So that's our quandary.

@patriknw
Copy link
Member

What you are requesting is not that simple. Even if it was allowed to join its status will not be changed to Up until there is convergence, i.e. no unreachable. See http://doc.akka.io/docs/akka/2.4-M2/common/cluster.html#Membership_Lifecycle

I think the solution is to use a proper downing strategy, which is something we are working on improving.

@tpantelis
Copy link
Author

Setting auto-down to a low value (which we had originally) would be fine except for the network partition case. Why can't a node rejoin with the same UUID after it's been downed? Can a flag be added to allow this?

We could also potentially auto-restart the actor system if we could programmatically know it is not being allowed to rejoin b/c it needs to be restarted. Is there any way to detect this (a cluster event would be nice)?

@nilok
Copy link

nilok commented Aug 12, 2015

Hi Patrick,

I work with Tom in OpenDaylight and in essence, I see there being three solutions to this and I'd love to get feedback on which one makes sense and how to go about implementing it.

First is to fix Akka's clustering logic to allow for us to make use of reachable nodes as soon as they are reachable without conditions on the rechability of other nodes—at least in certain configurations.

Second, is to reboot the actor system when we discover that we've been downed but can talk to other nodes. We don't know how to tell if the local actor has been downed, but presumably that is possible.

Third is to just avoid using Akka clustering altogether and just use Akka remoting, which we'd rather not do because we do actually like the failure detectors and other features.

Based on the comments on this thread, my guess is that the second option is really the only valid one in the short run. As Tom asks, could you provide some insight there?

@patriknw
Copy link
Member

Please send me an email in private (patrik dot nordwall at typesafe.com) and I will give you some information about a feature that is not released yet. Are you a Typesafe subscriber by the way?

@nilok
Copy link

nilok commented Aug 12, 2015

We're an open source project so we don't typical subscribe to commercial versions or support for our dependencies. Is there any way you can talk publicly about the feature and give us an idea of when we might be able to use it?

@patriknw
Copy link
Member

I understand. Let's continue here then. I'd love to talk publicly about the feature, but can't just yet. Hopefully that will change within a few weeks.

I repeat, what are you going to do with node-3, the unreachable node that has not been downed? Akka Cluster is not designed for having unreachable nodes lingering around forever. At some point you have to decide that they are dead, and will never come back (no zombies please). Then node-2 would be removed and restarted node-2 can join.

Akka Cluster is designed around that some membership state transitions are only performed when there is so called convergence, and that can only happen when all unreachable members have been marked as down.

One such transition is for changing state from Down to Removed (and actually removing the member from the cluster). Another such transition is for changing Joining to Up.

I understand that you challenge the design, but changing that is not something we can work on any time soon.

You must solve the downing of unreachable nodes anyway. That is not a trivial problem to solve. Our default recommendation is that the decision of what to do should be taken by a human operator or an external monitoring system. That is not the solution you are looking for.

Then you must build something that decides based on membership state. You must ensure that both sides of a network partition can make this decision by themselves and that they take the same decision about which part will keep running and which part will shut itself down.

Another solution is to coordinate via an external component, such as a database or zookeeper. The problem is then what happens if that component is unavailable?

@He-Pin
Copy link
Member

He-Pin commented Aug 13, 2015

@nilok We run akka cluster for the game server too,and encounter this problem when we just want to quick restart part of the cluster(for update).In our case ,we provide an external endpoint via spray to send some command to ask the cluster remove/down the specified node which we want to restart and then rejoin the cluster.

I think you may need to take some control on it manualy.

@He-Pin
Copy link
Member

He-Pin commented Aug 13, 2015

@nilok Still I don't think we restart the akka node via kill command is the right way,the ConductR may provide some help,but we have not try it.

@nilok
Copy link

nilok commented Aug 13, 2015

Thanks for all the feedback. This all makes sense.

In essence we are doing a version of what was suggested using Zookeeper, but we're building it on top of Akka remoting using RAFT. Since that shares fate with the Akka nodes, we're not worried about it being unavailable differently from Akka.

I think what we're really looking for is to use Akka clustering's failure detector, i.e., the reachable/unreachable flag, without having to use the full Akka clustering membership since we provide our own notion of consensus without needing to rely on the Akka clustering notion of convergence.

Can you expand a bit on how people have used something like Zookeeper for this in the past?

@nilok
Copy link

nilok commented Aug 13, 2015

@hepin1989 It sounds like you're basically just doing graceful shutdown and restart. Can you expand on exactly how you did that. I know that @tpantelis tried to do that, but ran into some issues where it wasn't behaving like he'd want.

@tpantelis
Copy link
Author

I tried to programmatically issue a "leave" on graceful shutdown just prior to shutting down the actor system but the cluster leader didn't receive the message. Perhaps the actor system was shutdown too quickly such that the leave message wasn't processed - I didn't try a delay or wait to verify the member was removed.

@He-Pin
Copy link
Member

He-Pin commented Aug 13, 2015

@tpantelis I don't think the leave should be issued from the node which is leaving ,a better way I think should using a cluster singleton to issue that,and the cluster singleton shoube better running on the master.
I have noticed what you saw too,the other nodes will see it as unreachable.but AFAIK,node now can join the cluster again if all other nodes saw it as unreachable and it comes up with a new uid.

I don't know why it's must needed for the node to restart to rejoin the cluster again,why couldn't we let the other nodes drop the stashed messages and let it in?Maybe it would be great that akka cluster could be associated dynamically.

@He-Pin
Copy link
Member

He-Pin commented Aug 13, 2015

@nilok the zk cluster in practice is small too,3/5 nodes.and it's more sensible about latency.We could build what zk could do beyond akka cluster too,with 3/5 nodes.If you need more control at akka cluster,I think you need control/assist it from out side.

@rkuhn rkuhn removed the 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted label Mar 9, 2016
@patriknw patriknw added the obsolete – reopen if necessary Ticket closed as currently obsolete, reopen if discussion still relevant label Mar 9, 2016
@patriknw patriknw closed this as completed Mar 9, 2016
@nelsonblaha
Copy link
Contributor

nelsonblaha commented Jan 29, 2018

We're getting a similar behavior that I could use some help with. We have three nodes in different AWS data centers, one of which is the sole seed node and exclusive possessor of a singleton, accomplished by using .withDataCenter on the singleton proxy settings. We can get our cluster to work as-designed by starting the seed node and then the others, but if any of the nodes go down it seems the only way to get them talking again is to restart the whole cluster in the same way. We'd like to get these to try to reconnect to the seed node and resume normal operation when they can.

When I take down a non-seed node, the seed node marks it as UNREACHABLE and begins to periodically log the following:

Association with remote system [akka.tcp://application@xxx.xx.x.xxx:xxxx] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://application@xxx.xx.x.xxx:xxxx]] Caused by: [connection timed out: /xxx.xx.x.xxx:xxxx]

Fair enough. When I bring the node back up, however, the newly-started node begins to repeat:

2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.ClusterCoreDaemon in application-akka.actor.default-dispatcher-18 - 
now supervising Actor[akka://application/system/cluster/core/daemon/joinSeedNodeProcess-16#-1572745962]

2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-3 - 
started (akka.cluster.JoinSeedNodeProcess@2ae57537)

2018-01-29 22:59:09,755 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-2 - 
stopped

The seed node logs:

2018-01-29 22:56:25,442 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-4 - 
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://application@172.xx.x.xxx:xxxx, dataCenter = indonesia, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

2018-01-29 22:56:25,443 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - 
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - Marking unreachable node [akka.tcp://application@172.xx.x.xxx:xxxx] as [Down]

and repeatedly thereafter:

2018-01-29 22:57:41,659 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - 
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - Sending InitJoinAck message from node [akka.tcp://application@52.xx.xxx.xx:xxxx] to [Actor[akka.tcp://application@172.xx.x.xxx:xxxx/system/cluster/core/daemon/joinSeedNodeProcess-8#-1322646338]]

2018-01-29 22:57:41,827 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - 
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://application@172.xx.x.xxx:xxxx, dataCenter = indonesia, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

It seems strange to me that the log indicates things "will" happen that don't, the existing being removed and the new member being allowed to join. I've been googling that message and can't find an explanation of what I may need to do to make that actually happen.

@nick-nachos
Copy link
Contributor

nick-nachos commented Feb 8, 2018

@nelsonblaha this scenario manifests when the cluster is in a non-convergent state, while a node is trying to re-join it after a restart: As soon as the re-started member issues a Join request to a cluster member, then that cluster member finds the older version of it (same address, different UID) and marks it as Down. This has the effect that the old version of the member will be removed from the cluster if the cluster is in a convergent state; i.e. NO unreachable members exist. If that is not the case, then the leader (who is responsible for removing members) cannot take any actions, so the old version of the member is not removed, thus the new incarnation is not able to join the cluster. This ends up triggering the {{join-timeout}}, which starts the join process from scratch. If the unreachability is persistent for any reason, then you end up into an endless cycle, which is probably what happened in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
obsolete – reopen if necessary Ticket closed as currently obsolete, reopen if discussion still relevant t:cluster
Projects
None yet
Development

No branches or pull requests

8 participants