New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster initial seed node intermittently fails to rejoin cluster on restart #18757
Comments
|
I re-did the tests again with seed-node-timeout=10s, which eliminated the issue. This probably means that the issue was something specific to my system that we preventing the dnode1 and dnode2 from responding within 5 seconds. Unless you see something odd in the logs, I'm okay to have this issue closed. Thanks, |
|
@garyiwu Glad to see that you found a solution. I would still like to investigate this somewhat before closing because it is not good if we have default values that does not work. Are dnode1 and dnode2 loaded when you perform this test, so that it would be possible that they are not responsive during the 5 seconds timeout? It's rather strange that in dnode0 we can see that it is actually receiving messages from dnode1, but it still doesn't join to it: Message [org.opendaylight.controller.cluster.raft.messages.AppendEntries] from Actor[akka.tcp://opendaylight-cluster-data@dnode1:2550/user/shardmanager-config/member-2-shard-inventory-config#1943505944] to Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-inventory-config] was not delivered. [4] dead letters encountered. I'll try to reproduce it myself but it would be helpful if you can try one more time but will full debug logging turned on. Please enable these config options: http://doc.akka.io/docs/akka/2.4.0/scala/logging.html#Auxiliary_remote_logging_options |
|
New logs with full debugging enabled: dnode0 |
|
dnode1 part 1 |
|
dnode1 part 2 |
|
dnode2 part 1 |
|
dnode2 part 2 |
|
My three nodes are each running in its own docker container, but all three are on the same physical machine. dnode1 and dnode2 are idle, but dnode0's startup creates a heavy load, so it's possible that dnode0 might be affecting the responsiveness of dnode1 and dnode2. I have seen before that even when dnode0 fails to join the existing cluster, that there can be some Associated log messages being recorded between dnode0 and the other nodes, which is odd. |
|
Thanks for taking the time to produce those logs. There we see that dnode0 repeatedly sends However, we don't see any traces of the reply Another interesting thing. You see that dnode1 and dnode2 send heartbeat messages to each other, logged by Do you run any blocking tasks on the default-dispatcher? (ClusterHeartbeatSender is also running on the default dispatcher so that theory is weak). |
|
This is part of the OpenDaylight controller which is a pretty complex system, so I can't be certain that there are no blocking tasks. However, I would be surprised if there were any. |
|
There's a mechanism to gate an unreachable node for 5 sec (I see log messages to this effect). I'm not sure of the reason for this but I assume it disallows re-connect for that time period. Is it possible, if timing is right, the 5 sec gate could result in the seed node timeout? That would explain why increasing to 10 sec alleviates the issue. So if that's the case then the default seed-node-timeout needs to be > 5 sec (actually > the gating period). |
|
That was also what I suspected when I first saw this issue, but when analysing the logs it did not point in that direction. I might be wrong though, so if you can play around with different settings, such as decreasing the gating period, and it solves the issue I'm open for changing the defaults. |
|
We ended up increasing the seed-node-timeout to 12s and that has fixed the On a side note, is seed-node-timeout only used/significant to the first On Fri, Nov 13, 2015 at 8:08 AM, Patrik Nordwall notifications@github.com
|
|
right, it's only imporant for the first seed node, but it is also used for others for the retry frequency. first: https://github.com/akka/akka/blob/master/akka-cluster/src/main/scala/akka/cluster/ClusterDaemon.scala#L1125 |
|
Cool. Thanks. On Fri, Nov 13, 2015 at 9:16 AM, Patrik Nordwall notifications@github.com
|
I have a three-node cluster, nodes dnode0, dnode1, and dnode2. Every node the same seed node configuration as follows:
I first start all three nodes to form a cluster. This works fine.
I then repeatedly restart dnode0 (the first listed seed node). In about 10% of the time, dnode0 fails to rejoin the existing cluster, and forms an island itself instead. This happens on both 2.3.10 and 2.3.14.
This problem does not seem to occur when dnode1 or dnode2 is restarted.
My questions are:
Thanks,
Gary
Looks like I don't have permission to attach documents, so the logs are inlined below:
dnode0
dnode1
dnode2
The text was updated successfully, but these errors were encountered: