Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the first seed node won't join the cluster via other seed node after restarting #29983

Closed
Zhen-hao opened this issue Jan 29, 2021 · 32 comments
Closed

Comments

@Zhen-hao
Copy link

To reproduce:

The cluster is of 3 nodes: node1, node2, node3
the seed nodes are [node1, node2]
they will form a cluster of 3 nodes if deployed at the same time.

now restart node1.

you can see the log from node2

Jan 29 16:22:33 node2: Cluster Node [akka://node2:2552] - Received InitJoin message from [Actor[akka://node1:2552/system/cluster/core/daemon/firstSeedNodeProcess-1#1473844315]] to [akka://node2:2552]
Jan 29 16:22:33 node2: Cluster Node [akka://node2:2552] - Sending InitJoinAck message from node [akka://node2:2552] to [Actor[akka://node1:2552/system/cluster/core/daemon/firstSeedNodeProcess-1#1473844315]] (version [2>

but on node1 the logs are:

Jan 29 16:21:54 node1: Cluster Node [akka://node1:2552] - Node [akka://node1:2552] is JOINING itself (with roles [backend, dc-default], version [1.0.0]) and forming new cluster
Jan 29 16:21:54 node1: Cluster Node [akka://node1:2552] - is the new leader among reachable nodes (more leaders may exist)

so node1 did try to join node2 but ignored the response from node2 as there is no debug log about it.
The end result is that node1 becomes a new standing alone cluster.

@Zhen-hao
Copy link
Author

tested with both TCP and Aeron-UDP

@Zhen-hao
Copy link
Author

Zhen-hao commented Jan 29, 2021

I checked the source code (master branch). my conclusion is that the InitJoinAck message from node2 to node1 is never received by node1 because it would have resulted in some log.

I also could NOT find the on node1

Couldn't join other seed nodes, will join myself

@patriknw
Copy link
Member

patriknw commented Feb 1, 2021

We have a test for this so I doubt that it would be broken.

I also tried with the akka-sample-cluster-scala and it works fine.

sbt "runMain sample.cluster.simple.App 25251"
sbt "runMain sample.cluster.simple.App 25252"

restart 25251 and it will join 25252 again.

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

we found the issue in our multi-node deployment.
we do have a difference in our deployment from the Akka documentation. we don't use the default port.
we have akka.remote.artery.canonical = 2552 and akka.remote.artery.bind = ""

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

I mean we have the same config on all the nodes. We were using 2.6.11.
as a workaround, we set the seedNodes on node1 to be [node2], node2 [node1], and node3 [node1, node2].
this solves the problem for rolling updates. but the cluster would fail to boot if all nodes restart at the same time.

@patriknw
Copy link
Member

patriknw commented Feb 1, 2021

That doesn't sound like the right approach for configuring seed nodes. Easiest would be to use same list on all nodes and there must be at least 2 (better more) entries in the list

seed-nodes = [node1, node2, node3]

Read the docs carefully: https://doc.akka.io/docs/akka/current/typed/cluster.html#joining and consider using Cluster Bootstrap.

Closing this, since it's more of a question that we would prefer to handle at http://discuss.akka.io/

@patriknw patriknw closed this as completed Feb 1, 2021
@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

@patriknw I think you misunderstood my last comment,
the issue occurred with seed-nodes = [node1, node2, node3]

and we changed for the latter config as a workaround to temporally overcome the issue.

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

there is still a possibility that this is a real bug

@patriknw
Copy link
Member

patriknw commented Feb 1, 2021

If you can show a minimized example that illustrates the problem we can look more at it. I suggest that you take the https://github.com/akka/akka-samples/tree/2.6/akka-sample-cluster-scala and show us how it doesn't work.

What do you mean with akka.remote.artery.bind = "" ? Exactly what akka.remote.artery.* and seed-nodes configuration are you using?

@patriknw patriknw reopened this Feb 1, 2021
@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

What do you mean with akka.remote.artery.bind = "" ? Exactly what akka.remote.artery.* and seed-nodes configuration are you using?

sorry, I meant akka.remote.artery.bind.port = ""
as in https://github.com/akka/akka/blob/master/akka-remote/src/main/resources/reference.conf#L783

my full artery config is

artery {

      enabled = on

      transport = aeron-udp

      canonical {

        port = 2552 // to be filled by the deployment

        hostname = "127.0.0.1" // to be filled by the deployment
      }

      bind {

        port = ""

        hostname = ""

        bind-timeout = 3s
      }

      large-message-destinations = []

      untrusted-mode = off

      trusted-selection-paths = []

      log-received-messages = off

      log-sent-messages = off

      log-frame-size-exceeding = off

      advanced {

        maximum-frame-size = 2 MiB # 256 KiB
        buffer-pool-size = 128

        maximum-large-frame-size = 6 MiB

        large-buffer-pool-size = 32

        test-mode = off

        materializer = ${akka.stream.materializer}

        use-dispatcher = "akka.remote.default-remote-dispatcher"

        use-control-stream-dispatcher = "akka.actor.internal-dispatcher"

        inbound-lanes = 4

     
        outbound-lanes = 1


        outbound-message-queue-size = 3072
        outbound-control-queue-size = 20000

        outbound-large-message-queue-size = 256
        system-message-buffer-size = 20000

        system-message-resend-interval = 1 second

        handshake-timeout = 20 seconds

        handshake-retry-interval = 1 second

        inject-handshake-interval = 1 second

        give-up-system-message-after = 6 hours

        stop-idle-outbound-after = 5 minutes

        quarantine-idle-outbound-after = 6 hours

        stop-quarantined-after-idle = 3 seconds

        remove-quarantined-association-after = 1 h

        shutdown-flush-timeout = 1 second

        death-watch-notification-flush-timeout = 3 seconds

        inbound-restart-timeout = 5 seconds

        inbound-max-restarts = 5

        outbound-restart-backoff = 1 second

        outbound-restart-timeout = 5 seconds

        outbound-max-restarts = 5

        compression {

          actor-refs {
            max = 256

            advertisement-interval = 1 minute
          }
          manifests {
            max = 256
            advertisement-interval = 1 minute
          }
        }

        instruments = ${?akka.remote.artery.advanced.instruments} []

        # Only used when transport is aeron-udp
        aeron {
          log-aeron-counters = false

          embedded-media-driver = on

          aeron-dir = ""
          delete-aeron-dir = yes

          idle-cpu-level = 5

          give-up-message-after = 60 seconds

          client-liveness-timeout = 20 seconds

          publication-unblock-timeout = 40 seconds

          image-liveness-timeout = 10 seconds
          driver-timeout = 20 seconds
        }
      }
      ssl {
      }
    }

the seed nodes are [node1, node2] on all 3 nodes. (same problem with [node1, node2, node3])

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

I will try to make a minimal reproducible setup.

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

I can't find a small reproducer yet.
Looking at the logs again, I see a lot

sending remote message [HandshakeRsp(akka://my-cluster@node2:2552#2877356506108829124)] to [] from []

on node2. I don't know if this is an issue

and on node1, I see

Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306] doesn't match an active actor. It has probably been stopped, using deadLetters.

and

received message [InitJoinAck(akka://my-cluster@node2:2552,CompatibleConfig(Config(SimpleConfigObject({"akka":{"cluster":{"downing-provider-class":"akka.cluster.sbr.SplitBrainResolverProvider","sharding":{"number-of-shards":1000,"state-store-mode":"ddata"},"split-brain-resolver":{"active-strategy":"keep-majority"},"typed":{"receptionist":{"distributed-key-count":5}}},"version":"2.6.12"}}))))] to [Actor[akka://my-cluster/system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306]] from [Actor[akka://my-cluster@node2:2552/system/cluster/core/daemon#-729820782]]

@patriknw can you help me understand the It has probably been stopped, using deadLetters. part?
this can explain why InitJoinAck is never handled on node1, but I don't know what can cause firstSeedNodeProcess-1#-1590522306 to stop before it receives a response.

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 1, 2021

on node2, I do see

Cluster Node [akka://my-cluster@node2:2552] - Received InitJoin message from [Actor[akka://my-cluster@node1:2552/system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306]] to [akka://my-cluster@node2:2552]

so node1 used firstSeedNodeProcess-1#-1590522306 to send the InitJoin message but couldn't receive the response before t stops.
and there is no log about its stopping...

@patriknw
Copy link
Member

patriknw commented Feb 2, 2021

firstSeedNodeProcess stops when it has waited long enough, deciding to join itself instead. Have you changed for example akka.cluster.seed-node-timeout configuration? After how long do you see the log message that node1 joins itself?

I might see better if you share the full akka debug level logs from node1 and node2.

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 2, 2021

I have akka.cluster.seed-node-timeout = 5s
I couldn't see node1 joining itself in the last debug run because I also have min-nr-of-members = 2. so node1 is now stuck at the joining state.

in the previous log example,

node2 received InitJoin at 22:10:42.273, 22:10:42.556, 22:10:42.685, 22:10:42.812, and 22:10:42.992

node1 received InitJoinAck at 22:10:42.820, 22:10:42.827, 22:10:42.833, 22:10:42.998, and 22:10:43.109

I can export the full logs of the 3 nodes from Elasticsearch if you think it is helpful.

@msb1
Copy link

msb1 commented Feb 6, 2021

Just had the problem:

Cluster Node [akka://XXXXXXXX@localhost:2552] - Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka://XXXXXXXXX@0.0.0.0:2551, akka://XXXXXXX@0.0.0.0:2552]

Using these entries in application.conf:

akka.remote.artery {
  canonical.port = 2551
  canonical.hostname = localhost
}

akka.cluster {
  seed-nodes = [
    "akka://XXXXXXXX@0.0.0.0:2551",
    "akka://XXXXXXXX@0.0.0.0:2552"
  ]
  sharding {
    number-of-shards = 100
    remember-entities = off
    remember-entities-store = "eventsourced"
  }
  jmx.multi-mbeans-in-same-jvm = on
  downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
}

Changing application.conf to

akka.remote.artery {
  canonical.port = 0
  canonical.hostname = 0.0.0.0
}

Fixes the problem.

Note that the canonical.port is overridden in Scala code:

private def configWithPort(port: Int): Config =
    ConfigFactory.parseString(s"""akka.remote.artery.canonical.port = $port""").withFallback(ConfigFactory.load())

Could have also overridden akka.remote.artery.canonical.host as well but appear to have that set properly in the config file for now. From Akka Cluster Documentation:

"Remoting is the mechanism by which Actors on different nodes talk to each other internally. When building an Akka application, you would usually not use the Remoting concepts directly, but instead use the more high-level Akka Cluster utilities or technology-agnostic protocols such as HTTP, gRPC etc."

"The port number needs to be unique for each actor system on the same machine even if the actor systems have different names. This is because each actor system has its own networking subsystem listening for connections and handling messages as not to interfere with other actor systems."

It appears that the clusters need this set properly so they can join the seed nodes. A bit more from the documentation:

  • Change provider from local. We recommend using Akka Cluster over using remoting directly.
  • Enable Artery to use it as the remoting implementation
  • Add host name - the machine you want to run the actor system on; this host name is exactly what is passed to remote systems - in order to identify this system and consequently used for connecting back to this system if need be, hence set it to a reachable - IP address or resolvable name in case you want to communicate across the network.
  • Add port number - the port the actor system should listen on, set to 0 to have it chosen automatically

Artery Remoting Documentation link: https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration

@Zhen-hao
Copy link
Author

Zhen-hao commented Feb 6, 2021

Just had the problem:

Cluster Node [akka://XXXXXXXX@localhost:2552] - Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka://XXXXXXXXX@0.0.0.0:2551, akka://XXXXXXX@0.0.0.0:2552]

Using these entries in application.conf:

akka.remote.artery {
  canonical.port = 2551
  canonical.hostname = localhost
}

akka.cluster {
  seed-nodes = [
    "akka://XXXXXXXX@0.0.0.0:2551",
    "akka://XXXXXXXX@0.0.0.0:2552"
  ]
  sharding {
    number-of-shards = 100
    remember-entities = off
    remember-entities-store = "eventsourced"
  }
  jmx.multi-mbeans-in-same-jvm = on
  downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
}

Changing application.conf to

akka.remote.artery {
  canonical.port = 0
  canonical.hostname = 0.0.0.0
}

Fixes the problem.

Note that the canonical.port is overridden in Scala code:

private def configWithPort(port: Int): Config =
    ConfigFactory.parseString(s"""akka.remote.artery.canonical.port = $port""").withFallback(ConfigFactory.load())

Could have also overridden akka.remote.artery.canonical.host as well but appear to have that set properly in the config file for now. From Akka Cluster Documentation:

"Remoting is the mechanism by which Actors on different nodes talk to each other internally. When building an Akka application, you would usually not use the Remoting concepts directly, but instead use the more high-level Akka Cluster utilities or technology-agnostic protocols such as HTTP, gRPC etc."

"The port number needs to be unique for each actor system on the same machine even if the actor systems have different names. This is because each actor system has its own networking subsystem listening for connections and handling messages as not to interfere with other actor systems."

It appears that the clusters need this set properly so they can join the seed nodes. A bit more from the documentation:

  • Change provider from local. We recommend using Akka Cluster over using remoting directly.
  • Enable Artery to use it as the remoting implementation
  • Add host name - the machine you want to run the actor system on; this host name is exactly what is passed to remote systems - in order to identify this system and consequently used for connecting back to this system if need be, hence set it to a reachable - IP address or resolvable name in case you want to communicate across the network.
  • Add port number - the port the actor system should listen on, set to 0 to have it chosen automatically

Artery Remoting Documentation link: https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration

I don't think it is the same issue. (It is not clear what you were trying to do. I can only assume you were trying to run the cluster example locally.)

@msb1
Copy link

msb1 commented Feb 6, 2021 via email

@patriknw
Copy link
Member

I'm rather confused by all questions and comments here. Some just acknoledge how it's supposed to work. Some looks like trial and error config changes.

Using 0.0.0.0 in seed-nodes config will not work, that must be a specific address, same as canonical.hostname.

If you are new to Akka, start with artery-tcp, since that's easier to use than aeron-udp.

I'd recommend that you start from an example. It's not much configuration needed if you just want to try an Akka Cluster locally. See https://developer.lightbend.com/start/?group=akka&project=akka-samples-cluster-scala

I'd also recommend the tutorial in the Akka Platform Guide. It has downloadable samples, and for example the shopping-order-service is a "minimal" cluster app. https://developer.lightbend.com/docs/akka-platform-guide/microservices-tutorial/index.html

@Zhen-hao
Copy link
Author

content posted by @msb1 is not really relevant to the original issue. I agree it is confusing.

To summarize the issue (so future readers don't need to read everything above):

I have a 3-node cluster with node1 and node2 as seed nodes. During rolling update (or just restart), I've seen:

node2 received InitJoin at 22:10:42.273, 22:10:42.556, 22:10:42.685, 22:10:42.812, and 22:10:42.992

node1 received InitJoinAck at 22:10:42.820, 22:10:42.827, 22:10:42.833, 22:10:42.998, and 22:10:43.109
yet on node1 there are errors like

Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306] doesn't match an active actor. It has probably been stopped, using deadLetters.

indicating that firstSeedNodeProcess has died before handling InitJoinAck. I don't think it is caused by timeout because the node2 always replies within one second.

I couldn't reproduce this issue modifying the example Akka cluster project.
Currently, I use the following workaround to make rolling update work in production:

  1. node1's config only contains node2 as seed node
  2. node2's config only contains node1 as seed node
  3. node3's config contains node1 and node2 as seed nodes

Note: This workaround only works for rolling updates. We need to set seed nodes to node1 and node2 on each node for a fresh deployment.

I read the source code and couldn't figure out what can cause firstSeedNodeProcess to die early. Everything about its lifecycle looks fine to me.

@patriknw
Copy link
Member

What happens if you define seed-nodes = [node1, node2, node3] in all 3? Same config for all.

@Zhen-hao
Copy link
Author

Zhen-hao commented Mar 29, 2021

What happens if you define seed-nodes = [node1, node2, node3] in all 3? Same config for all.

It is the same. node1 gets stuck at joining after restarting. (it gets stuck because the minimal nodes requirement is set to 2)
They will form a cluster again only when I restart node2 and node3.

@patriknw
Copy link
Member

You have all 3 running (Up), restart node1. Then it will join node2 or node3. It will be moved to Up because 3 >= 2.
Can you share some evidence of that? Example application that I can reproduce the problem or logs from all 3 nodes?

@msb1
Copy link

msb1 commented Mar 29, 2021 via email

@Zhen-hao
Copy link
Author

Zhen-hao commented Mar 29, 2021

@patriknw I uploaded the full debug logs to Google Drive
the logs for each node start around 20:14-20:15
the restart of node1 happened around 20:16. I stopped all nodes after node1 has been in "joining" for a while.

FYI,
node1 is 192.168.1.102
node2 is 192.168.1.103
node3 is 192.168.1.202

@Zhen-hao
Copy link
Author

I got help from someone on PairTime and found the cause in my code. I still want someone to explain this though.

The fix is "stupid":
I had

val codeSet = compileTimeMacroGeneratedCode
import codeSet._

in my serializer

class MySerializer(system: ExtendedActorSystem) extends SerializerWithStringManifest

After making codeSet lazy and removing the import, the seed node issue is gone.

This doesn't make sense to me because codeSet doesn't have any Akka dependencies or interact with the Akka cluster. All it contains are some pure functions and values generated by some macros.

On the other hand, I'm happy to get rid of this issue on my side.

@patriknw
Copy link
Member

Thanks for reporting back. I think I see the reason.

Mar 29 20:16:39 compute2 systemd[1]: Starting nt-ui.service...
Mar 29 20:16:45 compute2 nt-ui[2049]: Starting outbound message stream to [akka://nt-ui@192.168.1.202:2552]
Mar 29 20:16:50 compute2 nt-ui[2049]: Cluster Node [akka://nt-ui@192.168.1.102:2552] - Couldn't join other seed nodes, will join myself. seed-nodes=[akka://nt-ui@192.168.1.102:2552, akka://nt-ui@192.168.1.103:2552, akka://nt-ui@192.168.1.202:2552]

Note that there is no logging between 16:45 and 16:50. Something is blocking all progress. That could be the macro stuff.

This join self timeout can be configured with akka.cluster.seed-node-timeout, but that wouldn't solve the real problem.

It would probably be good to trigger the macros stuff earlier if that cause a complete 5 seconds halt. Before starting the ActorSystem. Having such halt later (lazy) will not be good either.

@Zhen-hao
Copy link
Author

It would probably be good to trigger the macros stuff earlier if that cause a complete 5 seconds halt. Before starting the ActorSystem. Having such halt later (lazy) will not be good either.

To workaround that , I wrap a reference of codeSet in a Future. something like

Future {
  log.info(s"starting codeSet of size ${codeSet.size}")
}

@patriknw I have two questions about your explanation.

  1. I doubt the macro code takes 5 seconds to run. the hard work is done at compile-time, and it is just (possibly large) values at runtime. Even if blocks the serializer thread, why doesn't affect the seed node process?
  2. is the initialization order stated somewhere? I'm surprised that the user-defined serializers can affect the seed node discovery. They should run in a strict order, or on two completely separated threads.

@leviramsey
Copy link
Contributor

The delay would seem to me to be classloading. Forcing the classes referred to by the serializer (e.g. the message classes you're sending over the wire or persisting) to be loaded earlier might help.

@patriknw
Copy link
Member

yeah, I think you have a global lock/halt (probably due to that the macro stuff hooks into class loading).

@Zhen-hao
Copy link
Author

is there a documented best practice to ensure a certain class loading order?
I know nothing about the locking mechanism in the class loader...

@leviramsey
Copy link
Contributor

Referring to the classes you're serializing before starting the cluster would probably be a good bet. Disclaimer: classloading is a "here be dragons" area of the JVM for me, so I can't claim expertise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants