the first seed node won't join the cluster via other seed node after restarting #29983

Zhen-hao · 2021-01-29T17:56:04Z

To reproduce:

The cluster is of 3 nodes: node1, node2, node3
the seed nodes are [node1, node2]
they will form a cluster of 3 nodes if deployed at the same time.

now restart node1.

you can see the log from node2

Jan 29 16:22:33 node2: Cluster Node [akka://node2:2552] - Received InitJoin message from [Actor[akka://node1:2552/system/cluster/core/daemon/firstSeedNodeProcess-1#1473844315]] to [akka://node2:2552]
Jan 29 16:22:33 node2: Cluster Node [akka://node2:2552] - Sending InitJoinAck message from node [akka://node2:2552] to [Actor[akka://node1:2552/system/cluster/core/daemon/firstSeedNodeProcess-1#1473844315]] (version [2>

but on node1 the logs are:

Jan 29 16:21:54 node1: Cluster Node [akka://node1:2552] - Node [akka://node1:2552] is JOINING itself (with roles [backend, dc-default], version [1.0.0]) and forming new cluster
Jan 29 16:21:54 node1: Cluster Node [akka://node1:2552] - is the new leader among reachable nodes (more leaders may exist)

so node1 did try to join node2 but ignored the response from node2 as there is no debug log about it.
The end result is that node1 becomes a new standing alone cluster.

The text was updated successfully, but these errors were encountered:

Zhen-hao · 2021-01-29T17:59:11Z

tested with both TCP and Aeron-UDP

Zhen-hao · 2021-01-29T20:28:07Z

I checked the source code (master branch). my conclusion is that the InitJoinAck message from node2 to node1 is never received by node1 because it would have resulted in some log.

I also could NOT find the on node1

Couldn't join other seed nodes, will join myself

patriknw · 2021-02-01T08:25:37Z

We have a test for this so I doubt that it would be broken.

I also tried with the akka-sample-cluster-scala and it works fine.

sbt "runMain sample.cluster.simple.App 25251"
sbt "runMain sample.cluster.simple.App 25252"

restart 25251 and it will join 25252 again.

Zhen-hao · 2021-02-01T09:20:53Z

we found the issue in our multi-node deployment.
we do have a difference in our deployment from the Akka documentation. we don't use the default port.
we have akka.remote.artery.canonical = 2552 and akka.remote.artery.bind = ""

Zhen-hao · 2021-02-01T09:29:05Z

I mean we have the same config on all the nodes. We were using 2.6.11.
as a workaround, we set the seedNodes on node1 to be [node2], node2 [node1], and node3 [node1, node2].
this solves the problem for rolling updates. but the cluster would fail to boot if all nodes restart at the same time.

patriknw · 2021-02-01T09:41:17Z

That doesn't sound like the right approach for configuring seed nodes. Easiest would be to use same list on all nodes and there must be at least 2 (better more) entries in the list

seed-nodes = [node1, node2, node3]

Read the docs carefully: https://doc.akka.io/docs/akka/current/typed/cluster.html#joining and consider using Cluster Bootstrap.

Closing this, since it's more of a question that we would prefer to handle at http://discuss.akka.io/

Zhen-hao · 2021-02-01T10:11:09Z

@patriknw I think you misunderstood my last comment,
the issue occurred with seed-nodes = [node1, node2, node3]

and we changed for the latter config as a workaround to temporally overcome the issue.

Zhen-hao · 2021-02-01T10:12:24Z

there is still a possibility that this is a real bug

patriknw · 2021-02-01T13:34:56Z

If you can show a minimized example that illustrates the problem we can look more at it. I suggest that you take the https://github.com/akka/akka-samples/tree/2.6/akka-sample-cluster-scala and show us how it doesn't work.

What do you mean with akka.remote.artery.bind = "" ? Exactly what akka.remote.artery.* and seed-nodes configuration are you using?

Zhen-hao · 2021-02-01T13:57:34Z

What do you mean with akka.remote.artery.bind = "" ? Exactly what akka.remote.artery.* and seed-nodes configuration are you using?

sorry, I meant akka.remote.artery.bind.port = ""
as in https://github.com/akka/akka/blob/master/akka-remote/src/main/resources/reference.conf#L783

my full artery config is

artery {

      enabled = on

      transport = aeron-udp

      canonical {

        port = 2552 // to be filled by the deployment

        hostname = "127.0.0.1" // to be filled by the deployment
      }

      bind {

        port = ""

        hostname = ""

        bind-timeout = 3s
      }

      large-message-destinations = []

      untrusted-mode = off

      trusted-selection-paths = []

      log-received-messages = off

      log-sent-messages = off

      log-frame-size-exceeding = off

      advanced {

        maximum-frame-size = 2 MiB # 256 KiB
        buffer-pool-size = 128

        maximum-large-frame-size = 6 MiB

        large-buffer-pool-size = 32

        test-mode = off

        materializer = ${akka.stream.materializer}

        use-dispatcher = "akka.remote.default-remote-dispatcher"

        use-control-stream-dispatcher = "akka.actor.internal-dispatcher"

        inbound-lanes = 4

     
        outbound-lanes = 1


        outbound-message-queue-size = 3072
        outbound-control-queue-size = 20000

        outbound-large-message-queue-size = 256
        system-message-buffer-size = 20000

        system-message-resend-interval = 1 second

        handshake-timeout = 20 seconds

        handshake-retry-interval = 1 second

        inject-handshake-interval = 1 second

        give-up-system-message-after = 6 hours

        stop-idle-outbound-after = 5 minutes

        quarantine-idle-outbound-after = 6 hours

        stop-quarantined-after-idle = 3 seconds

        remove-quarantined-association-after = 1 h

        shutdown-flush-timeout = 1 second

        death-watch-notification-flush-timeout = 3 seconds

        inbound-restart-timeout = 5 seconds

        inbound-max-restarts = 5

        outbound-restart-backoff = 1 second

        outbound-restart-timeout = 5 seconds

        outbound-max-restarts = 5

        compression {

          actor-refs {
            max = 256

            advertisement-interval = 1 minute
          }
          manifests {
            max = 256
            advertisement-interval = 1 minute
          }
        }

        instruments = ${?akka.remote.artery.advanced.instruments} []

        # Only used when transport is aeron-udp
        aeron {
          log-aeron-counters = false

          embedded-media-driver = on

          aeron-dir = ""
          delete-aeron-dir = yes

          idle-cpu-level = 5

          give-up-message-after = 60 seconds

          client-liveness-timeout = 20 seconds

          publication-unblock-timeout = 40 seconds

          image-liveness-timeout = 10 seconds
          driver-timeout = 20 seconds
        }
      }
      ssl {
      }
    }

the seed nodes are [node1, node2] on all 3 nodes. (same problem with [node1, node2, node3])

Zhen-hao · 2021-02-01T14:15:49Z

I will try to make a minimal reproducible setup.

Zhen-hao · 2021-02-01T21:59:30Z

I can't find a small reproducer yet.
Looking at the logs again, I see a lot

sending remote message [HandshakeRsp(akka://my-cluster@node2:2552#2877356506108829124)] to [] from []

on node2. I don't know if this is an issue

and on node1, I see

Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306] doesn't match an active actor. It has probably been stopped, using deadLetters.

and

received message [InitJoinAck(akka://my-cluster@node2:2552,CompatibleConfig(Config(SimpleConfigObject({"akka":{"cluster":{"downing-provider-class":"akka.cluster.sbr.SplitBrainResolverProvider","sharding":{"number-of-shards":1000,"state-store-mode":"ddata"},"split-brain-resolver":{"active-strategy":"keep-majority"},"typed":{"receptionist":{"distributed-key-count":5}}},"version":"2.6.12"}}))))] to [Actor[akka://my-cluster/system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306]] from [Actor[akka://my-cluster@node2:2552/system/cluster/core/daemon#-729820782]]

@patriknw can you help me understand the It has probably been stopped, using deadLetters. part?
this can explain why InitJoinAck is never handled on node1, but I don't know what can cause firstSeedNodeProcess-1#-1590522306 to stop before it receives a response.

Zhen-hao · 2021-02-01T22:03:31Z

on node2, I do see

Cluster Node [akka://my-cluster@node2:2552] - Received InitJoin message from [Actor[akka://my-cluster@node1:2552/system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306]] to [akka://my-cluster@node2:2552]

so node1 used firstSeedNodeProcess-1#-1590522306 to send the InitJoin message but couldn't receive the response before t stops.
and there is no log about its stopping...

patriknw · 2021-02-02T07:36:47Z

firstSeedNodeProcess stops when it has waited long enough, deciding to join itself instead. Have you changed for example akka.cluster.seed-node-timeout configuration? After how long do you see the log message that node1 joins itself?

I might see better if you share the full akka debug level logs from node1 and node2.

Zhen-hao · 2021-02-02T08:06:01Z

I have akka.cluster.seed-node-timeout = 5s
I couldn't see node1 joining itself in the last debug run because I also have min-nr-of-members = 2. so node1 is now stuck at the joining state.

in the previous log example,

node2 received InitJoin at 22:10:42.273, 22:10:42.556, 22:10:42.685, 22:10:42.812, and 22:10:42.992

node1 received InitJoinAck at 22:10:42.820, 22:10:42.827, 22:10:42.833, 22:10:42.998, and 22:10:43.109

I can export the full logs of the 3 nodes from Elasticsearch if you think it is helpful.

msb1 · 2021-02-06T05:05:49Z

Just had the problem:

Cluster Node [akka://XXXXXXXX@localhost:2552] - Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka://XXXXXXXXX@0.0.0.0:2551, akka://XXXXXXX@0.0.0.0:2552]

Using these entries in application.conf:

akka.remote.artery {
  canonical.port = 2551
  canonical.hostname = localhost
}

akka.cluster {
  seed-nodes = [
    "akka://XXXXXXXX@0.0.0.0:2551",
    "akka://XXXXXXXX@0.0.0.0:2552"
  ]
  sharding {
    number-of-shards = 100
    remember-entities = off
    remember-entities-store = "eventsourced"
  }
  jmx.multi-mbeans-in-same-jvm = on
  downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
}

Changing application.conf to

akka.remote.artery {
  canonical.port = 0
  canonical.hostname = 0.0.0.0
}

Fixes the problem.

Note that the canonical.port is overridden in Scala code:

private def configWithPort(port: Int): Config =
    ConfigFactory.parseString(s"""akka.remote.artery.canonical.port = $port""").withFallback(ConfigFactory.load())

Could have also overridden akka.remote.artery.canonical.host as well but appear to have that set properly in the config file for now. From Akka Cluster Documentation:

"Remoting is the mechanism by which Actors on different nodes talk to each other internally. When building an Akka application, you would usually not use the Remoting concepts directly, but instead use the more high-level Akka Cluster utilities or technology-agnostic protocols such as HTTP, gRPC etc."

"The port number needs to be unique for each actor system on the same machine even if the actor systems have different names. This is because each actor system has its own networking subsystem listening for connections and handling messages as not to interfere with other actor systems."

It appears that the clusters need this set properly so they can join the seed nodes. A bit more from the documentation:

Change provider from local. We recommend using Akka Cluster over using remoting directly.
Enable Artery to use it as the remoting implementation
Add host name - the machine you want to run the actor system on; this host name is exactly what is passed to remote systems - in order to identify this system and consequently used for connecting back to this system if need be, hence set it to a reachable - IP address or resolvable name in case you want to communicate across the network.
Add port number - the port the actor system should listen on, set to 0 to have it chosen automatically

Artery Remoting Documentation link: https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration

Zhen-hao · 2021-02-06T12:51:21Z

Just had the problem:

Cluster Node [akka://XXXXXXXX@localhost:2552] - Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka://XXXXXXXXX@0.0.0.0:2551, akka://XXXXXXX@0.0.0.0:2552]

Using these entries in application.conf:
akka.remote.artery {
  canonical.port = 2551
  canonical.hostname = localhost
}

akka.cluster {
  seed-nodes = [
    "akka://XXXXXXXX@0.0.0.0:2551",
    "akka://XXXXXXXX@0.0.0.0:2552"
  ]
  sharding {
    number-of-shards = 100
    remember-entities = off
    remember-entities-store = "eventsourced"
  }
  jmx.multi-mbeans-in-same-jvm = on
  downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
}
Changing application.conf to
akka.remote.artery {
  canonical.port = 0
  canonical.hostname = 0.0.0.0
}
Fixes the problem.

Note that the canonical.port is overridden in Scala code:
private def configWithPort(port: Int): Config =
    ConfigFactory.parseString(s"""akka.remote.artery.canonical.port = $port""").withFallback(ConfigFactory.load())
Could have also overridden akka.remote.artery.canonical.host as well but appear to have that set properly in the config file for now. From Akka Cluster Documentation:

"Remoting is the mechanism by which Actors on different nodes talk to each other internally. When building an Akka application, you would usually not use the Remoting concepts directly, but instead use the more high-level Akka Cluster utilities or technology-agnostic protocols such as HTTP, gRPC etc."

"The port number needs to be unique for each actor system on the same machine even if the actor systems have different names. This is because each actor system has its own networking subsystem listening for connections and handling messages as not to interfere with other actor systems."

It appears that the clusters need this set properly so they can join the seed nodes. A bit more from the documentation:

Change provider from local. We recommend using Akka Cluster over using remoting directly.

Enable Artery to use it as the remoting implementation

Add host name - the machine you want to run the actor system on; this host name is exactly what is passed to remote systems - in order to identify this system and consequently used for connecting back to this system if need be, hence set it to a reachable - IP address or resolvable name in case you want to communicate across the network.

Add port number - the port the actor system should listen on, set to 0 to have it chosen automatically

Artery Remoting Documentation link: https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration

I don't think it is the same issue. (It is not clear what you were trying to do. I can only assume you were trying to run the cluster example locally.)

msb1 · 2021-02-06T16:29:59Z

This is part of a service belonging to a group of microservices. I changed the remote artery host and port on a revision of the application.conf that I was performing. I received this messages. When changing back, all worked well I can connect and reconnect with no issues. I was trying to emphasize that it is important to set the remote artery appropriately in the Akka Cluster. I do not think the documentation explains this well. The message was intended to portray this and hopefully help others who will experience this difficulty.

…

------ Original Message ------ From: "Zhenhao Li" <notifications@github.com> To: "akka/akka" <akka@noreply.github.com> Cc: "msb1" <barnwaldo@gmail.com>; "Comment" <comment@noreply.github.com> Sent: 2/6/2021 7:51:35 AM Subject: Re: [akka/akka] the first seed node won't join the cluster via other seed node after restarting (#29983)

>Just had the problem: > >Cluster Node ***@***.***:2552] - Couldn't join seed >nodes after [2] attempts, will try again. ***@***.***:2551, ***@***.***:2552] > >Using these entries in application.conf: > >akka.remote.artery { > canonical.port = 2551 > canonical.hostname = localhost >} > >akka.cluster { > seed-nodes = [ > ***@***.***:2551", > ***@***.***:2552" > ] > sharding { > number-of-shards = 100 > remember-entities = off > remember-entities-store = "eventsourced" > } > jmx.multi-mbeans-in-same-jvm = on > downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider" >} > >Changing application.conf to > >akka.remote.artery { > canonical.port = 0 > canonical.hostname = 0.0.0.0 >} > >Fixes the problem. > >Note that the canonical.port is overridden in Scala code: > >private def configWithPort(port: Int): Config = > ConfigFactory.parseString(s"""akka.remote.artery.canonical.port = $port""").withFallback(ConfigFactory.load()) > >Could have also overridden akka.remote.artery.canonical.host as well >but appear to have that set properly in the config file for now. From >Akka Cluster Documentation: > >"Remoting is the mechanism by which Actors on different nodes talk to >each other internally. When building an Akka application, you would >usually not use the Remoting concepts directly, but instead use the >more high-level Akka Cluster utilities or technology-agnostic >protocols such as HTTP, gRPC etc." > >"The port number needs to be unique for each actor system on the same >machine even if the actor systems have different names. This is >because each actor system has its own networking subsystem listening >for connections and handling messages as not to interfere with other >actor systems." > >It appears that the clusters need this set properly so they can join >the seed nodes. A bit more from the documentation: > >Change provider from local. We recommend using Akka Cluster over using >remoting directly. >Enable Artery to use it as the remoting implementation >Add host name - the machine you want to run the actor system on; this >host name is exactly what is passed to remote systems - in order to >identify this system and consequently used for connecting back to this >system if need be, hence set it to a reachable - IP address or >resolvable name in case you want to communicate across the network. >Add port number - the port the actor system should listen on, set to 0 >to have it chosen automatically >Artery Remoting Documentation link: >https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration > I don't think it is the same issue. (It is not clear what you were trying to do. I can only assume you were trying to run the cluster example locally.) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#29983 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKQALDI4HMXO5UH5XSOMOTTS5U3NPANCNFSM4WZH3O4Q>.

patriknw · 2021-03-29T08:09:51Z

I'm rather confused by all questions and comments here. Some just acknoledge how it's supposed to work. Some looks like trial and error config changes.

Using 0.0.0.0 in seed-nodes config will not work, that must be a specific address, same as canonical.hostname.

If you are new to Akka, start with artery-tcp, since that's easier to use than aeron-udp.

I'd recommend that you start from an example. It's not much configuration needed if you just want to try an Akka Cluster locally. See https://developer.lightbend.com/start/?group=akka&project=akka-samples-cluster-scala

I'd also recommend the tutorial in the Akka Platform Guide. It has downloadable samples, and for example the shopping-order-service is a "minimal" cluster app. https://developer.lightbend.com/docs/akka-platform-guide/microservices-tutorial/index.html

Zhen-hao · 2021-03-29T09:31:32Z

content posted by @msb1 is not really relevant to the original issue. I agree it is confusing.

To summarize the issue (so future readers don't need to read everything above):

I have a 3-node cluster with node1 and node2 as seed nodes. During rolling update (or just restart), I've seen:

node2 received InitJoin at 22:10:42.273, 22:10:42.556, 22:10:42.685, 22:10:42.812, and 22:10:42.992

node1 received InitJoinAck at 22:10:42.820, 22:10:42.827, 22:10:42.833, 22:10:42.998, and 22:10:43.109
yet on node1 there are errors like

Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#-1590522306] doesn't match an active actor. It has probably been stopped, using deadLetters.

indicating that firstSeedNodeProcess has died before handling InitJoinAck. I don't think it is caused by timeout because the node2 always replies within one second.

I couldn't reproduce this issue modifying the example Akka cluster project.
Currently, I use the following workaround to make rolling update work in production:

node1's config only contains node2 as seed node
node2's config only contains node1 as seed node
node3's config contains node1 and node2 as seed nodes

Note: This workaround only works for rolling updates. We need to set seed nodes to node1 and node2 on each node for a fresh deployment.

I read the source code and couldn't figure out what can cause firstSeedNodeProcess to die early. Everything about its lifecycle looks fine to me.

patriknw · 2021-03-29T09:35:12Z

What happens if you define seed-nodes = [node1, node2, node3] in all 3? Same config for all.

Zhen-hao · 2021-03-29T09:49:09Z

What happens if you define seed-nodes = [node1, node2, node3] in all 3? Same config for all.

It is the same. node1 gets stuck at joining after restarting. (it gets stuck because the minimal nodes requirement is set to 2)
They will form a cluster again only when I restart node2 and node3.

patriknw · 2021-03-29T11:01:27Z

You have all 3 running (Up), restart node1. Then it will join node2 or node3. It will be moved to Up because 3 >= 2.
Can you share some evidence of that? Example application that I can reproduce the problem or logs from all 3 nodes?

msb1 · 2021-03-29T18:35:31Z

Apologies if I created any confusion – of course it is not my intent. I first read the dialog and misunderstood to be running three seed nodes locally but that apparently is not the case. Although not relevant to this discussion, the wild card IP address 0.0.0.0 resolves and works fine if seed nodes are running locally – at least on my test setup. (Wouldn’t use it otherwise) Anyway, the behavior described is observed (in my systems) when artery remote canonical host and port are not set properly for all nodes. Perhaps Patrik can elaborate. I only know that from the documentation https://doc.akka.io/docs/akka/current/remoting-artery.html#remote-configuration-artery In order to remoting to work properly, where each system can send messages to any other system on the same network (for example a system forwards a message to a third system, and the third replies directly to the sender system) it is essential for every system to have a unique, globally reachable address and port. This address is part of the unique name of the system and will be used by other systems to open a connection to it and send messages. This means that if a host has multiple names (different DNS records pointing to the same IP address) then only one of these can be canonical. If a message arrives to a system but it contains a different hostname than the expected canonical name then the message will be dropped. If multiple names for a system would be allowed, then equality checks among ActorRef instances would no longer to be trusted and this would violate the fundamental assumption that an actor has a globally unique reference on a given network. As a consequence, this also means that localhost addresses (e.g. 127.0.0.1) cannot be used in general (apart from local development) since they are not unique addresses in a real network. Moreover, the same documentation further elaborates: In setups involving Network Address Translation (NAT), Load Balancers or Docker containers the hostname and port pair that Akka binds to will be different than the “logical” host name and port pair that is used to connect to the system from the outside. This requires special configuration that sets both the logical and the bind pairs for remoting. Perhaps this is applicable to your setup? From: Patrik Nordwall ***@***.***> Sent: Monday, March 29, 2021 7:02 AM To: akka/akka ***@***.***> Cc: msb1 ***@***.***>; Mention ***@***.***> Subject: Re: [akka/akka] the first seed node won't join the cluster via other seed node after restarting (#29983) You have all 3 running (Up), restart node1. Then it will join node2 or node3. It will be moved to Up because 3 >= 2. Can you share some evidence of that? Example application that I can reproduce the problem or logs from all 3 nodes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29983 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKQALDKE53HPSTLTTR3P5ZDTGBT2RANCNFSM4WZH3O4Q> . <https://github.com/notifications/beacon/AKQALDMUC5FRB3WZCPFLDN3TGBT2RA5CNFSM4WZH3O42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOGA6MANI.gif>

Zhen-hao · 2021-03-29T18:56:50Z

@patriknw I uploaded the full debug logs to Google Drive
the logs for each node start around 20:14-20:15
the restart of node1 happened around 20:16. I stopped all nodes after node1 has been in "joining" for a while.

FYI,
node1 is 192.168.1.102
node2 is 192.168.1.103
node3 is 192.168.1.202

Zhen-hao · 2021-05-31T08:39:58Z

I got help from someone on PairTime and found the cause in my code. I still want someone to explain this though.

The fix is "stupid":
I had

val codeSet = compileTimeMacroGeneratedCode
import codeSet._

in my serializer

class MySerializer(system: ExtendedActorSystem) extends SerializerWithStringManifest

After making codeSet lazy and removing the import, the seed node issue is gone.

This doesn't make sense to me because codeSet doesn't have any Akka dependencies or interact with the Akka cluster. All it contains are some pure functions and values generated by some macros.

On the other hand, I'm happy to get rid of this issue on my side.

patriknw · 2021-05-31T11:32:14Z

Thanks for reporting back. I think I see the reason.

Mar 29 20:16:39 compute2 systemd[1]: Starting nt-ui.service...
Mar 29 20:16:45 compute2 nt-ui[2049]: Starting outbound message stream to [akka://nt-ui@192.168.1.202:2552]
Mar 29 20:16:50 compute2 nt-ui[2049]: Cluster Node [akka://nt-ui@192.168.1.102:2552] - Couldn't join other seed nodes, will join myself. seed-nodes=[akka://nt-ui@192.168.1.102:2552, akka://nt-ui@192.168.1.103:2552, akka://nt-ui@192.168.1.202:2552]

Note that there is no logging between 16:45 and 16:50. Something is blocking all progress. That could be the macro stuff.

This join self timeout can be configured with akka.cluster.seed-node-timeout, but that wouldn't solve the real problem.

It would probably be good to trigger the macros stuff earlier if that cause a complete 5 seconds halt. Before starting the ActorSystem. Having such halt later (lazy) will not be good either.

Zhen-hao · 2021-05-31T11:56:56Z

It would probably be good to trigger the macros stuff earlier if that cause a complete 5 seconds halt. Before starting the ActorSystem. Having such halt later (lazy) will not be good either.

To workaround that , I wrap a reference of codeSet in a Future. something like

Future {
  log.info(s"starting codeSet of size ${codeSet.size}")
}

@patriknw I have two questions about your explanation.

I doubt the macro code takes 5 seconds to run. the hard work is done at compile-time, and it is just (possibly large) values at runtime. Even if blocks the serializer thread, why doesn't affect the seed node process?
is the initialization order stated somewhere? I'm surprised that the user-defined serializers can affect the seed node discovery. They should run in a strict order, or on two completely separated threads.

leviramsey · 2021-05-31T13:09:44Z

The delay would seem to me to be classloading. Forcing the classes referred to by the serializer (e.g. the message classes you're sending over the wire or persisting) to be loaded earlier might help.

patriknw · 2021-05-31T13:11:14Z

yeah, I think you have a global lock/halt (probably due to that the macro stuff hooks into class loading).

Zhen-hao · 2021-05-31T13:48:26Z

is there a documented best practice to ensure a certain class loading order?
I know nothing about the locking mechanism in the class loader...

leviramsey · 2021-05-31T14:54:42Z

Referring to the classes you're serializing before starting the cluster would probably be a good bet. Disclaimer: classloading is a "here be dragons" area of the JVM for me, so I can't claim expertise.

patriknw closed this as completed Feb 1, 2021

patriknw reopened this Feb 1, 2021

patriknw closed this as completed May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the first seed node won't join the cluster via other seed node after restarting #29983

the first seed node won't join the cluster via other seed node after restarting #29983

Zhen-hao commented Jan 29, 2021

Zhen-hao commented Jan 29, 2021

Zhen-hao commented Jan 29, 2021 •

edited

patriknw commented Feb 1, 2021 •

edited

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

patriknw commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

patriknw commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021 •

edited

patriknw commented Feb 2, 2021

Zhen-hao commented Feb 2, 2021

msb1 commented Feb 6, 2021

Zhen-hao commented Feb 6, 2021

msb1 commented Feb 6, 2021 via email

patriknw commented Mar 29, 2021

Zhen-hao commented Mar 29, 2021

patriknw commented Mar 29, 2021

Zhen-hao commented Mar 29, 2021 •

edited

patriknw commented Mar 29, 2021

msb1 commented Mar 29, 2021 via email

Zhen-hao commented Mar 29, 2021 •

edited

Zhen-hao commented May 31, 2021

patriknw commented May 31, 2021

Zhen-hao commented May 31, 2021

leviramsey commented May 31, 2021

patriknw commented May 31, 2021

Zhen-hao commented May 31, 2021

leviramsey commented May 31, 2021

the first seed node won't join the cluster via other seed node after restarting #29983

the first seed node won't join the cluster via other seed node after restarting #29983

Comments

Zhen-hao commented Jan 29, 2021

Zhen-hao commented Jan 29, 2021

Zhen-hao commented Jan 29, 2021 • edited

patriknw commented Feb 1, 2021 • edited

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

patriknw commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

patriknw commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021

Zhen-hao commented Feb 1, 2021 • edited

patriknw commented Feb 2, 2021

Zhen-hao commented Feb 2, 2021

msb1 commented Feb 6, 2021

Zhen-hao commented Feb 6, 2021

msb1 commented Feb 6, 2021 via email

patriknw commented Mar 29, 2021

Zhen-hao commented Mar 29, 2021

patriknw commented Mar 29, 2021

Zhen-hao commented Mar 29, 2021 • edited

patriknw commented Mar 29, 2021

msb1 commented Mar 29, 2021 via email

Zhen-hao commented Mar 29, 2021 • edited

Zhen-hao commented May 31, 2021

patriknw commented May 31, 2021

Zhen-hao commented May 31, 2021

leviramsey commented May 31, 2021

patriknw commented May 31, 2021

Zhen-hao commented May 31, 2021

leviramsey commented May 31, 2021

Zhen-hao commented Jan 29, 2021 •

edited

patriknw commented Feb 1, 2021 •

edited

Zhen-hao commented Feb 1, 2021 •

edited

Zhen-hao commented Mar 29, 2021 •

edited

Zhen-hao commented Mar 29, 2021 •

edited