New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the first seed node won't join the cluster via other seed node after restarting #29983
Comments
tested with both TCP and Aeron-UDP |
I checked the source code (master branch). my conclusion is that the I also could NOT find the on
|
We have a test for this so I doubt that it would be broken. I also tried with the akka-sample-cluster-scala and it works fine.
restart 25251 and it will join 25252 again. |
we found the issue in our multi-node deployment. |
I mean we have the same config on all the nodes. We were using |
That doesn't sound like the right approach for configuring seed nodes. Easiest would be to use same list on all nodes and there must be at least 2 (better more) entries in the list
Read the docs carefully: https://doc.akka.io/docs/akka/current/typed/cluster.html#joining and consider using Cluster Bootstrap. Closing this, since it's more of a question that we would prefer to handle at http://discuss.akka.io/ |
@patriknw I think you misunderstood my last comment, and we changed for the latter config as a workaround to temporally overcome the issue. |
there is still a possibility that this is a real bug |
If you can show a minimized example that illustrates the problem we can look more at it. I suggest that you take the https://github.com/akka/akka-samples/tree/2.6/akka-sample-cluster-scala and show us how it doesn't work. What do you mean with |
sorry, I meant my full artery config is
the seed nodes are |
I will try to make a minimal reproducible setup. |
I can't find a small reproducer yet.
on node2. I don't know if this is an issue and on node1, I see
and
@patriknw can you help me understand the |
on node2, I do see
so node1 used |
firstSeedNodeProcess stops when it has waited long enough, deciding to join itself instead. Have you changed for example I might see better if you share the full akka debug level logs from node1 and node2. |
I have in the previous log example, node2 received node1 received I can export the full logs of the 3 nodes from Elasticsearch if you think it is helpful. |
Just had the problem: Cluster Node [akka://XXXXXXXX@localhost:2552] - Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka://XXXXXXXXX@0.0.0.0:2551, akka://XXXXXXX@0.0.0.0:2552] Using these entries in application.conf:
Changing application.conf to
Fixes the problem. Note that the canonical.port is overridden in Scala code:
Could have also overridden akka.remote.artery.canonical.host as well but appear to have that set properly in the config file for now. From Akka Cluster Documentation: "Remoting is the mechanism by which Actors on different nodes talk to each other internally. When building an Akka application, you would usually not use the Remoting concepts directly, but instead use the more high-level Akka Cluster utilities or technology-agnostic protocols such as HTTP, gRPC etc." "The port number needs to be unique for each actor system on the same machine even if the actor systems have different names. This is because each actor system has its own networking subsystem listening for connections and handling messages as not to interfere with other actor systems." It appears that the clusters need this set properly so they can join the seed nodes. A bit more from the documentation:
Artery Remoting Documentation link: https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration |
I don't think it is the same issue. (It is not clear what you were trying to do. I can only assume you were trying to run the cluster example locally.) |
This is part of a service belonging to a group of microservices. I
changed the remote artery host and port on a revision of the
application.conf that I was performing. I received this messages. When
changing back, all worked well I can connect and reconnect with no
issues. I was trying to emphasize that it is important to set the remote
artery appropriately in the Akka Cluster. I do not think the
documentation explains this well. The message was intended to portray
this and hopefully help others who will experience this difficulty.
…------ Original Message ------
From: "Zhenhao Li" <notifications@github.com>
To: "akka/akka" <akka@noreply.github.com>
Cc: "msb1" <barnwaldo@gmail.com>; "Comment" <comment@noreply.github.com>
Sent: 2/6/2021 7:51:35 AM
Subject: Re: [akka/akka] the first seed node won't join the cluster via
other seed node after restarting (#29983)
>Just had the problem:
>
>Cluster Node ***@***.***:2552] - Couldn't join seed
>nodes after [2] attempts, will try again.
***@***.***:2551,
***@***.***:2552]
>
>Using these entries in application.conf:
>
>akka.remote.artery {
> canonical.port = 2551
> canonical.hostname = localhost
>}
>
>akka.cluster {
> seed-nodes = [
> ***@***.***:2551",
> ***@***.***:2552"
> ]
> sharding {
> number-of-shards = 100
> remember-entities = off
> remember-entities-store = "eventsourced"
> }
> jmx.multi-mbeans-in-same-jvm = on
> downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
>}
>
>Changing application.conf to
>
>akka.remote.artery {
> canonical.port = 0
> canonical.hostname = 0.0.0.0
>}
>
>Fixes the problem.
>
>Note that the canonical.port is overridden in Scala code:
>
>private def configWithPort(port: Int): Config =
> ConfigFactory.parseString(s"""akka.remote.artery.canonical.port = $port""").withFallback(ConfigFactory.load())
>
>Could have also overridden akka.remote.artery.canonical.host as well
>but appear to have that set properly in the config file for now. From
>Akka Cluster Documentation:
>
>"Remoting is the mechanism by which Actors on different nodes talk to
>each other internally. When building an Akka application, you would
>usually not use the Remoting concepts directly, but instead use the
>more high-level Akka Cluster utilities or technology-agnostic
>protocols such as HTTP, gRPC etc."
>
>"The port number needs to be unique for each actor system on the same
>machine even if the actor systems have different names. This is
>because each actor system has its own networking subsystem listening
>for connections and handling messages as not to interfere with other
>actor systems."
>
>It appears that the clusters need this set properly so they can join
>the seed nodes. A bit more from the documentation:
>
>Change provider from local. We recommend using Akka Cluster over using
>remoting directly.
>Enable Artery to use it as the remoting implementation
>Add host name - the machine you want to run the actor system on; this
>host name is exactly what is passed to remote systems - in order to
>identify this system and consequently used for connecting back to this
>system if need be, hence set it to a reachable - IP address or
>resolvable name in case you want to communicate across the network.
>Add port number - the port the actor system should listen on, set to 0
>to have it chosen automatically
>Artery Remoting Documentation link:
>https://doc.akka.io/docs/akka/current/remoting-artery.html#configuration
>
I don't think it is the same issue. (It is not clear what you were
trying to do. I can only assume you were trying to run the cluster
example locally.)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29983 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKQALDI4HMXO5UH5XSOMOTTS5U3NPANCNFSM4WZH3O4Q>.
|
I'm rather confused by all questions and comments here. Some just acknoledge how it's supposed to work. Some looks like trial and error config changes. Using If you are new to Akka, start with I'd recommend that you start from an example. It's not much configuration needed if you just want to try an Akka Cluster locally. See https://developer.lightbend.com/start/?group=akka&project=akka-samples-cluster-scala I'd also recommend the tutorial in the Akka Platform Guide. It has downloadable samples, and for example the shopping-order-service is a "minimal" cluster app. https://developer.lightbend.com/docs/akka-platform-guide/microservices-tutorial/index.html |
content posted by @msb1 is not really relevant to the original issue. I agree it is confusing. To summarize the issue (so future readers don't need to read everything above): I have a 3-node cluster with
indicating that I couldn't reproduce this issue modifying the example Akka cluster project.
Note: This workaround only works for rolling updates. We need to set seed nodes to I read the source code and couldn't figure out what can cause |
What happens if you define |
It is the same. |
You have all 3 running (Up), restart node1. Then it will join node2 or node3. It will be moved to Up because 3 >= 2. |
Apologies if I created any confusion – of course it is not my intent. I first read the dialog and misunderstood to be running three seed nodes locally but that apparently is not the case.
Although not relevant to this discussion, the wild card IP address 0.0.0.0 resolves and works fine if seed nodes are running locally – at least on my test setup. (Wouldn’t use it otherwise)
Anyway, the behavior described is observed (in my systems) when artery remote canonical host and port are not set properly for all nodes. Perhaps Patrik can elaborate. I only know that from the documentation https://doc.akka.io/docs/akka/current/remoting-artery.html#remote-configuration-artery
In order to remoting to work properly, where each system can send messages to any other system on the same network (for example a system forwards a message to a third system, and the third replies directly to the sender system) it is essential for every system to have a unique, globally reachable address and port. This address is part of the unique name of the system and will be used by other systems to open a connection to it and send messages. This means that if a host has multiple names (different DNS records pointing to the same IP address) then only one of these can be canonical. If a message arrives to a system but it contains a different hostname than the expected canonical name then the message will be dropped. If multiple names for a system would be allowed, then equality checks among ActorRef instances would no longer to be trusted and this would violate the fundamental assumption that an actor has a globally unique reference on a given network. As a consequence, this also means that localhost addresses (e.g. 127.0.0.1) cannot be used in general (apart from local development) since they are not unique addresses in a real network.
Moreover, the same documentation further elaborates:
In setups involving Network Address Translation (NAT), Load Balancers or Docker containers the hostname and port pair that Akka binds to will be different than the “logical” host name and port pair that is used to connect to the system from the outside. This requires special configuration that sets both the logical and the bind pairs for remoting. Perhaps this is applicable to your setup?
From: Patrik Nordwall ***@***.***>
Sent: Monday, March 29, 2021 7:02 AM
To: akka/akka ***@***.***>
Cc: msb1 ***@***.***>; Mention ***@***.***>
Subject: Re: [akka/akka] the first seed node won't join the cluster via other seed node after restarting (#29983)
You have all 3 running (Up), restart node1. Then it will join node2 or node3. It will be moved to Up because 3 >= 2.
Can you share some evidence of that? Example application that I can reproduce the problem or logs from all 3 nodes?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#29983 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKQALDKE53HPSTLTTR3P5ZDTGBT2RANCNFSM4WZH3O4Q> . <https://github.com/notifications/beacon/AKQALDMUC5FRB3WZCPFLDN3TGBT2RA5CNFSM4WZH3O42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOGA6MANI.gif>
|
@patriknw I uploaded the full debug logs to Google Drive FYI, |
I got help from someone on PairTime and found the cause in my code. I still want someone to explain this though. The fix is "stupid": val codeSet = compileTimeMacroGeneratedCode
import codeSet._ in my serializer class MySerializer(system: ExtendedActorSystem) extends SerializerWithStringManifest After making This doesn't make sense to me because On the other hand, I'm happy to get rid of this issue on my side. |
Thanks for reporting back. I think I see the reason.
Note that there is no logging between 16:45 and 16:50. Something is blocking all progress. That could be the macro stuff. This join self timeout can be configured with It would probably be good to trigger the macros stuff earlier if that cause a complete 5 seconds halt. Before starting the ActorSystem. Having such halt later (lazy) will not be good either. |
To workaround that , I wrap a reference of Future {
log.info(s"starting codeSet of size ${codeSet.size}")
} @patriknw I have two questions about your explanation.
|
The delay would seem to me to be classloading. Forcing the classes referred to by the serializer (e.g. the message classes you're sending over the wire or persisting) to be loaded earlier might help. |
yeah, I think you have a global lock/halt (probably due to that the macro stuff hooks into class loading). |
is there a documented best practice to ensure a certain class loading order? |
Referring to the classes you're serializing before starting the cluster would probably be a good bet. Disclaimer: classloading is a "here be dragons" area of the JVM for me, so I can't claim expertise. |
To reproduce:
The cluster is of 3 nodes: node1, node2, node3
the seed nodes are
[node1, node2]
they will form a cluster of 3 nodes if deployed at the same time.
now restart
node1
.you can see the log from
node2
but on
node1
the logs are:so
node1
did try to joinnode2
but ignored the response fromnode2
as there is no debug log about it.The end result is that
node1
becomes a new standing alone cluster.The text was updated successfully, but these errors were encountered: