Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Deployment fails completely after only one disconnected Node: AssociationError #3887

Closed
Kenji-Tanaka opened this issue Aug 21, 2019 · 7 comments · Fixed by #4088
Closed

Comments

@Kenji-Tanaka
Copy link

Greetings.
I implemented a simple Cluster Application in Akka with Scala, which I also implemented in Akka.NET with C# for testing purposes. As soon as one disconnect a non-seed Cluster Node in Akka.NET, the cluster fails to deploy to all Nodes that join at a later time. Strangely enough, it does not work in Akka.NET, although it corresponds exactly to the Akka version which works fine.

I am using DotNet Core 2.2.401 under Windows 10 and the following Packages:

  • Akka: 1.3.14
  • Akka.Cluster: 1.3.14

I created a Repository that contains the Code to reproduce this Behaviour:
https://github.com/Kenji-Tanaka/AkkaNetClusterSample.git

Steps to reproduce:

  1. git clone https://github.com/Kenji-Tanaka/AkkaNetClusterSample.git
  2. run one seed: dotnet run seed
  3. when the seed is UP, run a node: dotnet run node
  4. after a few Seconds, both print the Time it was started every Second
  5. press any key to shutdown the node
  6. start a new node with dotnet run node

Expected Behaviour:

  • The node joins, an Actor is deployed to it and both print the Time it was started every Second again.

Actual Behaviour:

  • The Cluster logs Warnings (AssociationError) and no Actor is deployed.
    The Log for the seed Node is attached: log.txt

I guess I might do something wrong here. However, I don't see what it could be. I accept every hint with gratitude. Akka.NET is great, keep up the good work!

Thank You!

@Aaronontheweb
Copy link
Member

Thanks for submitting this - we should take a look at the sample and see if it's a configuration issue of some kind. That's usually in the culprit.

@Kenji-Tanaka
Copy link
Author

The problem described above is still present in Akka.NET 1.3.15 using DotNet Core 3.0.
I wonder if it could really be because of the configuration, because I use the same configuration in the original Actor System implemented in Scala, where it works correctly.

@valdisz
Copy link
Contributor

valdisz commented Nov 23, 2019

I'll look into this problem

@valdisz
Copy link
Contributor

valdisz commented Nov 23, 2019

After a quick investigation here are some details:

In sample seed and worker nodes are started on random ports. When worker node exits, seed marks it as unreachable, but when worker wants to join back, it uses a new port and then the problem occurs.

If worker uses static port all works as expected. Looks like the problem is in cluster joining code.

@Aaronontheweb
Copy link
Member

yes, this is just a configuration issue - on the JVM they might have it so cluster.allow-weakly-up-member is set to on by default, which isn't the case currently in Akka.NET. As @valdisz points out, having the child node restart on a different port makes the cluster think that it's a new node joining, which it won't allow since there's still an unreachable node missing (and thus, it can't vote on any changes in membership.)

I explain this in a lot of depth here and offer a few different solutions for tackling these types of problems: https://petabridge.com/blog/proper-care-of-akkadotnet-clusters/

@Kenji-Tanaka
Copy link
Author

Excellent, when adding allow-weakly-up-members = on to the configuration, this example works like the one in Akka (Scala). Thanks also for the link to the article!

@Aaronontheweb
Copy link
Member

@Kenji-Tanaka we've now updated Akka.NET to automatically have this setting on by default via #4087

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants