New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThisActorSystemQuarantinedEvent not received when nodes quarantine each other #24764
Labels
Comments
|
Would be interesting to see if we have the same problem in Artery. |
|
I am also facing this in akka 4.1.15.. can someone look into it |
patriknw
added a commit
that referenced
this issue
Nov 22, 2018
* When node A quarantines node B it sends Quarantined message to B, which will be publised to the eventStream as OtherHasQuarantinedThisActorSystemEvent at B * If there was a network partition B might not receive the above message, and might continue and attempt to send messages to A (e.g. heartbeat messages). Then A will send back Quarantined, which will be publised to the eventStream as OtherHasQuarantinedThisActorSystemEvent at B * ClusterDaemon subscribes to OtherHasQuarantinedThisActorSystemEvent and downs itself. * However, there can be sitations of false self downing as illustrated in the test.
|
See #29565 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In our project, we are using ThisActorSystemQuarantinedEvent event added by #18758 to detect when the local actor-system gets quarantined so that the system can be restarted to recover and join the cluster again. This works fine when one node is quarantined by the other. However we notice that when there is network partition for considerable time, both nodes can end up quarantining each other. When the partition is removed, below message is seen continuously on both nodes:
Since neither nodes process the association with remote node due to quarantine, neither of them receive the ThisActorSystemQuarantinedEvent event, and so they continue to be in this state forever. A way is needed to detect this state so that recovery can be triggered
The text was updated successfully, but these errors were encountered: