Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThisActorSystemQuarantinedEvent not received when nodes quarantine each other #24764

Closed
ajayslele opened this issue Mar 20, 2018 · 4 comments
Closed
Assignees

Comments

@ajayslele
Copy link

ajayslele commented Mar 20, 2018

In our project, we are using ThisActorSystemQuarantinedEvent event added by #18758 to detect when the local actor-system gets quarantined so that the system can be restarted to recover and join the cluster again. This works fine when one node is quarantined by the other. However we notice that when there is network partition for considerable time, both nodes can end up quarantining each other. When the partition is removed, below message is seen continuously on both nodes:

[DEBUG] [03/20/2018 10:19:04.786] [akka-sample-akka.remote.default-remote-dispatcher-35] [akka.tcp://akka-sample@10.18.130.90:2552/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fakka-sample%4010.18.130.115%3A2552-1085] Association between local [tcp://akka-sample@10.18.130.90:53355] and remote [tcp://akka-sample@10.18.130.115:2552] was disassociated because the ProtocolStateActor failed: ForbiddenUidReason
[WARN] [03/20/2018 10:19:04.786] [akka-sample-akka.remote.default-remote-dispatcher-35] [akka.tcp://akka-sample@10.18.130.90:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fakka-sample%4010.18.130.115%3A2552-680/endpointWriter] AssociationError [akka.tcp://akka-sample@10.18.130.90:2552] -> [akka.tcp://akka-sample@10.18.130.115:2552]: Error [Invalid address: akka.tcp://akka-sample@10.18.130.115:2552] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://akka-sample@10.18.130.115:2552
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted.
]
[INFO] [03/20/2018 10:19:04.786] [akka-sample-akka.remote.default-remote-dispatcher-35] [akka.remote.Remoting] Quarantined address [akka.tcp://akka-sample@10.18.130.115:2552] is still unreachable or has not been restarted. Keeping it quarantined.
[DEBUG] [03/20/2018 10:19:04.786] [akka-sample-akka.remote.default-remote-dispatcher-35] [akka.tcp://akka-sample@10.18.130.90:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fakka-sample%4010.18.130.115%3A2552-680/endpointWriter] Disassociated [akka.tcp://akka-sample@10.18.130.90:2552] -> [akka.tcp://akka-sample@10.18.130.115:2552]

Since neither nodes process the association with remote node due to quarantine, neither of them receive the ThisActorSystemQuarantinedEvent event, and so they continue to be in this state forever. A way is needed to detect this state so that recovery can be triggered

@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:remoting labels Mar 21, 2018
@patriknw
Copy link
Member

Would be interesting to see if we have the same problem in Artery.

@manonthegithub
Copy link
Contributor

manonthegithub commented Mar 28, 2018

@patriknw No, at least in 'aeron-udp' see #24807

@helenflorida
Copy link

I am also facing this in akka 4.1.15.. can someone look into it

@patriknw patriknw self-assigned this Nov 20, 2018
@patriknw patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Nov 20, 2018
patriknw added a commit that referenced this issue Nov 22, 2018
* When node A quarantines node B it sends Quarantined message to B,
  which will be publised to the eventStream as OtherHasQuarantinedThisActorSystemEvent
  at B
* If there was a network partition B might not receive the above message, and might continue
  and attempt to send messages to A (e.g. heartbeat messages). Then A will send back Quarantined,
  which will be publised to the eventStream as OtherHasQuarantinedThisActorSystemEvent at B
* ClusterDaemon subscribes to OtherHasQuarantinedThisActorSystemEvent and downs itself.
* However, there can be sitations of false self downing as illustrated in the test.
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Mar 15, 2021
@patriknw
Copy link
Member

See #29565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants