Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add named exception to detect when a cluster node has been quarantined by others #18758

Closed
garyiwu opened this issue Oct 21, 2015 · 4 comments
Closed
Milestone

Comments

@garyiwu
Copy link

garyiwu commented Oct 21, 2015

When a cluster node gets auto-downed/quarantined, it has to restart its own actor system before it can rejoin the cluster. To do this we need a way to reliably detect when a node has been quarantined by other nodes.

Currently, the only way to do this appears to be doing string matching on the exception message of an AssociationError, which is fragile. Ideally there should be a named exception that we can check, so that the node can restart itself as appropriate.

@patriknw patriknw added this to the 2.4.x milestone Oct 22, 2015
@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:remoting labels Oct 22, 2015
@jdevelop
Copy link
Contributor

jdevelop commented Dec 2, 2015

👍

@jdevelop
Copy link
Contributor

jdevelop commented Dec 9, 2015

I faced the same problem and came out with the following solution:

  • add the special case class AssociationHandle.ActorSystemQuarantinedEvent that will contain the local address of the system being quarantined, and remote address of the system that has quarantined the current one.
  • upon receive of the quarantine message in the Endpoint - send a new AssociationHandle.ActorSystemQuarantinedEvent to the event stream of the current ActorSystem.

Hence any actor could subscribe to those events and in case of quarantine - do something, e.g - trigger system restart.

I found this pretty flexible in terms of usage, however need somebody to review the approach and the proposed patch. Thanks.

rkuhn added a commit that referenced this issue Dec 20, 2015
…opagation-18758

#18758 Send appropriate events on remote actor system shutdown and quarantine
@jdevelop
Copy link
Contributor

Perhaps this one could be mark as fixed?

@rkuhn rkuhn modified the milestones: 2.4.2, 2.4.x Dec 21, 2015
@rkuhn
Copy link
Contributor

rkuhn commented Dec 21, 2015

Indeed, thanks!

@rkuhn rkuhn closed this as completed Dec 21, 2015
@ktoso ktoso removed the 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted label Dec 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants