New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamRef does not terminate stream on node failure #25960

Closed
ollyw opened this Issue Nov 21, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@ollyw
Copy link
Contributor

ollyw commented Nov 21, 2018

It seems that StreamRefs don't reliably detect the failure of their partner, despite the fact that they do watch their partner and act accordingly.

After some investigation, I found that StreamRefs handle failure of the remote side of the conversation if the remote subscribed actor gets terminated, but in the case of full node failure (e.g. node killed or downed after network partition), the StreamRef doesn't receive a termination message from its partner. This is because the FunctionRef actor type is used for streams, which doesn't extend the DeathWatch trait. FunctionRef receives Terminations from the stopped actor, but it doesn't listen for address down events, and the Remote Watcher doesn't notify it, as it assumed it implemented DeathWatch.

It seems that there are multiple ways to fix this. Here are four I considered:

  1. Implement address watching in the FunctionRef, similar to what is in the DeathWatch trait (which adds lots of complexity/duplication). The other issue is that FunctionRef doesn't have obvious access to the ActorSystem, which it needs to subscribe to address down events.
  2. Change streams to use an ActorRef type that inherits DeathWatch. This is a big change, and could have serious performance implications.
  3. Change the RemoteWatcher to treat those actors which don't inherit DeathWatch to send the message directly in the way that it does when the address doesn't go down. I suspect that this is not optimal for performance, but given that there seems to be very few cases (only StreamRefs?) that this effects, it may be acceptable. It is also the most bullet proof, as no ActorRef gets left behind.
  4. Update the documentation and tell all users to implement a keepalive of some sort between the StreamRef pair at the application level if they want to reliably detect termination of the other party. This will cause lots of unnecessary messaging, and probably will not be spotted by most users.

Superficially, option 3 seems the best route, so I have implemented it and tested locally, and it works #25959

The issue is whether/how to do automated testing for this. The change is in the Akka remoting project and a multi-jvm test could be done for it, but it can't use a StreamRef, as that is in the Akka Streams project. The akka streams project doesn't have multi-jvm tests. The StreamRef implementation itself seems to have minimal testing.

I would be grateful for any feedback as to which route is preferred for implementation and testing. Thanks!

@hepin1989

This comment has been minimized.

Copy link
Contributor

hepin1989 commented Nov 22, 2018

Maybe we could build RSocket on top of Akka TCP stream and then build StreamRef like thing on top of that?

@ollyw

This comment has been minimized.

Copy link
Contributor Author

ollyw commented Nov 22, 2018

@hepin1989 rebuilding StreamRefs ontop of RSocket+Akka TCP is an interesting idea, and related to #24276. Some thinking would have to be done regarding how well it works for failure detection, latency and firewalls. Currently the stream can be on-top of the existing cluster comms, including aeron, which performs pretty well.

Certainly for the scope of this bug, I hope that it can be fixed satisfactorily without a rewrite of StreamRefs, as it a painful bug which affects current users and a rewrite would take some time for sure!

@patriknw

This comment has been minimized.

Copy link
Member

patriknw commented Nov 25, 2018

That seems to be a rather serious shortcoming of StageActor that we should try to fix.

Thanks for reporting.

@chbatey

This comment has been minimized.

Copy link
Member

chbatey commented Nov 26, 2018

👍 to adding multi-jvm tests for in akka-stream-tests to test StreamRefs

@ollyw

This comment has been minimized.

Copy link
Contributor Author

ollyw commented Nov 26, 2018

@chbatey thanks for the reply. I can add the multi-jvm tests. Is the proposed fix #25959 ok with you, assuming the tests are in place?

@patriknw patriknw self-assigned this Jan 7, 2019

@patriknw

This comment has been minimized.

Copy link
Member

patriknw commented Jan 7, 2019

I'll take a stab at this now.

patriknw added a commit that referenced this issue Jan 7, 2019

Terminate StreamRef on node failure, #25960
* manage AddressTerminated subscription in FunctionRef
* implementation can be compared with akka/actor/dungeon/DeathWatch.scala

patriknw added a commit that referenced this issue Jan 10, 2019

Terminate StreamRef on node failure, #25960
* manage AddressTerminated subscription in FunctionRef
* implementation can be compared with akka/actor/dungeon/DeathWatch.scala

patriknw added a commit that referenced this issue Jan 14, 2019

Terminate StreamRef on node failure, #25960
* manage AddressTerminated subscription in FunctionRef
* implementation can be compared with akka/actor/dungeon/DeathWatch.scala

patriknw added a commit that referenced this issue Jan 17, 2019

Terminate StreamRef on node failure, #25960
* manage AddressTerminated subscription in FunctionRef
* implementation can be compared with akka/actor/dungeon/DeathWatch.scala
* use synchronized access to the watch state and AddressTerminatedTopic
* use OptionVal for _watchedBy

@patriknw patriknw added this to the 2.5.20 milestone Jan 17, 2019

@patriknw patriknw closed this Jan 17, 2019

@ollyw

This comment has been minimized.

Copy link
Contributor Author

ollyw commented Jan 17, 2019

Awesome @patriknw. Thanks for fixing this 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment