Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
StreamRef does not terminate stream on node failure #25960
It seems that StreamRefs don't reliably detect the failure of their partner, despite the fact that they do watch their partner and act accordingly.
After some investigation, I found that StreamRefs handle failure of the remote side of the conversation if the remote subscribed actor gets terminated, but in the case of full node failure (e.g. node killed or downed after network partition), the StreamRef doesn't receive a termination message from its partner. This is because the FunctionRef actor type is used for streams, which doesn't extend the
It seems that there are multiple ways to fix this. Here are four I considered:
Superficially, option 3 seems the best route, so I have implemented it and tested locally, and it works #25959
The issue is whether/how to do automated testing for this. The change is in the Akka remoting project and a multi-jvm test could be done for it, but it can't use a StreamRef, as that is in the Akka Streams project. The akka streams project doesn't have multi-jvm tests. The StreamRef implementation itself seems to have minimal testing.
I would be grateful for any feedback as to which route is preferred for implementation and testing. Thanks!
referenced this issue
Nov 21, 2018
@hepin1989 rebuilding StreamRefs ontop of RSocket+Akka TCP is an interesting idea, and related to #24276. Some thinking would have to be done regarding how well it works for failure detection, latency and firewalls. Currently the stream can be on-top of the existing cluster comms, including aeron, which performs pretty well.
Certainly for the scope of this bug, I hope that it can be fixed satisfactorily without a rewrite of StreamRefs, as it a painful bug which affects current users and a rewrite would take some time for sure!