Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828

Closed
jhyoty opened this issue Nov 20, 2020 · 9 comments
Assignees
Milestone

Comments

@jhyoty
Copy link

jhyoty commented Nov 20, 2020

We recently upgraded to Akka 2.6 and Artery remoting from classic remoting (no Akka cluster in use). One system started to have seemingly random timeouts, and logging confirmed that one actorsystem sometimes dropped the response message:

DEBUG [akka.actor.default-dispatcher-6726] akka.remote.artery.Association - Dropping message [...] from [Actor[akka://Server/user/RequestProcessor#-2013075229]] to [Actor[akka://Client@127.0.0.1:37995/temp/RequestProcessor$a]] due to quarantined system [akka://Client@127.0.0.1:37995]

The "server" is a long-running process, but the "client" is a batch program that is periodically started, sends few requests to the Server, and stops. Both run in the same machine. The logs show that the Server always receives the messages from the Client, processes them, and sends a response, but sometimes the association drops the response message. There are couple of requests per client execution, and sometimes one of the requests fails to get a response, but another one works. The client uses "Patterns.ask" (seems that the temporary actors have predictable short names) and the client-side actorsystem uses a random port.

Further investigation suggests that this particular messaging pattern "poisons" the thread-local ActorRef cache in the RemoteActorRefProvider, and sometimes an ActorRef that references a stale Association (one that is already removed from the AssociationRegistry) is returned. Is seems that using a fixed port on the client-side and making the remove-quarantined-association-after long enough resolves the problem for us. However, it seems that the caching behavior is not correct, and the RemoteActorRefProvider should e.g. check that the cached actor references do not refer to an expired Association.

@jhyoty jhyoty changed the title Artery RemoteActorRefProvider's resolveActorRef can return a stale actor reference When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference Nov 20, 2020
@patriknw
Copy link
Member

@jhyoty Thanks for reporting.
One part that I don't understand. "fixed port on the client-side... resolves the problem for us". Wouldn't that have opposite result. If it's a new random port it would not be in the cache. If it's a fixed port it's more likely to be in the cache. However, random port can also use the same port again.

@patriknw patriknw self-assigned this Nov 23, 2020
@patriknw patriknw added t:remoting:artery 3 - in progress Someone is working on this ticket bug labels Nov 23, 2020
@jhyoty
Copy link
Author

jhyoty commented Nov 23, 2020

Using a fixed port makes the server-side system to detect that this is a new incarnation, the association is updated and the problem does not occur if the quarantined association cleanup interval is long enough (but if you make the cleanup time short and fix the port, the problem becomes easy to reproduce).

The following program and config reproduces the issue (at least on my machine):
https://gist.github.com/jhyoty/b4e81e14fbe26a5c43b121ad635452d7

@patriknw
Copy link
Member

thanks

@jhyoty
Copy link
Author

jhyoty commented Nov 23, 2020

I'm probably mistaken about the thread local cache in RemoteActorRefProvider being the problem. It seemed logical because of the nondeterminism but now that I ran the reproducer in debugger, it points to akka.remote.artery.ActorRefResolveCacheWithAddress. The issue seems to be the same; an ActorRef referring to an expired association is returned.

@patriknw
Copy link
Member

Yes, I have noticed that. It's similar and they should work the same.

patriknw added a commit that referenced this issue Nov 23, 2020
* problem described in in issue
* temp (ask) ActorRef shouldn't be cached
* similar to to the ActorRef compression that was fixed
  in #28823
* temp ActorRef doesn't contain the ActorRef uid so it can
  be resolved to an ActorRef with an old cached Association
* additionally, invalidate the cached Association if it has
  been removed after quarantine
* test that reproduces the problem in the issue
* also verified with the Main example in the issue
patriknw added a commit that referenced this issue Dec 4, 2020
* problem described in in issue
* temp (ask) ActorRef shouldn't be cached
* similar to to the ActorRef compression that was fixed
  in #28823
* temp ActorRef doesn't contain the ActorRef uid so it can
  be resolved to an ActorRef with an old cached Association
* additionally, invalidate the cached Association if it has
  been removed after quarantine
* test that reproduces the problem in the issue
* also verified with the Main example in the issue
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Dec 4, 2020
@patriknw patriknw added this to the 2.6.11 milestone Dec 4, 2020
@patriknw patriknw closed this as completed Dec 4, 2020
@kpritam
Copy link
Contributor

kpritam commented Feb 20, 2021

Hi @patriknw, We are facing exactly similar issue. Only difference in our case is that, we are using remote actorRef with use-unsafe-remote-features-outside-cluster = on and not using actorSelection.

Tried versions:
Akka typed - 2.6.11, 2.6.12

We have following setup:

  1. Actor1 => Running in one JVM and it has registered its remote actorRef in our custom LocationService, consider this similar to Receptionist

  2. Actor2 => Running in second JVM and registered its address with LocationService

Scenario:

  1. Actor1 sends Restart message to Actor2 in loop 2000 times.
  2. On Restart message Actor2 terminates its ActorSystem and starts new on random port (port=0)
  3. This works fine for 1600-1700 iterations
  4. But eventually fails with following message:
Dropping message [esw.ocs.api.actor.messages.SequencerMessages$GetSequenceComponent] from
[Actor[akka://sequence-manager/deadLetters]] to 
[Actor[akka://sequencer-system@192.168.1.3:50944/user/ESW.Perf#1625616268]] 
due to quarantined system [akka://sequencer-system@192.168.1.3:50944]

Here,
sequence-manager = Actor1
sequencer-system = Actor2

Let me know if you would like to take a look at detailed log file, its 150MB in size hence not attaching here.

@patriknw
Copy link
Member

@kpritam please share logs. I probably don't need all of it. Cut out the failed iteration and a few successful iterations before that. Important to have logs from all involved ActorSystems.

Quarantine and watch is exactly what is unsafe about use-unsafe-remote-features-outside-cluster=on, but I can take a look to fully understand what is going on.

@kpritam
Copy link
Contributor

kpritam commented Feb 22, 2021

I have attached debug.log, it contains 4 iterations. Short summary below

  1. ---------> 67 <--------- : SUCCESS - First time 50944 is used
  2. ---------> 75 <--------- : SUCCESS -

Contains following error related to 50944:

Association to [akka://sequencer-system@192.168.1.3:50944] having UID [2832747598290303135] has been stopped. 
All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated

Last two iterations:
3. ---------> 1205 <--------- : SUCCESS
4. ---------> 1206 <--------- : FAILED
reason:

Dropping message [esw.ocs.api.actor.messages.SequencerMessages$GetSequenceComponent]
from [Actor[akka://sequence-manager/deadLetters]] to 
[Actor[akka://sequencer-system@192.168.1.3:50944/user/ESW.Perf#1625616268]] 
due to quarantined system [akka://sequencer-system@192.168.1.3:50944]

Additional Info:

  • With default setting test fails at around 1600th iteration
  • With akka.remote.artery.advanced.remove-quarantined-association-after = 10s setting, test fails at around 4400th iteration.

debug.log

@patriknw
Copy link
Member

Thanks for logs, I have created a new issue for this. #30054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants