When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828

jhyoty · 2020-11-20T07:50:43Z

We recently upgraded to Akka 2.6 and Artery remoting from classic remoting (no Akka cluster in use). One system started to have seemingly random timeouts, and logging confirmed that one actorsystem sometimes dropped the response message:

DEBUG [akka.actor.default-dispatcher-6726] akka.remote.artery.Association - Dropping message [...] from [Actor[akka://Server/user/RequestProcessor#-2013075229]] to [Actor[akka://Client@127.0.0.1:37995/temp/RequestProcessor$a]] due to quarantined system [akka://Client@127.0.0.1:37995]

The "server" is a long-running process, but the "client" is a batch program that is periodically started, sends few requests to the Server, and stops. Both run in the same machine. The logs show that the Server always receives the messages from the Client, processes them, and sends a response, but sometimes the association drops the response message. There are couple of requests per client execution, and sometimes one of the requests fails to get a response, but another one works. The client uses "Patterns.ask" (seems that the temporary actors have predictable short names) and the client-side actorsystem uses a random port.

Further investigation suggests that this particular messaging pattern "poisons" the thread-local ActorRef cache in the RemoteActorRefProvider, and sometimes an ActorRef that references a stale Association (one that is already removed from the AssociationRegistry) is returned. Is seems that using a fixed port on the client-side and making the remove-quarantined-association-after long enough resolves the problem for us. However, it seems that the caching behavior is not correct, and the RemoteActorRefProvider should e.g. check that the cached actor references do not refer to an expired Association.

The text was updated successfully, but these errors were encountered:

patriknw · 2020-11-23T11:36:27Z

@jhyoty Thanks for reporting.
One part that I don't understand. "fixed port on the client-side... resolves the problem for us". Wouldn't that have opposite result. If it's a new random port it would not be in the cache. If it's a fixed port it's more likely to be in the cache. However, random port can also use the same port again.

jhyoty · 2020-11-23T12:02:59Z

Using a fixed port makes the server-side system to detect that this is a new incarnation, the association is updated and the problem does not occur if the quarantined association cleanup interval is long enough (but if you make the cleanup time short and fix the port, the problem becomes easy to reproduce).

The following program and config reproduces the issue (at least on my machine):
https://gist.github.com/jhyoty/b4e81e14fbe26a5c43b121ad635452d7

patriknw · 2020-11-23T12:17:44Z

thanks

jhyoty · 2020-11-23T14:43:56Z

I'm probably mistaken about the thread local cache in RemoteActorRefProvider being the problem. It seemed logical because of the nondeterminism but now that I ran the reproducer in debugger, it points to akka.remote.artery.ActorRefResolveCacheWithAddress. The issue seems to be the same; an ActorRef referring to an expired association is returned.

patriknw · 2020-11-23T14:55:48Z

Yes, I have noticed that. It's similar and they should work the same.

* problem described in in issue * temp (ask) ActorRef shouldn't be cached * similar to to the ActorRef compression that was fixed in #28823 * temp ActorRef doesn't contain the ActorRef uid so it can be resolved to an ActorRef with an old cached Association * additionally, invalidate the cached Association if it has been removed after quarantine * test that reproduces the problem in the issue * also verified with the Main example in the issue

kpritam · 2021-02-20T06:38:10Z

Hi @patriknw, We are facing exactly similar issue. Only difference in our case is that, we are using remote actorRef with use-unsafe-remote-features-outside-cluster = on and not using actorSelection.

Tried versions:
Akka typed - 2.6.11, 2.6.12

We have following setup:

Actor1 => Running in one JVM and it has registered its remote actorRef in our custom LocationService, consider this similar to Receptionist
Actor2 => Running in second JVM and registered its address with LocationService

Scenario:

Actor1 sends Restart message to Actor2 in loop 2000 times.
On Restart message Actor2 terminates its ActorSystem and starts new on random port (port=0)
This works fine for 1600-1700 iterations
But eventually fails with following message:

Dropping message [esw.ocs.api.actor.messages.SequencerMessages$GetSequenceComponent] from
[Actor[akka://sequence-manager/deadLetters]] to 
[Actor[akka://sequencer-system@192.168.1.3:50944/user/ESW.Perf#1625616268]] 
due to quarantined system [akka://sequencer-system@192.168.1.3:50944]

Here,
sequence-manager = Actor1
sequencer-system = Actor2

Let me know if you would like to take a look at detailed log file, its 150MB in size hence not attaching here.

patriknw · 2021-02-22T07:15:03Z

@kpritam please share logs. I probably don't need all of it. Cut out the failed iteration and a few successful iterations before that. Important to have logs from all involved ActorSystems.

Quarantine and watch is exactly what is unsafe about use-unsafe-remote-features-outside-cluster=on, but I can take a look to fully understand what is going on.

kpritam · 2021-02-22T07:38:48Z

I have attached debug.log, it contains 4 iterations. Short summary below

---------> 67 <--------- : SUCCESS - First time 50944 is used
---------> 75 <--------- : SUCCESS -

Contains following error related to 50944:

Association to [akka://sequencer-system@192.168.1.3:50944] having UID [2832747598290303135] has been stopped. 
All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated

Last two iterations:
3. ---------> 1205 <--------- : SUCCESS
4. ---------> 1206 <--------- : FAILED
reason:

Dropping message [esw.ocs.api.actor.messages.SequencerMessages$GetSequenceComponent]
from [Actor[akka://sequence-manager/deadLetters]] to 
[Actor[akka://sequencer-system@192.168.1.3:50944/user/ESW.Perf#1625616268]] 
due to quarantined system [akka://sequencer-system@192.168.1.3:50944]

Additional Info:

With default setting test fails at around 1600th iteration
With akka.remote.artery.advanced.remove-quarantined-association-after = 10s setting, test fails at around 4400th iteration.

debug.log

patriknw · 2021-02-22T09:21:52Z

Thanks for logs, I have created a new issue for this. #30054

jhyoty changed the title ~~Artery RemoteActorRefProvider's resolveActorRef can return a stale actor reference~~ When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference Nov 20, 2020

patriknw self-assigned this Nov 23, 2020

patriknw added t:remoting:artery 3 - in progress Someone is working on this ticket bug labels Nov 23, 2020

patriknw mentioned this issue Nov 23, 2020

Don't cache temp actors in ActorRefResolveCache, #29828 #29834

Merged

patriknw removed the 3 - in progress Someone is working on this ticket label Dec 4, 2020

patriknw added this to the 2.6.11 milestone Dec 4, 2020

patriknw closed this as completed Dec 4, 2020

patriknw mentioned this issue Feb 22, 2021

Association quarantined after new incarnation #30054

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828

When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828

jhyoty commented Nov 20, 2020

patriknw commented Nov 23, 2020

jhyoty commented Nov 23, 2020 •

edited

patriknw commented Nov 23, 2020

jhyoty commented Nov 23, 2020

patriknw commented Nov 23, 2020

kpritam commented Feb 20, 2021 •

edited

patriknw commented Feb 22, 2021

kpritam commented Feb 22, 2021 •

edited

patriknw commented Feb 22, 2021

When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828

When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828

Comments

jhyoty commented Nov 20, 2020

patriknw commented Nov 23, 2020

jhyoty commented Nov 23, 2020 • edited

patriknw commented Nov 23, 2020

jhyoty commented Nov 23, 2020

patriknw commented Nov 23, 2020

kpritam commented Feb 20, 2021 • edited

patriknw commented Feb 22, 2021

kpritam commented Feb 22, 2021 • edited

patriknw commented Feb 22, 2021

jhyoty commented Nov 23, 2020 •

edited

kpritam commented Feb 20, 2021 •

edited

kpritam commented Feb 22, 2021 •

edited