New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using Artery remoting, RemoteActorRefProvider's resolveActorRef can return a stale actor reference #29828
Comments
@jhyoty Thanks for reporting. |
Using a fixed port makes the server-side system to detect that this is a new incarnation, the association is updated and the problem does not occur if the quarantined association cleanup interval is long enough (but if you make the cleanup time short and fix the port, the problem becomes easy to reproduce). The following program and config reproduces the issue (at least on my machine): |
thanks |
I'm probably mistaken about the thread local cache in RemoteActorRefProvider being the problem. It seemed logical because of the nondeterminism but now that I ran the reproducer in debugger, it points to akka.remote.artery.ActorRefResolveCacheWithAddress. The issue seems to be the same; an ActorRef referring to an expired association is returned. |
Yes, I have noticed that. It's similar and they should work the same. |
* problem described in in issue * temp (ask) ActorRef shouldn't be cached * similar to to the ActorRef compression that was fixed in #28823 * temp ActorRef doesn't contain the ActorRef uid so it can be resolved to an ActorRef with an old cached Association * additionally, invalidate the cached Association if it has been removed after quarantine * test that reproduces the problem in the issue * also verified with the Main example in the issue
* problem described in in issue * temp (ask) ActorRef shouldn't be cached * similar to to the ActorRef compression that was fixed in #28823 * temp ActorRef doesn't contain the ActorRef uid so it can be resolved to an ActorRef with an old cached Association * additionally, invalidate the cached Association if it has been removed after quarantine * test that reproduces the problem in the issue * also verified with the Main example in the issue
Hi @patriknw, We are facing exactly similar issue. Only difference in our case is that, we are using remote actorRef with Tried versions: We have following setup:
Scenario:
Here, Let me know if you would like to take a look at detailed log file, its 150MB in size hence not attaching here. |
@kpritam please share logs. I probably don't need all of it. Cut out the failed iteration and a few successful iterations before that. Important to have logs from all involved ActorSystems. Quarantine and watch is exactly what is unsafe about use-unsafe-remote-features-outside-cluster=on, but I can take a look to fully understand what is going on. |
I have attached
Contains following error related to
Last two iterations:
Additional Info:
|
Thanks for logs, I have created a new issue for this. #30054 |
We recently upgraded to Akka 2.6 and Artery remoting from classic remoting (no Akka cluster in use). One system started to have seemingly random timeouts, and logging confirmed that one actorsystem sometimes dropped the response message:
DEBUG [akka.actor.default-dispatcher-6726] akka.remote.artery.Association - Dropping message [...] from [Actor[akka://Server/user/RequestProcessor#-2013075229]] to [Actor[akka://Client@127.0.0.1:37995/temp/RequestProcessor$a]] due to quarantined system [akka://Client@127.0.0.1:37995]
The "server" is a long-running process, but the "client" is a batch program that is periodically started, sends few requests to the Server, and stops. Both run in the same machine. The logs show that the Server always receives the messages from the Client, processes them, and sends a response, but sometimes the association drops the response message. There are couple of requests per client execution, and sometimes one of the requests fails to get a response, but another one works. The client uses "Patterns.ask" (seems that the temporary actors have predictable short names) and the client-side actorsystem uses a random port.
Further investigation suggests that this particular messaging pattern "poisons" the thread-local ActorRef cache in the RemoteActorRefProvider, and sometimes an ActorRef that references a stale Association (one that is already removed from the AssociationRegistry) is returned. Is seems that using a fixed port on the client-side and making the
remove-quarantined-association-after
long enough resolves the problem for us. However, it seems that the caching behavior is not correct, and the RemoteActorRefProvider should e.g. check that the cached actor references do not refer to an expired Association.The text was updated successfully, but these errors were encountered: