Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka-Persistence: Zombie Actor on missing recovery permission (waitingRecoveryPermit) #28658

Open
motmot80 opened this issue Feb 28, 2020 · 6 comments
Labels
0 - new Ticket is unclear on it's purpose or if it is valid or not

Comments

@motmot80
Copy link

We are having issues with PersistentActors staying in waitRecoveryPermit state in case the permit answer RecoveryPermitGranted is lost in high load scenarios.

private def waitingRecoveryPermit(recovery: Recovery) = new State {

Because every message is stashed there's no way to stop or restart the actor in this state.

In addition to the existing max-concurrent-recoveries there should be a setting on how long the PersistentActor should wait before closing itself (die trying).

F.e. something like max-concurrent-recoveries-wait-timeout.

  • -1 no timout
  • X msec to stop the actor giving the control back to its supervisor

Thanks in advance and best regards
Thomas

@johanandren
Copy link
Member

Can you clarify in what scenario the RecoverPermitGranted would be lost? It's a completely local message so there shouldn't be any circumstances where it is never received unless the journal is completely stuck with progressing the current recoveries forever.

@johanandren johanandren added the 0 - new Ticket is unclear on it's purpose or if it is valid or not label Mar 2, 2020
@patriknw
Copy link
Member

It can be a bug, so if you have any more information @motmot80 that would be valuable.

@motmot80
Copy link
Author

motmot80 commented Sep 3, 2020

We are still having this issue.

The message wasn't lost - it seems the implementation is causing this issue:

When a permit is buffered in RecoveryPermitter it seems that the permit is not automatically granted to the requesting EventSourced when the buffer meanwhile drained.

So EventSourced is timing out (RecoveryTick) and stopping the actor and telling the RecoveryPermitter to forget him.

Maybe I took a wrong turn - but I think the RecoveryPermitter permit request should be answered as soon as the buffer drained under max-concurrent-recoveries, so the answer could reach the permit requester (EventSourced) within it's timeout window.

So it's possible to queue actor recoveries to not overload the persistence, without restarting the persistent actors several times.

Thanks in advance.

@patriknw
Copy link
Member

patriknw commented Sep 5, 2020

What you describe is intended, but maybe there is a bug. I’ll look into it. Let us know if you see what is wong. The code for this is rather difficult in Eventsourced due to binary compatibility limitations.

@patriknw
Copy link
Member

patriknw commented Sep 7, 2020

I think the RecoveryPermitter permit request should be answered as soon as the buffer drained under max-concurrent-recoveries

That is how it is implemented: https://github.com/akka/akka/blob/master/akka-persistence/src/main/scala/akka/persistence/RecoveryPermitter.scala#L75

and there is a test for it: https://github.com/akka/akka/blob/master/akka-persistence/src/test/scala/akka/persistence/RecoveryPermitterSpec.scala#L81

so there must be something else

@patriknw
Copy link
Member

patriknw commented Sep 7, 2020

After reading the original issue description again it seems like you would like to have a configurable timeout for how long to wait for the permit. Your system is overloaded and you would prefer to stop the actors when they have been waiting for too long for the permit. That seems like a fair request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - new Ticket is unclear on it's purpose or if it is valid or not
Projects
None yet
Development

No branches or pull requests

3 participants