Akka-Persistence: Zombie Actor on missing recovery permission (waitingRecoveryPermit) #28658

motmot80 · 2020-02-28T08:01:00Z

We are having issues with PersistentActors staying in waitRecoveryPermit state in case the permit answer RecoveryPermitGranted is lost in high load scenarios.

akka/akka-persistence/src/main/scala/akka/persistence/Eventsourced.scala

Line 601 in 6fe2f66

private def waitingRecoveryPermit(recovery: Recovery) = new State {

Because every message is stashed there's no way to stop or restart the actor in this state.

In addition to the existing max-concurrent-recoveries there should be a setting on how long the PersistentActor should wait before closing itself (die trying).

F.e. something like max-concurrent-recoveries-wait-timeout.

-1 no timout
X msec to stop the actor giving the control back to its supervisor

Thanks in advance and best regards
Thomas

The text was updated successfully, but these errors were encountered:

johanandren · 2020-02-28T15:39:27Z

Can you clarify in what scenario the RecoverPermitGranted would be lost? It's a completely local message so there shouldn't be any circumstances where it is never received unless the journal is completely stuck with progressing the current recoveries forever.

patriknw · 2020-03-17T16:13:46Z

It can be a bug, so if you have any more information @motmot80 that would be valuable.

motmot80 · 2020-09-03T19:13:31Z

We are still having this issue.

The message wasn't lost - it seems the implementation is causing this issue:

When a permit is buffered in RecoveryPermitter it seems that the permit is not automatically granted to the requesting EventSourced when the buffer meanwhile drained.

So EventSourced is timing out (RecoveryTick) and stopping the actor and telling the RecoveryPermitter to forget him.

Maybe I took a wrong turn - but I think the RecoveryPermitter permit request should be answered as soon as the buffer drained under max-concurrent-recoveries, so the answer could reach the permit requester (EventSourced) within it's timeout window.

So it's possible to queue actor recoveries to not overload the persistence, without restarting the persistent actors several times.

Thanks in advance.

patriknw · 2020-09-05T06:42:19Z

What you describe is intended, but maybe there is a bug. I’ll look into it. Let us know if you see what is wong. The code for this is rather difficult in Eventsourced due to binary compatibility limitations.

patriknw · 2020-09-07T13:36:04Z

I think the RecoveryPermitter permit request should be answered as soon as the buffer drained under max-concurrent-recoveries

That is how it is implemented: https://github.com/akka/akka/blob/master/akka-persistence/src/main/scala/akka/persistence/RecoveryPermitter.scala#L75

and there is a test for it: https://github.com/akka/akka/blob/master/akka-persistence/src/test/scala/akka/persistence/RecoveryPermitterSpec.scala#L81

so there must be something else

patriknw · 2020-09-07T13:51:56Z

After reading the original issue description again it seems like you would like to have a configurable timeout for how long to wait for the permit. Your system is overloaded and you would prefer to stop the actors when they have been waiting for too long for the permit. That seems like a fair request.

johanandren added the 0 - new Ticket is unclear on it's purpose or if it is valid or not label Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka-Persistence: Zombie Actor on missing recovery permission (waitingRecoveryPermit) #28658

Akka-Persistence: Zombie Actor on missing recovery permission (waitingRecoveryPermit) #28658

motmot80 commented Feb 28, 2020

johanandren commented Feb 28, 2020

patriknw commented Mar 17, 2020

motmot80 commented Sep 3, 2020 •

edited

patriknw commented Sep 5, 2020

patriknw commented Sep 7, 2020

patriknw commented Sep 7, 2020

Akka-Persistence: Zombie Actor on missing recovery permission (waitingRecoveryPermit) #28658

Akka-Persistence: Zombie Actor on missing recovery permission (waitingRecoveryPermit) #28658

Comments

motmot80 commented Feb 28, 2020

johanandren commented Feb 28, 2020

patriknw commented Mar 17, 2020

motmot80 commented Sep 3, 2020 • edited

patriknw commented Sep 5, 2020

patriknw commented Sep 7, 2020

patriknw commented Sep 7, 2020

motmot80 commented Sep 3, 2020 •

edited