Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

object · 2020-02-27T08:26:03Z

This issue looks similar to #3870, however it happens when using latest version of Akka.NET.

OS: Windows Server 2016
Platform: .NET Core 3.1
Akka.NET packages: 1.4.0-beta14 (used in a cluster)

Scenario:

Akka.Persistence.SqlServer.Journal.BatchingSqlServerJournal raises exception with message "Circuit Breaker is open; calls are failing fast", most likely due to a temporary db outage
Attempt to recover state of some persistent actors fail with RecoveryTimedOutException. Here's a typical sequence of events, taken from our log:

Started (Akka.Pattern.BackoffOnRestartSupervisor)
now supervising akka://Oddjob/system/sharding/upload/0/ps~msui30002111/msc:ps~msui30002111
now watched by [akka://Oddjob/system/sharding/upload/0/ps~msui30002111#1585193596]
now watched by [akka://Oddjob/system/recoveryPermitter#1099929798
Spawned MediaSetController actor
now watched by [akka://Oddjob/system/sharding/upload/0#1224240942]
Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
Restoring state from snapshot
(after 1 minute)
["", null, "Akka.Persistence.RecoveryTimedOutException: Recovery timed out, didn't get event within 60s, highest sequence number seen 312."] {AckInfo} {Exception}
Passivating started on entity "ps~msui30002111"
received AutoReceiveMessage <Terminated>: [akka://Oddjob/system/sharding/upload/0/ps~msui30002111#1585193596] - ExistenceConfirmed=True
Entity stopped after passivation ["ps~msui30002111"]

Once a persistent actor fails with such exception, it is stuck until the system is restarted. Other actors may be successfully recovered.

The text was updated successfully, but these errors were encountered:

object · 2020-02-27T08:29:59Z

And here's an extract from our HOCON:

akka {
  persistence{
    journal {
      plugin = "akka.persistence.journal.sql-server"
      sql-server {
        class = "Akka.Persistence.SqlServer.Journal.BatchingSqlServerJournal, Akka.Persistence.SqlServer"
        recovery-event-timeout = 60s
        schema-name = dbo
        table-name = EventJournal
        auto-initialize = off
      }
    }
    snapshot-store {
      plugin = "akka.persistence.snapshot-store.sql-server"
      sql-server {
        class = "Akka.Persistence.SqlServer.Snapshot.SqlServerSnapshotStore, Akka.Persistence.SqlServer"
        serializer = protobuf
        schema-name = dbo
        table-name = SnapshotStore
        auto-initialize = off
      }
    }
  }
}

Aaronontheweb · 2020-02-27T16:13:30Z

Think I have an idea on how to reproduce this with SQL Server - it's tough to reproduce in this repository with Sqlite because I can't easily simulate a database outage there. Should be doable with SQL Server though.

object · 2020-02-27T16:48:39Z

I once managed to trigger this scenario by simply executing a long running query that locked the whole EventJournal table. After the query was complete, persistent actors that failed during its execution were still stuck and I had to restart the service to be able to recover them.

Aaronontheweb · 2020-02-27T16:53:14Z

@object https://github.com/akkadotnet/Akka.Persistence.SqlServer/blob/dev/src/Akka.Persistence.SqlServer.Tests/SqlServerFixture.cs - we can add a method to stop the Docker container we use for integration testing in here and a second method to start it again. That would be a pretty robust way of doing it.

ismaelhamed · 2020-02-27T18:18:18Z

This still stands akkadotnet/Akka.Persistence.SqlServer#114 (comment)

Aaronontheweb · 2020-02-27T18:23:03Z

@ismaelhamed that would do it

object · 2020-02-27T18:36:10Z

@Aaronontheweb we are not using Docker for integration tests yet but stopping a container looks like a way to reproduce this scenario.

Aaronontheweb · 2020-02-27T18:40:06Z

I'm going to leave this issue open (edit: meaning myself and my team can't get on it right away) for the time being as we're pretty tied up with the 1.4.0 release (trying to get a release candidate with a stable API shipped today) - but I think we can get this reproduced and patched in short order.

Arkatufus · 2020-07-02T16:12:24Z

@ismaelhamed I'm still working on a reproduction bug for this, using Docker.DotNet to simulate database outage

object · 2021-01-12T09:13:45Z

I've finally managed to allocate some time to invesigate this one further. I made a couple of tests to trigger state recovery of 10 000 persistent actors. Here what happens:

Test 1: Recover state of 10K persistent actors in a non-clustered environment. Tests creates an actor system and instantiates 10K actors. I ran this tests multiple times and it seems that it works without any errors.

Test 2: Recovcer state of 10K persistent actors activated using cluster sharding and full version of actors that spawn additional work. Persistent actors are spread between various shards and the test sends activation requests via Web API that uses cluster sharding to instantiate persistent actors in its respective nodes. Some actors fail to recover its state, here's what's logged:

akka://Oddjob/system/akka.persistence.journal.sql-server | Circuit Breaker is open; calls are failing fast
akka.tcp://Oddjob@maodatest02:1966/system/sharding/uploadmediaset/4/ps~~dkbu10027519/msc:ps~~dkbu10027519 | ["", null, "Akka.Persistence.RecoveryTimedOutException: Recovery timed out, didn't get event within 60s, highest sequence number seen 0."] {AckInfo} {Exception}

followed by a few hundred similar Akka.Persistence.RecoveryTimedOutException log entries.
Increasing circuit breaker timeout sometimes help to get rid of opening circuit breaker, but timeout exceptions still occur.

Some of such failed actors can then be recovered later, others are stuck and require restart of the nodes where they live.

To investigate this further I can try two directions:

Try to extract a reasonably small code base that can be used to reproduce this error, but this can be hard. It's a lot of domain specific code written in F# and Akkling, so reducing it down to the code that will be easy to understand and debug can be tricky.
Try to investigate further what's happening in Akka.Persistence.SqlServer, peharps by bulding a custom version with more logging.

@Aaronontheweb @IgorFedchenko @ismaelhamed what approach do you think it's better to take first? Any other suggestions?

object · 2021-01-14T08:10:17Z

I experimented more with this, some more observations:

If I extract an environment where the system almost exclusively deals with instantiating persistent actors, not doing much more than that, then errors are gone. 10K requerst are handled quickly, all actor states are recovered, no timeout.
If same actors do their regular work after recovering its states (spawn worker actors that in turn spawn other worker actors that make calls to external services etc.), then running 2K requests triggers Akka.Persistence.RecoveryTimedOutException (and opens circuit breaker). Such additional activities don't spawn other persistent actors

So it looks that I need to build custom versions of Akka persistence DLLs with extra logging to understand better what's going on.

Arkatufus · 2021-05-03T23:40:07Z

This could be caused by a bug in BatchingSqlJournal. The way BatchingSqlJournal works is that it uses a counter (_remainingOperations) to keep track of how many database batching operations it is allowed to do concurrently. This counter is decremented every time it processes a batch chunk and incremented for each completed batch.
The problem is the circuit breaker code. The circuit breaker code does not emit BatchComplete message, hence every retry "consumes" a counter and it never gets incremented again. Once this counter reaches zero, the batching journal effectively blocks all database operations until it is restarted.

Arkatufus · 2021-05-03T23:42:37Z

This problem should be fixed in #4953, the new code only have a single code path for failure and it always increments the counter for each fails.

Aaronontheweb added akka-persistence confirmed bug labels Feb 27, 2020

Aaronontheweb added this to the 1.4.1 and Later milestone Feb 27, 2020

Aaronontheweb modified the milestones: 1.4.2, 1.4.3, 1.4.4 Mar 13, 2020

Aaronontheweb modified the milestones: 1.4.4, 1.4.5 Mar 31, 2020

Aaronontheweb modified the milestones: 1.4.5, 1.4.6 Apr 29, 2020

Aaronontheweb modified the milestones: 1.4.6, 1.4.7 May 12, 2020

Aaronontheweb modified the milestones: 1.4.7, 1.4.8 May 26, 2020

Aaronontheweb modified the milestones: 1.4.8, 1.4.9 Jun 17, 2020

Arkatufus referenced this issue in Arkatufus/Akka.Persistence.SqlServer Jul 2, 2020

Bug reproduction spec effort

94d5292

Aaronontheweb modified the milestones: 1.4.9, 1.4.10 Jul 21, 2020

Aaronontheweb removed this from the 1.4.10 milestone Aug 20, 2020

Aaronontheweb added this to the 1.4.11 milestone Aug 20, 2020

Aaronontheweb modified the milestones: 1.4.11, 1.4.12 Nov 5, 2020

Aaronontheweb modified the milestones: 1.4.12, 1.4.13 Nov 16, 2020

Aaronontheweb modified the milestones: 1.4.13, 1.4.14 Dec 16, 2020

Aaronontheweb modified the milestones: 1.4.14, 1.4.15 Dec 30, 2020

object mentioned this issue Jan 14, 2021

Occasional huge spikes when executing batches using BatchingSqlJournal akkadotnet/Akka.Persistence.SqlServer#190

Closed

Aaronontheweb added the help wanted label Jan 18, 2021

Aaronontheweb modified the milestones: 1.4.15, 1.4.16 Jan 20, 2021

Aaronontheweb modified the milestones: 1.4.17, 1.4.18 Mar 11, 2021

Aaronontheweb modified the milestones: 1.4.18, 1.4.19 Mar 23, 2021

Aaronontheweb modified the milestones: 1.4.19, 1.4.20 Apr 28, 2021

Arkatufus mentioned this issue May 3, 2021

Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. #4953

Merged

Aaronontheweb closed this as completed in #4953 May 3, 2021

This was referenced May 12, 2021

added v1.4.20 release notes #5007

Merged

v1.4.20 Production Release #5008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

object commented Feb 27, 2020 •

edited

Loading

object commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020

object commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020

ismaelhamed commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020

object commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020 •

edited

Loading

Arkatufus commented Jul 2, 2020

object commented Jan 12, 2021 •

edited

Loading

object commented Jan 14, 2021

Arkatufus commented May 3, 2021

Arkatufus commented May 3, 2021

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

Comments

object commented Feb 27, 2020 • edited Loading

object commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020

object commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020

ismaelhamed commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020

object commented Feb 27, 2020

Aaronontheweb commented Feb 27, 2020 • edited Loading

Arkatufus commented Jul 2, 2020

object commented Jan 12, 2021 • edited Loading

object commented Jan 14, 2021

Arkatufus commented May 3, 2021

Arkatufus commented May 3, 2021

object commented Feb 27, 2020 •

edited

Loading

Aaronontheweb commented Feb 27, 2020 •

edited

Loading

object commented Jan 12, 2021 •

edited

Loading