Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

Closed
object opened this issue Feb 27, 2020 · 13 comments · Fixed by #4953
Closed

Comments

@object
Copy link
Contributor

object commented Feb 27, 2020

This issue looks similar to #3870, however it happens when using latest version of Akka.NET.

OS: Windows Server 2016
Platform: .NET Core 3.1
Akka.NET packages: 1.4.0-beta14 (used in a cluster)

Scenario:

  1. Akka.Persistence.SqlServer.Journal.BatchingSqlServerJournal raises exception with message "Circuit Breaker is open; calls are failing fast", most likely due to a temporary db outage

  2. Attempt to recover state of some persistent actors fail with RecoveryTimedOutException. Here's a typical sequence of events, taken from our log:

Started (Akka.Pattern.BackoffOnRestartSupervisor)
now supervising akka://Oddjob/system/sharding/upload/0/ps~msui30002111/msc:ps~msui30002111
now watched by [akka://Oddjob/system/sharding/upload/0/ps~msui30002111#1585193596]
now watched by [akka://Oddjob/system/recoveryPermitter#1099929798
Spawned MediaSetController actor
now watched by [akka://Oddjob/system/sharding/upload/0#1224240942]
Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
Restoring state from snapshot
(after 1 minute)
["", null, "Akka.Persistence.RecoveryTimedOutException: Recovery timed out, didn't get event within 60s, highest sequence number seen 312."] {AckInfo} {Exception}
Passivating started on entity "ps~msui30002111"
received AutoReceiveMessage <Terminated>: [akka://Oddjob/system/sharding/upload/0/ps~msui30002111#1585193596] - ExistenceConfirmed=True
Entity stopped after passivation ["ps~msui30002111"]

  1. Once a persistent actor fails with such exception, it is stuck until the system is restarted. Other actors may be successfully recovered.
@object
Copy link
Contributor Author

object commented Feb 27, 2020

And here's an extract from our HOCON:

akka {
  persistence{
    journal {
      plugin = "akka.persistence.journal.sql-server"
      sql-server {
        class = "Akka.Persistence.SqlServer.Journal.BatchingSqlServerJournal, Akka.Persistence.SqlServer"
        recovery-event-timeout = 60s
        schema-name = dbo
        table-name = EventJournal
        auto-initialize = off
      }
    }
    snapshot-store {
      plugin = "akka.persistence.snapshot-store.sql-server"
      sql-server {
        class = "Akka.Persistence.SqlServer.Snapshot.SqlServerSnapshotStore, Akka.Persistence.SqlServer"
        serializer = protobuf
        schema-name = dbo
        table-name = SnapshotStore
        auto-initialize = off
      }
    }
  }
}


@Aaronontheweb
Copy link
Member

Think I have an idea on how to reproduce this with SQL Server - it's tough to reproduce in this repository with Sqlite because I can't easily simulate a database outage there. Should be doable with SQL Server though.

@Aaronontheweb Aaronontheweb added this to the 1.4.1 and Later milestone Feb 27, 2020
@object
Copy link
Contributor Author

object commented Feb 27, 2020

I once managed to trigger this scenario by simply executing a long running query that locked the whole EventJournal table. After the query was complete, persistent actors that failed during its execution were still stuck and I had to restart the service to be able to recover them.

@Aaronontheweb
Copy link
Member

@object https://github.com/akkadotnet/Akka.Persistence.SqlServer/blob/dev/src/Akka.Persistence.SqlServer.Tests/SqlServerFixture.cs - we can add a method to stop the Docker container we use for integration testing in here and a second method to start it again. That would be a pretty robust way of doing it.

@ismaelhamed
Copy link
Member

This still stands akkadotnet/Akka.Persistence.SqlServer#114 (comment)

@Aaronontheweb
Copy link
Member

@ismaelhamed that would do it

@object
Copy link
Contributor Author

object commented Feb 27, 2020

@Aaronontheweb we are not using Docker for integration tests yet but stopping a container looks like a way to reproduce this scenario.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Feb 27, 2020

I'm going to leave this issue open (edit: meaning myself and my team can't get on it right away) for the time being as we're pretty tied up with the 1.4.0 release (trying to get a release candidate with a stable API shipped today) - but I think we can get this reproduced and patched in short order.

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.2, 1.4.3, 1.4.4 Mar 13, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.4, 1.4.5 Mar 31, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.5, 1.4.6 Apr 29, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.6, 1.4.7 May 12, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.7, 1.4.8 May 26, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.8, 1.4.9 Jun 17, 2020
@Arkatufus
Copy link
Contributor

@ismaelhamed I'm still working on a reproduction bug for this, using Docker.DotNet to simulate database outage

Arkatufus referenced this issue in Arkatufus/Akka.Persistence.SqlServer Jul 2, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.9, 1.4.10 Jul 21, 2020
@Aaronontheweb Aaronontheweb removed this from the 1.4.10 milestone Aug 20, 2020
@Aaronontheweb Aaronontheweb added this to the 1.4.11 milestone Aug 20, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.11, 1.4.12 Nov 5, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.12, 1.4.13 Nov 16, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.13, 1.4.14 Dec 16, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.14, 1.4.15 Dec 30, 2020
@object
Copy link
Contributor Author

object commented Jan 12, 2021

I've finally managed to allocate some time to invesigate this one further. I made a couple of tests to trigger state recovery of 10 000 persistent actors. Here what happens:

Test 1: Recover state of 10K persistent actors in a non-clustered environment. Tests creates an actor system and instantiates 10K actors. I ran this tests multiple times and it seems that it works without any errors.

Test 2: Recovcer state of 10K persistent actors activated using cluster sharding and full version of actors that spawn additional work. Persistent actors are spread between various shards and the test sends activation requests via Web API that uses cluster sharding to instantiate persistent actors in its respective nodes. Some actors fail to recover its state, here's what's logged:

akka://Oddjob/system/akka.persistence.journal.sql-server | Circuit Breaker is open; calls are failing fast
akka.tcp://Oddjob@maodatest02:1966/system/sharding/uploadmediaset/4/psdkbu10027519/msc:psdkbu10027519 | ["", null, "Akka.Persistence.RecoveryTimedOutException: Recovery timed out, didn't get event within 60s, highest sequence number seen 0."] {AckInfo} {Exception}

followed by a few hundred similar Akka.Persistence.RecoveryTimedOutException log entries.
Increasing circuit breaker timeout sometimes help to get rid of opening circuit breaker, but timeout exceptions still occur.

Some of such failed actors can then be recovered later, others are stuck and require restart of the nodes where they live.

To investigate this further I can try two directions:

  1. Try to extract a reasonably small code base that can be used to reproduce this error, but this can be hard. It's a lot of domain specific code written in F# and Akkling, so reducing it down to the code that will be easy to understand and debug can be tricky.
  2. Try to investigate further what's happening in Akka.Persistence.SqlServer, peharps by bulding a custom version with more logging.

@Aaronontheweb @IgorFedchenko @ismaelhamed what approach do you think it's better to take first? Any other suggestions?

@object
Copy link
Contributor Author

object commented Jan 14, 2021

I experimented more with this, some more observations:

  • If I extract an environment where the system almost exclusively deals with instantiating persistent actors, not doing much more than that, then errors are gone. 10K requerst are handled quickly, all actor states are recovered, no timeout.
  • If same actors do their regular work after recovering its states (spawn worker actors that in turn spawn other worker actors that make calls to external services etc.), then running 2K requests triggers Akka.Persistence.RecoveryTimedOutException (and opens circuit breaker). Such additional activities don't spawn other persistent actors

So it looks that I need to build custom versions of Akka persistence DLLs with extra logging to understand better what's going on.

@Arkatufus
Copy link
Contributor

This could be caused by a bug in BatchingSqlJournal. The way BatchingSqlJournal works is that it uses a counter (_remainingOperations) to keep track of how many database batching operations it is allowed to do concurrently. This counter is decremented every time it processes a batch chunk and incremented for each completed batch.
The problem is the circuit breaker code. The circuit breaker code does not emit BatchComplete message, hence every retry "consumes" a counter and it never gets incremented again. Once this counter reaches zero, the batching journal effectively blocks all database operations until it is restarted.

@Arkatufus
Copy link
Contributor

This problem should be fixed in #4953, the new code only have a single code path for failure and it always increments the counter for each fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants