Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool routers use Directive.Escalate by default when supervising routees - this causes unintended consequences #6295

Closed
Aaronontheweb opened this issue Dec 6, 2022 · 5 comments · Fixed by #6370

Comments

@Aaronontheweb
Copy link
Member

Version Information
Version of Akka.NET? 1.4.46 and earlier, 1.5.0-alpha3 and earlier.
Which Akka.NET Modules? Akka

Describe the bug

Currently, the default SupervisionStrategy on all Pool routers uses Directive.Escalate, which causes the router to restart and kill all routees every time any of the routees restart. This can be problematic, as it means that all routees are restarted even if only one of them has failed.

Instead, I suggest changing the default SupervisionStrategy to return a Directive.Restart instead. This would cause only the failed routee to be restarted, rather than restarting all routees in the pool. This would allow for more granular control over how failures are handled, and can prevent unnecessary restarts of healthy routees.

To Reproduce

Reproduced with one of my unit tests on #6294

// unit test to assert that a pool router actor's children generate telemetry events, but not the router actor itself
[Fact]
public async Task ActorTelemetry_must_be_accurate_for_pool_router()
{
    // create a TelemetrySubscriber actor
    var subscriber = Sys.ActorOf(Props.Create<TelemetrySubscriber>(), "subscriber");
    
    // request current telemetry values (ensure that the actor has started, so counter values will be accurate)
    await subscriber.Ask<TelemetrySubscriber.GetTelemetry>(TelemetrySubscriber.GetTelemetryRequest.Instance);
    
    // create a pool router
    var router = Sys.ActorOf(Props.Create<ChildActor>().WithRouter(new RoundRobinPool(10)), "router");

    // awaitassert collecting data from the telemetry subscriber until we can see that 10 actors have been created
    await AwaitAssertAsync(async () =>
    {
        var telemetry = await subscriber.Ask<TelemetrySubscriber.GetTelemetry>(TelemetrySubscriber.GetTelemetryRequest.Instance);
        Assert.Equal(12, telemetry.ActorCreated);
        // assert no restarts or stops recorded
        Assert.Equal(0, telemetry.ActorRestarted);
        Assert.Equal(0, telemetry.ActorStopped);
    }, RemainingOrDefault);
    
    // send a message to the router to restart all children
    router.Tell(new Broadcast(RestartChildren.Instance));

    // await assert collecting data from the telemetry subscriber until we can see that 10 actors have been restarted
    await AwaitAssertAsync(async () =>
    {
        var telemetry = await subscriber.Ask<TelemetrySubscriber.GetTelemetry>(TelemetrySubscriber.GetTelemetryRequest.Instance);
        // assert that actor start count is still 10
        Assert.Equal(12, telemetry.ActorCreated);
        Assert.Equal(10, telemetry.ActorRestarted);
        // assert no stops recorded
        Assert.Equal(0, telemetry.ActorStopped);
    }, RemainingOrDefault);
    
    // GracefulStop router actor and assert that 10 actors have been stopped
    await router.GracefulStop(RemainingOrDefault);
    await AwaitAssertAsync(async () =>
    {
        var telemetry = await subscriber.Ask<TelemetrySubscriber.GetTelemetry>(TelemetrySubscriber.GetTelemetryRequest.Instance);
        // assert that actor start count is still 10
        Assert.Equal(12, telemetry.ActorCreated);
        Assert.Equal(10, telemetry.ActorRestarted);
        Assert.Equal(10, telemetry.ActorStopped);
    }, RemainingOrDefault);
}

Expected behavior

Each child should be restarted individually and independently from each other.

Actual behavior

Assert.Equal(10, telemetry.ActorRestarted); should be true, but instead the telemetry.ActorRestarted value is 110!

That's because the following line:

// send a message to the router to restart all children
  router.Tell(new Broadcast(RestartChildren.Instance));

Causes the router to restart all of its children 10 times apiece.

@Aaronontheweb Aaronontheweb added akka-actor potential bug dependencies Pull requests that update a dependency file discussion and removed dependencies Pull requests that update a dependency file labels Dec 6, 2022
@to11mtm
Copy link
Member

to11mtm commented Dec 7, 2022

Can we add this as an arg to the routers if it is not there already?

TBH never realized this was a thing and explains a lot, a fix would be good but we should be careful to respect existing code that guards condition

@Aaronontheweb
Copy link
Member Author

This can be configured via Router.WithSupervisionStrategy - but most users never specify that.

@Aaronontheweb
Copy link
Member Author

FWIW I still think we should change the default here, even though that's technically a breaking behavioral change - but I expect that the impact of the change will be more positive than not.

@Aaronontheweb Aaronontheweb added this to the 1.4.47 milestone Dec 8, 2022
@Aaronontheweb
Copy link
Member Author

We're just going to reverse this in 1.4.47. I have no doubt it does more harm than good in most cases.

@Aaronontheweb
Copy link
Member Author

Actually, we'll push this out to 1.4.48

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.47, 1.4.48 Dec 9, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.48, 1.4.49 Jan 5, 2023
Aaronontheweb added a commit that referenced this issue Jan 26, 2023
…6366)

* close #6295 - set default PoolRouter SupervisorStrategy to Restart

* use the `SupervisionStrategy.DefaultStrategy`

* Fix unit test

* Change unit test to use RoundRobinPool router

Co-authored-by: Gregorius Soedharmo <arkatufus@yahoo.com>
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.49, 1.4.50 Jan 26, 2023
Arkatufus pushed a commit to Arkatufus/akka.net that referenced this issue Jan 26, 2023
… Restart (akkadotnet#6366)

* close akkadotnet#6295 - set default PoolRouter SupervisorStrategy to Restart
* use the `SupervisionStrategy.DefaultStrategy`
* Fix unit test
* Change unit test to use RoundRobinPool router

(cherry-picked from ffd9a9e)
Aaronontheweb added a commit that referenced this issue Jan 30, 2023
)

* close #6295 - set default PoolRouter SupervisorStrategy to Restart (#6366)

* close #6295 - set default PoolRouter SupervisorStrategy to Restart
* use the `SupervisionStrategy.DefaultStrategy`
* Fix unit test
* Change unit test to use RoundRobinPool router

(cherry-picked from ffd9a9e)

* Fix actor telemetry unit test

---------

Co-authored-by: Aaron Stannard <aaron@petabridge.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants