Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
akka-cluster-sharding HandOffStopper issue #27647
We have an application that is using akka cluster sharding that read a message from queue and send it to interested consumers that could be placed in different shards, if all consumer persist message it is then acked on queue. From time to time we’ve lot of nacks caused by timeouts to different shards, to solve the problem we need restart whole application. As some of consumers working well and we see lot of timeouts to particular shards we think that sharding mechanism could be an issue in this situation.
I’ve noticed that logs from
What I’ve noticed is that
If I understood correctly it blocking sending
I think the explanation could be that an entity (or more) that terminated, which was seen by the shard, and that triggered writing that the entity stopped, but that write hasn't completed yet, and in the meanwhile a handoff is requested.
That should make it safe to only look at
Looking a bit more, both the persistent and the ddata shard stashes incoming messages while updating the state, so state.entities shouldn't ever be out of when receiving the Terminated message (which is where the actor is removed fromidByRef).
Any other hints what your app was up to when this happens, changing topology? Is remember entities enabled? Are you using passivate idle?
Hi, thanks for looking into it!
Remember entities is enabled, relevant snippet from configuration:
We're using passivation feature for part of our
I don't remember details of this particular situation, but we observe such problems mainly around redeployment, recreating some nodes of akka cluster.
Are you enabling
Given when you primarily see this, and rebalancing, I might first try to increase
I would not be surprised by that during redeployment. And I suspect with some tuning it could be mitigated as a first step?
I see described problem in both: shards that using passivation and with disabled passivation.
I can increase timeout but from the logs it is clearly that
To be clear - I'm not 100% sure that my problems with sharding are in 100% related to this weird behaviour. From time to time we're seeing issues with sharding that could be solved only by restarting nodes and this is the only trace that I can find in logs that could be related to this problem.
Thanks for reporting @sebarys . This is a bug.
Very difficult to follow the state changes in those actors as we have acknowledged previously and we will look into that when improve remembering entities. After a quick look it seems very wrong to use
@sebarys Since you have a branch that you can try, could you try changing to
I'll do the same and run our tests.