Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

akka-cluster-sharding HandOffStopper issue #27647

Open
sebarys opened this issue Sep 6, 2019 · 8 comments

Comments

@sebarys
Copy link

commented Sep 6, 2019

Hi,

We have an application that is using akka cluster sharding that read a message from queue and send it to interested consumers that could be placed in different shards, if all consumer persist message it is then acked on queue. From time to time we’ve lot of nacks caused by timeouts to different shards, to solve the problem we need restart whole application. As some of consumers working well and we see lot of timeouts to particular shards we think that sharding mechanism could be an issue in this situation.

I’ve noticed that logs from HandOffStopper: HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities ... is appearing really frequent in constant time intervals for the same shards. I decided to add some additional logs around this processing and build SNAPSHOT version of akka library, you can check changes here: https://github.com/sebarys/akka/pull/1/files - just additional logging.

What I’ve noticed is that HandOffStopper is created with Set() (empty set) of entities to terminate, e.g. example logs:

$ gunzip -c * | ggrep "Dispatcher/11" | ggrep "HandOffStopper" | jq -s -c 'sort_by(.time)[] | .kubernetes.host + " AA " + .time + " BB " + .log.logger + " CC " + .log.actor + " DD " + .log.message' | head -n 50
"ip-XXX.ec2.internal AA 2019-09-04T08:2:08.277+00:00 BB akka.cluster.sharding.PersistentShard CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11 DD Starting HandOffStopper for shard 11 to terminate following entities: Set(Actor[akka://ProfileAkkaClusterSystem/system/sharding/Dispatcher/11/XXX%2Fa0eb829f-9f66-3f29-a6df-9934e9a5bed8%2FXXX#-1195859775])"
"ip-XXX.ec2.internal AA 2019-09-04T08:2:08.307+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$b DD HandOffStopper of the `11` shard received Terminated message for the ActorRef Actor[akka://ProfileAkkaClusterSystem/system/sharding/Dispatcher/11/XXX%2Fa0eb829f-9f66-3f29-a6df-9934e9a5bed8%2FXXX#-1195859775], the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:2:08.307+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$b DD HandOffStopper remaining entities are empty, sending ShardStopped message for shard 11"
"ip-XXX.ec2.internal AA 2019-09-04T08:2:54.496+00:00 BB akka.cluster.sharding.PersistentShard CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11 DD Starting HandOffStopper for shard 11 to terminate following entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:22:49.515+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:23:44.535+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:24:39.555+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:25:34.575+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:26:29.595+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:27:24.615+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:28:19.635+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
"ip-XXX.ec2.internal AA 2019-09-04T08:29:14.655+00:00 BB akka.cluster.sharding.ShardRegion$HandOffStopper CC akka.tcp://ProfileAkkaClusterSystem@XXX:2551/system/sharding/Dispatcher/11/$a DD HandOffStopMessage[akka.actor.PoisonPill$] is not handled by some of the entities of the `11` shard after 55000 milliseconds, stopping the remaining entities: Set()"
...

If I understood correctly it blocking sending ShardStopped(shardId) message back and finish process of rebalancing
Could anyone verify my assumption?

HandOffStoper is created when if (state.entities.nonEmpty) condition is fulfilled but later idByRef.keySet from akka.cluster.sharding.Shard is passed as Set of ActorRefs that should be terminated. I’m not confident what relation is between these two sets but as HandOffStopper with empty Set will be waiting forever for Termination of provided ActorRefs it looks like not desired situation. Couldn’t we just create this actor in case when idByRef.keySet.nonEmpty?

@johanandren

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

idByRef is supposed to be kept in sync with state, I think there is a bug lurking here.

@johanandren

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

I think the explanation could be that an entity (or more) that terminated, which was seen by the shard, and that triggered writing that the entity stopped, but that write hasn't completed yet, and in the meanwhile a handoff is requested.

That should make it safe to only look at idByRef to determine what actual actors need stopping

@johanandren

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

Looking a bit more, both the persistent and the ddata shard stashes incoming messages while updating the state, so state.entities shouldn't ever be out of when receiving the Terminated message (which is where the actor is removed fromidByRef).

Any other hints what your app was up to when this happens, changing topology? Is remember entities enabled? Are you using passivate idle?

@sebarys

This comment has been minimized.

Copy link
Author

commented Sep 10, 2019

Hi, thanks for looking into it!

Remember entities is enabled, relevant snippet from configuration:

akka.cluster {
  sharding {
    state-store-mode = persistence
    remember-entities = on
    entity-recovery-strategy = "all"
  }
}

akka.persistence {
  snapshot-store.plugin = "cassandra-snapshot-store"
  journal.plugin = "cassandra-journal"
}

We're using passivation feature for part of our ClusterSharding actors (4 different ClusterSharding components, one is using passivation).

I don't remember details of this particular situation, but we observe such problems mainly around redeployment, recreating some nodes of akka cluster.

@johanandren johanandren removed their assignment Sep 16, 2019

@helena

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

Hi @sebarys,

Remember entities is enabled
We're using passivation feature for part of our ClusterSharding actors (4 different ClusterSharding components, one is using passivation).

Are you enabling remember-entities on all?
And are you seeing this behavior on the "one using passivation"? Enabling remember-entities disables passivation, wondering if this is related.

we observe such problems mainly around redeployment, recreating some nodes of akka cluster.

Given when you primarily see this, and rebalancing, I might first try to increase akka.cluster.sharding.handoff-timeout (defaults to 60s),
to allow entities more time to process the stopMessage in the mailbox (they may be busy) before the timeout causing them to be stopped forcefully. You may see better behavior during rolling restarts.

From time to time we’ve lot of nacks caused by timeouts to different shards

I would not be surprised by that during redeployment. And I suspect with some tuning it could be mitigated as a first step?

@sebarys

This comment has been minimized.

Copy link
Author

commented Sep 17, 2019

Hi @helena
We're redeploying cluster using rolling update strategy. It is totally fine to see such issues during redeployment but the problem is that it is not resolved after longer time and only way to fix it is to recreate/restart nodes.

I see described problem in both: shards that using passivation and with disabled passivation.

I can increase timeout but from the logs it is clearly that HandOffStopper is created with empty Set of ActorRefs - in such case IMO increasing timeout will not help as he will be waiting forever for Terminated message from not existing actors (maybe additional guard condition to be sure that HandOffStopper entities are not empty will help).

To be clear - I'm not 100% sure that my problems with sharding are in 100% related to this weird behaviour. From time to time we're seeing issues with sharding that could be solved only by restarting nodes and this is the only trace that I can find in logs that could be related to this problem.

@patriknw

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

Thanks for reporting @sebarys . This is a bug.

Very difficult to follow the state changes in those actors as we have acknowledged previously and we will look into that when improve remembering entities. After a quick look it seems very wrong to use state.nonEmpty in the if and then use the idByRef. I think there are cases when they are out of sync (probably for good reasons, or at least some reason).

I think idByRef is the right thing to use here because that is only removed when child is terminated.

@sebarys Since you have a branch that you can try, could you try changing to

if (idByRef.nonEmpty)

I'll do the same and run our tests.

@patriknw

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

all cluster sharding tests green on my side with that change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.