Cluster sharding entities are not being restarted when coordinator is moved to another node #21892

Closed
Valentin-vak opened this Issue Nov 24, 2016 · 11 comments

Projects

None yet

3 participants

@Valentin-vak
Valentin-vak commented Nov 24, 2016 edited

Description

Started 2 nodes, manually downed the oldest one. Coordinator has been moved to the new node, recovery completed successfully, but the entities were not restarted.
This is broken only for akka versions >= 2.4.12. It works fine for 2.4.11.
The issue is most likely to due to this commit

Versions

akka >=2.4.12
scala 2.11.8

Config

cluster {
      sharding {
        remember-entities = on
      }
}

persistence {
      journal {
          plugin = "akka.persistence.journal.leveldb-shared"
          leveldb-shared.store.dir = "target/shared-journal"
      }
      snapshot-store {
          plugin = "akka.persistence.snapshot-store.local"
          local.dir = "target/snapshots"
      }
}

Log

...Shard coordinator recovery completes successfully....
[2016-11-23 14:28:11,947] [debug] [***-akka.actor.default-dispatcher-16] a.c.s.PersistentShardCoordinator - ShardRegion terminated: [Actor[akka.tcp://***@127.0.0.1:30001/system/sharding/***-api#-803326708]]
...
[2016-11-23 14:28:12,468] [debug] [***-akka.actor.default-dispatcher-3] a.c.s.PersistentShardCoordinator - GetShardHome [8] request ignored, because not all regions have registered yet.
[2016-11-23 14:28:12,468] [debug] [***-akka.actor.default-dispatcher-3] a.c.s.PersistentShardCoordinator - GetShardHome [11] request ignored, because not all regions have registered yet.
[2016-11-23 14:28:12,468] [debug] [***-akka.actor.default-dispatcher-3] a.c.s.PersistentShardCoordinator - GetShardHome [9] request ignored, because not all regions have registered yet.
[2016-11-23 14:28:12,468] [debug] [***-akka.actor.default-dispatcher-3] a.c.s.PersistentShardCoordinator - GetShardHome [10] request ignored, because not all regions have registered yet.
[2016-11-23 14:28:12,469] [debug] [***-akka.actor.default-dispatcher-3] a.c.s.PersistentShardCoordinator - GetShardHome [1] request ignored, because not all regions have registered yet.
[2016-11-23 14:28:14,361] [debug] [***-akka.actor.default-dispatcher-23] a.c.s.PersistentShardCoordinator - ShardRegion registered: [Actor[akka://***/system/sharding/***-api#426635209]]
@patriknw
Member

@Valentin-vak Thanks for reporting. You are right that "request ignored, because not all regions have registered yet" comes from that commit. I have not tried to reproduce yet, but when looking at the code I'm surprised that it's not started when the registration arrives, https://github.com/akka/akka/blob/master/akka-cluster-sharding/src/main/scala/akka/cluster/sharding/ShardCoordinator.scala#L449

What settings do you use for min-members? If you use 2 and have two nodes and shutdown one then it will not start until you have joined a new node, i.e. it is still requiring 2 nodes.

@Valentin-vak

I have not set min-members, so i suppose its = 1

@patriknw
Member

right, I'll take a closer look at this next week.

No other interesting logging after "ShardRegion registered"?

@Valentin-vak

No. After ShardRegion registered it proceeds normally.
Thanks!

@patriknw
Member

ok, but still not started after that? I expected that it would start after the registration.

@Valentin-vak

No. Just tried to reproduce it locally. Previously we had only 1 shard region, and it was working as I described. Now we added another region and after recovery I can see the following log:

[2016-11-25 19:17:02,676] [debug] [***-akka.actor.default-dispatcher-23] akka.cluster.sharding.ShardRegion - Retry request for shard [8] homes from coordinator at [Actor[akka://***/system/sharding/store-apiCoordinator/singleton/coordinator#754049355]]. [2] buffered messages.
[2016-11-25 19:17:02,676] [debug] [***-akka.actor.default-dispatcher-23] akka.cluster.sharding.ShardRegion - Retry request for shard [11] homes from coordinator at [Actor[akka://***/system/sharding/store-apiCoordinator/singleton/coordinator#754049355]]. [2] buffered messages.
[2016-11-25 19:17:02,676] [debug] [***-akka.actor.default-dispatcher-23] akka.cluster.sharding.ShardRegion - Retry request for shard [9] homes from coordinator at [Actor[akka://***/system/sharding/store-apiCoordinator/singleton/coordinator#754049355]]. [2] buffered messages.
[2016-11-25 19:17:02,676] [debug] [***-akka.actor.default-dispatcher-23] akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka://***/system/sharding/store-apiCoordinator/singleton/coordinator#754049355]]. [2] buffered messages.
[2016-11-25 19:17:02,676] [debug] [***-akka.actor.default-dispatcher-23] akka.cluster.sharding.ShardRegion - Retry request for shard [1] homes from coordinator at [Actor[akka://***/system/sharding/store-apiCoordinator/singleton/coordinator#754049355]]. [2] buffered messages.

If I remove the second shard region, nothing happens after "ShardRegion registered"

@patriknw
Member

@Valentin-vak I have been able to reproduce the issue. I will continue investigating how to solve it.

@patriknw patriknw self-assigned this Nov 28, 2016
@Valentin-vak

@patriknw great, thanks!

@patriknw patriknw added a commit that referenced this issue Nov 28, 2016
@patriknw patriknw fix regression in remember entities, #21892
* regression was introduced by 141318e
  in 2.4.12
d005374
@patriknw patriknw added this to the 2.4.15 milestone Nov 28, 2016
@patriknw patriknw added a commit that referenced this issue Dec 14, 2016
@patriknw patriknw fix regression in remember entities, #21892
* regression was introduced by 141318e
  in 2.4.12

(cherry picked from commit d005374)
d41374b
@patriknw patriknw closed this Dec 14, 2016
@lukas-phaf

I also ran into this bug in 2.4.14, and used 2.4.11 to avoid it. Today I tested this in 2.4.16.

The entities are now indeed restarted when the coordinator recovers on another node, but I did find a case where this is not working. I am using a cluster of 4 nodes, with akka.cluster.min-nr-of-members = 3. If I crash the oldest node, the shard coordinator is recovered on the second node, and the entities are restarted on the surviving nodes. If I now crash the node with the new shard coordinator, the shard coordinator is indeed recovered on one of the two remaining nodes, but the entities are not restarted.

If I repeat the same experiment with akka.cluster.min-nr-of-members not set, the entities are restarted after the second coordinator recovery.

So my conclusion is that entities are not restarted on coordinator recovery if the remaining number of nodes is below akka.cluster.min-nr-of-members.

I do not see the same problem in 2.4.11.

@patriknw
Member

@lukas-phaf does that mean that you only had 2 nodes left? Please create a new issue and we'll look into it. Thanks for testing and reporting.

@lukas-phaf

Yes, after the second failover of the coordinator there were only 2 nodes left. I have created a new issue #22064.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment