Shard reallocation problems #27259

jroper · 2019-07-04T03:35:56Z

I see this issue a lot when scaling a cluster down. I think the problem happens when more than one node leaves the cluster at a time. I'll describe the symptom first, here's the logs:

02:32:41.620 DEBUG c.l.s.autoscaler.Autoscaler - Scaling down to 2 because desired nodes for user function 1 and desired nodes for request handling 1 is below cluster members 4
02:32:44.901 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - Exiting confirmed [akka.tcp://statefulserverless-backend@10.52.13.17:2552]
02:32:45.010 INFO  akka.remote.EndpointWriter - AssociationError [akka.tcp://statefulserverless-backend@10.52.14.20:2552] <- [akka.tcp://statefulserverless-backend@10.52.13.17:2552]: Error [Shut down address: akka.tcp://statefulserverless-backend@10.52.13.17:2552] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://statefulserverless-backend@10.52.13.17:2552
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
02:32:46.254 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.13.17:2552]
02:32:46.264 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.13.17:2552]
02:32:52.942 WARN  akka.cluster.sharding.ShardRegion - shopping-cart: Retry request for shard [94] homes from coordinator at [Actor[akka://statefulserverless-backend/system/sharding/shopping-cartCoordinator/singleton/coordinator#1273361376]]. [17] buffered messages.
[snip - above message is repeated 18 times per second, differing only in the shard id and buffered messages]
02:32:53.137 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - Exiting confirmed [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:53.225 INFO  akka.remote.EndpointWriter - AssociationError [akka.tcp://statefulserverless-backend@10.52.14.20:2552] <- [akka.tcp://statefulserverless-backend@10.52.12.18:2552]: Error [Shut down address: akka.tcp://statefulserverless-backend@10.52.12.18:2552] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://statefulserverless-backend@10.52.12.18:2552
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
02:32:53.243 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - is the new leader among reachable nodes (more leaders may exist)
02:32:54.242 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - Leader is removing confirmed Exiting node [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:54.250 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:54.255 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:54.942 WARN  akka.cluster.sharding.ShardRegion - shopping-cart: Retry request for shard [94] homes from coordinator at [Actor[akka://statefulserverless-backend/system/sharding/shopping-cartCoordinator/singleton/coordinator#1273361376]]. [17] buffered messages.
[snip - above message is repeated 18 times per second, differing only in the shard id and buffered messages]

The 18 retry request warnings are then repeated every second for about 50 seconds, until the last one, 50 seconds after the first message was output:

02:33:42.932 WARN  akka.cluster.sharding.ShardRegion - shopping-cart: Retry request for shard [84] homes from coordinator at [Actor[akka://statefulserverless-backend/system/sharding/shopping-cartCoordinator/singleton/coordinator#1273361376]]. [35] buffered messages.

And then, everything goes back to normal, the shards start working again, and then 16 seconds after that:

02:33:58.432 WARN  a.r.transport.netty.NettyTransport - Remote connection to [null] failed with org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.52.12.18:2552
02:33:58.439 WARN  a.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://statefulserverless-backend@10.52.12.18:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://statefulserverless-backend@10.52.12.18:2552]] Caused by: [org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.52.12.18:2552]

So, in the above, 10.52.13.17 shut down first, followed by 10.52.12.18, but it appears that the node still attempts to communicate with 10.52.12.18 after its left the cluster, hence the last error message?

I suspect what is happening is that during shard allocation after the first node leaves the cluster, shards can be allocated to the second node that is also concurrently leaving, and there is a race condition that means that if that node shuts down while the shards are being allocated to it, the shard coordinator does not attempt to fix that and reallocate for a minute.

The text was updated successfully, but these errors were encountered:

jroper · 2019-07-04T04:43:39Z

I guess hand off is timing out in the RebalanceWorker. We can see the scaling decision was made at 2:32:41, it would have taken a second or two for the request made on the Kubernetes API to result in the node actually shutting down, so at around 2:32:43 the shard coordinator would have received the GracefulShutdownReq, started the RebalanceWorker which defaults to a 60 second rebalance handoff so that would have timed out at 2:33:43. That's why the last retry request message is seen at 2:33:42. I don't see any other 60 second timeouts that might cause this behaviour.

So, why is it timing out? Well, don't know exactly, but we can see at 2:32:44, the node leaves the cluster, and communication with it shuts down shortly after that. At some point the shard coordinator receives a Terminated for that region, to handle this it updates its state and removes the region from its list of gracefully terminating entities, however, it doesn't remove the shards that were currently being rebalanced from the rebalanceInProgress set. This might be a bug, if a node that previously indicated it is gracefully terminating goes away, then isn't it implied that handoff finished? Shouldn't the shards being rebalanced from it be removed?

As for the reason hand off is timing out... I'll take guess. There's no ack sent from the RebalanceWorker when ShardStopped is sent to it. The Shard actor will send the ShardStopped message to the RebalanceWorker, and then it just stops itself, then the region stops, and the coordinated shutdown promise is redeemed. At that point, I guess remoting shuts down and the actor system terminates. But, did the ShardStopped message get through? If it was buffered, I guess not, does remoting drain all buffers when its told to shutdown? And I'm seeing these messages during load tests, where I guess there's a high chance of there being a buffer of unsent messages.

So one potential solution to this might be for the Rebalance to ack the shard stopped, so that the Shard actor can know the message got through, and shutdown. That's assuming that's the bug. What I will do is insert a delay into coordinated shutdown that tries to ensure any messages that are meant to be sent through remoting are drained before remoting is shutdown, if the problem goes away, then I guess that's the bug.

jroper · 2019-07-04T04:51:50Z

Just looked into it a bit deeper, cluster sharding is shutdown before the cluster is left, and if the cluster leaves, it has to send a message to all the nodes in the cluster, that message won't be sent until after the shard stopped messages are sent, and since that message did get through, the shard stopped messages must have got through. So that's not the bug.

jroper · 2019-07-04T05:00:05Z

I'll see if I can capture more info by enabling debug for cluster sharding.

jroper · 2019-07-04T05:10:03Z

Here's some:

https://gist.github.com/jroper/90c571c44601f6cfa7092083368cdea4

jroper · 2019-07-04T05:16:51Z

Actually, I'm not sure if that gist actually captured it. I'll try again.

jroper · 2019-07-04T05:26:34Z

Ok, definitely captured now, and I've updated the above gist. It's very noisy, even though I filtered all the Forwarding debug messages.

jroper · 2019-07-04T05:52:05Z

Update, from my analysis of the logs, the first member requests graceful shutdown from cluster sharding, successfully hands off all shards, and then leaves the cluster, all within a couple of seconds. The second requests graceful shutdown from cluster sharding, begin hand off is logged, but no shards are successfully deallocated, then 10 seconds later, leaves the cluster. That 10 seconds I guess is the cluster sharding coordinated shutdown phase timeout. Then 50 seconds after that hand off times out.

So, either the hand off messages aren't actually being sent, or there's a problem on the terminating node that causes it to fail to terminate every single shard it has? That doesn't sound right. I'll need to try and capture the logs on the terminating nodes too.

patriknw · 2019-07-04T06:32:46Z

Thanks for reporting and investigating. Good that you capture detailed logs that we can look into..

I guess that in those scaling scenarios we have no control of which nodes are shutdown and in which order? Always best for as smooth leaving as possible that the coordinator (singleton on oldest node) is kept alive.

jroper · 2019-07-04T06:34:48Z

I've found one race condition, not sure if it's causing the problem here. The rebalance worker won't tell the node to hand off its shards until all shard regions have acked the begin hand off message. This includes regions that are currently gracefully shutting down. So, if sometime before the BeginHandOff message is received by the ShardRegion actor of the first node shutting down, it finishes hand off and the actor is shut down, then it's not going to ack, and that will prevent hand off of the second region to start.

Is this causing the problem? In the logs I see 715ms between when the RebalanceActor starts, and when the coordinator logs that the region has terminated. There is no configured removal margin, so that log message is not delayed by that. It feels like maybe it's a little too long? Can there be a delay that long between an actor terminating and a remote watcher receiving the termination signal?

I guess to fix the problem, I guess the rebalance worker needs to watch the shard regions itself. If that's expensive, it can just watch the regions that are currently gracefully shutting down.

jroper · 2019-07-04T06:50:27Z

Ok, I managed to find the logs (turns out I've got GKE stack driver logging enabled, which means I can see all the logs for nodes that have shut down), and I can confirm that only 3 out of 4 nodes received the BeginHandOff messages, and the one that didn't receive it was the node that was the first shutting down node. The second then didn't receive a HandOff message.

Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.

jroper · 2019-07-04T07:08:43Z

PR: #27261

jroper · 2019-07-04T07:15:53Z

@patriknw only just noticed your comment above. It's using k8s deployments, so they are shutdown newest first, keeping the shard coordinator alive.

Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.

* RebalanceWorker should watch shard regions Fixes #27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.

* RebalanceWorker should watch shard regions Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.

gmarti · 2019-09-17T15:44:16Z

I have stumbled upon this issue when playing with akka cluster sharding in 2.5.x
I've looked at the #27261 fix but it seems that we cannot send the HandOff message to the shard if it's terminated. Right ?
I opened a PR #27736 to handle that case

jroper mentioned this issue Jul 4, 2019

RebalanceWorker should watch shard regions #27261

Merged

chbatey added 0 - new Ticket is unclear on it's purpose or if it is valid or not discuss Tickets that need some discussion before proceeding. Not decided if it's a good idea. labels Jul 22, 2019

patriknw closed this as completed in #27261 Aug 15, 2019

patriknw added this to the 2.6.0-M6 milestone Aug 15, 2019

gmarti mentioned this issue Sep 17, 2019

RebalanceWorker should stop when shard region handing off the shard is terminated #27736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard reallocation problems #27259

Shard reallocation problems #27259

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

patriknw commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

gmarti commented Sep 17, 2019

Shard reallocation problems #27259

Shard reallocation problems #27259

Comments

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

patriknw commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

jroper commented Jul 4, 2019

gmarti commented Sep 17, 2019