Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard reallocation problems #27259

Closed
jroper opened this issue Jul 4, 2019 · 13 comments · Fixed by #27261
Closed

Shard reallocation problems #27259

jroper opened this issue Jul 4, 2019 · 13 comments · Fixed by #27261
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster-sharding
Milestone

Comments

@jroper
Copy link
Contributor

jroper commented Jul 4, 2019

I see this issue a lot when scaling a cluster down. I think the problem happens when more than one node leaves the cluster at a time. I'll describe the symptom first, here's the logs:

02:32:41.620 DEBUG c.l.s.autoscaler.Autoscaler - Scaling down to 2 because desired nodes for user function 1 and desired nodes for request handling 1 is below cluster members 4
02:32:44.901 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - Exiting confirmed [akka.tcp://statefulserverless-backend@10.52.13.17:2552]
02:32:45.010 INFO  akka.remote.EndpointWriter - AssociationError [akka.tcp://statefulserverless-backend@10.52.14.20:2552] <- [akka.tcp://statefulserverless-backend@10.52.13.17:2552]: Error [Shut down address: akka.tcp://statefulserverless-backend@10.52.13.17:2552] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://statefulserverless-backend@10.52.13.17:2552
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
02:32:46.254 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.13.17:2552]
02:32:46.264 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.13.17:2552]
02:32:52.942 WARN  akka.cluster.sharding.ShardRegion - shopping-cart: Retry request for shard [94] homes from coordinator at [Actor[akka://statefulserverless-backend/system/sharding/shopping-cartCoordinator/singleton/coordinator#1273361376]]. [17] buffered messages.
[snip - above message is repeated 18 times per second, differing only in the shard id and buffered messages]
02:32:53.137 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - Exiting confirmed [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:53.225 INFO  akka.remote.EndpointWriter - AssociationError [akka.tcp://statefulserverless-backend@10.52.14.20:2552] <- [akka.tcp://statefulserverless-backend@10.52.12.18:2552]: Error [Shut down address: akka.tcp://statefulserverless-backend@10.52.12.18:2552] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://statefulserverless-backend@10.52.12.18:2552
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
02:32:53.243 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - is the new leader among reachable nodes (more leaders may exist)
02:32:54.242 INFO  a.c.Cluster(akka://statefulserverless-backend) - Cluster Node [akka.tcp://statefulserverless-backend@10.52.14.20:2552] - Leader is removing confirmed Exiting node [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:54.250 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:54.255 INFO  a.c.s.ClusterSingletonManager - Member removed [akka.tcp://statefulserverless-backend@10.52.12.18:2552]
02:32:54.942 WARN  akka.cluster.sharding.ShardRegion - shopping-cart: Retry request for shard [94] homes from coordinator at [Actor[akka://statefulserverless-backend/system/sharding/shopping-cartCoordinator/singleton/coordinator#1273361376]]. [17] buffered messages.
[snip - above message is repeated 18 times per second, differing only in the shard id and buffered messages]

The 18 retry request warnings are then repeated every second for about 50 seconds, until the last one, 50 seconds after the first message was output:

02:33:42.932 WARN  akka.cluster.sharding.ShardRegion - shopping-cart: Retry request for shard [84] homes from coordinator at [Actor[akka://statefulserverless-backend/system/sharding/shopping-cartCoordinator/singleton/coordinator#1273361376]]. [35] buffered messages.

And then, everything goes back to normal, the shards start working again, and then 16 seconds after that:

02:33:58.432 WARN  a.r.transport.netty.NettyTransport - Remote connection to [null] failed with org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.52.12.18:2552
02:33:58.439 WARN  a.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://statefulserverless-backend@10.52.12.18:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://statefulserverless-backend@10.52.12.18:2552]] Caused by: [org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.52.12.18:2552]

So, in the above, 10.52.13.17 shut down first, followed by 10.52.12.18, but it appears that the node still attempts to communicate with 10.52.12.18 after its left the cluster, hence the last error message?

I suspect what is happening is that during shard allocation after the first node leaves the cluster, shards can be allocated to the second node that is also concurrently leaving, and there is a race condition that means that if that node shuts down while the shards are being allocated to it, the shard coordinator does not attempt to fix that and reallocate for a minute.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

I guess hand off is timing out in the RebalanceWorker. We can see the scaling decision was made at 2:32:41, it would have taken a second or two for the request made on the Kubernetes API to result in the node actually shutting down, so at around 2:32:43 the shard coordinator would have received the GracefulShutdownReq, started the RebalanceWorker which defaults to a 60 second rebalance handoff so that would have timed out at 2:33:43. That's why the last retry request message is seen at 2:33:42. I don't see any other 60 second timeouts that might cause this behaviour.

So, why is it timing out? Well, don't know exactly, but we can see at 2:32:44, the node leaves the cluster, and communication with it shuts down shortly after that. At some point the shard coordinator receives a Terminated for that region, to handle this it updates its state and removes the region from its list of gracefully terminating entities, however, it doesn't remove the shards that were currently being rebalanced from the rebalanceInProgress set. This might be a bug, if a node that previously indicated it is gracefully terminating goes away, then isn't it implied that handoff finished? Shouldn't the shards being rebalanced from it be removed?

As for the reason hand off is timing out... I'll take guess. There's no ack sent from the RebalanceWorker when ShardStopped is sent to it. The Shard actor will send the ShardStopped message to the RebalanceWorker, and then it just stops itself, then the region stops, and the coordinated shutdown promise is redeemed. At that point, I guess remoting shuts down and the actor system terminates. But, did the ShardStopped message get through? If it was buffered, I guess not, does remoting drain all buffers when its told to shutdown? And I'm seeing these messages during load tests, where I guess there's a high chance of there being a buffer of unsent messages.

So one potential solution to this might be for the Rebalance to ack the shard stopped, so that the Shard actor can know the message got through, and shutdown. That's assuming that's the bug. What I will do is insert a delay into coordinated shutdown that tries to ensure any messages that are meant to be sent through remoting are drained before remoting is shutdown, if the problem goes away, then I guess that's the bug.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

Just looked into it a bit deeper, cluster sharding is shutdown before the cluster is left, and if the cluster leaves, it has to send a message to all the nodes in the cluster, that message won't be sent until after the shard stopped messages are sent, and since that message did get through, the shard stopped messages must have got through. So that's not the bug.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

I'll see if I can capture more info by enabling debug for cluster sharding.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

Actually, I'm not sure if that gist actually captured it. I'll try again.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

Ok, definitely captured now, and I've updated the above gist. It's very noisy, even though I filtered all the Forwarding debug messages.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

Update, from my analysis of the logs, the first member requests graceful shutdown from cluster sharding, successfully hands off all shards, and then leaves the cluster, all within a couple of seconds. The second requests graceful shutdown from cluster sharding, begin hand off is logged, but no shards are successfully deallocated, then 10 seconds later, leaves the cluster. That 10 seconds I guess is the cluster sharding coordinated shutdown phase timeout. Then 50 seconds after that hand off times out.

So, either the hand off messages aren't actually being sent, or there's a problem on the terminating node that causes it to fail to terminate every single shard it has? That doesn't sound right. I'll need to try and capture the logs on the terminating nodes too.

@patriknw
Copy link
Member

patriknw commented Jul 4, 2019

Thanks for reporting and investigating. Good that you capture detailed logs that we can look into..

I guess that in those scaling scenarios we have no control of which nodes are shutdown and in which order? Always best for as smooth leaving as possible that the coordinator (singleton on oldest node) is kept alive.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

I've found one race condition, not sure if it's causing the problem here. The rebalance worker won't tell the node to hand off its shards until all shard regions have acked the begin hand off message. This includes regions that are currently gracefully shutting down. So, if sometime before the BeginHandOff message is received by the ShardRegion actor of the first node shutting down, it finishes hand off and the actor is shut down, then it's not going to ack, and that will prevent hand off of the second region to start.

Is this causing the problem? In the logs I see 715ms between when the RebalanceActor starts, and when the coordinator logs that the region has terminated. There is no configured removal margin, so that log message is not delayed by that. It feels like maybe it's a little too long? Can there be a delay that long between an actor terminating and a remote watcher receiving the termination signal?

I guess to fix the problem, I guess the rebalance worker needs to watch the shard regions itself. If that's expensive, it can just watch the regions that are currently gracefully shutting down.

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

Ok, I managed to find the logs (turns out I've got GKE stack driver logging enabled, which means I can see all the logs for nodes that have shut down), and I can confirm that only 3 out of 4 nodes received the BeginHandOff messages, and the one that didn't receive it was the node that was the first shutting down node. The second then didn't receive a HandOff message.

jroper added a commit to jroper/akka that referenced this issue Jul 4, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

PR: #27261

@jroper
Copy link
Contributor Author

jroper commented Jul 4, 2019

@patriknw only just noticed your comment above. It's using k8s deployments, so they are shutdown newest first, keeping the shard coordinator alive.

jroper added a commit to jroper/akka that referenced this issue Jul 5, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
jroper added a commit to jroper/akka that referenced this issue Jul 8, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
jroper added a commit to jroper/akka that referenced this issue Jul 8, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
jroper added a commit to jroper/akka that referenced this issue Jul 8, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
jroper added a commit to jroper/akka that referenced this issue Jul 18, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
@chbatey chbatey added 0 - new Ticket is unclear on it's purpose or if it is valid or not discuss Tickets that need some discussion before proceeding. Not decided if it's a good idea. labels Jul 22, 2019
patriknw pushed a commit to jroper/akka that referenced this issue Aug 15, 2019
Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
patriknw pushed a commit that referenced this issue Aug 15, 2019
* RebalanceWorker should watch shard regions

Fixes #27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
@patriknw patriknw added this to the 2.6.0-M6 milestone Aug 15, 2019
@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster-sharding and removed 0 - new Ticket is unclear on it's purpose or if it is valid or not discuss Tickets that need some discussion before proceeding. Not decided if it's a good idea. labels Aug 15, 2019
johanandren pushed a commit to johanandren/akka that referenced this issue Aug 28, 2019
* RebalanceWorker should watch shard regions

Fixes akka#27259.

The RebalanceWorker actor needs to watch the shard regions that it's
expecting a BeginHandOffAck message from, in case the ShardRegion shuts
down before it can receive the BeginHandOff message, preventing hand
off.  This can be a problem when two nodes are shut down at about the
same time.
@gmarti
Copy link

gmarti commented Sep 17, 2019

I have stumbled upon this issue when playing with akka cluster sharding in 2.5.x
I've looked at the #27261 fix but it seems that we cannot send the HandOff message to the shard if it's terminated. Right ?
I opened a PR #27736 to handle that case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster-sharding
Projects
None yet
4 participants