-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard reallocation problems #27259
Comments
I guess hand off is timing out in the So, why is it timing out? Well, don't know exactly, but we can see at As for the reason hand off is timing out... I'll take guess. There's no ack sent from the So one potential solution to this might be for the |
Just looked into it a bit deeper, cluster sharding is shutdown before the cluster is left, and if the cluster leaves, it has to send a message to all the nodes in the cluster, that message won't be sent until after the shard stopped messages are sent, and since that message did get through, the shard stopped messages must have got through. So that's not the bug. |
I'll see if I can capture more info by enabling debug for cluster sharding. |
Actually, I'm not sure if that gist actually captured it. I'll try again. |
Ok, definitely captured now, and I've updated the above gist. It's very noisy, even though I filtered all the Forwarding debug messages. |
Update, from my analysis of the logs, the first member requests graceful shutdown from cluster sharding, successfully hands off all shards, and then leaves the cluster, all within a couple of seconds. The second requests graceful shutdown from cluster sharding, begin hand off is logged, but no shards are successfully deallocated, then 10 seconds later, leaves the cluster. That 10 seconds I guess is the cluster sharding coordinated shutdown phase timeout. Then 50 seconds after that hand off times out. So, either the hand off messages aren't actually being sent, or there's a problem on the terminating node that causes it to fail to terminate every single shard it has? That doesn't sound right. I'll need to try and capture the logs on the terminating nodes too. |
Thanks for reporting and investigating. Good that you capture detailed logs that we can look into.. I guess that in those scaling scenarios we have no control of which nodes are shutdown and in which order? Always best for as smooth leaving as possible that the coordinator (singleton on oldest node) is kept alive. |
I've found one race condition, not sure if it's causing the problem here. The rebalance worker won't tell the node to hand off its shards until all shard regions have acked the begin hand off message. This includes regions that are currently gracefully shutting down. So, if sometime before the Is this causing the problem? In the logs I see 715ms between when the I guess to fix the problem, I guess the rebalance worker needs to watch the shard regions itself. If that's expensive, it can just watch the regions that are currently gracefully shutting down. |
Ok, I managed to find the logs (turns out I've got GKE stack driver logging enabled, which means I can see all the logs for nodes that have shut down), and I can confirm that only 3 out of 4 nodes received the |
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
PR: #27261 |
@patriknw only just noticed your comment above. It's using k8s deployments, so they are shutdown newest first, keeping the shard coordinator alive. |
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
* RebalanceWorker should watch shard regions Fixes #27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
* RebalanceWorker should watch shard regions Fixes akka#27259. The RebalanceWorker actor needs to watch the shard regions that it's expecting a BeginHandOffAck message from, in case the ShardRegion shuts down before it can receive the BeginHandOff message, preventing hand off. This can be a problem when two nodes are shut down at about the same time.
I see this issue a lot when scaling a cluster down. I think the problem happens when more than one node leaves the cluster at a time. I'll describe the symptom first, here's the logs:
The 18 retry request warnings are then repeated every second for about 50 seconds, until the last one, 50 seconds after the first message was output:
And then, everything goes back to normal, the shards start working again, and then 16 seconds after that:
So, in the above,
10.52.13.17
shut down first, followed by10.52.12.18
, but it appears that the node still attempts to communicate with10.52.12.18
after its left the cluster, hence the last error message?I suspect what is happening is that during shard allocation after the first node leaves the cluster, shards can be allocated to the second node that is also concurrently leaving, and there is a race condition that means that if that node shuts down while the shards are being allocated to it, the shard coordinator does not attempt to fix that and reallocate for a minute.
The text was updated successfully, but these errors were encountered: