Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Node takes a long time to restart after killing during workload. #38531
Describe the problem
When running a workload on a 3 node cluster and one node is stopped, when you try to restart the node, it takes quite a long time to restart, to the point that I just assumed the node was dead and was not going to restart.
Log entry from when the node was killed reads:
I restarted the node after it was marked as dead in the Admin UI and after 10+ minutes it was not seen as live on the Admin UI and logs from the cluster read:
Later on the logs read
The node never successfully restarted according to the logs at this point.
However ~12 minutes later on in the logs:
And checking in the Admin UI the node is up and running again.
Steps to Reproduce
Kill the third node:
`roachprod stop $CLUSTER:3'
I had made a few cluster settings changes as per the issue that first reported this, they are as follows:
Here is the log file from node 3.
Reasonably sure this is just another rediscovery of #37906. We're trying to get a mitigation into 19.1.3, so far it looks like we'll succeed. The main PR is #38484 and will hopefully go a long way already
Note that on a 3 node cluster the <5 vs >5 minute distinction doesn't matter because there's nowhere else for the replicas to go, so they stay on the dead node indefinitely.