-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
TLDR; I'm seeing the application of a change_replicas command take sometimes more than one sec on a follower in a unit test (one sec since the command was committed).
While working on #38529 I started poking at the seemingly slow initial replication for a unit test - TestCluster.WaitForFullReplication() is taking seconds on a 5 minute cluster.
There seems to be a number of causes involving lease transfers, snapshot throttling and also a more egregious one - application of log entries on some replicas seems to take up to 1s.
I'd like this issue to track this slow replication primarily, and then the whole investigation around WaitForFullReplication() speed secondarily.
So, for the slow replication:
I have a branch with a test that fails consistently, seeming showing a replica not apply a command for up to 1s after the command was committed.
https://github.com/andreimatei/cockroach/tree/storage.slow-initial-replication-repro
Run with
make testshort PKG=./pkg/storage TESTS=TestLargeUnsplittableRangeReplicate TESTFLAGS="-count 1 -v --vmodule=replicate_queue=2,queue=2"
This is a test that creates a 5-node cluster, with ranges configured to be quite small. The rest of the test is irrelevant; only the cluster creation is of note. As part of a TestCluster startup, we wait for all the ranges to be fully replicated - TestCluster.WaitForFullReplication(). This method pokes the replication queue a bit and then waits for the underreplicated metric to say 0. It was slow, so I've tried to improve this waiting code.
I've made it kick all the replication queues repeatedly until they claim there's no more work to do, and then I assert that all the replicas on all the nodes have descriptors reflecting the desired replication factor. And this assertion fails.
The code isn't particularly correct: it doesn't take into account lease transfers and it doesn't take into account the fact that only the leaseholder is guaranteed to have an up-to-date descriptor. But the degree to which the latter bites me I find very troubling.
I'm observing that sometimes a replica is up to 1s behind the leader (and the quorum). I'm attaching a log showing such a bad case.
In this specific log, I see:
26.797922 - a preemptive snapshot of r17 is sent from n1 to n2 and applied by n2. Then immediately an ADD_REPLICA for n2 is proposed and committed.
27.738875 (almost 1s later) - the descriptor for r17 on n2 does not reflect that ADD_REPLICA. So n2 is still unaware that it itself is a member of the group. Even though it does have a replica (I guess ever since it applied the snapshot?).
So it takes us 1s for n2 to apply some command in a unit test? And this happens consistently. It always take more than 200ms. Doesn't seem to take more than 1s.
I'm thinking there must be some pathology here that's worth investigating.
FWIW, anecdotally I know that WaitForFull replication is quite slow generally and it's affecting a ton of tests. Perhaps it'll speed up if we figure something out here.