storage: add_replica takes 1s to apply and slow WaitForFullReplication

TLDR; I'm seeing the application of a change_replicas command take sometimes more than one sec on a follower in a unit test (one sec since the command was committed).

While working on #38529 I started poking at the seemingly slow initial replication for a unit test - `TestCluster.WaitForFullReplication()` is taking seconds on a 5 minute cluster.
There seems to be a number of causes involving lease transfers, snapshot throttling and also a more egregious one - application of log entries on some replicas seems to take up to 1s.
I'd like this issue to track this slow replication primarily, and then the whole investigation around `WaitForFullReplication()` speed secondarily.

So, for the slow replication:
I have a branch with a test that fails consistently, seeming showing a replica not apply a command for up to 1s after the command was committed.
https://github.com/andreimatei/cockroach/tree/storage.slow-initial-replication-repro
Run with
```
make testshort PKG=./pkg/storage TESTS=TestLargeUnsplittableRangeReplicate TESTFLAGS="-count 1 -v --vmodule=replicate_queue=2,queue=2"
```

This is a test that creates a 5-node cluster, with ranges configured to be quite small. The rest of the test is irrelevant; only the cluster creation is of note. As part of a `TestCluster` startup, we wait for all the ranges to be fully replicated - `TestCluster.WaitForFullReplication()`. This method pokes the replication queue a bit and then waits for the underreplicated metric to say 0. It was slow, so I've tried to improve this waiting code.
I've made it kick all the replication queues repeatedly until they claim there's no more work to do, and then I assert that all the replicas on all the nodes have descriptors reflecting the desired replication factor. And this assertion fails.

The code isn't particularly correct: it doesn't take into account lease transfers and it doesn't take into account the fact that only the leaseholder is guaranteed to have an up-to-date descriptor. But the degree to which the latter bites me I find very troubling.

I'm observing that sometimes a replica is up to 1s behind the leader (and the quorum). I'm attaching a log showing such a bad case.
In this specific log, I see:
26.797922 - a preemptive snapshot of r17 is sent from `n1` to `n2` and applied by `n2`. Then immediately an `ADD_REPLICA` for `n2` is proposed and committed.
27.738875 (almost 1s later) - the descriptor for `r17` on `n2` does not reflect that `ADD_REPLICA`. So `n2` is still unaware that it itself is a member of the group. Even though it does have a replica (I guess ever since it applied the snapshot?).

So it takes us 1s for n2 to apply some command in a unit test? And this happens consistently. It always take more than 200ms. Doesn't seem to take more than 1s.
I'm thinking there must be some pathology here that's worth investigating.

FWIW, anecdotally I know that `WaitForFull` replication is quite slow generally and it's affecting a ton of tests. Perhaps it'll speed up if we figure something out here.


[out-slow-replication.txt](https://github.com/cockroachdb/cockroach/files/3340960/out-slow-replication.txt)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

storage: add_replica takes 1s to apply and slow WaitForFullReplication #38565

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: add_replica takes 1s to apply and slow WaitForFullReplication #38565

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions