storage: very slow restart of local cluster #26391

a-robinson · 2018-06-04T20:32:28Z

I created a 3-node local cluster using roachdemo on top of 98b1ceb with minor unrelated modifications to the SQL layer. I created a simple table with 100 ranges but no data in it (create table foo (k int primary key, v int); alter table foo [Csplit at select generate_series(0, 10000, 100);).

I stopped the cluster for about a minute then restarted it, and after restarting it any query that tried to touch the table took multiple minutes to complete (e.g. SHOW TESTING_RANGES FOR TABLE foo; or SELECT * FROM foo;). After the first one succeeds, additional such queries complete in well under a second.

The cluster is hardly using any CPU, so it's not spinning on a bunch of real work.

The most suspicious bit I've noticed so far is a lot of raft election traffic until the moment when things start working:

Also showing the slow startup is the very slow ramp-up of the replicas and leaseholders metrics:

Maybe relevant is an "Invalid Lease" for range 9, the eventlog range, although that seems unlikely, particularly since it remains "Invalid" even after the cluster has started responding to queries quickly.

The text was updated successfully, but these errors were encountered:

a-robinson · 2018-06-04T20:39:23Z

Trace from a 2.5 minute SHOW TESTING_RANGES: https://gist.githubusercontent.com/a-robinson/14d7740fd58c221cd2bb089d5b18eed8/raw/c713b3f9ddb29b5ad75330bdae43333694342336/gistfile1.txt

bdarnell · 2018-06-04T20:49:47Z

That looks like #26257. That commit was before the fix/workaround was merged in #26291.

a-robinson · 2018-06-04T21:06:30Z

Actually, it looks like it's different. I've rebased on top of HEAD (5d2feac) and am still seeing the same behavior.

The trace suggests that what's happening is that the SQL plan for the query is hitting each range serially with a LeaseInfo request, and each range takes the full 3 second raft ticks timeout in order to respond to it. It's possibly still related to the recent unquiesced replica changes, but it isn't the exact same issue. For example:

| 2018-06-04 19:47:13.389449+00:00 | 3s601ms392µs381ns    | heartbeat 1528141642.885732004,0                                         | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | liveness heartbeat                                    |   68 |
| 2018-06-04 19:47:13.389479+00:00 | 3s601ms422µs462ns    | read-write path                                                          | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:13.389507+00:00 | 3s601ms450µs111ns    | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:13.389815+00:00 | 3s601ms758µs530ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:13.389826+00:00 | 3s601ms769µs139ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.378799+00:00 | 6s590ms742µs654ns    | retry proposal 50397e81a2bd7e83: reasonTicks                             | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.378956+00:00 | 6s590ms899µs155ns    | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.379316+00:00 | 6s591ms259µs198ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.379326+00:00 | 6s591ms269µs722ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.581442+00:00 | 6s793ms384µs914ns    | retry proposal 52d17b0bb61814c0: reasonNewLeader                         | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.581726+00:00 | 6s793ms669µs21ns     | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.583367+00:00 | 6s795ms310µs768ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.583413+00:00 | 6s795ms355µs946ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.585453+00:00 | 6s797ms395µs860ns    | retry proposal 7ce5ca9e7298e9d4: reasonNewLeaderOrConfigChange           | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.585601+00:00 | 6s797ms544µs811ns    | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.58639+00:00  | 6s798ms333µs721ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.586431+00:00 | 6s798ms373µs989ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.589585+00:00 | 6s801ms528µs468ns    | command proposed from replica (n1,s1):1 with lease #3 incompatible       | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
|                                  |                      | to repl=(n1,s1):1 seq=4 start=1528141568.365333322,0 epo=3               |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | pro=1528141633.385683914,0                                               |                                                                                                                       |                                 |                                                       |      |
| 2018-06-04 19:47:16.58965+00:00  | 6s801ms593µs206ns    | applying command with forced error: cannot replace lease repl=(n1,s1):1  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
|                                  |                      | seq=4 start=1528141568.365333322,0 epo=3 pro=1528141633.385683914,0      |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | with repl=(n1,s1):1 seq=4 start=1528141568.365333322,0 epo=3             |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | pro=1528141633.385683914,0: proposed under invalid lease                 |                                                                                                                       |                                 |                                                       |      |
| 2018-06-04 19:47:16.590403+00:00 | 6s802ms346µs692ns    | replica.Send got error: cannot replace lease repl=(n1,s1):1 seq=4        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
|                                  |                      | start=1528141568.365333322,0 epo=3 pro=1528141633.385683914,0            |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | with repl=(n1,s1):1 seq=4 start=1528141568.365333322,0 epo=3             |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | pro=1528141633.385683914,0: proposed under invalid lease                 |                                                                                                                       |                                 |                                                       |      |
| 2018-06-04 19:47:16.590588+00:00 | 6s802ms530µs888ns    | read has no clock uncertainty                                            | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |
| 2018-06-04 19:47:16.590613+00:00 | 6s802ms556µs777ns    | command queue                                                            | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |
| 2018-06-04 19:47:16.590658+00:00 | 6s802ms600µs991ns    | waiting for read lock                                                    | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |
| 2018-06-04 19:47:16.590753+00:00 | 6s802ms696µs119ns    | read completed                                                           | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |

a-robinson · 2018-06-04T22:16:45Z

I'm still seeing the problem even after reverting your 4 recent unquiesced replica changes. Do you have any thoughts on what to look into? The initial election for each raft group appears to hit a stalemate where each replica votes for itself:

And then I guess it takes until reasonTicks fires for a new election to get kicked off? After that things move reasonably quickly.

This would mostly be a non-issue if batches were sent to more/all ranges in parallel (which I'd at least expect for SELECT * FROM foo;), but it still seems undesirable.

bdarnell · 2018-06-04T22:18:51Z

Which specific ones did you revert? This stalemate sounds like it could be caused by 786954d or 44e3977

a-robinson · 2018-06-04T22:28:22Z

Ah, I had missed that PR, but even after reverting them I'm still seeing the problem. I've now reverted #26297, #26291, #26269, #24956, and #24920, and it's still happening. It does not reproduce on 2.0.0.

a-robinson · 2018-06-04T22:32:56Z

Wait, nevermind. After reverting all of #24920 and restarting a couple times, the queries now succeed quickly after the restart. I'm not sure why my first couple attempts after #24920 were still taking over a minute, but it may be because I wasn't waiting as long as I had been before between stopping and restarting the cluster.

Are you up for taking over? The repro steps are: create a 3 node local cluster, create a simple table with 100 ranges in it as shown in the initial comment, kill the cluster, wait until the last node has finished "gracefully" shutting down, then restart it and run select * from yourtable. It should return in no more than a couple seconds. In cases when it isn't working, it takes minutes.

bdarnell · 2018-06-05T16:46:54Z

The initial election for each raft group appears to hit a stalemate where each replica votes for itself:

And yet they're all in the follower state, not candidate. I think what's happening is that the range is being awakened for an incoming MsgVote, but before it processes that message, it initiates its own campaign (committing its vote for this term and sending out its own MsgVotes), then it has to reject the incoming vote request because it's already voted for itself.

bdarnell · 2018-06-05T18:09:11Z

I've written this up as a roachtest in this commit. It only fails about half the time for me (in local mode on my laptop), making it annoying to figure out whether my change has made a difference.

tbg · 2018-06-06T13:04:10Z

I think I'm seeing the same thing restarting a (remote) three node cluster with 100k ranges:

The kv load I'm running hovers at 200qps and is slowly climbing, but it used to do over 1000qps before I restarted the cluster.

It eventually works itself out, with "other" messages dropping to basically zero:

When the incoming message is a MsgVote (which is likely the case for the first message received by a quiesced follower), immediate campaigning will cause the election to fail. This is similar to reverting commit 44e3977, but only disables campaigning in one location. Fixes cockroachdb#26391 Release note: None

26441: distsql: add NewFinalIterator to the rowIterator interface r=asubiotto a=asubiotto Some implementations of the rowIterator interface would destroy rows as they were iterated over to free memory eagerly. NewFinalIterator is introduced in this change to provide non-reusable behavior and NewIterator is explicitly described as reusable. A reusable iterator has been added to the memRowContainer to satisfy these new interface semantics. Release note: None 26463: storage: Disable campaign-on-wake when receiving raft messages r=bdarnell a=bdarnell When the incoming message is a MsgVote (which is likely the case for the first message received by a quiesced follower), immediate campaigning will cause the election to fail. This is similar to reverting commit 44e3977, but only disables campaigning in one location. Fixes #26391 Release note: None 26469: lint: Fix a file descriptor leak r=bdarnell a=bdarnell This leak is enough to cause `make lintshort` fail when run under the default file descriptor limit on macos (256). Release note: None 26470: build: Pin go.uuid to the version currently in use r=bdarnell a=bdarnell Updates #26332 Release note: None Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com> Co-authored-by: Ben Darnell <ben@cockroachlabs.com>

tbg · 2018-06-12T09:19:02Z

Tried this again for #26448 -- sometimes 1s, but a run just now took 2m16s... I don't think we can call this closed.

tbg · 2018-06-12T09:26:41Z

Hmm, very confusing. I wrote a bash script to repro this, but the last couple of runs were all pretty fast.

(note: edited the script to insert a sleep before creating the table, to allow for upreplication, didn't seem to change anything).

#!/bin/bash

set -euxo pipefail

rm -rf cockroach-data
killall -9 cockroach || true

function start {
	./cockroach start --insecure --port=26259 --http-port=26260 --store=cockroach-data/2 --join=localhost:26257 &
	./cockroach start --insecure --port=26261 --http-port=26262 --store=cockroach-data/3 --join=localhost:26257 &
	./cockroach start --insecure --port=26257 --http-port=26258 --store=cockroach-data/1 --background
}

start

# Allow upreplication of initial ranges (and by extension all splits below).
sleep 20

./cockroach sql --insecure <<EOF
create table foo (k int primary key, v int); alter table foo split at select generate_series(0, 10000, 100);
EOF

killall -9 cockroach
start

./cockroach sql --insecure <<EOF
show trace for select * from foo;
EOF

tbg · 2018-10-11T10:20:56Z

What I was seeing here is likely along the lines of #30613 and we're fixing all of these bugs right now.

a-robinson added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 4, 2018

a-robinson self-assigned this Jun 4, 2018

a-robinson assigned bdarnell and unassigned a-robinson Jun 4, 2018

This was referenced Jun 5, 2018

storage: simplify campaigning logic after PreVote is enabled #18365

Closed

storage: Cancel pending lease requests in leasePostApply #26448

Closed

bdarnell mentioned this issue Jun 6, 2018

storage: Disable campaign-on-wake when receiving raft messages #26463

Merged

craig bot closed this as completed in #26463 Jun 6, 2018

tbg reopened this Jun 12, 2018

tbg added the C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. label Jul 22, 2018

tbg added this to the 2.1 milestone Jul 22, 2018

bdarnell modified the milestones: 2.1, 2.2 Aug 15, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

tbg closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: very slow restart of local cluster #26391

storage: very slow restart of local cluster #26391

a-robinson commented Jun 4, 2018

a-robinson commented Jun 4, 2018

bdarnell commented Jun 4, 2018

a-robinson commented Jun 4, 2018

a-robinson commented Jun 4, 2018

bdarnell commented Jun 4, 2018

a-robinson commented Jun 4, 2018

a-robinson commented Jun 4, 2018

bdarnell commented Jun 5, 2018

bdarnell commented Jun 5, 2018

tbg commented Jun 6, 2018 •

edited

Loading

tbg commented Jun 12, 2018

tbg commented Jun 12, 2018 •

edited

Loading

tbg commented Oct 11, 2018

storage: very slow restart of local cluster #26391

storage: very slow restart of local cluster #26391

Comments

a-robinson commented Jun 4, 2018

a-robinson commented Jun 4, 2018

bdarnell commented Jun 4, 2018

a-robinson commented Jun 4, 2018

a-robinson commented Jun 4, 2018

bdarnell commented Jun 4, 2018

a-robinson commented Jun 4, 2018

a-robinson commented Jun 4, 2018

bdarnell commented Jun 5, 2018

bdarnell commented Jun 5, 2018

tbg commented Jun 6, 2018 • edited Loading

tbg commented Jun 12, 2018

tbg commented Jun 12, 2018 • edited Loading

tbg commented Oct 11, 2018

tbg commented Jun 6, 2018 •

edited

Loading

tbg commented Jun 12, 2018 •

edited

Loading