Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: very slow restart of local cluster #26391

Closed
a-robinson opened this issue Jun 4, 2018 · 13 comments
Closed

storage: very slow restart of local cluster #26391

a-robinson opened this issue Jun 4, 2018 · 13 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.

Comments

@a-robinson
Copy link
Contributor

I created a 3-node local cluster using roachdemo on top of 98b1ceb with minor unrelated modifications to the SQL layer. I created a simple table with 100 ranges but no data in it (create table foo (k int primary key, v int); alter table foo [Csplit at select generate_series(0, 10000, 100);).

I stopped the cluster for about a minute then restarted it, and after restarting it any query that tried to touch the table took multiple minutes to complete (e.g. SHOW TESTING_RANGES FOR TABLE foo; or SELECT * FROM foo;). After the first one succeeds, additional such queries complete in well under a second.

The cluster is hardly using any CPU, so it's not spinning on a bunch of real work.

The most suspicious bit I've noticed so far is a lot of raft election traffic until the moment when things start working:

screen shot 2018-06-04 at 3 02 40 pm

screen shot 2018-06-04 at 3 29 30 pm

Also showing the slow startup is the very slow ramp-up of the replicas and leaseholders metrics:

screen shot 2018-06-04 at 3 14 14 pm

screen shot 2018-06-04 at 3 14 18 pm

Maybe relevant is an "Invalid Lease" for range 9, the eventlog range, although that seems unlikely, particularly since it remains "Invalid" even after the cluster has started responding to queries quickly.

screen shot 2018-06-04 at 3 16 24 pm

@a-robinson a-robinson added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 4, 2018
@a-robinson a-robinson self-assigned this Jun 4, 2018
@a-robinson
Copy link
Contributor Author

@bdarnell
Copy link
Contributor

bdarnell commented Jun 4, 2018

That looks like #26257. That commit was before the fix/workaround was merged in #26291.

@a-robinson
Copy link
Contributor Author

Actually, it looks like it's different. I've rebased on top of HEAD (5d2feac) and am still seeing the same behavior.

The trace suggests that what's happening is that the SQL plan for the query is hitting each range serially with a LeaseInfo request, and each range takes the full 3 second raft ticks timeout in order to respond to it. It's possibly still related to the recent unquiesced replica changes, but it isn't the exact same issue. For example:

| 2018-06-04 19:47:13.389449+00:00 | 3s601ms392µs381ns    | heartbeat 1528141642.885732004,0                                         | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | liveness heartbeat                                    |   68 |
| 2018-06-04 19:47:13.389479+00:00 | 3s601ms422µs462ns    | read-write path                                                          | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:13.389507+00:00 | 3s601ms450µs111ns    | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:13.389815+00:00 | 3s601ms758µs530ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:13.389826+00:00 | 3s601ms769µs139ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.378799+00:00 | 6s590ms742µs654ns    | retry proposal 50397e81a2bd7e83: reasonTicks                             | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.378956+00:00 | 6s590ms899µs155ns    | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.379316+00:00 | 6s591ms259µs198ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.379326+00:00 | 6s591ms269µs722ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.581442+00:00 | 6s793ms384µs914ns    | retry proposal 52d17b0bb61814c0: reasonNewLeader                         | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.581726+00:00 | 6s793ms669µs21ns     | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.583367+00:00 | 6s795ms310µs768ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.583413+00:00 | 6s795ms355µs946ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.585453+00:00 | 6s797ms395µs860ns    | retry proposal 7ce5ca9e7298e9d4: reasonNewLeaderOrConfigChange           | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.585601+00:00 | 6s797ms544µs811ns    | applied timestamp cache                                                  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.58639+00:00  | 6s798ms333µs721ns    | evaluated request                                                        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.586431+00:00 | 6s798ms373µs989ns    | acquired {raft,replica}mu                                                | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
| 2018-06-04 19:47:16.589585+00:00 | 6s801ms528µs468ns    | command proposed from replica (n1,s1):1 with lease #3 incompatible       | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
|                                  |                      | to repl=(n1,s1):1 seq=4 start=1528141568.365333322,0 epo=3               |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | pro=1528141633.385683914,0                                               |                                                                                                                       |                                 |                                                       |      |
| 2018-06-04 19:47:16.58965+00:00  | 6s801ms593µs206ns    | applying command with forced error: cannot replace lease repl=(n1,s1):1  | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
|                                  |                      | seq=4 start=1528141568.365333322,0 epo=3 pro=1528141633.385683914,0      |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | with repl=(n1,s1):1 seq=4 start=1528141568.365333322,0 epo=3             |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | pro=1528141633.385683914,0: proposed under invalid lease                 |                                                                                                                       |                                 |                                                       |      |
| 2018-06-04 19:47:16.590403+00:00 | 6s802ms346µs692ns    | replica.Send got error: cannot replace lease repl=(n1,s1):1 seq=4        | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | [async] storage.pendingLeaseRequest: requesting lease |   67 |
|                                  |                      | start=1528141568.365333322,0 epo=3 pro=1528141633.385683914,0            |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | with repl=(n1,s1):1 seq=4 start=1528141568.365333322,0 epo=3             |                                                                                                                       |                                 |                                                       |      |
|                                  |                      | pro=1528141633.385683914,0: proposed under invalid lease                 |                                                                                                                       |                                 |                                                       |      |
| 2018-06-04 19:47:16.590588+00:00 | 6s802ms530µs888ns    | read has no clock uncertainty                                            | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |
| 2018-06-04 19:47:16.590613+00:00 | 6s802ms556µs777ns    | command queue                                                            | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |
| 2018-06-04 19:47:16.590658+00:00 | 6s802ms600µs991ns    | waiting for read lock                                                    | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |
| 2018-06-04 19:47:16.590753+00:00 | 6s802ms696µs119ns    | read completed                                                           | [n1,s1,r41/1:/Table/52/1/1{700-800}]                                                                                  |                                 | /cockroach.roachpb.Internal/Batch                     |   65 |

@a-robinson
Copy link
Contributor Author

I'm still seeing the problem even after reverting your 4 recent unquiesced replica changes. Do you have any thoughts on what to look into? The initial election for each raft group appears to hit a stalemate where each replica votes for itself:

screen shot 2018-06-04 at 5 10 53 pm

And then I guess it takes until reasonTicks fires for a new election to get kicked off? After that things move reasonably quickly.

This would mostly be a non-issue if batches were sent to more/all ranges in parallel (which I'd at least expect for SELECT * FROM foo;), but it still seems undesirable.

@bdarnell
Copy link
Contributor

bdarnell commented Jun 4, 2018

Which specific ones did you revert? This stalemate sounds like it could be caused by 786954d or 44e3977

@a-robinson
Copy link
Contributor Author

Ah, I had missed that PR, but even after reverting them I'm still seeing the problem. I've now reverted #26297, #26291, #26269, #24956, and #24920, and it's still happening. It does not reproduce on 2.0.0.

@a-robinson
Copy link
Contributor Author

Wait, nevermind. After reverting all of #24920 and restarting a couple times, the queries now succeed quickly after the restart. I'm not sure why my first couple attempts after #24920 were still taking over a minute, but it may be because I wasn't waiting as long as I had been before between stopping and restarting the cluster.

Are you up for taking over? The repro steps are: create a 3 node local cluster, create a simple table with 100 ranges in it as shown in the initial comment, kill the cluster, wait until the last node has finished "gracefully" shutting down, then restart it and run select * from yourtable. It should return in no more than a couple seconds. In cases when it isn't working, it takes minutes.

@a-robinson a-robinson assigned bdarnell and unassigned a-robinson Jun 4, 2018
@tbg tbg added this to On the horizon in KV Jun 5, 2018
@bdarnell
Copy link
Contributor

bdarnell commented Jun 5, 2018

The initial election for each raft group appears to hit a stalemate where each replica votes for itself:

And yet they're all in the follower state, not candidate. I think what's happening is that the range is being awakened for an incoming MsgVote, but before it processes that message, it initiates its own campaign (committing its vote for this term and sending out its own MsgVotes), then it has to reject the incoming vote request because it's already voted for itself.

@bdarnell
Copy link
Contributor

bdarnell commented Jun 5, 2018

I've written this up as a roachtest in this commit. It only fails about half the time for me (in local mode on my laptop), making it annoying to figure out whether my change has made a difference.

@tbg
Copy link
Member

tbg commented Jun 6, 2018

I think I'm seeing the same thing restarting a (remote) three node cluster with 100k ranges:

image

The kv load I'm running hovers at 200qps and is slowly climbing, but it used to do over 1000qps before I restarted the cluster.

It eventually works itself out, with "other" messages dropping to basically zero:

image

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jun 6, 2018
When the incoming message is a MsgVote (which is likely the case for
the first message received by a quiesced follower), immediate
campaigning will cause the election to fail.

This is similar to reverting commit 44e3977, but only disables
campaigning in one location.

Fixes cockroachdb#26391

Release note: None
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jun 6, 2018
When the incoming message is a MsgVote (which is likely the case for
the first message received by a quiesced follower), immediate
campaigning will cause the election to fail.

This is similar to reverting commit 44e3977, but only disables
campaigning in one location.

Fixes cockroachdb#26391

Release note: None
craig bot pushed a commit that referenced this issue Jun 6, 2018
26441: distsql: add NewFinalIterator to the rowIterator interface r=asubiotto a=asubiotto

Some implementations of the rowIterator interface would destroy rows as
they were iterated over to free memory eagerly. NewFinalIterator is
introduced in this change to provide non-reusable behavior and
NewIterator is explicitly described as reusable.

A reusable iterator has been added to the memRowContainer to satisfy
these new interface semantics.

Release note: None

26463: storage: Disable campaign-on-wake when receiving raft messages r=bdarnell a=bdarnell

When the incoming message is a MsgVote (which is likely the case for
the first message received by a quiesced follower), immediate
campaigning will cause the election to fail.

This is similar to reverting commit 44e3977, but only disables
campaigning in one location.

Fixes #26391

Release note: None

26469: lint: Fix a file descriptor leak r=bdarnell a=bdarnell

This leak is enough to cause `make lintshort` fail when run under the
default file descriptor limit on macos (256).

Release note: None

26470: build: Pin go.uuid to the version currently in use r=bdarnell a=bdarnell

Updates #26332

Release note: None

Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
@craig craig bot closed this as completed in #26463 Jun 6, 2018
KV automation moved this from On the horizon to Finished (milestone 2, ends 6/25) Jun 6, 2018
@tbg
Copy link
Member

tbg commented Jun 12, 2018

Tried this again for #26448 -- sometimes 1s, but a run just now took 2m16s... I don't think we can call this closed.

@tbg tbg reopened this Jun 12, 2018
KV automation moved this from Finished (milestone 2, ends 6/25) to Milestone 2 Jun 12, 2018
@tbg
Copy link
Member

tbg commented Jun 12, 2018

Hmm, very confusing. I wrote a bash script to repro this, but the last couple of runs were all pretty fast.

(note: edited the script to insert a sleep before creating the table, to allow for upreplication, didn't seem to change anything).

#!/bin/bash

set -euxo pipefail

rm -rf cockroach-data
killall -9 cockroach || true

function start {
	./cockroach start --insecure --port=26259 --http-port=26260 --store=cockroach-data/2 --join=localhost:26257 &
	./cockroach start --insecure --port=26261 --http-port=26262 --store=cockroach-data/3 --join=localhost:26257 &
	./cockroach start --insecure --port=26257 --http-port=26258 --store=cockroach-data/1 --background
}

start

# Allow upreplication of initial ranges (and by extension all splits below).
sleep 20

./cockroach sql --insecure <<EOF
create table foo (k int primary key, v int); alter table foo split at select generate_series(0, 10000, 100);
EOF

killall -9 cockroach
start

./cockroach sql --insecure <<EOF
show trace for select * from foo;
EOF

@tbg tbg moved this from Milestone 2 to Backlog in KV Jun 12, 2018
@tbg tbg added the C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. label Jul 22, 2018
@tbg tbg added this to the 2.1 milestone Jul 22, 2018
@bdarnell bdarnell modified the milestones: 2.1, 2.2 Aug 15, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

What I was seeing here is likely along the lines of #30613 and we're fixing all of these bugs right now.

@tbg tbg closed this as completed Oct 11, 2018
KV automation moved this from Backlog to Closed Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

No branches or pull requests

4 participants