roachtest: election-after-restart failed #30613

cockroach-teamcity · 2018-09-25T07:30:31Z

SHA: https://github.com/cockroachdb/cockroach/commits/b2bd8e8b6446a566b667be6094e019c1040ed98d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=923560&tab=buildLog

The test failed on master:
	test.go:498,cluster.go:890,election.go:60: /home/agent/work/.go/bin/roachprod run teamcity-923560-election-after-restart:1 -- ./cockroach sql --insecure -e "
		        SELECT * FROM test.kv" returned:
		stderr:
		
		stdout:
		Error: pq: unable to dial n2: breaker open
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

The text was updated successfully, but these errors were encountered:

tbg · 2018-09-25T10:03:35Z

These errors likely originate in DistSQL. I assume they get sent back to the client verbatim (or there would be a structured error return in the matches below, which I've checked there isn't).

$ rg -F -g '!*_test.go' 'nodeDialer.Dial('
pkg/sql/distsql_running.go
65:	conn, err := req.nodeDialer.Dial(req.ctx, req.nodeID)

pkg/sql/distsqlrun/outbox.go
211:			conn, err = m.flowCtx.nodeDialer.Dial(ctx, m.nodeID)

DistSender definitely works differently, it's going to wrap these either with a SendError or AmbiguousResult, but won't leak them verbatim (see (*DistSender).sendToReplicas).

What is it that we want to happen here in a DistSQL world?

cc @jordanlewis

cockroach-teamcity · 2018-09-27T07:15:34Z

SHA: https://github.com/cockroachdb/cockroach/commits/9f191cdad64747f7f19ec4990e559cbb19e6d372

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=928248&tab=buildLog

The test failed on master:
	test.go:498,cluster.go:890,election.go:60: /home/agent/work/.go/bin/roachprod run teamcity-928248-election-after-restart:1 -- ./cockroach sql --insecure -e "
		        SELECT * FROM test.kv" returned:
		stderr:
		
		stdout:
		Error: pq: unable to dial n2: breaker open
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

cockroach-teamcity · 2018-09-30T07:27:17Z

SHA: https://github.com/cockroachdb/cockroach/commits/137b282943695b8d2aabcbd5d69be85b58746acd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=933504&tab=buildLog

The test failed on master:
	test.go:503,election.go:72: expected query to succeed in less than 15s, took 1m9.906603757s

See cockroachdb#30613 (comment). Touches cockroachdb#30613 Release note: None

30384: workload/tpcc: adjust audit checks for 1 warehouse r=solongordon a=solongordon I updated the TPCC audit checks not to complain about a lack of remote orders/payments when there is only one warehouse. Fixes #29275 Release note: None 30817: roachtest: trace slow query in election test r=benesch a=tschottdorf See #30613 (comment). Touches #30613 Release note: None 30820: roachtest: update comment on scaledata tests r=petermattis a=tschottdorf We're already running them under basic chaos. Release note: None 30822: Update issue templates r=knz a=knz I took the proposed default Github templates and "blended" our existing issue template into it. I also created a new "performance inquiry" template given the large number of incoming inquiries that are miscategorized as bugs. 30825: rpc,server: authenticate all gRPC methods r=bdarnell,petermattis a=benesch Previously only the roachpb.Batch RPC was correctly checking for an authenticated user. All other RPCs were open to the public, even when the server was running in secure mode. To prevent future accidents of this kind, hoist the authentication check to a gRPC interceptor that is guaranteed to run before all RPCs. Release note (bug fix): A security vulnerability in which data could be leaked from or tampered with in a cluster in secure mode has been fixed. Co-authored-by: Solon Gordon <solon@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: kena <knz@users.noreply.github.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

See cockroachdb#30613 (comment). Touches cockroachdb#30613 Release note: None

cockroach-teamcity · 2018-10-03T07:44:08Z

SHA: https://github.com/cockroachdb/cockroach/commits/f78cc21bb4d7bd3a79e44b4608c04684edf28179

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=939960&tab=buildLog

The test failed on master:
	test.go:500,election.go:67: /home/agent/work/.go/bin/roachprod run teamcity-939960-election-after-restart:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

tbg · 2018-10-03T08:07:28Z

Repro'ed in a few minutes of running

while [ $? -eq 0 ]; do rm -rf ~/artifacts ; ./bin/roachtest run --local election-after-restart; done

on gceworker, with the output intact (not as above).

Artifacts attached: out.zip

Has the trace in test.log.

tbg · 2018-10-03T08:14:49Z

Looks like the only thing we really need to look at is the below. There are basically two ranges that return RangeNotFound for significant periods of time (i.e. dozens of seconds).

$ rg 'r[0-9]+ was not found' test.log
967:2018-10-03 08:00:01.91582+00:00	108ms944\302\265s18ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
1211:2018-10-03 08:00:01.967777+00:00	160ms900\302\265s749ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
1806:2018-10-03 08:00:02.08134+00:00	274ms463\302\265s658ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
2913:2018-10-03 08:00:02.290029+00:00	483ms152\302\265s889ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3125:2018-10-03 08:00:02.633424+00:00	826ms547\302\265s948ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3166:2018-10-03 08:00:03.385153+00:00	1s578ms277\302\265s307ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3207:2018-10-03 08:00:04.481621+00:00	2s674ms744\302\265s954ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3248:2018-10-03 08:00:05.620749+00:00	3s813ms873\302\265s157ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3289:2018-10-03 08:00:06.717929+00:00	4s911ms52\302\265s906ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471

[ ... many similar lines emitted at 1s spacing ...]

7107:2018-10-03 08:01:35.076396+00:00	1m33s269ms520\302\265s253ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7148:2018-10-03 08:01:35.133587+00:00	1m33s326ms711\302\265s111ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7189:2018-10-03 08:01:35.239665+00:00	1m33s432ms788\302\265s920ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7230:2018-10-03 08:01:35.438885+00:00	1m33s632ms8\302\265s945ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7271:2018-10-03 08:01:35.827366+00:00	1m34s20ms489\302\265s626ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7312:2018-10-03 08:01:36.623312+00:00	1m34s816ms435\302\265s591ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7353:2018-10-03 08:01:37.477719+00:00	1m35s670ms843\302\265s89ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7397:2018-10-03 08:01:38.467285+00:00	1m36s660ms408\302\265s731ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7438:2018-10-03 08:01:39.42067+00:00	1m37s613ms794\302\265s460ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7479:2018-10-03 08:01:40.36893+00:00	1m38s562ms53\302\265s647ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7520:2018-10-03 08:01:41.501695+00:00	1m39s694ms818\302\265s812ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7561:2018-10-03 08:01:42.458582+00:00	1m40s651ms705\302\265s813ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7602:2018-10-03 08:01:43.526777+00:00	1m41s719ms901\302\265s462ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7643:2018-10-03 08:01:44.440548+00:00	1m42s633ms671\302\265s873ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7684:2018-10-03 08:01:45.323977+00:00	1m43s517ms101\302\265s51ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7725:2018-10-03 08:01:46.307719+00:00	1m44s500ms842\302\265s711ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7766:2018-10-03 08:01:47.387228+00:00	1m45s580ms352\302\265s161ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7810:2018-10-03 08:01:48.390339+00:00	1m46s583ms463\302\265s318ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7851:2018-10-03 08:01:49.245899+00:00	1m47s439ms23\302\265s495ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7892:2018-10-03 08:01:50.278959+00:00	1m48s472ms82\302\265s974ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7933:2018-10-03 08:01:51.167815+00:00	1m49s360ms939\302\265s24ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7974:2018-10-03 08:01:52.160355+00:00	1m50s353ms479\302\265s388ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
8019:2018-10-03 08:01:53.073666+00:00	1m51s266ms790\302\265s492ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
8060:2018-10-03 08:01:54.001229+00:00	1m52s194ms353\302\265s99ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
8101:2018-10-03 08:01:54.942158+00:00	1m53s135ms281\302\265s899ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471

tbg · 2018-10-03T08:34:28Z

Got this error spuriously too (probably what happened in the nightly failure above; I tweaked the test to report stderr in that case):

--- FAIL: election-after-restart (13.91s)
	test.go:505,election.go:67: Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.

		read tcp 127.0.0.1:51928->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1


		/home/tschottdorf/go/bin/roachprod run local:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

tbg · 2018-10-03T09:25:53Z

Heh, I incremented the connection timeout, now I see

--- FAIL: election-after-restart (3.88s)
	test.go:505,election.go:67: Error: pq: unable to dial n2: breaker open
		Failed running "sql"
		Error:  exit status 1


		/home/tschottdorf/go/bin/roachprod run local:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

cc @petermattis that thing really is everywhere.

tbg · 2018-10-03T10:05:04Z

Meh, now it's just sitting there and not repro'ing anything. Well. It'll fail eventually.

petermattis · 2018-10-03T13:37:24Z

cc @petermattis that thing really is everywhere.

Ack. I'm working on it.

tbg · 2018-10-03T14:03:19Z

Still not failing (except with the occasional breaker open). I must've gotten really lucky the first time around.

tbg · 2018-10-03T15:10:10Z

Finally got one. The RangeNotFoundErrors all originate here:

// getReplicaDescriptorRLocked is like getReplicaDescriptor, but assumes that
// r.mu is held for either reading or writing.
func (r *Replica) getReplicaDescriptorRLocked() (roachpb.ReplicaDescriptor, error) {
	repDesc, ok := r.mu.state.Desc.GetReplicaDescriptor(r.store.StoreID())
	if ok {
		return repDesc, nil
	}
	return roachpb.ReplicaDescriptor{}, roachpb.NewRangeNotFoundError(r.RangeID)
}

This gets called a bunch in lots of places, but I think we're getting it from the lease renewal code. This smells like replica removal and waiting for GC, but in a 3x cluster? Time to sleuth some logs.

tbg · 2018-10-05T17:03:24Z

I looked at another couple repros (which, btw, happened every couple minutes now) but didn't catch the snapshot thing again. Oh well. I assume this is unrelated and a sign that when lots of Raft log gets created right after the cluster start, we may sometimes truncate too aggressively (maybe the Raft status isn't fully populated at that point).

cockroach-teamcity · 2018-10-08T22:33:06Z

SHA: https://github.com/cockroachdb/cockroach/commits/dc0e73c728e533fdb3bec63e53eec174e920ff22

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=952220&tab=buildLog

The test failed on release-2.1:
	test.go:570,election.go:67: /home/agent/work/.go/bin/roachprod run teamcity-952220-election-after-restart:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

@nvanbenschoten

Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.

cockroach-teamcity · 2018-10-10T08:39:45Z

SHA: https://github.com/cockroachdb/cockroach/commits/ac2f39fcc6be7366bc786d231890ee91e84f1c3c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=955173&tab=buildLog

The test failed on master:
	test.go:570,election.go:67: /home/agent/work/.go/bin/roachprod run teamcity-955173-election-after-restart:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

@nvanbenschoten

Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.

@nvanbenschoten

31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes #30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added. 31187: roachtest: add synctest r=bdarnell a=tschottdorf This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs the `synctest` cli command against a nemesis that injects random I/O errors. The synctest command is new. It simulates a Raft log and can be directed at a filesystem that is being hit with random failures. The workload essentially writes ascending keys (flushing each one to disk synchronously) until an I/O error occurs, at which point it re-opens the instance to verify that all persisted writes are still there. If the RocksDB instance was permanently corrupted, it switches to a new, pristine, directory. This is used in the roachtest, but is also useful for manual use in user deployments in which we suspect there is a failure to persist data to disk. This hasn't found anything, but it's fun to watch and also shows us a number of errors that we know and love from sentry. Release note: None 31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes #31096. Fixes #31149. Fixes #31160. Fixes #31167. Release note: None 31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults. Release note: none. 31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>

@nvanbenschoten

Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.

@nvanbenschoten

Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.

@nvanbenschoten

Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes cockroachdb#30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added.

cockroach-teamcity added this to the 2.2 milestone Sep 25, 2018

cockroach-teamcity assigned bdarnell Sep 25, 2018

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 25, 2018

tbg assigned jordanlewis and unassigned bdarnell Sep 25, 2018

tbg added the A-sql-execution Relating to SQL execution. label Sep 25, 2018

tbg modified the milestones: 2.2, 2.1 Sep 25, 2018

tbg mentioned this issue Sep 25, 2018

teamcity: failed tests on master: acceptance/bank/node-restart #29149

Closed

tbg added a commit to tbg/cockroach that referenced this issue Oct 1, 2018

roachtest: trace slow query in election test

ec0a99a

See cockroachdb#30613 (comment). Touches cockroachdb#30613 Release note: None

tbg mentioned this issue Oct 1, 2018

roachtest: trace slow query in election test #30817

Merged

tbg added a commit to tbg/cockroach that referenced this issue Oct 2, 2018

roachtest: trace slow query in election test

dcd60ed

See cockroachdb#30613 (comment). Touches cockroachdb#30613 Release note: None

tbg assigned tbg and unassigned jordanlewis Oct 9, 2018

tbg removed the A-sql-execution Relating to SQL execution. label Oct 9, 2018

tbg added this to Incoming in KV via automation Oct 9, 2018

tbg modified the milestones: 2.1, 2.2 Oct 9, 2018

tbg mentioned this issue Oct 10, 2018

teamcity: failed test: bank/node-restart #30064

Closed

craig bot closed this as completed in #31013 Oct 10, 2018

KV automation moved this from Incoming to Closed Oct 10, 2018

This was referenced Oct 11, 2018

backport-2.1: kv: try next replica on RangeNotFoundError #31250

Merged

storage: very slow restart of local cluster #26391

Closed

tbg mentioned this issue Oct 15, 2018

backport-2.0: kv: try next replica on RangeNotFoundError #31350

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: election-after-restart failed #30613

roachtest: election-after-restart failed #30613

cockroach-teamcity commented Sep 25, 2018

tbg commented Sep 25, 2018

cockroach-teamcity commented Sep 27, 2018

cockroach-teamcity commented Sep 30, 2018

cockroach-teamcity commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

petermattis commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 5, 2018

cockroach-teamcity commented Oct 8, 2018

cockroach-teamcity commented Oct 10, 2018

roachtest: election-after-restart failed #30613

roachtest: election-after-restart failed #30613

Comments

cockroach-teamcity commented Sep 25, 2018

tbg commented Sep 25, 2018

cockroach-teamcity commented Sep 27, 2018

cockroach-teamcity commented Sep 30, 2018

cockroach-teamcity commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

petermattis commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 5, 2018

cockroach-teamcity commented Oct 8, 2018

cockroach-teamcity commented Oct 10, 2018