Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: election-after-restart failed #30613

Closed
cockroach-teamcity opened this issue Sep 25, 2018 · 23 comments · Fixed by #31013
Closed

roachtest: election-after-restart failed #30613

cockroach-teamcity opened this issue Sep 25, 2018 · 23 comments · Fixed by #31013
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/b2bd8e8b6446a566b667be6094e019c1040ed98d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=923560&tab=buildLog

The test failed on master:
	test.go:498,cluster.go:890,election.go:60: /home/agent/work/.go/bin/roachprod run teamcity-923560-election-after-restart:1 -- ./cockroach sql --insecure -e "
		        SELECT * FROM test.kv" returned:
		stderr:
		
		stdout:
		Error: pq: unable to dial n2: breaker open
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Sep 25, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 25, 2018
@tbg
Copy link
Member

tbg commented Sep 25, 2018

These errors likely originate in DistSQL. I assume they get sent back to the client verbatim (or there would be a structured error return in the matches below, which I've checked there isn't).

$ rg -F -g '!*_test.go' 'nodeDialer.Dial('
pkg/sql/distsql_running.go
65:	conn, err := req.nodeDialer.Dial(req.ctx, req.nodeID)

pkg/sql/distsqlrun/outbox.go
211:			conn, err = m.flowCtx.nodeDialer.Dial(ctx, m.nodeID)

DistSender definitely works differently, it's going to wrap these either with a SendError or AmbiguousResult, but won't leak them verbatim (see (*DistSender).sendToReplicas).

What is it that we want to happen here in a DistSQL world?

cc @jordanlewis

@tbg tbg assigned jordanlewis and unassigned bdarnell Sep 25, 2018
@tbg tbg added the A-sql-execution Relating to SQL execution. label Sep 25, 2018
@tbg tbg modified the milestones: 2.2, 2.1 Sep 25, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9f191cdad64747f7f19ec4990e559cbb19e6d372

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=928248&tab=buildLog

The test failed on master:
	test.go:498,cluster.go:890,election.go:60: /home/agent/work/.go/bin/roachprod run teamcity-928248-election-after-restart:1 -- ./cockroach sql --insecure -e "
		        SELECT * FROM test.kv" returned:
		stderr:
		
		stdout:
		Error: pq: unable to dial n2: breaker open
		Failed running "sql"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/137b282943695b8d2aabcbd5d69be85b58746acd

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=933504&tab=buildLog

The test failed on master:
	test.go:503,election.go:72: expected query to succeed in less than 15s, took 1m9.906603757s

tbg added a commit to tbg/cockroach that referenced this issue Oct 1, 2018
craig bot pushed a commit that referenced this issue Oct 1, 2018
30384: workload/tpcc: adjust audit checks for 1 warehouse r=solongordon a=solongordon

I updated the TPCC audit checks not to complain about a lack of remote
orders/payments when there is only one warehouse.

Fixes #29275

Release note: None

30817: roachtest: trace slow query in election test r=benesch a=tschottdorf

See
#30613 (comment).

Touches #30613

Release note: None

30820: roachtest: update comment on scaledata tests r=petermattis a=tschottdorf

We're already running them under basic chaos.

Release note: None

30822: Update issue templates r=knz a=knz

I took the proposed default Github templates and "blended" our existing issue template into it.

I also created a new "performance inquiry" template given the large number of incoming inquiries that are miscategorized as bugs.

30825: rpc,server: authenticate all gRPC methods r=bdarnell,petermattis a=benesch

Previously only the roachpb.Batch RPC was correctly checking for an
authenticated user. All other RPCs were open to the public, even when
the server was running in secure mode.

To prevent future accidents of this kind, hoist the authentication check
to a gRPC interceptor that is guaranteed to run before all RPCs.

Release note (bug fix): A security vulnerability in which data could be
leaked from or tampered with in a cluster in secure mode has been fixed.

Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: kena <knz@users.noreply.github.com>
Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Oct 2, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f78cc21bb4d7bd3a79e44b4608c04684edf28179

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=939960&tab=buildLog

The test failed on master:
	test.go:500,election.go:67: /home/agent/work/.go/bin/roachprod run teamcity-939960-election-after-restart:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Repro'ed in a few minutes of running

while [ $? -eq 0 ]; do rm -rf ~/artifacts ; ./bin/roachtest run --local election-after-restart; done

on gceworker, with the output intact (not as above).

Artifacts attached: out.zip

Has the trace in test.log.

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Looks like the only thing we really need to look at is the below. There are basically two ranges that return RangeNotFound for significant periods of time (i.e. dozens of seconds).

$ rg 'r[0-9]+ was not found' test.log
967:2018-10-03 08:00:01.91582+00:00	108ms944\302\265s18ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
1211:2018-10-03 08:00:01.967777+00:00	160ms900\302\265s749ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
1806:2018-10-03 08:00:02.08134+00:00	274ms463\302\265s658ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
2913:2018-10-03 08:00:02.290029+00:00	483ms152\302\265s889ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3125:2018-10-03 08:00:02.633424+00:00	826ms547\302\265s948ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3166:2018-10-03 08:00:03.385153+00:00	1s578ms277\302\265s307ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3207:2018-10-03 08:00:04.481621+00:00	2s674ms744\302\265s954ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3248:2018-10-03 08:00:05.620749+00:00	3s813ms873\302\265s157ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471
3289:2018-10-03 08:00:06.717929+00:00	4s911ms52\302\265s906ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/1200,/Table/53/1/1300): r39 was not found	[n2,txn=e9c7f45e]		dist sender send	471

[ ... many similar lines emitted at 1s spacing ...]

7107:2018-10-03 08:01:35.076396+00:00	1m33s269ms520\302\265s253ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7148:2018-10-03 08:01:35.133587+00:00	1m33s326ms711\302\265s111ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7189:2018-10-03 08:01:35.239665+00:00	1m33s432ms788\302\265s920ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7230:2018-10-03 08:01:35.438885+00:00	1m33s632ms8\302\265s945ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7271:2018-10-03 08:01:35.827366+00:00	1m34s20ms489\302\265s626ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7312:2018-10-03 08:01:36.623312+00:00	1m34s816ms435\302\265s591ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7353:2018-10-03 08:01:37.477719+00:00	1m35s670ms843\302\265s89ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7397:2018-10-03 08:01:38.467285+00:00	1m36s660ms408\302\265s731ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7438:2018-10-03 08:01:39.42067+00:00	1m37s613ms794\302\265s460ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7479:2018-10-03 08:01:40.36893+00:00	1m38s562ms53\302\265s647ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7520:2018-10-03 08:01:41.501695+00:00	1m39s694ms818\302\265s812ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7561:2018-10-03 08:01:42.458582+00:00	1m40s651ms705\302\265s813ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7602:2018-10-03 08:01:43.526777+00:00	1m41s719ms901\302\265s462ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7643:2018-10-03 08:01:44.440548+00:00	1m42s633ms671\302\265s873ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7684:2018-10-03 08:01:45.323977+00:00	1m43s517ms101\302\265s51ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7725:2018-10-03 08:01:46.307719+00:00	1m44s500ms842\302\265s711ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7766:2018-10-03 08:01:47.387228+00:00	1m45s580ms352\302\265s161ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7810:2018-10-03 08:01:48.390339+00:00	1m46s583ms463\302\265s318ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7851:2018-10-03 08:01:49.245899+00:00	1m47s439ms23\302\265s495ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7892:2018-10-03 08:01:50.278959+00:00	1m48s472ms82\302\265s974ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7933:2018-10-03 08:01:51.167815+00:00	1m49s360ms939\302\265s24ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
7974:2018-10-03 08:01:52.160355+00:00	1m50s353ms479\302\265s388ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
8019:2018-10-03 08:01:53.073666+00:00	1m51s266ms790\302\265s492ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
8060:2018-10-03 08:01:54.001229+00:00	1m52s194ms353\302\265s99ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471
8101:2018-10-03 08:01:54.942158+00:00	1m53s135ms281\302\265s899ns	reply error [txn: e9c7f45e], Scan [/Table/53/1/5400,/Table/53/1/5500): r81 was not found	[n2,txn=e9c7f45e]		dist sender send	471

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Got this error spuriously too (probably what happened in the nightly failure above; I tweaked the test to report stderr in that case):

--- FAIL: election-after-restart (13.91s)
	test.go:505,election.go:67: Error: cannot dial server.
		Is the server running?
		If the server is running, check --host client-side and --advertise server-side.

		read tcp 127.0.0.1:51928->127.0.0.1:26257: i/o timeout
		Failed running "sql"
		Error:  exit status 1


		/home/tschottdorf/go/bin/roachprod run local:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Heh, I incremented the connection timeout, now I see

--- FAIL: election-after-restart (3.88s)
	test.go:505,election.go:67: Error: pq: unable to dial n2: breaker open
		Failed running "sql"
		Error:  exit status 1


		/home/tschottdorf/go/bin/roachprod run local:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

cc @petermattis that thing really is everywhere.

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Meh, now it's just sitting there and not repro'ing anything. Well. It'll fail eventually.

@petermattis
Copy link
Collaborator

cc @petermattis that thing really is everywhere.

Ack. I'm working on it.

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Still not failing (except with the occasional breaker open). I must've gotten really lucky the first time around.

@tbg
Copy link
Member

tbg commented Oct 3, 2018

Finally got one. The RangeNotFoundErrors all originate here:

// getReplicaDescriptorRLocked is like getReplicaDescriptor, but assumes that
// r.mu is held for either reading or writing.
func (r *Replica) getReplicaDescriptorRLocked() (roachpb.ReplicaDescriptor, error) {
	repDesc, ok := r.mu.state.Desc.GetReplicaDescriptor(r.store.StoreID())
	if ok {
		return repDesc, nil
	}
	return roachpb.ReplicaDescriptor{}, roachpb.NewRangeNotFoundError(r.RangeID)
}

This gets called a bunch in lots of places, but I think we're getting it from the lease renewal code. This smells like replica removal and waiting for GC, but in a 3x cluster? Time to sleuth some logs.

@tbg
Copy link
Member

tbg commented Oct 5, 2018

I looked at another couple repros (which, btw, happened every couple minutes now) but didn't catch the snapshot thing again. Oh well. I assume this is unrelated and a sign that when lots of Raft log gets created right after the cluster start, we may sometimes truncate too aggressively (maybe the Raft status isn't fully populated at that point).

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dc0e73c728e533fdb3bec63e53eec174e920ff22

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=952220&tab=buildLog

The test failed on release-2.1:
	test.go:570,election.go:67: /home/agent/work/.go/bin/roachprod run teamcity-952220-election-after-restart:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

@tbg tbg assigned tbg and unassigned jordanlewis Oct 9, 2018
@tbg tbg removed the A-sql-execution Relating to SQL execution. label Oct 9, 2018
@tbg tbg added this to Incoming in KV via automation Oct 9, 2018
tbg added a commit to tbg/cockroach that referenced this issue Oct 9, 2018
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes cockroachdb#30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.
@tbg tbg modified the milestones: 2.1, 2.2 Oct 9, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ac2f39fcc6be7366bc786d231890ee91e84f1c3c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=955173&tab=buildLog

The test failed on master:
	test.go:570,election.go:67: /home/agent/work/.go/bin/roachprod run teamcity-955173-election-after-restart:1 -- ./cockroach sql --insecure -e "
		SET TRACING = on;
		SELECT * FROM test.kv;
		SET TRACING = off;
		SHOW TRACE FOR SESSION;
		": exit status 1

tbg added a commit to tbg/cockroach that referenced this issue Oct 10, 2018
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes cockroachdb#30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.
craig bot pushed a commit that referenced this issue Oct 10, 2018
31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf

Previously, if a Batch RPC came back with a RangeNotFoundError, we would
immediately stop trying to send to more replicas, evict the range
descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't aggressively
shuffle the replicas.

It turns out that there are such situations, and the election-after-restart
roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time. 4. the first request to the range
   runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the tendency
of DistSender to get stuck in unfortunate configurations.

Fixes #30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found to
be in the process of being added.

31187: roachtest: add synctest r=bdarnell a=tschottdorf

This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs
the `synctest` cli command against a nemesis that injects random I/O
errors.

The synctest command is new. It simulates a Raft log and can be directed at a
filesystem that is being hit with random failures.

The workload essentially writes ascending keys (flushing each one to disk
synchronously) until an I/O error occurs, at which point it re-opens the
instance to verify that all persisted writes are still there. If the
RocksDB instance was permanently corrupted, it switches to a new, pristine,
directory.
This is used in the roachtest, but is also useful for manual use in user
deployments in which we suspect there is a failure to persist data to disk.

This hasn't found anything, but it's fun to watch and also shows us a
number of errors that we know and love from sentry.

Release note: None

31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch

This test could deadlock if the LHS replica on store2 was shut down
before it processed the split at "b". Teach the test to wait for the LHS
replica on store2 to process the split before blocking Raft traffic to
it.

Fixes #31096.
Fixes #31149.
Fixes #31160.
Fixes #31167.

Release note: None

31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt

this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults.

Release note: none.

31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Ben Darnell <ben@bendarnell.com>
@craig craig bot closed this as completed in #31013 Oct 10, 2018
KV automation moved this from Incoming to Closed Oct 10, 2018
tbg added a commit to tbg/cockroach that referenced this issue Oct 11, 2018
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes cockroachdb#30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.
tbg added a commit to tbg/cockroach that referenced this issue Oct 12, 2018
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes cockroachdb#30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.
tbg added a commit to tbg/cockroach that referenced this issue Oct 15, 2018
Previously, if a Batch RPC came back with a RangeNotFoundError,
we would immediately stop trying to send to more replicas, evict the
range descriptor, and start a new attempt after a back-off.

This new attempt could end up using the same replica, so if the
RangeNotFoundError persisted for some amount of time, so would the
unsuccessful retries for requests to it as DistSender doesn't
aggressively shuffle the replicas.

It turns out that there are such situations, and the
election-after-restart roachtest spuriously hit one of them:

1. new replica receives a preemptive snapshot and the ConfChange
2. cluster restarts
3. now the new replica is in this state until the range wakes
   up, which may not happen for some time.
4. the first request to the range runs into the above problem

@nvanbenschoten: I think there is an issue to be filed about the
tendency of DistSender to get stuck in unfortunate configurations.

Fixes cockroachdb#30613.

Release note (bug fix): Avoid repeatedly trying a replica that was found
to be in the process of being added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants