teamcity: failed test: TestStoreRangeMergeWatcher #31096

cockroach-teamcity · 2018-10-08T19:16:55Z

The following tests appear to have failed on release-2.1 (test): TestStoreRangeMergeWatcher, TestStoreRangeMergeWatcher/inject-failures=false, TestStoreRangeMergeWatcher/inject-failures=true

You may want to check for open issues.

#951855:

TestStoreRangeMergeWatcher
--- FAIL: testrace/TestStoreRangeMergeWatcher (3.170s)
Test ended in panic.




TestStoreRangeMergeWatcher
--- FAIL: test/TestStoreRangeMergeWatcher (0.000s)
Test ended in panic.




TestStoreRangeMergeWatcher/inject-failures=true
...ync/cond.go:56 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc421ca5100, 0x2c06600, 0xc421c0fbc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:196 +0x7c
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x2c06600, 0xc421c0fbc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc421db44c0, 0xc420c63830, 0xc421db44b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

goroutine 47921 [semacquire]:
sync.runtime_notifyListWait(0xc4209a5410, 0x7876)
	/usr/local/go/src/runtime/sema.go:510 +0x10b
sync.(*Cond).Wait(0xc4209a5400)
	/usr/local/go/src/sync/cond.go:56 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420977100, 0x2c06600, 0xc421e2cc30)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:196 +0x7c
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x2c06600, 0xc421e2cc30)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4216fb6f0, 0xc4213f9d40, 0xc4216fb6e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

goroutine 47894 [semacquire]:
sync.runtime_notifyListWait(0xc4209a5410, 0x7897)
	/usr/local/go/src/runtime/sema.go:510 +0x10b
sync.(*Cond).Wait(0xc4209a5400)
	/usr/local/go/src/sync/cond.go:56 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420977100, 0x2c06600, 0xc421e2c720)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:196 +0x7c
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x2c06600, 0xc421e2c720)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4216fabe0, 0xc4213f9d40, 0xc4216fabd0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

goroutine 48185 [select]:
github.com/cockroachdb/cockroach/pkg/gossip.(*server).Gossip(0xc421790700, 0x2c1d060, 0xc421775bf0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/gossip/server.go:198 +0x733
github.com/cockroachdb/cockroach/pkg/gossip._Gossip_Gossip_Handler(0x260c7c0, 0xc421790700, 0x2c189e0, 0xc42184a210, 0xc420100088, 0xc421308360)
	/go/src/github.com/cockroachdb/cockroach/pkg/gossip/gossip.pb.go:293 +0xb2
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc421c4b180, 0x2c21da0, 0xc420456a00, 0xc42171c700, 0xc421914ea0, 0x3afca60, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1160 +0xa2d
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleStream(0xc421c4b180, 0x2c21da0, 0xc420456a00, 0xc42171c700, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1253 +0x12b1
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4203eb760, 0xc421c4b180, 0x2c21da0, 0xc420456a00, 0xc42171c700)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:680 +0x9f
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:678 +0xa1



TestStoreRangeMergeWatcher/inject-failures=false
... minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
I181008 18:51:16.645461 47437 gossip/client.go:129  [n2] started gossip client to 127.0.0.1:44677
I181008 18:51:16.697141 47249 storage/store.go:1562  [s3] [n3,s3]: failed initial metrics computation: [n3,s3]: system config not yet available
W181008 18:51:16.699555 47249 gossip/gossip.go:1499  [n3] no incoming or outgoing connections
I181008 18:51:16.699627 47249 gossip/gossip.go:395  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:42397" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
I181008 18:51:16.700591 47789 gossip/client.go:129  [n3] started gossip client to 127.0.0.1:44677
I181008 18:51:16.764202 47249 rpc/nodedialer/nodedialer.go:92  [s1,r1/1:/M{in-ax}] connection to n2 established
I181008 18:51:16.768114 47249 storage/store_snapshot.go:615  [s1,r1/1:/M{in-ax}] sending preemptive snapshot 6ec9031e at applied index 16
I181008 18:51:16.768524 47249 storage/store_snapshot.go:657  [s1,r1/1:/M{in-ax}] streamed snapshot to (n2,s2):?: kv pairs: 49, log entries: 6, rate-limit: 2.0 MiB/sec, 4ms
I181008 18:51:16.769077 47442 storage/replica_raftstorage.go:803  [s2,r1/?:{-}] applying preemptive snapshot at index 16 (id=6ec9031e, encoded size=8300, 1 rocksdb batches, 6 log entries)
I181008 18:51:16.769935 47442 storage/replica_raftstorage.go:809  [s2,r1/?:/M{in-ax}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=1ms commit=0ms]
I181008 18:51:16.770938 47249 storage/replica_command.go:812  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r1:/M{in-ax} [(n1,s1):1, next=2, gen=0]
I181008 18:51:16.774240 47249 storage/replica.go:3836  [s1,r1/1:/M{in-ax},txn=d4aeb76e] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I181008 18:51:16.775324 47249 rpc/nodedialer/nodedialer.go:92  [s1,r1/1:/M{in-ax}] connection to n3 established
I181008 18:51:16.776173 47249 storage/store_snapshot.go:615  [s1,r1/1:/M{in-ax}] sending preemptive snapshot bf31e98e at applied index 18
I181008 18:51:16.776472 47249 storage/store_snapshot.go:657  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 52, log entries: 8, rate-limit: 2.0 MiB/sec, 1ms
I181008 18:51:16.776850 47282 storage/replica_raftstorage.go:803  [s3,r1/?:{-}] applying preemptive snapshot at index 18 (id=bf31e98e, encoded size=9242, 1 rocksdb batches, 8 log entries)
I181008 18:51:16.777606 47282 storage/replica_raftstorage.go:809  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I181008 18:51:16.778062 47249 storage/replica_command.go:812  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):3): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, next=3, gen=0]
I181008 18:51:16.779538 47604 rpc/nodedialer/nodedialer.go:92  connection to n1 established
I181008 18:51:16.781604 47249 storage/replica.go:3836  [s1,r1/1:/M{in-ax},txn=c65449b4] proposing ADD_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n2,s2):2 (n3,s3):3] next=4
I181008 18:51:16.936946 47249 storage/replica_command.go:298  [s1,r1/1:/M{in-ax}] initiating a split of this range at key "b" [r2]
I181008 18:51:16.947054 47697 storage/replica_proposal.go:211  [s3,r2/3:{b-/Max}] new range lease repl=(n3,s3):3 seq=2 start=0.000000123,278 epo=1 pro=0.000000123,279 following repl=(n1,s1):1 seq=1 start=0.000000000,0 exp=0.900000123,5 pro=0.000000123,6
I181008 18:51:16.947848 47249 storage/replica_command.go:430  [s1,r1/1:{/Min-b}] initiating a merge of r2:{b-/Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=0] into this range
I181008 18:51:16.962187 47515 storage/store.go:2740  [s1,r1/1:{/Min-b},txn=790a505e] removing replica r2/1
I181008 18:51:16.963968 47704 storage/store.go:2740  [s3,r1/3:{/Min-b}] removing replica r2/3
I181008 18:51:16.964485 47606 storage/store.go:2740  [s2,r1/2:{/Min-b}] removing replica r2/2

Please assign, take a look and update the issue accordingly.

The text was updated successfully, but these errors were encountered:

benesch · 2018-10-08T21:51:59Z

Man, I stressed this thing for 30m on my GCE worker. Just once I want to merge a non-flaky test.

tbg · 2018-10-09T10:01:06Z

I think @andy-kimball once called flaky tests the "bane of our existence". That is just to say, we feel ya.

This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes cockroachdb#31096. Fixes cockroachdb#31149. Fixes cockroachdb#31160. Fixes cockroachdb#31167. Release note: None

@nvanbenschoten

31013: kv: try next replica on RangeNotFoundError r=nvanbenschoten,bdarnell a=tschottdorf Previously, if a Batch RPC came back with a RangeNotFoundError, we would immediately stop trying to send to more replicas, evict the range descriptor, and start a new attempt after a back-off. This new attempt could end up using the same replica, so if the RangeNotFoundError persisted for some amount of time, so would the unsuccessful retries for requests to it as DistSender doesn't aggressively shuffle the replicas. It turns out that there are such situations, and the election-after-restart roachtest spuriously hit one of them: 1. new replica receives a preemptive snapshot and the ConfChange 2. cluster restarts 3. now the new replica is in this state until the range wakes up, which may not happen for some time. 4. the first request to the range runs into the above problem @nvanbenschoten: I think there is an issue to be filed about the tendency of DistSender to get stuck in unfortunate configurations. Fixes #30613. Release note (bug fix): Avoid repeatedly trying a replica that was found to be in the process of being added. 31187: roachtest: add synctest r=bdarnell a=tschottdorf This new roachtest sets up a charybdefs on a single (Ubuntu) node and runs the `synctest` cli command against a nemesis that injects random I/O errors. The synctest command is new. It simulates a Raft log and can be directed at a filesystem that is being hit with random failures. The workload essentially writes ascending keys (flushing each one to disk synchronously) until an I/O error occurs, at which point it re-opens the instance to verify that all persisted writes are still there. If the RocksDB instance was permanently corrupted, it switches to a new, pristine, directory. This is used in the roachtest, but is also useful for manual use in user deployments in which we suspect there is a failure to persist data to disk. This hasn't found anything, but it's fun to watch and also shows us a number of errors that we know and love from sentry. Release note: None 31215: storage: deflake TestStoreRangeMergeWatcher r=tschottdorf a=benesch This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes #31096. Fixes #31149. Fixes #31160. Fixes #31167. Release note: None 31217: importccl: add explicit default to mysql testdata timestamp r=dt a=dt this makes the testdata work on mysql 8.0.2+, where the timestamp type no longer has the implicit defaults. Release note: none. 31221: cluster: Create final cluster version for 2.1 r=bdarnell a=bdarnell Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Ben Darnell <ben@bendarnell.com>

This test could deadlock if the LHS replica on store2 was shut down before it processed the split at "b". Teach the test to wait for the LHS replica on store2 to process the split before blocking Raft traffic to it. Fixes cockroachdb#31096. Fixes cockroachdb#31149. Fixes cockroachdb#31160. Fixes cockroachdb#31167. Release note: None

cockroach-teamcity added this to the 2.1 milestone Oct 8, 2018

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 8, 2018

jordanlewis assigned benesch Oct 8, 2018

jordanlewis mentioned this issue Oct 9, 2018

ccl/partitionccl: TestInitialPartitioning failed under stress [skipped] #28789

Closed

tbg mentioned this issue Oct 9, 2018

teamcity: failed test: TestStoreRangeMergeWatcher #31114

Closed

benesch mentioned this issue Oct 10, 2018

storage: deflake TestStoreRangeMergeWatcher #31215

Merged

craig bot closed this as completed in #31215 Oct 10, 2018

tbg mentioned this issue Oct 11, 2018

backport-2.1: storage: deflake TestStoreRangeMergeWatcher #31248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

teamcity: failed test: TestStoreRangeMergeWatcher #31096

teamcity: failed test: TestStoreRangeMergeWatcher #31096

cockroach-teamcity commented Oct 8, 2018

benesch commented Oct 8, 2018

tbg commented Oct 9, 2018

teamcity: failed test: TestStoreRangeMergeWatcher #31096

teamcity: failed test: TestStoreRangeMergeWatcher #31096

Comments

cockroach-teamcity commented Oct 8, 2018

benesch commented Oct 8, 2018

tbg commented Oct 9, 2018