Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: general test flakiness due to Pebble close error #51544

Closed
knz opened this issue Jul 17, 2020 · 19 comments
Closed

kv: general test flakiness due to Pebble close error #51544

knz opened this issue Jul 17, 2020 · 19 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). S-3-productivity Severe issues that impede the productivity of CockroachDB developers. T-storage Storage Team
Projects

Comments

@knz
Copy link
Contributor

knz commented Jul 17, 2020

Describe the problem

make stress PKG=./pkg/kv/kvclient/kvcoord is failing reliably with the following stack trace:

panic: pebble: closed [recovered]
        panic: pebble: closed [recovered]
        panic: pebble: closed

goroutine 9311 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc001803b00, 0x1702240, 0xc000431680)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:183 +0x11f
panic(0x9b3820, 0xc0002931a0)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send.func1(0xc000b1af58, 0xc000b1afe0, 0xc000b1afd8, 0xc000476e00)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:103 +0x1e6
panic(0x9b3820, 0xc0002931a0)
        /usr/local/go/src/runtime/panic.go:975 +0x3e3
github.com/cockroachdb/pebble.(*DB).newIterInternal(0xc000441500, 0x177b5a0, 0xc000bf8280, 0x0, 0x0, 0x0, 0xc00095c370, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:668 +0xd94
github.com/cockroachdb/pebble.(*Batch).NewIter(0xc00043b180, 0xc00095c370, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:675 +0x1df
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).init(0xc00095c368, 0x16e6680, 0xc00043b180, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:125 +0x46b
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).NewIterator(0xc00095c340, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:207 +0x14f
github.com/cockroachdb/cockroach/pkg/storage.MVCCGet(0x1702300, 0xc001d389c0, 0x82ece9408, 0xc001204780, 0xc0004461c0, 0x26, 0x40, 0x0, 0x0, 0xc000000000, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:754 +0x9b
github.com/cockroachdb/cockroach/pkg/storage.MVCCGetProto(0x1702300, 0xc001d389c0, 0x82ece9408, 0xc001204780, 0xc0004461c0, 0x26, 0x40, 0x0, 0x0, 0x173e060, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:630 +0xd7
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.EndTxn(0x1702300, 0xc001d389c0, 0x82ed61598, 0xc001204780, 0x179c260, 0xc00038b000, 0x7b, 0x3, 0x100000001, 0x1, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_end_transaction.go:200 +0x309
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand(0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x0, 0x82ed61598, 0xc001204780, 0x179c260, 0xc00038b000, 0xc000734000, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_evaluate.go:471 +0x235
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch(0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x82ed61598, 0xc001204780, 0x179c260, 0xc00038b000, 0xc000734000, 0xc000bf8080, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_evaluate.go:241 +0x3c2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x179c260, 0xc00038b000, 0xc000734000, 0xc000bf8080, 0xc0012043c0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:557 +0x144
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0x179c260, 0xc00038b000, 0xc000734000, 0xc000bf8080, 0xc0012043c0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:526 +0x135
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatch(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0xc000bf8080, 0xc0012043c0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:349 +0x212
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0xc000bf8080, 0xc0012043c0, 0x0, 0xc000aa5600, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:735 +0x127
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc00058a190, 0x8, 0xc000bf8080, 0xc0012043c0, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:855 +0x8e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc000bf8080, 0xc000744000, 0xc000b1a330, 0x0, 0x0, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:73 +0xee
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc000bf8080, 0x0, 0x0, 0xc0004b2da0, 0x100000001, 0x1, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:133 +0x769
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc00038b000, 0x1702300, 0xc001d389c0, 0xc000bf8080, 0xdfed58, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:275 +0x3e7
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc00038b000, 0x1702300, 0xc001d38990, 0x2, 0xc000bf8080, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:95 +0x6b2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc00038b000, 0x1702300, 0xc001d38990, 0x7b, 0x3, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:37 +0x91
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc000476e00, 0x1702300, 0xc001d38810, 0x7b, 0x3, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5a2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000caf300, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0xed
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*senderTransport).SendNext(0xc000b04000, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:299 +0x21e
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*localTestClusterTransport).SendNext(0xc000f200a0, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/local_test_cluster_util.go:44 +0x8f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1795 +0x6b1
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1459 +0x305
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1099 +0x18ef
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000578840, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:739 +0x8e4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc00014cd38, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:86 +0x11c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc00014cd00, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x8d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc00014ccd0, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:190 +0x53b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc00014cc38, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:247 +0x9b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc00014cc38, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:182 +0x180
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc00014cb78, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:252 +0x159
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc00014cb58, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x20c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc00014cab8, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:172 +0x1a9
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc00014c900, 0x1702300, 0xc001d38810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:499 +0x3cc
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000caf400, 0x17022c0, 0xc001204360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:742 +0x122
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0016e6360, 0x17022c0, 0xc001204360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:911 +0x11e
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback.func1.1(0x17022c0, 0xc001204360, 0xb2d05e00, 0x17022c0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:733 +0x98
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout(0x17022c0, 0xc001204360, 0xc8b302, 0x12, 0xb2d05e00, 0xc000bcee68, 0x0, 0x0)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:135 +0x9e
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).rollback.func1(0x1702240, 0xc000431680)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:732 +0x14b
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc001803b00, 0x1702240, 0xc000431680, 0xc000d34020, 0x16, 0x0, 0x0, 0xc001d78140)
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:323 +0xee
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:318 +0x131

To Reproduce

make stress PKG=./pkg/kv/kvclient/kvcoord

Jira issue: CRDB-4032

@blathers-crl
Copy link

blathers-crl bot commented Jul 17, 2020

Hi @knz, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@knz knz added C-test-failure Broken test (automatically or manually discovered). S-3-productivity Severe issues that impede the productivity of CockroachDB developers. labels Jul 17, 2020
@knz knz added this to To do in KV 20.2 via automation Jul 17, 2020
@knz knz added this to Incoming in Storage via automation Jul 17, 2020
@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

cc @tbg @petermattis for triage.

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

I am going to bisect this and see when it came from.

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

This commit seems OK: 12b58af

Will bisect between now and then.

edit: that selection was incorrect

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

I have tried very hard to use git bisect to get to the root cause and every time git bisect comes to the conclusion I have contradictory outputs.

It may be that our Makefile / Go dependency tracking is not right so that when I switch branches, I am not re-building the vendor dir properly.

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

I will not investigate this further myself.

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

For reference this one is very probably OK, after running from a clean repo:

[91ae9bc] Merge #51232 #51299

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

(I'm re-running my bisect session with 'make clean' in-between each step, just to be sure)

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

Here's my latest bisect log:

# bad: [af0031e3004327b8d09e23e99eb9659abf7d82de] Merge #51375 #51520
# good: [753fe8f51291aff12e4dad5a6fc850988c25dd82] Merge #51235
git bisect start 'master' '753fe8f51291aff12e4dad5a6fc850988c25dd82'
# good: [d0c79625eda85d3aa38afad5b0254d419a9bc4cd] Merge #51023 #51154 #51256 #51303
git bisect good d0c79625eda85d3aa38afad5b0254d419a9bc4cd
# good: [91ae9bc70d00868e46c83139c0e5621e4e4971f3] Merge #51232 #51299
git bisect good 91ae9bc70d00868e46c83139c0e5621e4e4971f3
# good: [2f05eb02306768de6d8c35310851c5d4684d55a5] geoviz: add improvements
git bisect good 2f05eb02306768de6d8c35310851c5d4684d55a5
# good: [2d21db0fca674cd448c1d9d2bf14c53244de56f3] builtins: add underscore variant for each index builtin
git bisect good 2d21db0fca674cd448c1d9d2bf14c53244de56f3
# good: [2d21db0fca674cd448c1d9d2bf14c53244de56f3] builtins: add underscore variant for each index builtin
git bisect good 2d21db0fca674cd448c1d9d2bf14c53244de56f3
# good: [2d21db0fca674cd448c1d9d2bf14c53244de56f3] builtins: add underscore variant for each index builtin
git bisect good 2d21db0fca674cd448c1d9d2bf14c53244de56f3
# good: [f00c3ff59db097746011ffb1586fd4ce67f5596d] Merge #51225 #51311
git bisect good f00c3ff59db097746011ffb1586fd4ce67f5596d
# bad: [9062cb0fb03e7af73745601e5ad44f3b8b87d519] Merge #50771
git bisect bad 9062cb0fb03e7af73745601e5ad44f3b8b87d519
# bad: [51398fb4a822bb9257ddba87682380ec0745f6ae] Merge #51327
git bisect bad 51398fb4a822bb9257ddba87682380ec0745f6ae
# bad: [0f36c8df07b6d7f948989a68926141b1e9866368] Merge #51351
git bisect bad 0f36c8df07b6d7f948989a68926141b1e9866368
# bad: [a215af7dd3a7387a4454082c27c1528ff54b1698] Merge #51319
git bisect bad a215af7dd3a7387a4454082c27c1528ff54b1698
# good: [d86e265eb443e9eeac8a743ad558b5bfa58e3720] kv: consolidate RangeInfos in RPC responses
git bisect good d86e265eb443e9eeac8a743ad558b5bfa58e3720
# good: [3a299690b26b528a1fdbf28865e1bc81ae697264] Merge #51310
git bisect good 3a299690b26b528a1fdbf28865e1bc81ae697264
# good: [dfdd8bb563dd7f6899c96a963a3906257784c135] geoviz: expose --geo_libs flag
git bisect good dfdd8bb563dd7f6899c96a963a3906257784c135
# bad: [a9e9c9cdf75f07c51dfca3773305d4af379c5fb7] Merge #51168
git bisect bad a9e9c9cdf75f07c51dfca3773305d4af379c5fb7
# first bad commit: [a9e9c9cdf75f07c51dfca3773305d4af379c5fb7] Merge #51168

This points to #51168 as the culprit, but I don't understand how it causes the problem.

Maybe someone else needs to re-run the bisect to confirm.

@knz
Copy link
Contributor Author

knz commented Jul 17, 2020

The problem with this bisect result is the following:

  • the merge commit a9e9c9c triggers the bug
  • however the only commit inside that branch, d86e265, does not.

(This is why I wrote it's incosnsitent)

@tbg
Copy link
Member

tbg commented Jul 17, 2020 via email

@tbg
Copy link
Member

tbg commented Jul 17, 2020

Looking at this with Andrei. Like you suggested it's probably https://github.com/cockroachdb/cockroach/pull/51310/files exposing an existing problem.

@andreimatei
Copy link
Contributor

Hopefully this was fixed enough by #51413

Storage automation moved this from Incoming to Done (milestone C) Jul 24, 2020
KV 20.2 automation moved this from In progress to Done Jul 24, 2020
@ajwerner
Copy link
Contributor

ajwerner commented Aug 8, 2020

@ajwerner ajwerner reopened this Aug 8, 2020
Storage automation moved this from Done (milestone C) to Incoming Aug 8, 2020
@ajwerner
Copy link
Contributor

ajwerner commented Aug 8, 2020

One thing I'll note is that we seem to lack synchronization that all of our outstanding grpc requests have concluded when the stopper stops. This seems like a problem.

It seems like we should add something to the server shutdown code that prevents new requests and wait for outstanding requests on certain grpc services. In particular Batch and RangeFeed from Internal and others like PerReplicaClient

@andreimatei
Copy link
Contributor

I'll get new energy on #51566

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 14, 2020
This patch makes the Store reject requests once its stopper is
quiescing.

Before this patch, we didn't seem to have good protection against
requests not running after the stopper has been stopped. We've seen this
in some tests, where requests were racing with the engine closing.
Running after the stopper has stopped is generally pretty undefined
behavior, so let's avoid it.
I think the reason why we didn't see a lot of errors from such races is
that we're stopping the gRPC server even before we start quiescing, so
at least for requests coming from remote nodes we had some draining
window.

This is a lighter version of cockroachdb#51566. That patch was trying to run
requests as tasks, so that they properly synchronize with server
shutdown. This patch still allows races between requests that started
evaluating the server started quiescing and server shutdown.

Touches cockroachdb#51544

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 14, 2020
This patch runs some infrequent operations that might use the storage
engine as tasks, and thus synchronizes them with server shutdown.
In cockroachdb#51544 we've seen one of these cause a crash when executing after
Pebble was shut down.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 17, 2020
…tasks

Prior to this change, it was possible for a rangefeed request to be issued
concurrently with shutting down which could lead to an iterator being
constructed after the engine has been closed.

Touches cockroachdb#51544

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 17, 2020
…eation

This commit optimizes the Stopper for task creation by ripping out the
existing heavyweight task tracking in production builds. I realized that
my biggest concern with most of the proposals (cockroachdb#52843 and cockroachdb#51566) being
floated to address cockroachdb#51544 was that they bought more into the inefficient
tracking in the Stopper, not that they were doing anything inherently
wrong themselves.

Before this change, creating a task acquired an exclusive mutex and then
wrote to a hashmap. At high levels of concurrency, this would have
become a performance chokepoint. After this change, the cost of
launching a Task is three atomic increments – one to acquire a read
lock, one to register with a WaitGroup, and one to release the read
lock. When no one is draining the Stopper, these are all wait-free
operations, which means that task creation becomes wait-free.

With a change like this, I would feel much more comfortable pushing on
Stopper tasks to solve cockroachdb#51544.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 19, 2020
…tasks

Prior to this change, it was possible for a rangefeed request to be issued
concurrently with shutting down which could lead to an iterator being
constructed after the engine has been closed.

Touches cockroachdb#51544

Release note: None
craig bot pushed a commit that referenced this issue Aug 20, 2020
52844: kvserver,rangefeed: ensure that iterators are only constructed under tasks r=andreimatei a=ajwerner


Prior to this change, it was possible for a rangefeed request to be issued
concurrently with shutting down which could lead to an iterator being
constructed after the engine has been closed.

Touches #51544

Release note: None

52996: partialidx: prove implication for comparisons with two variables r=RaduBerinde a=mgartner

This commit adds support for proving partial index predicates are
implied by query filters when they contain comparison expressions with
two variables and they are not identical expressions.

Below are some examples where the left expression implies (=>) the right
expression. The right is guaranteed to contain the left despite both
expressions having no constant values.

    a > b  =>  a >= b
    a = b  =>  a >= b
    b < a  =>  a >= b
    a > b  =>  a != b

Release note: None

53113: roachprod: introduce --skip-init to `roachprod start` r=irfansharif a=irfansharif

..and `roachprod init`. I attempted to originally introduce this flag in
\#51329, and ultimately abandoned it. I still think it's a good idea to
have such a thing, especially given now we're writing integration tests
that want to control `init` behaviour. It's much easier to write them
with this --skip-init flag than it is to work around roachprod's magical
auto-init behavior.

To do what's skipped when using --skip-init, we introduce a `roachprod
init` sub command.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
RichardJCai pushed a commit to RichardJCai/cockroach that referenced this issue Aug 20, 2020
…tasks

Prior to this change, it was possible for a rangefeed request to be issued
concurrently with shutting down which could lead to an iterator being
constructed after the engine has been closed.

Touches cockroachdb#51544

Release note: None
@ajwerner
Copy link
Contributor

ajwerner commented Oct 2, 2020

The vast majority of these seem to have to do with range merges happening after shutdown.

tbg added a commit to tbg/cockroach that referenced this issue Feb 1, 2021
We are likely going to invest more in the stopper-conferred
observability in the near future as part of initiatives such as cockroachdb#58164,
but the task tracking that has been a part of the stopper since near
its conception has not proven to be useful in practice, while at the
same time raising concern about stopper use in hot paths.

When shutting down a running server, we don't particularly care about
leaking goroutines (as the process will end anyway). In tests, we want
to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to
look at the stacks to find out why than to consult the task map.

Together, this left little reason to do anything more complicated than
what's left after this commit: we keep track of the running number of
tasks, and wait until this drops to zero.

With this change in, we should feel comfortable using the stopper
extensively and, for example, ensuring that any CRDB goroutine is
anchored in a Stopper task; this is the right approach for test flakes
such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in
issue cockroachdb#58164 as well.

In a future change, we should make the Stopper more configurable and,
through this configurability, we could in principle bring a version of
the task map back (in debug builds) without backing it into the stopper,
though I don't anticipate that we'll want to.

Closes cockroachdb#52894.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 2, 2021
We are likely going to invest more in the stopper-conferred
observability in the near future as part of initiatives such as cockroachdb#58164,
but the task tracking that has been a part of the stopper since near
its conception has not proven to be useful in practice, while at the
same time raising concern about stopper use in hot paths.

When shutting down a running server, we don't particularly care about
leaking goroutines (as the process will end anyway). In tests, we want
to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to
look at the stacks to find out why than to consult the task map.

Together, this left little reason to do anything more complicated than
what's left after this commit: we keep track of the running number of
tasks, and wait until this drops to zero.

With this change in, we should feel comfortable using the stopper
extensively and, for example, ensuring that any CRDB goroutine is
anchored in a Stopper task; this is the right approach for test flakes
such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in
issue cockroachdb#58164 as well.

In a future change, we should make the Stopper more configurable and,
through this configurability, we could in principle bring a version of
the task map back (in debug builds) without backing it into the stopper,
though I don't anticipate that we'll want to.

Closes cockroachdb#52894.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 3, 2021
We are likely going to invest more in the stopper-conferred
observability in the near future as part of initiatives such as cockroachdb#58164,
but the task tracking that has been a part of the stopper since near
its conception has not proven to be useful in practice, while at the
same time raising concern about stopper use in hot paths.

When shutting down a running server, we don't particularly care about
leaking goroutines (as the process will end anyway). In tests, we want
to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to
look at the stacks to find out why than to consult the task map.

Together, this left little reason to do anything more complicated than
what's left after this commit: we keep track of the running number of
tasks, and wait until this drops to zero.

With this change in, we should feel comfortable using the stopper
extensively and, for example, ensuring that any CRDB goroutine is
anchored in a Stopper task; this is the right approach for test flakes
such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in
issue cockroachdb#58164 as well.

In a future change, we should make the Stopper more configurable and,
through this configurability, we could in principle bring a version of
the task map back (in debug builds) without backing it into the stopper,
though I don't anticipate that we'll want to.

Closes cockroachdb#52894.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 3, 2021
We are likely going to invest more in the stopper-conferred
observability in the near future as part of initiatives such as cockroachdb#58164,
but the task tracking that has been a part of the stopper since near
its conception has not proven to be useful in practice, while at the
same time raising concern about stopper use in hot paths.

When shutting down a running server, we don't particularly care about
leaking goroutines (as the process will end anyway). In tests, we want
to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to
look at the stacks to find out why than to consult the task map.

Together, this left little reason to do anything more complicated than
what's left after this commit: we keep track of the running number of
tasks, and wait until this drops to zero.

With this change in, we should feel comfortable using the stopper
extensively and, for example, ensuring that any CRDB goroutine is
anchored in a Stopper task; this is the right approach for test flakes
such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in
issue cockroachdb#58164 as well.

In a future change, we should make the Stopper more configurable and,
through this configurability, we could in principle bring a version of
the task map back (in debug builds) without backing it into the stopper,
though I don't anticipate that we'll want to.

Closes cockroachdb#52894.

Release note: None
craig bot pushed a commit that referenced this issue Feb 3, 2021
59647: stop: rip out expensive task tracking r=knz a=tbg

First commit was put up for PR separately, ignore it here.

----

We are likely going to invest more in the stopper-conferred
observability in the near future as part of initiatives such as #58164,
but the task tracking that has been a part of the stopper since near
its conception has not proven to be useful in practice, while at the
same time raising concern about stopper use in hot paths.

When shutting down a running server, we don't particularly care about leaking
goroutines (as the process will end anyway). In tests, we want to ensure
goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at
the stacks to find out why than to consult the task map.

Together, this left little reason to do anything more complicated than
what's left after this commit: we keep track of the running number of
tasks, and wait until this drops to zero.

With this change in, we should feel comfortable using the stopper
extensively and, for example, ensuring that any CRDB goroutine is
anchored in a Stopper task; this is the right approach for test flakes
such as in #51544 and makes sense for all of the reasons mentioned in
issue #58164 as well.

In a future change, we should make the Stopper more configurable and,
through this configurability, we could in principle bring a version of
the task map back (in debug builds) without backing it into the stopper,
though I don't anticipate that we'll want to.

Closes #52894.

Release note: None


59732: backupccl: add an owner column behind the WITH PRIVILEGES option r=pbardea a=Elliebababa

Previously, when users perform RESTORE, they are ignorant of the original owner.

This PR gives ownership data as a column behind privileges.

Resolves: #57906.

Release note: None.

59746: opt: switch checks to use CrdbTestBuild instead of RaceEnabled r=RaduBerinde a=RaduBerinde

The RaceEnabled flag is not very useful for checks; e.g. apparently
execbuilder tests aren't run routinely in race mode. These checks are
now "live" in any test build, using the crdb_test build tag.

Release note: None

59747: tree: correct StatementTag of ALTER TABLE ... LOCALITY r=ajstorm a=otan

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: elliebababa <ellie24.huang@gmail.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
@jlinder jlinder added the T-storage Storage Team label Jun 16, 2021
@jbowens
Copy link
Collaborator

jbowens commented Aug 15, 2022

Closing out as unactionable. I think most instances have been fixed throughout the cockroach repo.

@jbowens jbowens closed this as completed Aug 15, 2022
Storage automation moved this from Incoming to Done Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). S-3-productivity Severe issues that impede the productivity of CockroachDB developers. T-storage Storage Team
Projects
No open projects
KV 20.2
  
Done
Development

No branches or pull requests

6 participants