-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestReplicaDrainLease failed #81511
Comments
@cockroachdb/storage Does this ring any bells?
|
For context: in #81133 we made the synchronous disk write during node heartbeats use a separate goroutine, to handle context cancellation. Looks like this can race with engine shutdown, possibly due to non-atomic access of atomic variables. cockroach/pkg/kv/kvserver/liveness/liveness.go Lines 1273 to 1275 in 6568292
cockroach/pkg/storage/engine.go Lines 929 to 942 in 6568292
|
Took a quick look at this just now. I think you're right. This field is is implicitly atomic: Yet we read from it here non-atomically here: Seems like the trivial fix would be to read atomically. Though we should probably do a deeper audit of the write / read sites. I'll file a bug and link back. Will assign this Storage. Thanks for the heads up! |
Yeah, I did a quick trial with making all accesses atomic, but I still tripped the race detector. May have missed something though. |
This is also trivially reproducible: $ ./dev test ./pkg/kv/kvserver --filter TestReplicaDrainLease --race --stress One of the stacks was missing in the original log. Here's the full output, for additional context. OutputWARNING: DATA RACE
Read at 0x00c0035e2000 by goroutine 42:
github.com/cockroachdb/pebble/record.(*LogWriter).emitEOFTrailer()
github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:678 +0x4b
github.com/cockroachdb/pebble/record.(*LogWriter).Close()
github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:589 +0x32
github.com/cockroachdb/pebble.(*DB).Close()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:1194 +0x664
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).Close()
github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1026 +0x71
github.com/cockroachdb/cockroach/pkg/server.(*Engines).Close()
github.com/cockroachdb/cockroach/pkg/server/config.go:505 +0xb7
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop()
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x390
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServerLocked()
github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:186 +0xfd
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers()
github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:122 +0x41c
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:367 +0x4a
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close()
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:107 +0x2e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop()
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x390
github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease·dwrap·485()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1430 +0x58
github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1525 +0xe4c
testing.tRunner()
GOROOT/src/testing/testing.go:1259 +0x22f
testing.(*T).Run·dwrap·21()
GOROOT/src/testing/testing.go:1306 +0x47
Previous write at 0x00c0035e2000 by goroutine 706: [26/2603]
sync/atomic.StoreInt32()
src/runtime/race_amd64.s:255 +0xb
sync/atomic.StoreInt32()
<autogenerated>:1 +0x1a
github.com/cockroachdb/pebble/record.(*LogWriter).SyncRecord()
github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:646 +0xe4
github.com/cockroachdb/pebble.(*DB).commitWrite()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:851 +0x371
github.com/cockroachdb/pebble.(*DB).commitWrite-fm()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:802 +0x64
github.com/cockroachdb/pebble.(*commitPipeline).prepare()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/commit.go:379 +0x2db
github.com/cockroachdb/pebble.(*commitPipeline).Commit()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/commit.go:253 +0xa5
github.com/cockroachdb/pebble.(*DB).Apply()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:757 +0x465
github.com/cockroachdb/pebble.(*Batch).Commit()
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/batch.go:932 +0xbb
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).Commit()
github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:516 +0x5a
github.com/cockroachdb/cockroach/pkg/storage.WriteSyncNoop()
github.com/cockroachdb/cockroach/pkg/storage/engine.go:938 +0x105
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness.func1()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1274 +0x3e
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall()
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 +0x67
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan·dwrap·1()
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:121 +0x71
Goroutine 42 (running) created at:
testing.(*T).Run()
GOROOT/src/testing/testing.go:1306 +0x726
testing.runTests.func1()
GOROOT/src/testing/testing.go:1598 +0x99
testing.tRunner()
GOROOT/src/testing/testing.go:1259 +0x22f
testing.runTests()
GOROOT/src/testing/testing.go:1596 +0x7ca
testing.(*M).Run()
GOROOT/src/testing/testing.go:1504 +0x9d1
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMain()
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/main_test.go:49 +0x105
main.main()
main/bazel-out/darwin-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/testmain.go:1435 +0x5c6
Goroutine 706 (finished) created at:
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan()
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:121 +0x59a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1273 +0x996
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:947 +0xbc4
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:762 +0x5b9
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:745 +0x4c6
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:745 +0x4c6
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x235 Seems like there's a non-atomic write happening in there too. |
Currently, a block's `written` field is intended to be used atomically. There are existing call sites where the field is read from or written to non-atomically, which can result in a data race (as observed in cockroachdb/cockroach#81511). Make reads and writes of `block.written` atomic. This results in up to a 8% reduction in latency and increase in throughput, depending on the size of the record, due to a change to use a stack variable rather that dereferencing a field on the struct. However, it should be noted that this particular benchmark does not touch the filesystem (it makes use of `io.Discard`). In practice the savings would be dwarfed by the filesystem operation (sync / write). ``` name old time/op new time/op delta RecordWrite/size=8-16 42.9ns ± 5% 40.7ns ± 2% -5.24% (p=0.000 n=10+10) RecordWrite/size=16-16 42.7ns ± 4% 40.7ns ± 1% -4.89% (p=0.000 n=10+10) RecordWrite/size=32-16 44.0ns ± 6% 40.4ns ± 1% -8.24% (p=0.000 n=10+10) RecordWrite/size=64-16 44.1ns ± 1% 41.6ns ± 0% -5.58% (p=0.000 n=10+10) RecordWrite/size=256-16 67.7ns ± 4% 65.2ns ± 1% -3.71% (p=0.000 n=10+9) RecordWrite/size=1028-16 124ns ± 2% 119ns ± 1% -4.04% (p=0.000 n=10+10) RecordWrite/size=4096-16 306ns ± 5% 296ns ± 1% -3.25% (p=0.000 n=10+9) RecordWrite/size=65536-16 4.54µs ± 1% 4.52µs ± 2% ~ (p=0.247 n=10+10) name old speed new speed delta RecordWrite/size=8-16 186MB/s ± 5% 197MB/s ± 2% +5.49% (p=0.000 n=10+10) RecordWrite/size=16-16 374MB/s ± 3% 394MB/s ± 1% +5.12% (p=0.000 n=10+10) RecordWrite/size=32-16 727MB/s ± 5% 792MB/s ± 1% +8.91% (p=0.000 n=10+10) RecordWrite/size=64-16 1.45GB/s ± 1% 1.54GB/s ± 0% +5.91% (p=0.000 n=10+10) RecordWrite/size=256-16 3.78GB/s ± 4% 3.92GB/s ± 1% +3.82% (p=0.000 n=10+9) RecordWrite/size=1028-16 8.31GB/s ± 2% 8.66GB/s ± 1% +4.21% (p=0.000 n=10+10) RecordWrite/size=4096-16 13.4GB/s ± 5% 13.8GB/s ± 1% +3.30% (p=0.000 n=10+9) RecordWrite/size=65536-16 14.4GB/s ± 1% 14.5GB/s ± 2% ~ (p=0.247 n=10+10) name old alloc/op new alloc/op delta RecordWrite/size=8-16 0.00B 0.00B ~ (all equal) RecordWrite/size=16-16 0.00B 0.00B ~ (all equal) RecordWrite/size=32-16 0.00B 0.00B ~ (all equal) RecordWrite/size=64-16 0.00B 0.00B ~ (all equal) RecordWrite/size=256-16 0.00B 0.00B ~ (all equal) RecordWrite/size=1028-16 0.00B 0.00B ~ (all equal) RecordWrite/size=4096-16 0.00B 0.00B ~ (all equal) RecordWrite/size=65536-16 2.00B ± 0% 2.00B ± 0% ~ (all equal) name old allocs/op new allocs/op delta RecordWrite/size=8-16 0.00 0.00 ~ (all equal) RecordWrite/size=16-16 0.00 0.00 ~ (all equal) RecordWrite/size=32-16 0.00 0.00 ~ (all equal) RecordWrite/size=64-16 0.00 0.00 ~ (all equal) RecordWrite/size=256-16 0.00 0.00 ~ (all equal) RecordWrite/size=1028-16 0.00 0.00 ~ (all equal) RecordWrite/size=4096-16 0.00 0.00 ~ (all equal) RecordWrite/size=65536-16 0.00 0.00 ~ (all equal) ``` Fixes cockroachdb#1692.
As discussed in cockroachdb/pebble#1719, we'll do this through CRDB synchronization instead (i.e. don't attempt to write when we're closing the database). |
kv/kvserver.TestReplicaDrainLease failed with artifacts on master @ ce66acdbba801de88f1dd645eaedeb3834f23dbd:
Help
See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:
This test on roachdash | Improve this report!
Jira issue: CRDB-15111
The text was updated successfully, but these errors were encountered: