Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestReplicaDrainLease failed #81511

Closed
cockroach-teamcity opened this issue May 19, 2022 · 6 comments · Fixed by #81813
Closed

kv/kvserver: TestReplicaDrainLease failed #81511

cockroach-teamcity opened this issue May 19, 2022 · 6 comments · Fixed by #81813
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 19, 2022

kv/kvserver.TestReplicaDrainLease failed with artifacts on master @ ce66acdbba801de88f1dd645eaedeb3834f23dbd:

=== RUN   TestReplicaDrainLease
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestReplicaDrainLease1320269877
    test_log_scope.go:80: use -show-logs to present logs inline
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-15111

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels May 19, 2022
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV May 19, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 19, 2022
@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 19, 2022

@cockroachdb/storage Does this ring any bells?

=== RUN   TestReplicaDrainLease
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestReplicaDrainLease1320269877
    test_log_scope.go:80: use -show-logs to present logs inline
==================
WARNING: DATA RACE
Write at 0x00c0020b6000 by goroutine 2422:
  sync/atomic.StoreInt32()
      src/runtime/race_amd64.s:255 +0xb
  sync/atomic.StoreInt32()
      <autogenerated>:1 +0x1a
  github.com/cockroachdb/pebble/record.(*LogWriter).Close()
      github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:589 +0x32
  github.com/cockroachdb/pebble.(*DB).Close()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:1194 +0x664
  github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).Close()
      github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1026 +0x71
  github.com/cockroachdb/cockroach/pkg/server.(*Engines).Close()
      github.com/cockroachdb/cockroach/pkg/server/config.go:505 +0xb7
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x390
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServerLocked()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:186 +0xfd
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:122 +0x41c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:367 +0x4a
  github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:107 +0x2e
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x390
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease·dwrap·485()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1430 +0x58
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1525 +0xe4c
  github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart()
      github.com/cockroachdb/cockroach/pkg/server/server.go:1454 +0x406f
  github.com/cockroachdb/cockroach/pkg/server.(*Server).Start()
      github.com/cockroachdb/cockroach/pkg/server/server.go:911 +0x44
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:498 +0x109
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:501 +0x10a
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:327 +0x4b5
  github.com/cockroachdb/cockroach/pkg/testutils/serverutils.StartNewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/serverutils/test_cluster_shim.go:229 +0x141
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1429 +0x2b6
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      GOROOT/src/testing/testing.go:1306 +0x47

Previous read at 0x00c0020b6000 by goroutine 3744:
  [failed to restore the stack]

Goroutine 3744 (finished) created at:
  github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan()
      github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:121 +0x59a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1273 +0x996
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:947 +0xbc4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:762 +0x5b9
  github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
      github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:745 +0x4c6
  github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
      github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:745 +0x4c6
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x235

@erikgrinaker erikgrinaker added this to Incoming in (Deprecated) Storage via automation May 19, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label May 19, 2022
@erikgrinaker erikgrinaker removed the T-kv KV Team label May 19, 2022
@erikgrinaker erikgrinaker removed this from roachtest/unit test backlog in KV May 19, 2022
@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 19, 2022

For context: in #81133 we made the synchronous disk write during node heartbeats use a separate goroutine, to handle context cancellation. Looks like this can race with engine shutdown, possibly due to non-atomic access of atomic variables.

resultCs[i], _ = nl.engineSyncs.DoChan(strconv.Itoa(i), func() (interface{}, error) {
return nil, storage.WriteSyncNoop(eng)
})

// WriteSyncNoop carries out a synchronous no-op write to the engine.
func WriteSyncNoop(eng Engine) error {
batch := eng.NewBatch()
defer batch.Close()
if err := batch.LogData(nil); err != nil {
return err
}
if err := batch.Commit(true /* sync */); err != nil {
return err
}
return nil
}

@nicktrav
Copy link
Collaborator

Took a quick look at this just now. I think you're right.

This field is is implicitly atomic:

https://github.com/cockroachdb/pebble/blob/9bd431430a38da6564dc46e2adfcf8ff77e10c23/record/log_writer.go#L26

Yet we read from it here non-atomically here:

https://github.com/cockroachdb/pebble/blob/9bd431430a38da6564dc46e2adfcf8ff77e10c23/record/log_writer.go#L678

Seems like the trivial fix would be to read atomically. Though we should probably do a deeper audit of the write / read sites.

I'll file a bug and link back. Will assign this Storage. Thanks for the heads up!

@nicktrav nicktrav added the A-storage Relating to our storage engine (Pebble) on-disk storage. label May 19, 2022
@nicktrav nicktrav moved this from Incoming to Prioritized in (Deprecated) Storage May 19, 2022
@erikgrinaker
Copy link
Contributor

Yeah, I did a quick trial with making all accesses atomic, but I still tripped the race detector. May have missed something though.

@nicktrav
Copy link
Collaborator

This is also trivially reproducible:

$ ./dev test ./pkg/kv/kvserver --filter TestReplicaDrainLease --race --stress

One of the stacks was missing in the original log. Here's the full output, for additional context.

Output
WARNING: DATA RACE
Read at 0x00c0035e2000 by goroutine 42:
  github.com/cockroachdb/pebble/record.(*LogWriter).emitEOFTrailer()
      github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:678 +0x4b
  github.com/cockroachdb/pebble/record.(*LogWriter).Close()
      github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:589 +0x32
  github.com/cockroachdb/pebble.(*DB).Close()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:1194 +0x664
  github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).Close()
      github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1026 +0x71
  github.com/cockroachdb/cockroach/pkg/server.(*Engines).Close()
      github.com/cockroachdb/cockroach/pkg/server/config.go:505 +0xb7
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x390
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServerLocked()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:186 +0xfd
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:122 +0x41c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:367 +0x4a
  github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:107 +0x2e
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:563 +0x390
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease·dwrap·485()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1430 +0x58
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestReplicaDrainLease()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:1525 +0xe4c
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      GOROOT/src/testing/testing.go:1306 +0x47

Previous write at 0x00c0035e2000 by goroutine 706:                                                                                                                                                                                   [26/2603]
  sync/atomic.StoreInt32()
      src/runtime/race_amd64.s:255 +0xb
  sync/atomic.StoreInt32()
      <autogenerated>:1 +0x1a
  github.com/cockroachdb/pebble/record.(*LogWriter).SyncRecord()
      github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:646 +0xe4
  github.com/cockroachdb/pebble.(*DB).commitWrite()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:851 +0x371
  github.com/cockroachdb/pebble.(*DB).commitWrite-fm()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:802 +0x64
  github.com/cockroachdb/pebble.(*commitPipeline).prepare()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/commit.go:379 +0x2db
  github.com/cockroachdb/pebble.(*commitPipeline).Commit()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/commit.go:253 +0xa5
  github.com/cockroachdb/pebble.(*DB).Apply()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:757 +0x465
  github.com/cockroachdb/pebble.(*Batch).Commit()
      github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/batch.go:932 +0xbb
  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).Commit()
      github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:516 +0x5a
  github.com/cockroachdb/cockroach/pkg/storage.WriteSyncNoop()
      github.com/cockroachdb/cockroach/pkg/storage/engine.go:938 +0x105
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1274 +0x3e
  github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall()
      github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 +0x67
  github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan·dwrap·1()
      github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:121 +0x71

Goroutine 42 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:1306 +0x726
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:1598 +0x99
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.runTests()
      GOROOT/src/testing/testing.go:1596 +0x7ca
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:1504 +0x9d1
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMain()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/main_test.go:49 +0x105
  main.main()
      main/bazel-out/darwin-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/testmain.go:1435 +0x5c6

Goroutine 706 (finished) created at:
  github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan()
      github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:121 +0x59a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1273 +0x996
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:947 +0xbc4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:762 +0x5b9
  github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
      github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:745 +0x4c6
  github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
      github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:745 +0x4c6
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x235

Seems like there's a non-atomic write happening in there too.

@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@nicktrav nicktrav self-assigned this May 23, 2022
nicktrav added a commit to nicktrav/pebble that referenced this issue May 23, 2022
Currently, a block's `written` field is intended to be used atomically.
There are existing call sites where the field is read from or written to
non-atomically, which can result in a data race (as observed in
cockroachdb/cockroach#81511).

Make reads and writes of `block.written` atomic.

This results in up to a 8% reduction in latency and increase in
throughput, depending on the size of the record, due to a change to use
a stack variable rather that dereferencing a field on the struct.
However, it should be noted that this particular benchmark does not
touch the filesystem (it makes use of `io.Discard`). In practice the
savings would be dwarfed by the filesystem operation (sync / write).

```
name                       old time/op    new time/op    delta
RecordWrite/size=8-16        42.9ns ± 5%    40.7ns ± 2%  -5.24%  (p=0.000 n=10+10)
RecordWrite/size=16-16       42.7ns ± 4%    40.7ns ± 1%  -4.89%  (p=0.000 n=10+10)
RecordWrite/size=32-16       44.0ns ± 6%    40.4ns ± 1%  -8.24%  (p=0.000 n=10+10)
RecordWrite/size=64-16       44.1ns ± 1%    41.6ns ± 0%  -5.58%  (p=0.000 n=10+10)
RecordWrite/size=256-16      67.7ns ± 4%    65.2ns ± 1%  -3.71%  (p=0.000 n=10+9)
RecordWrite/size=1028-16      124ns ± 2%     119ns ± 1%  -4.04%  (p=0.000 n=10+10)
RecordWrite/size=4096-16      306ns ± 5%     296ns ± 1%  -3.25%  (p=0.000 n=10+9)
RecordWrite/size=65536-16    4.54µs ± 1%    4.52µs ± 2%    ~     (p=0.247 n=10+10)

name                       old speed      new speed      delta
RecordWrite/size=8-16       186MB/s ± 5%   197MB/s ± 2%  +5.49%  (p=0.000 n=10+10)
RecordWrite/size=16-16      374MB/s ± 3%   394MB/s ± 1%  +5.12%  (p=0.000 n=10+10)
RecordWrite/size=32-16      727MB/s ± 5%   792MB/s ± 1%  +8.91%  (p=0.000 n=10+10)
RecordWrite/size=64-16     1.45GB/s ± 1%  1.54GB/s ± 0%  +5.91%  (p=0.000 n=10+10)
RecordWrite/size=256-16    3.78GB/s ± 4%  3.92GB/s ± 1%  +3.82%  (p=0.000 n=10+9)
RecordWrite/size=1028-16   8.31GB/s ± 2%  8.66GB/s ± 1%  +4.21%  (p=0.000 n=10+10)
RecordWrite/size=4096-16   13.4GB/s ± 5%  13.8GB/s ± 1%  +3.30%  (p=0.000 n=10+9)
RecordWrite/size=65536-16  14.4GB/s ± 1%  14.5GB/s ± 2%    ~     (p=0.247 n=10+10)

name                       old alloc/op   new alloc/op   delta
RecordWrite/size=8-16         0.00B          0.00B         ~     (all equal)
RecordWrite/size=16-16        0.00B          0.00B         ~     (all equal)
RecordWrite/size=32-16        0.00B          0.00B         ~     (all equal)
RecordWrite/size=64-16        0.00B          0.00B         ~     (all equal)
RecordWrite/size=256-16       0.00B          0.00B         ~     (all equal)
RecordWrite/size=1028-16      0.00B          0.00B         ~     (all equal)
RecordWrite/size=4096-16      0.00B          0.00B         ~     (all equal)
RecordWrite/size=65536-16     2.00B ± 0%     2.00B ± 0%    ~     (all equal)

name                       old allocs/op  new allocs/op  delta
RecordWrite/size=8-16          0.00           0.00         ~     (all equal)
RecordWrite/size=16-16         0.00           0.00         ~     (all equal)
RecordWrite/size=32-16         0.00           0.00         ~     (all equal)
RecordWrite/size=64-16         0.00           0.00         ~     (all equal)
RecordWrite/size=256-16        0.00           0.00         ~     (all equal)
RecordWrite/size=1028-16       0.00           0.00         ~     (all equal)
RecordWrite/size=4096-16       0.00           0.00         ~     (all equal)
RecordWrite/size=65536-16      0.00           0.00         ~     (all equal)
```

Fixes cockroachdb#1692.
@erikgrinaker
Copy link
Contributor

As discussed in cockroachdb/pebble#1719, we'll do this through CRDB synchronization instead (i.e. don't attempt to write when we're closing the database).

@craig craig bot closed this as completed in 5b0453e May 25, 2022
Replication automation moved this from Incoming to Done May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants