record: ensure block.written is accessed atomically #1719

nicktrav · 2022-05-23T23:27:01Z

Currently, a block's written field is intended to be used atomically.
There are existing call sites where the field is read from or written to
non-atomically, which can result in a data race (as observed in
cockroachdb/cockroach#81511).

Make reads and writes of block.written atomic.

This results in up to a 8% reduction in latency and increase in
throughput, depending on the size of the record, due to a change to use
a stack variable rather that dereferencing a field on the struct.
However, it should be noted that this particular benchmark does not
touch the filesystem (it makes use of io.Discard). In practice the
savings would be dwarfed by the filesystem operation (sync / write).

name                       old time/op    new time/op    delta
RecordWrite/size=8-16        42.9ns ± 5%    40.7ns ± 2%  -5.24%  (p=0.000 n=10+10)
RecordWrite/size=16-16       42.7ns ± 4%    40.7ns ± 1%  -4.89%  (p=0.000 n=10+10)
RecordWrite/size=32-16       44.0ns ± 6%    40.4ns ± 1%  -8.24%  (p=0.000 n=10+10)
RecordWrite/size=64-16       44.1ns ± 1%    41.6ns ± 0%  -5.58%  (p=0.000 n=10+10)
RecordWrite/size=256-16      67.7ns ± 4%    65.2ns ± 1%  -3.71%  (p=0.000 n=10+9)
RecordWrite/size=1028-16      124ns ± 2%     119ns ± 1%  -4.04%  (p=0.000 n=10+10)
RecordWrite/size=4096-16      306ns ± 5%     296ns ± 1%  -3.25%  (p=0.000 n=10+9)
RecordWrite/size=65536-16    4.54µs ± 1%    4.52µs ± 2%    ~     (p=0.247 n=10+10)

name                       old speed      new speed      delta
RecordWrite/size=8-16       186MB/s ± 5%   197MB/s ± 2%  +5.49%  (p=0.000 n=10+10)
RecordWrite/size=16-16      374MB/s ± 3%   394MB/s ± 1%  +5.12%  (p=0.000 n=10+10)
RecordWrite/size=32-16      727MB/s ± 5%   792MB/s ± 1%  +8.91%  (p=0.000 n=10+10)
RecordWrite/size=64-16     1.45GB/s ± 1%  1.54GB/s ± 0%  +5.91%  (p=0.000 n=10+10)
RecordWrite/size=256-16    3.78GB/s ± 4%  3.92GB/s ± 1%  +3.82%  (p=0.000 n=10+9)
RecordWrite/size=1028-16   8.31GB/s ± 2%  8.66GB/s ± 1%  +4.21%  (p=0.000 n=10+10)
RecordWrite/size=4096-16   13.4GB/s ± 5%  13.8GB/s ± 1%  +3.30%  (p=0.000 n=10+9)
RecordWrite/size=65536-16  14.4GB/s ± 1%  14.5GB/s ± 2%    ~     (p=0.247 n=10+10)

name                       old alloc/op   new alloc/op   delta
RecordWrite/size=8-16         0.00B          0.00B         ~     (all equal)
RecordWrite/size=16-16        0.00B          0.00B         ~     (all equal)
RecordWrite/size=32-16        0.00B          0.00B         ~     (all equal)
RecordWrite/size=64-16        0.00B          0.00B         ~     (all equal)
RecordWrite/size=256-16       0.00B          0.00B         ~     (all equal)
RecordWrite/size=1028-16      0.00B          0.00B         ~     (all equal)
RecordWrite/size=4096-16      0.00B          0.00B         ~     (all equal)
RecordWrite/size=65536-16     2.00B ± 0%     2.00B ± 0%    ~     (all equal)

name                       old allocs/op  new allocs/op  delta
RecordWrite/size=8-16          0.00           0.00         ~     (all equal)
RecordWrite/size=16-16         0.00           0.00         ~     (all equal)
RecordWrite/size=32-16         0.00           0.00         ~     (all equal)
RecordWrite/size=64-16         0.00           0.00         ~     (all equal)
RecordWrite/size=256-16        0.00           0.00         ~     (all equal)
RecordWrite/size=1028-16       0.00           0.00         ~     (all equal)
RecordWrite/size=4096-16       0.00           0.00         ~     (all equal)
RecordWrite/size=65536-16      0.00           0.00         ~     (all equal)

Fixes #1692.

Currently, a block's `written` field is intended to be used atomically. There are existing call sites where the field is read from or written to non-atomically, which can result in a data race (as observed in cockroachdb/cockroach#81511). Make reads and writes of `block.written` atomic. This results in up to a 8% reduction in latency and increase in throughput, depending on the size of the record, due to a change to use a stack variable rather that dereferencing a field on the struct. However, it should be noted that this particular benchmark does not touch the filesystem (it makes use of `io.Discard`). In practice the savings would be dwarfed by the filesystem operation (sync / write). ``` name old time/op new time/op delta RecordWrite/size=8-16 42.9ns ± 5% 40.7ns ± 2% -5.24% (p=0.000 n=10+10) RecordWrite/size=16-16 42.7ns ± 4% 40.7ns ± 1% -4.89% (p=0.000 n=10+10) RecordWrite/size=32-16 44.0ns ± 6% 40.4ns ± 1% -8.24% (p=0.000 n=10+10) RecordWrite/size=64-16 44.1ns ± 1% 41.6ns ± 0% -5.58% (p=0.000 n=10+10) RecordWrite/size=256-16 67.7ns ± 4% 65.2ns ± 1% -3.71% (p=0.000 n=10+9) RecordWrite/size=1028-16 124ns ± 2% 119ns ± 1% -4.04% (p=0.000 n=10+10) RecordWrite/size=4096-16 306ns ± 5% 296ns ± 1% -3.25% (p=0.000 n=10+9) RecordWrite/size=65536-16 4.54µs ± 1% 4.52µs ± 2% ~ (p=0.247 n=10+10) name old speed new speed delta RecordWrite/size=8-16 186MB/s ± 5% 197MB/s ± 2% +5.49% (p=0.000 n=10+10) RecordWrite/size=16-16 374MB/s ± 3% 394MB/s ± 1% +5.12% (p=0.000 n=10+10) RecordWrite/size=32-16 727MB/s ± 5% 792MB/s ± 1% +8.91% (p=0.000 n=10+10) RecordWrite/size=64-16 1.45GB/s ± 1% 1.54GB/s ± 0% +5.91% (p=0.000 n=10+10) RecordWrite/size=256-16 3.78GB/s ± 4% 3.92GB/s ± 1% +3.82% (p=0.000 n=10+9) RecordWrite/size=1028-16 8.31GB/s ± 2% 8.66GB/s ± 1% +4.21% (p=0.000 n=10+10) RecordWrite/size=4096-16 13.4GB/s ± 5% 13.8GB/s ± 1% +3.30% (p=0.000 n=10+9) RecordWrite/size=65536-16 14.4GB/s ± 1% 14.5GB/s ± 2% ~ (p=0.247 n=10+10) name old alloc/op new alloc/op delta RecordWrite/size=8-16 0.00B 0.00B ~ (all equal) RecordWrite/size=16-16 0.00B 0.00B ~ (all equal) RecordWrite/size=32-16 0.00B 0.00B ~ (all equal) RecordWrite/size=64-16 0.00B 0.00B ~ (all equal) RecordWrite/size=256-16 0.00B 0.00B ~ (all equal) RecordWrite/size=1028-16 0.00B 0.00B ~ (all equal) RecordWrite/size=4096-16 0.00B 0.00B ~ (all equal) RecordWrite/size=65536-16 2.00B ± 0% 2.00B ± 0% ~ (all equal) name old allocs/op new allocs/op delta RecordWrite/size=8-16 0.00 0.00 ~ (all equal) RecordWrite/size=16-16 0.00 0.00 ~ (all equal) RecordWrite/size=32-16 0.00 0.00 ~ (all equal) RecordWrite/size=64-16 0.00 0.00 ~ (all equal) RecordWrite/size=256-16 0.00 0.00 ~ (all equal) RecordWrite/size=1028-16 0.00 0.00 ~ (all equal) RecordWrite/size=4096-16 0.00 0.00 ~ (all equal) RecordWrite/size=65536-16 0.00 0.00 ~ (all equal) ``` Fixes cockroachdb#1692.

cockroach-teamcity · 2022-05-23T23:27:08Z

This change is

nicktrav · 2022-05-23T23:32:18Z

This is just one approach to solving this issue - honoring the contract that the struct field comment implies re: atomicity.

It looks like there are also some comments around using the commit pipeline mutex for external synchronization (for example), though evidently something is slipping through the cracks are we're seeing the race in Cockroach.

For context, in cockroachdb/cockroach#81133, parallel writes are made to each engine's WAL as part of liveness checking, which can race if the context is cancelled and the log is closed.

Seems like this code in Pebble has been this way since inception, and we're only tickling the issue now due to the nature of the concurrent access introduced in the Cockroach PR.

jbowens

There are some gaps in the comments which make it difficult to tease out how this is intended to work. I believe its synchronization is designed for a concurrent single-consumer, single-producer scheme. Each log writer has a single long-lived flushing goroutine. (the consumer). The commit pipeline (d.commit.mu) ensures at most a single writer is adding pending data to be flushed at a time (the producer).

Since we're guaranteed to have at most 1 writer appending data to be flushed, methods that are only used in the writing code path do not need to use atomic loads: There's no potential for another goroutine to write to b.written concurrently, because only the goroutine with d.commit.mu is permitted to write to b.written. However, the flushing goroutine needs to be able to access b.written concurrent with writes, so it uses atomic loads, and the writing goroutine must use atomic stores.

I think the data race is caused by the omission of a commit-pipeline lock in DB.Close. Closing the WAL writer is required to hold the commit pipeline mutex normally (eg, makeRoomForWrite always holds d.commit.mu).

I think there's still a problem though. The (*pebble.DB) contract is that calling any method concurrently with Close is prohibited:

// Close closes the DB.
//
// It is not safe to close a DB until all outstanding iterators are closed
// or to call Close concurrently with any other DB method. It is not valid
// to call any of a DB's methods after the DB has been closed.

If we were to make the adjustment to lock the commit pipeline at the beginning of DB.Close, the sync write batch application will panic when it acquires the mutex.

func (d *DB) Apply(batch *Batch, opts *WriteOptions) error {
	if err := d.closed.Load(); err != nil {
		panic(err)
	}

It seems like we could change this into an ordinary error return, but I'm not sure whether that solves the problem. What happens in cockroachdb/cockroach#81133 if the liveness write returns an error? Can we avoid the race between engine close and the liveness write? cc @erikgrinaker

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

jbowens

If we were to make the adjustment to lock the commit pipeline at the beginning of DB.Close, the sync write batch application will panic when it acquires the mutex.

This is not quite right. There's already the risk of a panic if Close() completes its atomic store before the liveness goroutine enters Apply.

It's hard to reason what a concurrent liveness write would do, because the internal state of the database is ~undefined during Close. I think we should find a way to avoid the batch application racing with the closing of the database above Pebble, and then adjust Close to make concurrent operations a little easier to catch. eg, changing the first line of Close() to swap d.closed's value to ErrClosed, increasing the chances a concurrent operation observes the ErrClosed.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

nicktrav

I believe its synchronization is designed for a concurrent single-consumer, single-producer scheme

This was my read too. The commentary was a little confusing though.

While this patch fixes this particular issue, we're still getting a race (when used in Cockroach) between the following two call sites when we close the writer:

Read:

pebble/record/log_writer.go

Line 677 in bb2c150

b := w.block

Write:

pebble/record/log_writer.go

Line 573 in bb2c150

w.block = nextBlock

Rather than sprinkling mutexes and atomics around the place in Pebble to cover use-cases above Pebble, I wonder if there's a way we can continue to honor the existing contract and adapt what we're doing above Pebble:

I think we should find a way to avoid the batch application racing with the closing of the database above Pebble,

Basically +1 on this part.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

nicktrav

I'm going to close this out as a signal that the code shouldn't be reviewed, but we can continue the conversation here.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)

petermattis · 2022-05-24T14:58:05Z

LogWriter.{Write,SyncRecord,Close} need external synchronization. Inside Pebble this synchronization is provided by commitPipeline.mu. It looks like we're not holding commitPipeline.mu correctly in the DB.Close() path.

erikgrinaker

Rather than sprinkling mutexes and atomics around the place in Pebble to cover use-cases above Pebble, I wonder if there's a way we can continue to honor the existing contract and adapt what we're doing above Pebble:

I think we should find a way to avoid the batch application racing with the closing of the database above Pebble,

Basically +1 on this part.

Ok, I'll add some synchronization in CRDB.

Reviewable status: 0 of 1 files reviewed, all discussions resolved

erikgrinaker

cockroachdb/cockroach#81813

Reviewable status: 0 of 1 files reviewed, all discussions resolved

nicktrav marked this pull request as ready for review May 23, 2022 23:27

nicktrav requested review from sumeerbhola and jbowens May 23, 2022 23:32

erikgrinaker mentioned this pull request May 24, 2022

kvserver: add timeout for lease acquisitions cockroachdb/cockroach#81136

Merged

jbowens reviewed May 24, 2022

View reviewed changes

nicktrav commented May 24, 2022

View reviewed changes

nicktrav closed this May 24, 2022

nicktrav deleted the nickt.record-atomic branch May 24, 2022 14:54

erikgrinaker reviewed May 25, 2022

View reviewed changes

erikgrinaker mentioned this pull request May 25, 2022

kv/kvserver: TestReplicaDrainLease failed cockroachdb/cockroach#81511

Closed

erikgrinaker reviewed May 25, 2022

View reviewed changes

nicktrav mentioned this pull request May 26, 2022

db: hold commit pipeline mutex when closing DB #1728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record: ensure block.written is accessed atomically #1719

record: ensure block.written is accessed atomically #1719

nicktrav commented May 23, 2022

cockroach-teamcity commented May 23, 2022

nicktrav commented May 23, 2022

jbowens left a comment

jbowens left a comment

nicktrav left a comment

nicktrav left a comment

petermattis commented May 24, 2022

erikgrinaker left a comment •

edited

Loading

erikgrinaker left a comment

record: ensure block.written is accessed atomically #1719

record: ensure block.written is accessed atomically #1719

Conversation

nicktrav commented May 23, 2022

cockroach-teamcity commented May 23, 2022

nicktrav commented May 23, 2022

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

petermattis commented May 24, 2022

erikgrinaker left a comment • edited Loading

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment •

edited

Loading