-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: avoid committing empty batches #30398
Conversation
0932c8e
to
78efc63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 10% performance win would be surprising to me, but it really depends on how many WAL syncs we're saving. I think it would be worth adding in some metrics/logging and checking how often we were performing superfluous syncs (wasted syncs/sec) that we will now avoid.
Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/engine/rocksdb_test.go, line 1585 at r1 (raw file):
} func TestRocksDBWALFile(t *testing.T) {
nit: this could use a more targeted name. We're testing something very specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):
// GetSortedWALFiles retrievews information about all of the write-ahead log // files in this engine in order from oldest to newest. func (r *RocksDB) GetSortedWALFiles() ([]WALFileInfo, error) {
This is a lot of implementation to verify the WAL is the same size. If you used an on-disk RocksDB instance, you could have done something like filepath.Glob(dir, "*.log")
.
pkg/storage/engine/rocksdb.go, line 1772 at r1 (raw file):
r.distinctOpen = false if r.flushes == 0 && r.builder.count == 0 {
Hmm, we have similar logic in commitInternal
. I'm guessing that isn't preventing empty batches because len(r.builder.repr) > 0
when batches have been grouped together.
Committing an empty batch previously wrote an empty entry to the RocksDB WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was causing unnecessary synchronous writes to disk during Raft heartbeats. A non-rigorous benchmark shows that this yields a nearly 10% performance improvement on a three node cluster under a write-heavy workload (kv5%). Release note: None
78efc63
to
5c7532c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This is a lot of implementation to verify the WAL is the same size. If you used an on-disk RocksDB instance, you could have done something like
filepath.Glob(dir, "*.log")
.
Yeah, I considered doing that, but I thought it might be nice to have an interface to access the WAL metadata available in case we want to stick it in the logs/metrics. Happy to rip that logic out if you don't think we'll ever do that.
pkg/storage/engine/rocksdb.go, line 1772 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Hmm, we have similar logic in
commitInternal
. I'm guessing that isn't preventing empty batches becauselen(r.builder.repr) > 0
when batches have been grouped together.
Precisely. If 50 empty batches are committed concurrently with a non-empty batch, that one non-empty batch will make the grouped batch look non-empty.
I've adjusted the logic in commitInternal
to panic if it sees a non-empty batch.
pkg/storage/engine/rocksdb_test.go, line 1585 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: this could use a more targeted name. We're testing something very specific.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Yeah, I considered doing that, but I thought it might be nice to have an interface to access the WAL metadata available in case we want to stick it in the logs/metrics. Happy to rip that logic out if you don't think we'll ever do that.
I don't have a strong opinion here. I'm not seeing this as being something we'll use outside of a test, but I could be mistaken. On the other hand, you've already written the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvanbenschoten the numbers seem to reproducible. I'm getting between an 8-9% performance win. Details:
$ ./workload run kv --init --drop --duration=5m --read-percent=5 --concurrency=192 --splits=1000
Before:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
300.0s 0 610958 2036.5 94.3 92.3 184.5 234.9 671.1
After:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
300.0s 0 660517 2201.7 87.2 79.7 176.2 234.9 453.0
I added a metric and found that before the change we performed 118.6k WAL syncs in the 5m that the workload ran and after the change we performed 121.3k syncs. That's not as much of a difference as I would have expected, but I guess it's significant? Possible my benchmarking is still flawed. Like, do we really expect a 90ms avg latency for kv5?
bors r=nvanbenschoten,petermattis,bdarnell
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I don't have a strong opinion here. I'm not seeing this as being something we'll use outside of a test, but I could be mistaken. On the other hand, you've already written the code.
Ack. In that case I'm just going to move forward as is.
30398: engine: avoid committing empty batches r=nvanbenschoten,petermattis,bdarnell a=benesch Committing an empty batch previously wrote an empty entry to the RocksDB WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was causing unnecessary synchronous writes to disk during Raft heartbeats. A non-rigorous benchmark shows that this yields a nearly 10% performance improvement on a three node cluster under a write-heavy workload (kv5%). ^ @nvanbenschoten can this be right??? I don't trust my numbers, but I don't have time to run another benchmark tonight. Note that the explosion of C++ is just to get at the RocksDB WAL's file size in a unit test. It's not a code path exercised anywhere but tests. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the --drop
flag. Are we wiping between runs? Also, those latencies are an order of magnitude larger than I'd expect. Try dropping your concurrency down to 24.
FWIW we don't really ever run kv5
.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Build succeeded |
I ran some benchmarks before and after this change on a three-node GCE cluster 4 vCPU machines and local SSDs. It certainly looks like this provides a throughput improvement on write-heavy workloads, especially when running with slow disks (i.e. with the SSD write barrier enabled).
Awesome find again @benesch! |
Committing an empty batch previously wrote an empty entry to the RocksDB
WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was
causing unnecessary synchronous writes to disk during Raft heartbeats.
A non-rigorous benchmark shows that this yields a nearly 10% performance
improvement on a three node cluster under a write-heavy workload (kv5%).
^ @nvanbenschoten can this be right??? I don't trust my numbers, but I don't have time to run another benchmark tonight.
Note that the explosion of C++ is just to get at the RocksDB WAL's file size in a unit test. It's not a code path exercised anywhere but tests.
Release note: None