engine: avoid committing empty batches #30398

benesch · 2018-09-19T04:44:16Z

Committing an empty batch previously wrote an empty entry to the RocksDB
WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was
causing unnecessary synchronous writes to disk during Raft heartbeats.

A non-rigorous benchmark shows that this yields a nearly 10% performance
improvement on a three node cluster under a write-heavy workload (kv5%).

^ @nvanbenschoten can this be right??? I don't trust my numbers, but I don't have time to run another benchmark tonight.

Note that the explosion of C++ is just to get at the RocksDB WAL's file size in a unit test. It's not a code path exercised anywhere but tests.

Release note: None

cockroach-teamcity · 2018-09-19T04:44:21Z

This change is

nvanbenschoten

A 10% performance win would be surprising to me, but it really depends on how many WAL syncs we're saving. I think it would be worth adding in some metrics/logging and checking how often we were performing superfluous syncs (wasted syncs/sec) that we will now avoid.

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/engine/rocksdb_test.go, line 1585 at r1 (raw file):

}

func TestRocksDBWALFile(t *testing.T) {

nit: this could use a more targeted name. We're testing something very specific.

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

// GetSortedWALFiles retrievews information about all of the write-ahead log
// files in this engine in order from oldest to newest.
func (r *RocksDB) GetSortedWALFiles() ([]WALFileInfo, error) {

This is a lot of implementation to verify the WAL is the same size. If you used an on-disk RocksDB instance, you could have done something like filepath.Glob(dir, "*.log").

pkg/storage/engine/rocksdb.go, line 1772 at r1 (raw file):

	r.distinctOpen = false

	if r.flushes == 0 && r.builder.count == 0 {

Hmm, we have similar logic in commitInternal. I'm guessing that isn't preventing empty batches because len(r.builder.repr) > 0 when batches have been grouped together.

Committing an empty batch previously wrote an empty entry to the RocksDB WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was causing unnecessary synchronous writes to disk during Raft heartbeats. A non-rigorous benchmark shows that this yields a nearly 10% performance improvement on a three node cluster under a write-heavy workload (kv5%). Release note: None

benesch

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is a lot of implementation to verify the WAL is the same size. If you used an on-disk RocksDB instance, you could have done something like filepath.Glob(dir, "*.log").

Yeah, I considered doing that, but I thought it might be nice to have an interface to access the WAL metadata available in case we want to stick it in the logs/metrics. Happy to rip that logic out if you don't think we'll ever do that.

pkg/storage/engine/rocksdb.go, line 1772 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Hmm, we have similar logic in commitInternal. I'm guessing that isn't preventing empty batches because len(r.builder.repr) > 0 when batches have been grouped together.

Precisely. If 50 empty batches are committed concurrently with a non-empty batch, that one non-empty batch will make the grouped batch look non-empty.

I've adjusted the logic in commitInternal to panic if it sees a non-empty batch.

pkg/storage/engine/rocksdb_test.go, line 1585 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this could use a more targeted name. We're testing something very specific.

Done.

bdarnell

Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yeah, I considered doing that, but I thought it might be nice to have an interface to access the WAL metadata available in case we want to stick it in the logs/metrics. Happy to rip that logic out if you don't think we'll ever do that.

I don't have a strong opinion here. I'm not seeing this as being something we'll use outside of a test, but I could be mistaken. On the other hand, you've already written the code.

benesch

@nvanbenschoten the numbers seem to reproducible. I'm getting between an 8-9% performance win. Details:

$ ./workload run kv --init --drop --duration=5m --read-percent=5 --concurrency=192 --splits=1000

Before:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0         610958         2036.5     94.3     92.3    184.5    234.9    671.1  

After:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0         660517         2201.7     87.2     79.7    176.2    234.9    453.0

I added a metric and found that before the change we performed 118.6k WAL syncs in the 5m that the workload ran and after the change we performed 121.3k syncs. That's not as much of a difference as I would have expected, but I guess it's significant? Possible my benchmarking is still flawed. Like, do we really expect a 90ms avg latency for kv5?

bors r=nvanbenschoten,petermattis,bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't have a strong opinion here. I'm not seeing this as being something we'll use outside of a test, but I could be mistaken. On the other hand, you've already written the code.

Ack. In that case I'm just going to move forward as is.

@nvanbenschoten

30398: engine: avoid committing empty batches r=nvanbenschoten,petermattis,bdarnell a=benesch Committing an empty batch previously wrote an empty entry to the RocksDB WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was causing unnecessary synchronous writes to disk during Raft heartbeats. A non-rigorous benchmark shows that this yields a nearly 10% performance improvement on a three node cluster under a write-heavy workload (kv5%). ^ @nvanbenschoten can this be right??? I don't trust my numbers, but I don't have time to run another benchmark tonight. Note that the explosion of C++ is just to get at the RocksDB WAL's file size in a unit test. It's not a code path exercised anywhere but tests. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

nvanbenschoten

I see the --drop flag. Are we wiping between runs? Also, those latencies are an order of magnitude larger than I'd expect. Try dropping your concurrency down to 24.

FWIW we don't really ever run kv5.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

craig · 2018-09-19T18:13:21Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten · 2018-09-26T23:54:31Z

I ran some benchmarks before and after this change on a three-node GCE cluster 4 vCPU machines and local SSDs. It certainly looks like this provides a throughput improvement on write-heavy workloads, especially when running with slow disks (i.e. with the SSD write barrier enabled).

name                  old ops/sec  new ops/sec  delta
kv5_conc192            2.01k ± 1%   2.18k ± 1%   +8.90%  (p=0.100 n=3+3)
kv95_conc64            15.1k ± 1%   15.1k ± 2%     ~     (p=0.700 n=3+3)
kv0_conc64             1.81k ± 1%   1.95k ± 0%   +7.94%  (p=0.100 n=3+3)
kv0_conc64_nobarrier   5.69k ± 1%   5.84k ± 1%   +2.54%  (p=0.100 n=3+3)

name                  old p50(ms)  new p50(ms)  delta
kv5_conc192             95.1 ± 3%    85.3 ± 3%  -10.30%  (p=0.100 n=3+3)
kv95_conc64             2.27 ± 6%    2.40 ± 0%     ~     (p=0.400 n=3+3)
kv0_conc64              31.5 ± 0%    29.0 ± 3%   -7.83%  (p=0.100 n=3+3)
kv0_conc64_nobarrier    8.40 ± 0%    8.30 ± 2%     ~     (p=1.000 n=3+3)

name                  old p99(ms)  new p99(ms)  delta
kv5_conc192              226 ± 4%     232 ± 2%     ~     (p=0.700 n=3+3)
kv95_conc64             41.9 ± 0%    40.5 ± 3%     ~     (p=0.400 n=3+3)
kv0_conc64              93.7 ± 7%    86.7 ± 3%     ~     (p=0.300 n=3+3)
kv0_conc64_nobarrier    75.5 ± 0%    71.3 ± 0%   -5.56%  (p=0.100 n=3+3)

Awesome find again @benesch!

benesch requested review from nvanbenschoten, petermattis and a team September 19, 2018 04:44

benesch mentioned this pull request Sep 19, 2018

production: registration outage postmortem 9/19 #30399

Closed

benesch force-pushed the empty-batch branch from 0932c8e to 78efc63 Compare September 19, 2018 05:43

nvanbenschoten approved these changes Sep 19, 2018

View reviewed changes

petermattis approved these changes Sep 19, 2018

View reviewed changes

benesch force-pushed the empty-batch branch from 78efc63 to 5c7532c Compare September 19, 2018 14:45

benesch commented Sep 19, 2018

View reviewed changes

bdarnell approved these changes Sep 19, 2018

View reviewed changes

petermattis approved these changes Sep 19, 2018

View reviewed changes

benesch commented Sep 19, 2018

View reviewed changes

nvanbenschoten reviewed Sep 19, 2018

View reviewed changes

craig bot merged commit 5c7532c into cockroachdb:master Sep 19, 2018

benesch mentioned this pull request Sep 24, 2018

release-2.1: engine: avoid committing empty batches #30571

Merged

benesch deleted the empty-batch branch September 24, 2018 18:09

benesch mentioned this pull request Dec 11, 2018

storage: add a metric to track number of RocksDB WAL syncs #33002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: avoid committing empty batches #30398

engine: avoid committing empty batches #30398

benesch commented Sep 19, 2018

cockroach-teamcity commented Sep 19, 2018

nvanbenschoten left a comment

petermattis left a comment

benesch left a comment

bdarnell left a comment

petermattis left a comment

benesch left a comment

nvanbenschoten left a comment

craig bot commented Sep 19, 2018

nvanbenschoten commented Sep 26, 2018

engine: avoid committing empty batches #30398

engine: avoid committing empty batches #30398

Conversation

benesch commented Sep 19, 2018

cockroach-teamcity commented Sep 19, 2018

nvanbenschoten left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Sep 19, 2018

Build succeeded

nvanbenschoten commented Sep 26, 2018