Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: avoid committing empty batches #30398

Merged
merged 1 commit into from
Sep 19, 2018
Merged

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Sep 19, 2018

Committing an empty batch previously wrote an empty entry to the RocksDB
WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was
causing unnecessary synchronous writes to disk during Raft heartbeats.

A non-rigorous benchmark shows that this yields a nearly 10% performance
improvement on a three node cluster under a write-heavy workload (kv5%).

^ @nvanbenschoten can this be right??? I don't trust my numbers, but I don't have time to run another benchmark tonight.

Note that the explosion of C++ is just to get at the RocksDB WAL's file size in a unit test. It's not a code path exercised anywhere but tests.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

A 10% performance win would be surprising to me, but it really depends on how many WAL syncs we're saving. I think it would be worth adding in some metrics/logging and checking how often we were performing superfluous syncs (wasted syncs/sec) that we will now avoid.

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/engine/rocksdb_test.go, line 1585 at r1 (raw file):

}

func TestRocksDBWALFile(t *testing.T) {

nit: this could use a more targeted name. We're testing something very specific.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

// GetSortedWALFiles retrievews information about all of the write-ahead log
// files in this engine in order from oldest to newest.
func (r *RocksDB) GetSortedWALFiles() ([]WALFileInfo, error) {

This is a lot of implementation to verify the WAL is the same size. If you used an on-disk RocksDB instance, you could have done something like filepath.Glob(dir, "*.log").


pkg/storage/engine/rocksdb.go, line 1772 at r1 (raw file):

	r.distinctOpen = false

	if r.flushes == 0 && r.builder.count == 0 {

Hmm, we have similar logic in commitInternal. I'm guessing that isn't preventing empty batches because len(r.builder.repr) > 0 when batches have been grouped together.

Committing an empty batch previously wrote an empty entry to the RocksDB
WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was
causing unnecessary synchronous writes to disk during Raft heartbeats.

A non-rigorous benchmark shows that this yields a nearly 10% performance
improvement on a three node cluster under a write-heavy workload (kv5%).

Release note: None
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is a lot of implementation to verify the WAL is the same size. If you used an on-disk RocksDB instance, you could have done something like filepath.Glob(dir, "*.log").

Yeah, I considered doing that, but I thought it might be nice to have an interface to access the WAL metadata available in case we want to stick it in the logs/metrics. Happy to rip that logic out if you don't think we'll ever do that.


pkg/storage/engine/rocksdb.go, line 1772 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Hmm, we have similar logic in commitInternal. I'm guessing that isn't preventing empty batches because len(r.builder.repr) > 0 when batches have been grouped together.

Precisely. If 50 empty batches are committed concurrently with a non-empty batch, that one non-empty batch will make the grouped batch look non-empty.

I've adjusted the logic in commitInternal to panic if it sees a non-empty batch.


pkg/storage/engine/rocksdb_test.go, line 1585 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this could use a more targeted name. We're testing something very specific.

Done.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yeah, I considered doing that, but I thought it might be nice to have an interface to access the WAL metadata available in case we want to stick it in the logs/metrics. Happy to rip that logic out if you don't think we'll ever do that.

I don't have a strong opinion here. I'm not seeing this as being something we'll use outside of a test, but I could be mistaken. On the other hand, you've already written the code.

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvanbenschoten the numbers seem to reproducible. I'm getting between an 8-9% performance win. Details:

$ ./workload run kv --init --drop --duration=5m --read-percent=5 --concurrency=192 --splits=1000

Before:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0         610958         2036.5     94.3     92.3    184.5    234.9    671.1  

After:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0         660517         2201.7     87.2     79.7    176.2    234.9    453.0  

I added a metric and found that before the change we performed 118.6k WAL syncs in the 5m that the workload ran and after the change we performed 121.3k syncs. That's not as much of a difference as I would have expected, but I guess it's significant? Possible my benchmarking is still flawed. Like, do we really expect a 90ms avg latency for kv5?

bors r=nvanbenschoten,petermattis,bdarnell

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/engine/rocksdb.go, line 1116 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't have a strong opinion here. I'm not seeing this as being something we'll use outside of a test, but I could be mistaken. On the other hand, you've already written the code.

Ack. In that case I'm just going to move forward as is.

craig bot pushed a commit that referenced this pull request Sep 19, 2018
30398: engine: avoid committing empty batches r=nvanbenschoten,petermattis,bdarnell a=benesch

Committing an empty batch previously wrote an empty entry to the RocksDB
WAL file. Coupled with a bug in Raft (etcd-io/etcd#10106), this was
causing unnecessary synchronous writes to disk during Raft heartbeats.

A non-rigorous benchmark shows that this yields a nearly 10% performance
improvement on a three node cluster under a write-heavy workload (kv5%).

^ @nvanbenschoten can this be right??? I don't trust my numbers, but I don't have time to run another benchmark tonight.

Note that the explosion of C++ is just to get at the RocksDB WAL's file size in a unit test. It's not a code path exercised anywhere but tests.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the --drop flag. Are we wiping between runs? Also, those latencies are an order of magnitude larger than I'd expect. Try dropping your concurrency down to 24.

FWIW we don't really ever run kv5.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@craig
Copy link
Contributor

craig bot commented Sep 19, 2018

Build succeeded

@craig craig bot merged commit 5c7532c into cockroachdb:master Sep 19, 2018
@benesch benesch deleted the empty-batch branch September 24, 2018 18:09
@nvanbenschoten
Copy link
Member

I ran some benchmarks before and after this change on a three-node GCE cluster 4 vCPU machines and local SSDs. It certainly looks like this provides a throughput improvement on write-heavy workloads, especially when running with slow disks (i.e. with the SSD write barrier enabled).

name                  old ops/sec  new ops/sec  delta
kv5_conc192            2.01k ± 1%   2.18k ± 1%   +8.90%  (p=0.100 n=3+3)
kv95_conc64            15.1k ± 1%   15.1k ± 2%     ~     (p=0.700 n=3+3)
kv0_conc64             1.81k ± 1%   1.95k ± 0%   +7.94%  (p=0.100 n=3+3)
kv0_conc64_nobarrier   5.69k ± 1%   5.84k ± 1%   +2.54%  (p=0.100 n=3+3)

name                  old p50(ms)  new p50(ms)  delta
kv5_conc192             95.1 ± 3%    85.3 ± 3%  -10.30%  (p=0.100 n=3+3)
kv95_conc64             2.27 ± 6%    2.40 ± 0%     ~     (p=0.400 n=3+3)
kv0_conc64              31.5 ± 0%    29.0 ± 3%   -7.83%  (p=0.100 n=3+3)
kv0_conc64_nobarrier    8.40 ± 0%    8.30 ± 2%     ~     (p=1.000 n=3+3)

name                  old p99(ms)  new p99(ms)  delta
kv5_conc192              226 ± 4%     232 ± 2%     ~     (p=0.700 n=3+3)
kv95_conc64             41.9 ± 0%    40.5 ± 3%     ~     (p=0.400 n=3+3)
kv0_conc64              93.7 ± 7%    86.7 ± 3%     ~     (p=0.300 n=3+3)
kv0_conc64_nobarrier    75.5 ± 0%    71.3 ± 0%   -5.56%  (p=0.100 n=3+3)

Awesome find again @benesch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants