Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Increase quota-backend-bytes default to 8GB #9771

Closed
jpbetz opened this issue May 24, 2018 · 16 comments
Closed

Proposal: Increase quota-backend-bytes default to 8GB #9771

jpbetz opened this issue May 24, 2018 · 16 comments

Comments

@jpbetz
Copy link
Contributor

jpbetz commented May 24, 2018

Now that the bbolt freelist no longer persisted, we should consider increasing the etcd default storage limit to 8GB. We could potentially go higher, but this keeps the snapshot/restore operations to roughly 1 minute. We can always increase this further in future releases based on feedback.

Based on the data below:

  • etcd's size limit should never exceed the memory available to it. If running on a dedicated machine with, say 16GB of memory, etcd's storage limit needs to be less than 16GB, and with a healthy margin (4GB?)
  • Throughput and latency appear stable up to at least 16GB
  • snapshot and restore operation latency increases linearly up to at least 16GB
  • At 8GB, snapshot and restore take about 1 minute each (TODO: do we hit any thresholds here? Anything we should update to support 1 minute snapshots/restores?)

The benchmark constructed using the following flow:

  • Write 1KB values randomly to a fixed size keyspace of 100,000,000
  • Start with a 15s compaction interval, each time the compaction interval is exceeded, continue to write for 1 minute, then increase the compaction interval by 15s
  • At each GB of DB file size growth, perform an snapshot followed by a restore

write latency 99 ile vs db size gb
write throughput writes 2fs vs db size gb
save and restore latency ms vs db size gb

Debian/linux 4.9.0-amd64
6 core - Intel(R) Xeon(R) @ 3.60GHz
64 GB memory (4x 16GiB DIMM DDR4 2400 MHz)
HDD

cc @gyuho @wenjiaswe

@gyuho
Copy link
Contributor

gyuho commented May 24, 2018

Interesting. How did we measure restore latencies?

@xiang90
Copy link
Contributor

xiang90 commented May 24, 2018

What is your memory size? When dB size grows beyond the memory size, the perf will decrease significantly

@xiang90
Copy link
Contributor

xiang90 commented May 24, 2018

A more interesting test is how etcd performs when the free list contains a lot of pages and the write size is small.

@jpbetz
Copy link
Contributor Author

jpbetz commented May 24, 2018

For restore latency, we measured how long bin/etcdctl snapshot restore snap.out took. This is for a single member cluster, I'd like to do a 3 member cluster next. For that one I'm thinking of stopping a single member, nuking it's DB, starting it again. To get the timing for the 3 member case I guess I'll need to poll either the endpoint status or the cluster-health?

@jpbetz
Copy link
Contributor Author

jpbetz commented May 24, 2018

@xiang90 Good idea. I'll try a test where we do a bunch of puts/deletes with small objects and see what happens at the bbolt layer when we're allocating against a large freelist with high fragmentation.

@xiang90
Copy link
Contributor

xiang90 commented May 24, 2018

@jpbetz also for the restore test, we now are testing how IO layer perform mostly. I would expect the index rebuilding dominates the time when the number of keys grows.

@xiang90
Copy link
Contributor

xiang90 commented May 24, 2018

@jpbetz

What is your memory size? When dB size grows beyond the memory size, the perf will decrease significantly

I think the big jump around 52GB is because of this issue.

@gyuho
Copy link
Contributor

gyuho commented May 24, 2018

nuking it's DB, starting it again

The main motivation to soft-limit database size was to limit mean time to recovery. So, I would also measure how long it takes to rebuild mvcc states on restart (using mvcc.New), as @xiang90 suggests.

@jpbetz
Copy link
Contributor Author

jpbetz commented May 24, 2018

Sounds good. I've added the machine stats to the description. I'll try with a range of object sizes incl. very small to get more data on worst case recovery times.

@jpbetz jpbetz self-assigned this May 25, 2018
@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 5, 2018

I've updated the testing based on the feedback here. The new flow creates a larger number of small objects, many of which get deleted over time and compacted, producing free list entries in bolt as well as putting pressure on the snapshot and restore operations.

@jpbetz jpbetz changed the title [WIP] Proposal: Increase quota-backend-bytes 8GB default limit [WIP] Proposal: Increase quota-backend-bytes default and recommended limits Jun 5, 2018
@jpbetz jpbetz changed the title [WIP] Proposal: Increase quota-backend-bytes default and recommended limits Proposal: Increase quota-backend-bytes default to 8GB Jun 5, 2018
@xiang90
Copy link
Contributor

xiang90 commented Jun 5, 2018

Start with a 15s compaction interval, each time the compaction interval is exceeded, continue to write for 1 minute, then increase the compaction interval by 15s

If we randomly write to a 100M keyspace at a rate around 15k/s, the possibility of key overwriting is pretty low. So I am not sure if the compaction actually does anything.

@xiang90
Copy link
Contributor

xiang90 commented Jun 5, 2018

regardless, I think 16GB is a reasonable goal that we can achieve with some effort in short term.

@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 6, 2018

Hm. Running time was about 1 day, so about 12 writes per key. Not a lot of compaction or history. I'll drop that down to 1M keys and see what happens the next time we run this.

@xiang90
Copy link
Contributor

xiang90 commented Jun 6, 2018

@jpbetz I also want to see what happens when the majority of the 16GB db pages are all in its free pages. That is the extreme case. If boltdb can still perform well, then great :P. Or we might need to do some optimization there.

@gyuho
Copy link
Contributor

gyuho commented Jun 6, 2018

Throughput and latency appear stable up to at least 16GB

I'm seeing the similar pattern with small values (100 KB).

I also logged growing freelist size (if we keep writing data, the freelist size grows up to 2 GB for 10 GB DB). And it doesn't seem to have much effect (writing large values would slow down quicker, with much less freelist).

Tested with 10 GB db file restore, and see most of time spent on rebuilding MVCC storage here

https://github.com/coreos/etcd/blob/25f4d809800542a2fa85568f5c5cd0c881f7e010/mvcc/kvstore.go#L363-L380

Will keep experimenting.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants