Proposal: Increase quota-backend-bytes default to 8GB #9771

jpbetz · 2018-05-24T17:04:37Z

Now that the bbolt freelist no longer persisted, we should consider increasing the etcd default storage limit to 8GB. We could potentially go higher, but this keeps the snapshot/restore operations to roughly 1 minute. We can always increase this further in future releases based on feedback.

Based on the data below:

etcd's size limit should never exceed the memory available to it. If running on a dedicated machine with, say 16GB of memory, etcd's storage limit needs to be less than 16GB, and with a healthy margin (4GB?)
Throughput and latency appear stable up to at least 16GB
snapshot and restore operation latency increases linearly up to at least 16GB
At 8GB, snapshot and restore take about 1 minute each (TODO: do we hit any thresholds here? Anything we should update to support 1 minute snapshots/restores?)

The benchmark constructed using the following flow:

Write 1KB values randomly to a fixed size keyspace of 100,000,000
Start with a 15s compaction interval, each time the compaction interval is exceeded, continue to write for 1 minute, then increase the compaction interval by 15s
At each GB of DB file size growth, perform an snapshot followed by a restore

Debian/linux 4.9.0-amd64
6 core - Intel(R) Xeon(R) @ 3.60GHz
64 GB memory (4x 16GiB DIMM DDR4 2400 MHz)
HDD

cc @gyuho @wenjiaswe

gyuho · 2018-05-24T17:06:43Z

Interesting. How did we measure restore latencies?

xiang90 · 2018-05-24T17:16:11Z

What is your memory size? When dB size grows beyond the memory size, the perf will decrease significantly

xiang90 · 2018-05-24T17:17:07Z

A more interesting test is how etcd performs when the free list contains a lot of pages and the write size is small.

jpbetz · 2018-05-24T17:17:48Z

For restore latency, we measured how long bin/etcdctl snapshot restore snap.out took. This is for a single member cluster, I'd like to do a 3 member cluster next. For that one I'm thinking of stopping a single member, nuking it's DB, starting it again. To get the timing for the 3 member case I guess I'll need to poll either the endpoint status or the cluster-health?

jpbetz · 2018-05-24T17:21:22Z

@xiang90 Good idea. I'll try a test where we do a bunch of puts/deletes with small objects and see what happens at the bbolt layer when we're allocating against a large freelist with high fragmentation.

xiang90 · 2018-05-24T17:22:59Z

@jpbetz also for the restore test, we now are testing how IO layer perform mostly. I would expect the index rebuilding dominates the time when the number of keys grows.

xiang90 · 2018-05-24T17:28:25Z

@jpbetz

What is your memory size? When dB size grows beyond the memory size, the perf will decrease significantly

I think the big jump around 52GB is because of this issue.

gyuho · 2018-05-24T17:30:07Z

nuking it's DB, starting it again

The main motivation to soft-limit database size was to limit mean time to recovery. So, I would also measure how long it takes to rebuild mvcc states on restart (using mvcc.New), as @xiang90 suggests.

jpbetz · 2018-05-24T17:48:08Z

Sounds good. I've added the machine stats to the description. I'll try with a range of object sizes incl. very small to get more data on worst case recovery times.

jpbetz · 2018-06-05T23:03:07Z

I've updated the testing based on the feedback here. The new flow creates a larger number of small objects, many of which get deleted over time and compacted, producing free list entries in bolt as well as putting pressure on the snapshot and restore operations.

xiang90 · 2018-06-05T23:50:18Z

Start with a 15s compaction interval, each time the compaction interval is exceeded, continue to write for 1 minute, then increase the compaction interval by 15s

If we randomly write to a 100M keyspace at a rate around 15k/s, the possibility of key overwriting is pretty low. So I am not sure if the compaction actually does anything.

xiang90 · 2018-06-05T23:51:19Z

regardless, I think 16GB is a reasonable goal that we can achieve with some effort in short term.

jpbetz · 2018-06-06T00:03:48Z

Hm. Running time was about 1 day, so about 12 writes per key. Not a lot of compaction or history. I'll drop that down to 1M keys and see what happens the next time we run this.

xiang90 · 2018-06-06T00:16:44Z

@jpbetz I also want to see what happens when the majority of the 16GB db pages are all in its free pages. That is the extreme case. If boltdb can still perform well, then great :P. Or we might need to do some optimization there.

gyuho · 2018-06-06T22:06:02Z

Throughput and latency appear stable up to at least 16GB

I'm seeing the similar pattern with small values (100 KB).

I also logged growing freelist size (if we keep writing data, the freelist size grows up to 2 GB for 10 GB DB). And it doesn't seem to have much effect (writing large values would slow down quicker, with much less freelist).

Tested with 10 GB db file restore, and see most of time spent on rebuilding MVCC storage here

https://github.com/coreos/etcd/blob/25f4d809800542a2fa85568f5c5cd0c881f7e010/mvcc/kvstore.go#L363-L380

Will keep experimenting.

stale · 2020-04-07T07:11:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho added the area/performance label May 24, 2018

jpbetz self-assigned this May 25, 2018

jpbetz changed the title ~~[WIP] Proposal: Increase quota-backend-bytes 8GB default limit~~ [WIP] Proposal: Increase quota-backend-bytes default and recommended limits Jun 5, 2018

jpbetz changed the title ~~[WIP] Proposal: Increase quota-backend-bytes default and recommended limits~~ Proposal: Increase quota-backend-bytes default to 8GB Jun 5, 2018

spzala mentioned this issue Aug 16, 2019

What is the reason for etcd's 8GB limit on storage? #11048

Closed

jingyih mentioned this issue Aug 21, 2019

mvcc: bad performance when range prefix with limit in large amount of keys #11057

Closed

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

sunstonesecure-robert mentioned this issue Feb 7, 2023

Evaluate policy report open-policy-agent/gatekeeper#2394

Closed

dims mentioned this issue Feb 24, 2023

possible Issues(?) bumping etcd 8 GB upper limit (or not!) #15354

Closed

Sharpz7 mentioned this issue Dec 6, 2023

ETCD X KWOK EPIC Sharpz7/gr-oss-work#4

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Increase quota-backend-bytes default to 8GB #9771

Proposal: Increase quota-backend-bytes default to 8GB #9771

jpbetz commented May 24, 2018 •

edited

Loading

gyuho commented May 24, 2018

xiang90 commented May 24, 2018

xiang90 commented May 24, 2018

jpbetz commented May 24, 2018 •

edited

Loading

jpbetz commented May 24, 2018

xiang90 commented May 24, 2018

xiang90 commented May 24, 2018

gyuho commented May 24, 2018 •

edited

Loading

jpbetz commented May 24, 2018

jpbetz commented Jun 5, 2018

xiang90 commented Jun 5, 2018

xiang90 commented Jun 5, 2018

jpbetz commented Jun 6, 2018

xiang90 commented Jun 6, 2018

gyuho commented Jun 6, 2018

stale bot commented Apr 7, 2020

Proposal: Increase quota-backend-bytes default to 8GB #9771

Proposal: Increase quota-backend-bytes default to 8GB #9771

Comments

jpbetz commented May 24, 2018 • edited Loading

gyuho commented May 24, 2018

xiang90 commented May 24, 2018

xiang90 commented May 24, 2018

jpbetz commented May 24, 2018 • edited Loading

jpbetz commented May 24, 2018

xiang90 commented May 24, 2018

xiang90 commented May 24, 2018

gyuho commented May 24, 2018 • edited Loading

jpbetz commented May 24, 2018

jpbetz commented Jun 5, 2018

xiang90 commented Jun 5, 2018

xiang90 commented Jun 5, 2018

jpbetz commented Jun 6, 2018

xiang90 commented Jun 6, 2018

gyuho commented Jun 6, 2018

stale bot commented Apr 7, 2020

jpbetz commented May 24, 2018 •

edited

Loading

jpbetz commented May 24, 2018 •

edited

Loading

gyuho commented May 24, 2018 •

edited

Loading