Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: bad performance when range prefix with limit in large amount of keys #11057

Closed
maomaozgw opened this issue Aug 20, 2019 · 8 comments
Closed

Comments

@maomaozgw
Copy link

I have a ETCD cluster with 2004938 keys, my application will range all keys from ETCD after started and keep watch key changes with clientv3/mirror, I found that will cost about 5 minutes to sync base data and during syncing the instance connected will found large number of etcd_server_slow_apply_total appeared.

After profiling the ETCD instance, I saw https://github.com/etcd-io/etcd/blob/master/mvcc/index.go#L106 will visit all indexes match the start & end, and most of the data will be dropped.

To resolve the problem, I change the behavior in Revisions to stop iterate after limitation reached when limitation larger than zero, and the sync time in reduce from 5 minutes to 7 seconds in this case.

Is it a good option to add a range option [noTotalCount], which can work with range with limitation to improve the performance of range requests ?

ETCD Version: 3.3.13

@jingyih
Copy link
Contributor

jingyih commented Aug 21, 2019

My personal take on this:

  1. Adding range option noTotalCount changes semantics of some returned fields such as count and more. This breaks backward compatibility (new client + old server).

  2. If there are enough use case, this can be implemented as a new API.

  3. Normally, indexing in memory is pretty quick. Can you make sure the memory available to etcd is bigger than the total size of the keys? When memory swapping happens, etcd's performance degrades severely [1].

Ref:
[1] #9771

@maomaozgw
Copy link
Author

Yes, I have tested offline with cpu profiling (30s) and sync full data from online cluster.
Both of my server is 32C 64G and the etcd_mvcc_db_total_size_in_use_in_bytes in cluster is 530MiB (5.52923136e+08). Only ETCD deployed in these machine.

In old version, when sync data, the most usage of cpu in treeIndex.Revisions
image
After limit the range keys, the new profiling (30s) info show that treeIndex.Revisions disappear and most time cost to (*readTx).UnsafeRange to get data from bbolt db
image

Why I consider to add the option noTotalCount in range request for it's more likely a declaration from client which means I don't care about the count field in range response (the more field does't been affect in this change), so even in new client and old server, only the performance not optimized, no other break changes of range API.

@maomaozgw
Copy link
Author

First time I found this issue, I also think this may about memory in slice growth. After I change my first version which add a counter in treeIndex.Revisions and skip add to revs after limitation reached. The time cost of full sync reduced from 5 minutes to 4m 46s, the profiling shows the most costing also be treeIndex.Revisions and runtime.growthslice disappeared
image

@jingyih
Copy link
Contributor

jingyih commented Aug 21, 2019

@maomaozgw Thanks. What you said makes sense to me.

Getting all rev pairs from memory index tree when client is only asking a limited number of keys, seems to be wasteful. Another potential concert is sorting provided by range API. Not sure if that will be violated - depends on if currently it trims before sort, or sort before trim.

Another related issue: #11001

@gyuho @xiang90 Any comment or concerns on this topic?

@maomaozgw
Copy link
Author

That's another issue, depends on current code, storeTxnRead.Range trims keys before sorted.
I considered to fix this issue by add sort type in RangeOptions, but it's a real break change for range with both limitation and sort type, also let the range be hard to use for the start/end key selection.

@jingyih
Copy link
Contributor

jingyih commented Aug 22, 2019

Oh actually it temporarily clears the limit if client specifies sorting:

etcd/etcdserver/apply.go

Lines 258 to 263 in c798242

if r.SortOrder != pb.RangeRequest_NONE ||
r.MinModRevision != 0 || r.MaxModRevision != 0 ||
r.MinCreateRevision != 0 || r.MaxCreateRevision != 0 {
// fetch everything; sort and truncate afterwards
limit = 0
}

So the code looks like "trim before sort" but there will be no trimming when fetching key-values if sorting is needed.

@jingyih
Copy link
Contributor

jingyih commented Sep 9, 2019

cc @YoyinZyc

@stale
Copy link

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 6, 2020
@stale stale bot closed this as completed Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants