mvcc: bad performance when range prefix with limit in large amount of keys #11057

maomaozgw · 2019-08-20T07:00:08Z

I have a ETCD cluster with 2004938 keys, my application will range all keys from ETCD after started and keep watch key changes with clientv3/mirror, I found that will cost about 5 minutes to sync base data and during syncing the instance connected will found large number of etcd_server_slow_apply_total appeared.

After profiling the ETCD instance, I saw https://github.com/etcd-io/etcd/blob/master/mvcc/index.go#L106 will visit all indexes match the start & end, and most of the data will be dropped.

To resolve the problem, I change the behavior in Revisions to stop iterate after limitation reached when limitation larger than zero, and the sync time in reduce from 5 minutes to 7 seconds in this case.

Is it a good option to add a range option [noTotalCount], which can work with range with limitation to improve the performance of range requests ?

ETCD Version: 3.3.13

jingyih · 2019-08-21T07:48:15Z

My personal take on this:

Adding range option noTotalCount changes semantics of some returned fields such as count and more. This breaks backward compatibility (new client + old server).
If there are enough use case, this can be implemented as a new API.
Normally, indexing in memory is pretty quick. Can you make sure the memory available to etcd is bigger than the total size of the keys? When memory swapping happens, etcd's performance degrades severely [1].

Ref:
[1] #9771

maomaozgw · 2019-08-21T12:30:37Z

Yes, I have tested offline with cpu profiling (30s) and sync full data from online cluster.
Both of my server is 32C 64G and the etcd_mvcc_db_total_size_in_use_in_bytes in cluster is 530MiB (5.52923136e+08). Only ETCD deployed in these machine.

In old version, when sync data, the most usage of cpu in treeIndex.Revisions

After limit the range keys, the new profiling (30s) info show that treeIndex.Revisions disappear and most time cost to (*readTx).UnsafeRange to get data from bbolt db

Why I consider to add the option noTotalCount in range request for it's more likely a declaration from client which means I don't care about the count field in range response (the more field does't been affect in this change), so even in new client and old server, only the performance not optimized, no other break changes of range API.

maomaozgw · 2019-08-21T12:39:36Z

First time I found this issue, I also think this may about memory in slice growth. After I change my first version which add a counter in treeIndex.Revisions and skip add to revs after limitation reached. The time cost of full sync reduced from 5 minutes to 4m 46s, the profiling shows the most costing also be treeIndex.Revisions and runtime.growthslice disappeared

jingyih · 2019-08-21T22:34:44Z

@maomaozgw Thanks. What you said makes sense to me.

Getting all rev pairs from memory index tree when client is only asking a limited number of keys, seems to be wasteful. Another potential concert is sorting provided by range API. Not sure if that will be violated - depends on if currently it trims before sort, or sort before trim.

Another related issue: #11001

@gyuho @xiang90 Any comment or concerns on this topic?

maomaozgw · 2019-08-22T05:55:54Z

That's another issue, depends on current code, storeTxnRead.Range trims keys before sorted.
I considered to fix this issue by add sort type in RangeOptions, but it's a real break change for range with both limitation and sort type, also let the range be hard to use for the start/end key selection.

jingyih · 2019-08-22T21:24:27Z

Oh actually it temporarily clears the limit if client specifies sorting:

etcd/etcdserver/apply.go

Lines 258 to 263 in c798242

    
           if r.SortOrder != pb.RangeRequest_NONE || 
        
           	r.MinModRevision != 0 || r.MaxModRevision != 0 || 
        
           	r.MinCreateRevision != 0 || r.MaxCreateRevision != 0 { 
        
           	// fetch everything; sort and truncate afterwards 
        
           	limit = 0 
        
           }

So the code looks like "trim before sort" but there will be no trimming when fetching key-values if sorting is needed.

jingyih · 2019-09-09T22:28:43Z

cc @YoyinZyc

stale · 2020-04-06T21:55:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

xiang90 added the type/question label Aug 21, 2019

jingyih added the area/performance label Sep 9, 2019

stale bot added the stale label Apr 6, 2020

stale bot closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc: bad performance when range prefix with limit in large amount of keys #11057

mvcc: bad performance when range prefix with limit in large amount of keys #11057

maomaozgw commented Aug 20, 2019

jingyih commented Aug 21, 2019

maomaozgw commented Aug 21, 2019

maomaozgw commented Aug 21, 2019

jingyih commented Aug 21, 2019

maomaozgw commented Aug 22, 2019

jingyih commented Aug 22, 2019

jingyih commented Sep 9, 2019

stale bot commented Apr 6, 2020

mvcc: bad performance when range prefix with limit in large amount of keys #11057

mvcc: bad performance when range prefix with limit in large amount of keys #11057

Comments

maomaozgw commented Aug 20, 2019

jingyih commented Aug 21, 2019

maomaozgw commented Aug 21, 2019

maomaozgw commented Aug 21, 2019

jingyih commented Aug 21, 2019

maomaozgw commented Aug 22, 2019

jingyih commented Aug 22, 2019

jingyih commented Sep 9, 2019

stale bot commented Apr 6, 2020