-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc.store.restore taking too long triggers snapshot cycle #5317
Comments
Fix etcd-io#5281. Fix etcd-io#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency. 3. Slow down stresser in case it takes longer to catch up and prevent stressers from exceeding the storage quota.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency. 3. Slow down stresser in case it takes longer to catch up and prevent stressers from exceeding the storage quota.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency. 3. Slow down stresser in case it takes longer to catch up and prevent stressers from exceeding the storage quota.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency. 3. Slow down stresser in case it takes longer to catch up and prevent stressers from exceeding the storage quota.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency. 3. Slow down stresser in case it takes longer to catch up and prevent stressers from exceeding the storage quota.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past, 2. Increase setHealthKey timeout in case we inject network latency.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.
Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.
Fix coreos#5281. Fix coreos#5317. Wait first before it gets revision, hash, as we did in the past.
Ideally we want etcd to recover from 40M small keys within 30 seconds. 40M is around 1 hour 10k keys/second put workload without any compaction. This is probably the max throughput we want to support for etcd. /cc @nekto0n @sinsharat @vimalk78 It would be great if any of you can work on it. |
According to @gyuho most of the time is taken by
Using only 100k keys (simulating multiple revisions) I get:
So it takes a bit longer than a second to perform 1M keys recovery and a bit less for 100k keys with total revisions number of 2M. Am I right that we need to optimize this a bit more or should we concoct a more involving test which should look like real code here? |
Is it growing linearly? 100K -> 10 rev pers key = 1s (1M keys in total); 100k -> 10 revs = 2s (1M keys in total)? I remembered I did some experiments a while ago and found some low hanging fruits. |
I did a bit of experimenting with benchmark and got these numbers:
Can you share some directions where they may be hanging from? :) |
Sure. During the restore, we keep on touching the treeIndex for single key get and single key put. It does one get, put for every key and every key modification. These operations for tree are O(NlogN), which is not ideal. To improve this, we can have a unordered map to store the temp data. We only touch the unordered map while restoring the index. After we iterate the all the revision, we range over the unordered map and convert it to the tree index. This can reduce the cost significantly if we have N versions per key. Some simple code: // use an unordered map to hold the temp index data to speed up
// the initial key index recovery.
// we will convert this unordered map into the tree index later.
unordered := make(map[string]*keyIndex, 100000)
// TODO: limit N to reduce max memory usage
keys, vals := tx.UnsafeRange(keyBucketName, min, max, 0)
for i, key := range keys {
var kv mvccpb.KeyValue
if err := kv.Unmarshal(vals[i]); err != nil {
plog.Fatalf("cannot unmarshal event: %v", err)
}
rev := bytesToRev(key[:revBytesLen])
is := time.Now()
// restore index
switch {
case isTombstone(key):
if ki, ok := unordered[string(kv.Key)]; ok {
ki.tombstone(rev.main, rev.sub)
}
delete(keyToLease, string(kv.Key))
default:
ki, ok := unordered[string(kv.Key)]
if ok {
ki.put(rev.main, rev.sub)
} else {
ki = &keyIndex{key: key}
ki.restore(revision{kv.CreateRevision, 0}, rev, kv.Version)
unordered[string(kv.Key)] = ki
}
if lid := lease.LeaseID(kv.Lease); lid != lease.NoLease {
keyToLease[string(kv.Key)] = lid
} else {
delete(keyToLease, string(kv.Key))
}
}
// update revision
s.currentRev = rev
}
// restore the tree index from the unordered index.
for _, v := range unordered {
s.kvindex.Insert(v)
} func (ti *treeIndex) Insert(ki *keyIndex) {
ti.Lock()
defer ti.Unlock()
ti.tree.ReplaceOrInsert(ki)
} On my laptop, it reduces the 10M revisions (1M keys and 10 versions pre key) recovery time from 40s to 10s. For the unordered map recovery, we can make it even fancier by translating it into map-reduce style and increase concurrency. But that is a future step and we might not even need it. |
/cc @nekto0n @sinsharat |
So we change |
@nekto0n Some tests need to be changed. Benchmark needs to be updated or added. |
@nekto0n are you going to add the code as well as benchmark or can i take any part of it? |
@sinsharat actually I'm pretty busy this week, so feel free to take all of it. |
@nekto0n Sure will start looking into it then. |
@xiang90 sorry for the delay on this. Was stuck with some business commitments, which i was asked to work on this entire week. Will try to finish this and raise a PR for this by tomorrow. |
Fixed by #6846. There are a few other places that can be improved. But we will optimize it until it becomes a problem in real world. |
Problem
functional-tester etcd-tester's
setHealthKey
getscontext timeout deadline exceeded
, while etcd servers are gettingetcdserver: publish error: etcdserver: request timed out
.How to reproduce
db
file from the disk, and rebuilds index of 8M keys, by calling kvstore.RestoreIt will be either compaction timeout or recovery timeout in functional-tests.
For example, use this benchmark command to stress etcd cluster:
And when the snapshot is triggered, one can see these in leader's logs:
And follower logs will be something like:
Root Cause
The root cause was the writes being too intensive, with the follower keep falling behind. When an etcd node recovers, it needs to rebuild the key-index from disk. Reading the backend file itself doesn't take long. Iterating 10M keys in boltdb takes only about 3 seconds in SSD.
Rebuilding the key-index from scratch takes long time.
Restoring 8M keys from scratch can take more than 30 seconds. And if there were concurrent writes (10K~15K QPS) in other nodes, the recovering node will need to receive snapshot right after key-index rebuild, and the snapshot will trigger another index rebuilding. And this cycle repeats, as long as there are ongoing concurrent writes, with restoring keys taking longer and longer.
Conclusion
This only happens with extremely intense writes (large number of keys to rebuild btree). And we believe that this will rarely happen in production. Even with this high workload, etcd leader kept operating. If anyone is hitting this issue, please check your workload, try slowing down the workload, or consider compacting your database.
There was nothing wrong in etcd correctness.
We mark this as unplanned.
In the future, we will make the
restore
operation faster.More details
Did more timing:
So it's not the decode that take much time, it's
kvindex.Restore
that takes most of the time./cc @heyitsanthony @xiang90
The text was updated successfully, but these errors were encountered: