mvcc.store.restore taking too long triggers snapshot cycle #5317

gyuho · 2016-05-11T13:12:17Z

Problem

functional-tester etcd-tester's setHealthKey gets context timeout deadline exceeded, while etcd servers are getting etcdserver: publish error: etcdserver: request timed out.

How to reproduce

Set up 3 node cluster
Write 8M keys
Kill one of the follower
In parallel, write more keys to other two members (10K~15K QPS)
Wait until +10K more keys are written (just enough to trigger snapshot)
Recover the follower
Then the follower reads the db file from the disk, and rebuilds index of 8M keys, by calling kvstore.Restore
Follower is now available again (able to talk to leader)
Follower fell behind as much the entries as were written in the leader while the follower was down
Leader sends snapshot to this follower
Follower receives this snapshot, which requires to rebuild index of 8M + 10K keys
While follower restores from this snapshot, it returns 'etcdserver: publish error: etcdserver:'
Stresser has been writing to leader in parallel, and it triggers another snapshot
Repeats from step 10.

It will be either compaction timeout or recovery timeout in functional-tests.

For example, use this benchmark command to stress etcd cluster:

benchmark --endpoints=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379 \
    --conns=100 --clients=1000 \
    put --key-size=10 --val-size=1 --total=3000000

And when the snapshot is triggered, one can see these in leader's logs:

2016-06-07 17:29:32.776501 I | etcdserver: wrote database snapshot out [total bytes: 235487232]
2016-06-07 17:29:33.649890 I | rafthttp: database snapshot [index: 3552177, to: 5fbb5cbbbef242b7] sent out successfully

And follower logs will be something like:

2016-06-07 17:29:33.649748 I | snap: saved database snapshot to disk [total bytes: 235487232]
2016-06-07 17:29:33.649800 I | rafthttp: received and saved database snapshot [index: 3552177, from: b88fca7fa2ea1289] successfully
2016-06-07 17:29:33.650026 I | raft: 5fbb5cbbbef242b7 [commit: 3000778, lastindex: 3000778, lastterm: 5] starts to restore snapshot [index: 3552177, term: 5]
2016-06-07 17:29:33.650089 I | raft: log [committed=3000778, applied=3000778, unstable.offset=3000779, len(unstable.Entries)=0] starts to restore snapshot [index: 3552177, term: 5]
2016-06-07 17:29:33.650129 I | raft: 5fbb5cbbbef242b7 restored progress of 5fbb5cbbbef242b7 [next = 3552178, match = 3552177, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-06-07 17:29:33.650148 I | raft: 5fbb5cbbbef242b7 restored progress of b88fca7fa2ea1289 [next = 3552178, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-06-07 17:29:33.650163 I | raft: 5fbb5cbbbef242b7 restored progress of c4268f53965ad024 [next = 3552178, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-06-07 17:29:33.650179 I | raft: 5fbb5cbbbef242b7 [commit: 3552177] restored snapshot [index: 3552177, term: 5]
2016-06-07 17:29:33.650323 I | etcdserver: applying snapshot at index 3000641...
2016-06-07 17:29:33.653349 I | etcdserver: raft applied incoming snapshot at index 3552177
2016-06-07 17:29:33.664600 I | etcdserver: restoring mvcc store...
2016-06-07 17:29:57.641605 I | etcdserver: finished restoring mvcc store
...

Root Cause

The root cause was the writes being too intensive, with the follower keep falling behind. When an etcd node recovers, it needs to rebuild the key-index from disk. Reading the backend file itself doesn't take long. Iterating 10M keys in boltdb takes only about 3 seconds in SSD.

Rebuilding the key-index from scratch takes long time.

Restoring 8M keys from scratch can take more than 30 seconds. And if there were concurrent writes (10K~15K QPS) in other nodes, the recovering node will need to receive snapshot right after key-index rebuild, and the snapshot will trigger another index rebuilding. And this cycle repeats, as long as there are ongoing concurrent writes, with restoring keys taking longer and longer.

Conclusion

This only happens with extremely intense writes (large number of keys to rebuild btree). And we believe that this will rarely happen in production. Even with this high workload, etcd leader kept operating. If anyone is hitting this issue, please check your workload, try slowing down the workload, or consider compacting your database.

There was nothing wrong in etcd correctness.

We mark this as unplanned.

In the future, we will make the restore operation faster.

More details

Did more timing:

var kv mvccpb.KeyValue
st1 := time.Now()
if err := kv.Unmarshal(vals[i]); err != nil {
    plog.Fatalf("cannot unmarshal event: %v", err)
}
dTime += time.Since(st1)

st2 := time.Now()
s.kvindex.Restore(kv.Key, revision{kv.CreateRevision, 0}, rev, kv.Version)
rTime += time.Since(st2)

mvcc: [DEBUG]
store.restore took 36.38789088s
decode took 2.106427628s
kvindex.restore took 31.230108334s

So it's not the decode that take much time, it's kvindex.Restore that takes most of the time.

/cc @heyitsanthony @xiang90

The text was updated successfully, but these errors were encountered:

Fix etcd-io#5281. Fix etcd-io#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency.

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency. 3. Slow down stresser in case it takes longer to catch up and prevent stressers from exceeding the storage quota.

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past, 2. Increase setHealthKey timeout in case we inject network latency.

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.

Fix coreos#5281. Fix coreos#5317. Wait first before it gets revision, hash, as we did in the past.

xiang90 · 2016-10-25T00:05:52Z

Ideally we want etcd to recover from 40M small keys within 30 seconds.

40M is around 1 hour 10k keys/second put workload without any compaction. This is probably the max throughput we want to support for etcd.

/cc @nekto0n @sinsharat @vimalk78 It would be great if any of you can work on it.

nekto0n · 2016-10-25T06:51:58Z

According to @gyuho most of the time is taken by kvindex.Restore. There is a BenchmarkIndexRestore which on my laptop gives me these numbers:

1000000       1411 ns/op         178 B/op          3 allocs/op

Using only 100k keys (simulating multiple revisions) I get:

2000000        687 ns/op         119 B/op          1 allocs/op

So it takes a bit longer than a second to perform 1M keys recovery and a bit less for 100k keys with total revisions number of 2M. Am I right that we need to optimize this a bit more or should we concoct a more involving test which should look like real code here?

xiang90 · 2016-10-25T20:08:29Z

So it takes a bit longer than a second to perform 1M keys recovery and a bit less for 100k keys with total revisions number of 2M.

Is it growing linearly? 100K -> 10 rev pers key = 1s (1M keys in total); 100k -> 10 revs = 2s (1M keys in total)?

I remembered I did some experiments a while ago and found some low hanging fruits.

nekto0n · 2016-10-26T08:00:39Z

I did a bit of experimenting with benchmark and got these numbers:
go test -v github.com/coreos/etcd/mvcc -bench BenchmarkIndexRestore -run ^$ -benchtime 10s

BenchmarkIndexRestore1-4        10000000          1314 ns/op         178 B/op          3 allocs/op
BenchmarkIndexRestore10-4       20000000           737 ns/op         123 B/op          1 allocs/op
BenchmarkIndexRestore100-4      30000000           624 ns/op         105 B/op          1 allocs/op

I remembered I did some experiments a while ago and found some low hanging fruits.

Can you share some directions where they may be hanging from? :)

xiang90 · 2016-11-08T19:37:03Z

Can you share some directions where they may be hanging from? :)

Sure. During the restore, we keep on touching the treeIndex for single key get and single key put. It does one get, put for every key and every key modification. These operations for tree are O(NlogN), which is not ideal.

To improve this, we can have a unordered map to store the temp data. We only touch the unordered map while restoring the index. After we iterate the all the revision, we range over the unordered map and convert it to the tree index. This can reduce the cost significantly if we have N versions per key.

Some simple code:

    // use an unordered map to hold the temp index data to speed up
    // the initial key index recovery.
    // we will convert this unordered map into the tree index later.
    unordered := make(map[string]*keyIndex, 100000)

    // TODO: limit N to reduce max memory usage
    keys, vals := tx.UnsafeRange(keyBucketName, min, max, 0)
    for i, key := range keys {
        var kv mvccpb.KeyValue
        if err := kv.Unmarshal(vals[i]); err != nil {
            plog.Fatalf("cannot unmarshal event: %v", err)
        }

        rev := bytesToRev(key[:revBytesLen])

        is := time.Now()
        // restore index
        switch {
        case isTombstone(key):
            if ki, ok := unordered[string(kv.Key)]; ok {
                ki.tombstone(rev.main, rev.sub)
            }
            delete(keyToLease, string(kv.Key))

        default:
            ki, ok := unordered[string(kv.Key)]
            if ok {
                ki.put(rev.main, rev.sub)
            } else {
                ki = &keyIndex{key: key}
                ki.restore(revision{kv.CreateRevision, 0}, rev, kv.Version)
                unordered[string(kv.Key)] = ki
            }

            if lid := lease.LeaseID(kv.Lease); lid != lease.NoLease {
                keyToLease[string(kv.Key)] = lid
            } else {
                delete(keyToLease, string(kv.Key))
            }
        }

        // update revision
        s.currentRev = rev
    }

    // restore the tree index from the unordered index.
    for _, v := range unordered {
        s.kvindex.Insert(v)
    }

func (ti *treeIndex) Insert(ki *keyIndex) {
    ti.Lock()
    defer ti.Unlock()
    ti.tree.ReplaceOrInsert(ki)
}

On my laptop, it reduces the 10M revisions (1M keys and 10 versions pre key) recovery time from 40s to 10s.

For the unordered map recovery, we can make it even fancier by translating it into map-reduce style and increase concurrency. But that is a future step and we might not even need it.

xiang90 · 2016-11-08T19:40:11Z

/cc @nekto0n @sinsharat

nekto0n · 2016-11-08T23:32:22Z

So we change O(nlogn) (for every key version we use 2 get/put to the tree) with O(n) + O(logn) (every key is put into hashtable and then about 1/10 of them are put into the tree).
All is needed is to copy-paste the code from above into a PR, sounds doable :)

xiang90 · 2016-11-08T23:35:36Z

@nekto0n Some tests need to be changed. Benchmark needs to be updated or added.

sinsharat · 2016-11-09T17:54:15Z

@nekto0n are you going to add the code as well as benchmark or can i take any part of it?

nekto0n · 2016-11-09T18:18:28Z

@sinsharat actually I'm pretty busy this week, so feel free to take all of it.

sinsharat · 2016-11-09T18:24:58Z

@nekto0n Sure will start looking into it then.
Thanks!

sinsharat · 2016-11-11T19:33:01Z

@xiang90 sorry for the delay on this. Was stuck with some business commitments, which i was asked to work on this entire week. Will try to finish this and raise a PR for this by tomorrow.
Thank you!

xiang90 · 2016-11-18T06:08:00Z

Fixed by #6846. There are a few other places that can be improved. But we will optimize it until it becomes a problem in real world.

gyuho self-assigned this May 11, 2016

gyuho added the area/functional-testing label May 11, 2016

gyuho added this to the v3.0.0 milestone May 11, 2016

gyuho added a commit to gyuho/etcd that referenced this issue May 11, 2016

etcd-tester: fix tester

9216a23

Fix etcd-io#5281. Fix etcd-io#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Increase setHealthKey timeout in case we inject network latency.

gyuho mentioned this issue May 11, 2016

etcd-tester: fix tester #5282

Closed

gyuho referenced this issue in gyuho/etcd May 13, 2016

etcd-tester: fix tester

f0fceaf

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past, 2. Increase setHealthKey timeout in case we inject network latency.

gyuho referenced this issue in gyuho/etcd May 13, 2016

etcd-tester: fix tester

bb3d730

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.

gyuho referenced this issue in gyuho/etcd May 13, 2016

etcd-tester: fix tester

2cbef3b

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.

gyuho referenced this issue in gyuho/etcd May 13, 2016

etcd-tester: fix tester

9275cd7

Fix coreos#5281. Fix coreos#5317. 1. Wait first before it gets revision, hash, as we did in the past. 2. Retry Compact request in case it times out from injected network latency.

gyuho referenced this issue in gyuho/etcd May 13, 2016

etcd-tester: fix tester

6b0fa96

Fix coreos#5281. Fix coreos#5317. Wait first before it gets revision, hash, as we did in the past.

gyuho closed this as completed May 24, 2016

gyuho reopened this May 24, 2016

This was referenced May 25, 2016

etcd-tester: fix recovery RPC timeout #5421

Closed

etcd-tester: fix compact timeout #5456

Merged

functional-tester: tester is stressing etcd too hard #5499

Closed

xiang90 added the area/performance label Jun 7, 2016

gyuho modified the milestones: unplanned, v3.0.0 Jun 7, 2016

gyuho changed the title ~~functional-tester: recover RPC call times out~~ mvcc.store.restore taking too long, and etcd-tester recover RPC call times out Jun 7, 2016

gyuho mentioned this issue Jun 8, 2016

functional-tester: compact request times out #5440

Closed

gyuho changed the title ~~mvcc.store.restore taking too long, and etcd-tester recover RPC call times out~~ mvcc.store.restore taking too long triggers snapshot cycle Jun 16, 2016

gyuho mentioned this issue Nov 7, 2016

functional-tester: compaction exceeds deadline #6805

Closed

xiang90 modified the milestones: v3.2.0, unplanned Nov 8, 2016

gyuho assigned xiang90 Nov 8, 2016

sinsharat mentioned this issue Nov 12, 2016

mvcc: store.restore taking too long triggering snapshot cycle fix #6846

Merged

xiang90 closed this as completed Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc.store.restore taking too long triggers snapshot cycle #5317

mvcc.store.restore taking too long triggers snapshot cycle #5317

gyuho commented May 11, 2016 •

edited

Loading

xiang90 commented Oct 25, 2016

nekto0n commented Oct 25, 2016 •

edited

Loading

xiang90 commented Oct 25, 2016

nekto0n commented Oct 26, 2016

xiang90 commented Nov 8, 2016 •

edited

Loading

xiang90 commented Nov 8, 2016

nekto0n commented Nov 8, 2016

xiang90 commented Nov 8, 2016

sinsharat commented Nov 9, 2016

nekto0n commented Nov 9, 2016

sinsharat commented Nov 9, 2016

sinsharat commented Nov 11, 2016

xiang90 commented Nov 18, 2016

mvcc.store.restore taking too long triggers snapshot cycle #5317

mvcc.store.restore taking too long triggers snapshot cycle #5317

Comments

gyuho commented May 11, 2016 • edited Loading

Problem

How to reproduce

Root Cause

Conclusion

More details

xiang90 commented Oct 25, 2016

nekto0n commented Oct 25, 2016 • edited Loading

xiang90 commented Oct 25, 2016

nekto0n commented Oct 26, 2016

xiang90 commented Nov 8, 2016 • edited Loading

xiang90 commented Nov 8, 2016

nekto0n commented Nov 8, 2016

xiang90 commented Nov 8, 2016

sinsharat commented Nov 9, 2016

nekto0n commented Nov 9, 2016

sinsharat commented Nov 9, 2016

sinsharat commented Nov 11, 2016

xiang90 commented Nov 18, 2016

gyuho commented May 11, 2016 •

edited

Loading

nekto0n commented Oct 25, 2016 •

edited

Loading

xiang90 commented Nov 8, 2016 •

edited

Loading