Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc.store.restore taking too long triggers snapshot cycle #5317

Closed
gyuho opened this issue May 11, 2016 · 13 comments
Closed

mvcc.store.restore taking too long triggers snapshot cycle #5317

gyuho opened this issue May 11, 2016 · 13 comments
Assignees
Milestone

Comments

@gyuho
Copy link
Contributor

gyuho commented May 11, 2016

Problem

functional-tester etcd-tester's setHealthKey gets context timeout deadline exceeded, while etcd servers are getting etcdserver: publish error: etcdserver: request timed out.


How to reproduce

  1. Set up 3 node cluster
  2. Write 8M keys
  3. Kill one of the follower
  4. In parallel, write more keys to other two members (10K~15K QPS)
  5. Wait until +10K more keys are written (just enough to trigger snapshot)
  6. Recover the follower
  7. Then the follower reads the db file from the disk, and rebuilds index of 8M keys, by calling kvstore.Restore
  8. Follower is now available again (able to talk to leader)
  9. Follower fell behind as much the entries as were written in the leader while the follower was down
  10. Leader sends snapshot to this follower
  11. Follower receives this snapshot, which requires to rebuild index of 8M + 10K keys
  12. While follower restores from this snapshot, it returns 'etcdserver: publish error: etcdserver:'
  13. Stresser has been writing to leader in parallel, and it triggers another snapshot
  14. Repeats from step 10.

It will be either compaction timeout or recovery timeout in functional-tests.

For example, use this benchmark command to stress etcd cluster:

benchmark --endpoints=${HOST_1}:2379,${HOST_2}:2379,${HOST_3}:2379 \
    --conns=100 --clients=1000 \
    put --key-size=10 --val-size=1 --total=3000000

And when the snapshot is triggered, one can see these in leader's logs:

2016-06-07 17:29:32.776501 I | etcdserver: wrote database snapshot out [total bytes: 235487232]
2016-06-07 17:29:33.649890 I | rafthttp: database snapshot [index: 3552177, to: 5fbb5cbbbef242b7] sent out successfully

And follower logs will be something like:

2016-06-07 17:29:33.649748 I | snap: saved database snapshot to disk [total bytes: 235487232]
2016-06-07 17:29:33.649800 I | rafthttp: received and saved database snapshot [index: 3552177, from: b88fca7fa2ea1289] successfully
2016-06-07 17:29:33.650026 I | raft: 5fbb5cbbbef242b7 [commit: 3000778, lastindex: 3000778, lastterm: 5] starts to restore snapshot [index: 3552177, term: 5]
2016-06-07 17:29:33.650089 I | raft: log [committed=3000778, applied=3000778, unstable.offset=3000779, len(unstable.Entries)=0] starts to restore snapshot [index: 3552177, term: 5]
2016-06-07 17:29:33.650129 I | raft: 5fbb5cbbbef242b7 restored progress of 5fbb5cbbbef242b7 [next = 3552178, match = 3552177, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-06-07 17:29:33.650148 I | raft: 5fbb5cbbbef242b7 restored progress of b88fca7fa2ea1289 [next = 3552178, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-06-07 17:29:33.650163 I | raft: 5fbb5cbbbef242b7 restored progress of c4268f53965ad024 [next = 3552178, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2016-06-07 17:29:33.650179 I | raft: 5fbb5cbbbef242b7 [commit: 3552177] restored snapshot [index: 3552177, term: 5]
2016-06-07 17:29:33.650323 I | etcdserver: applying snapshot at index 3000641...
2016-06-07 17:29:33.653349 I | etcdserver: raft applied incoming snapshot at index 3552177
2016-06-07 17:29:33.664600 I | etcdserver: restoring mvcc store...
2016-06-07 17:29:57.641605 I | etcdserver: finished restoring mvcc store
...

Root Cause

The root cause was the writes being too intensive, with the follower keep falling behind. When an etcd node recovers, it needs to rebuild the key-index from disk. Reading the backend file itself doesn't take long. Iterating 10M keys in boltdb takes only about 3 seconds in SSD.

Rebuilding the key-index from scratch takes long time.

Restoring 8M keys from scratch can take more than 30 seconds. And if there were concurrent writes (10K~15K QPS) in other nodes, the recovering node will need to receive snapshot right after key-index rebuild, and the snapshot will trigger another index rebuilding. And this cycle repeats, as long as there are ongoing concurrent writes, with restoring keys taking longer and longer.


Conclusion

This only happens with extremely intense writes (large number of keys to rebuild btree). And we believe that this will rarely happen in production. Even with this high workload, etcd leader kept operating. If anyone is hitting this issue, please check your workload, try slowing down the workload, or consider compacting your database.

There was nothing wrong in etcd correctness.

We mark this as unplanned.

In the future, we will make the restore operation faster.


More details

Did more timing:

var kv mvccpb.KeyValue
st1 := time.Now()
if err := kv.Unmarshal(vals[i]); err != nil {
    plog.Fatalf("cannot unmarshal event: %v", err)
}
dTime += time.Since(st1)

st2 := time.Now()
s.kvindex.Restore(kv.Key, revision{kv.CreateRevision, 0}, rev, kv.Version)
rTime += time.Since(st2)

mvcc: [DEBUG]
store.restore took 36.38789088s
decode took 2.106427628s
kvindex.restore took 31.230108334s

So it's not the decode that take much time, it's kvindex.Restore that takes most of the time.


/cc @heyitsanthony @xiang90
@gyuho gyuho self-assigned this May 11, 2016
@gyuho gyuho added this to the v3.0.0 milestone May 11, 2016
gyuho added a commit to gyuho/etcd that referenced this issue May 11, 2016
Fix etcd-io#5281.
Fix etcd-io#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Increase setHealthKey timeout in case we inject network latency.
gyuho referenced this issue in gyuho/etcd May 11, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Increase setHealthKey timeout in case we inject network latency.
3. Slow down stresser in case it takes longer to catch up and prevent
stressers from exceeding the storage quota.
gyuho referenced this issue in gyuho/etcd May 12, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Increase setHealthKey timeout in case we inject network latency.
3. Slow down stresser in case it takes longer to catch up and prevent
stressers from exceeding the storage quota.
gyuho referenced this issue in gyuho/etcd May 12, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Increase setHealthKey timeout in case we inject network latency.
3. Slow down stresser in case it takes longer to catch up and prevent
stressers from exceeding the storage quota.
gyuho referenced this issue in gyuho/etcd May 12, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Increase setHealthKey timeout in case we inject network latency.
3. Slow down stresser in case it takes longer to catch up and prevent
stressers from exceeding the storage quota.
gyuho referenced this issue in gyuho/etcd May 12, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Increase setHealthKey timeout in case we inject network latency.
3. Slow down stresser in case it takes longer to catch up and prevent
stressers from exceeding the storage quota.
gyuho referenced this issue in gyuho/etcd May 13, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past,
2. Increase setHealthKey timeout in case we inject network latency.
gyuho referenced this issue in gyuho/etcd May 13, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Retry Compact request in case it times out from injected network latency.
gyuho referenced this issue in gyuho/etcd May 13, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Retry Compact request in case it times out from injected network latency.
gyuho referenced this issue in gyuho/etcd May 13, 2016
Fix coreos#5281.
Fix coreos#5317.

1. Wait first before it gets revision, hash, as we did in the past.
2. Retry Compact request in case it times out from injected network latency.
gyuho referenced this issue in gyuho/etcd May 13, 2016
Fix coreos#5281.
Fix coreos#5317.

Wait first before it gets revision, hash, as we did in the past.
@gyuho gyuho closed this as completed May 24, 2016
@gyuho gyuho reopened this May 24, 2016
@gyuho gyuho modified the milestones: unplanned, v3.0.0 Jun 7, 2016
@gyuho gyuho changed the title functional-tester: recover RPC call times out mvcc.store.restore taking too long, and etcd-tester recover RPC call times out Jun 7, 2016
@gyuho gyuho changed the title mvcc.store.restore taking too long, and etcd-tester recover RPC call times out mvcc.store.restore taking too long triggers snapshot cycle Jun 16, 2016
@xiang90
Copy link
Contributor

xiang90 commented Oct 25, 2016

Ideally we want etcd to recover from 40M small keys within 30 seconds.

40M is around 1 hour 10k keys/second put workload without any compaction. This is probably the max throughput we want to support for etcd.

/cc @nekto0n @sinsharat @vimalk78 It would be great if any of you can work on it.

@nekto0n
Copy link

nekto0n commented Oct 25, 2016

According to @gyuho most of the time is taken by kvindex.Restore. There is a BenchmarkIndexRestore which on my laptop gives me these numbers:

1000000       1411 ns/op         178 B/op          3 allocs/op

Using only 100k keys (simulating multiple revisions) I get:

2000000        687 ns/op         119 B/op          1 allocs/op

So it takes a bit longer than a second to perform 1M keys recovery and a bit less for 100k keys with total revisions number of 2M. Am I right that we need to optimize this a bit more or should we concoct a more involving test which should look like real code here?

@xiang90
Copy link
Contributor

xiang90 commented Oct 25, 2016

So it takes a bit longer than a second to perform 1M keys recovery and a bit less for 100k keys with total revisions number of 2M.

Is it growing linearly? 100K -> 10 rev pers key = 1s (1M keys in total); 100k -> 10 revs = 2s (1M keys in total)?

I remembered I did some experiments a while ago and found some low hanging fruits.

@nekto0n
Copy link

nekto0n commented Oct 26, 2016

I did a bit of experimenting with benchmark and got these numbers:
go test -v github.com/coreos/etcd/mvcc -bench BenchmarkIndexRestore -run ^$ -benchtime 10s

BenchmarkIndexRestore1-4        10000000          1314 ns/op         178 B/op          3 allocs/op
BenchmarkIndexRestore10-4       20000000           737 ns/op         123 B/op          1 allocs/op
BenchmarkIndexRestore100-4      30000000           624 ns/op         105 B/op          1 allocs/op

I remembered I did some experiments a while ago and found some low hanging fruits.

Can you share some directions where they may be hanging from? :)

@xiang90 xiang90 modified the milestones: v3.2.0, unplanned Nov 8, 2016
@xiang90
Copy link
Contributor

xiang90 commented Nov 8, 2016

Can you share some directions where they may be hanging from? :)

Sure. During the restore, we keep on touching the treeIndex for single key get and single key put. It does one get, put for every key and every key modification. These operations for tree are O(NlogN), which is not ideal.

To improve this, we can have a unordered map to store the temp data. We only touch the unordered map while restoring the index. After we iterate the all the revision, we range over the unordered map and convert it to the tree index. This can reduce the cost significantly if we have N versions per key.

Some simple code:

    // use an unordered map to hold the temp index data to speed up
    // the initial key index recovery.
    // we will convert this unordered map into the tree index later.
    unordered := make(map[string]*keyIndex, 100000)

    // TODO: limit N to reduce max memory usage
    keys, vals := tx.UnsafeRange(keyBucketName, min, max, 0)
    for i, key := range keys {
        var kv mvccpb.KeyValue
        if err := kv.Unmarshal(vals[i]); err != nil {
            plog.Fatalf("cannot unmarshal event: %v", err)
        }

        rev := bytesToRev(key[:revBytesLen])

        is := time.Now()
        // restore index
        switch {
        case isTombstone(key):
            if ki, ok := unordered[string(kv.Key)]; ok {
                ki.tombstone(rev.main, rev.sub)
            }
            delete(keyToLease, string(kv.Key))

        default:
            ki, ok := unordered[string(kv.Key)]
            if ok {
                ki.put(rev.main, rev.sub)
            } else {
                ki = &keyIndex{key: key}
                ki.restore(revision{kv.CreateRevision, 0}, rev, kv.Version)
                unordered[string(kv.Key)] = ki
            }

            if lid := lease.LeaseID(kv.Lease); lid != lease.NoLease {
                keyToLease[string(kv.Key)] = lid
            } else {
                delete(keyToLease, string(kv.Key))
            }
        }

        // update revision
        s.currentRev = rev
    }

    // restore the tree index from the unordered index.
    for _, v := range unordered {
        s.kvindex.Insert(v)
    }
func (ti *treeIndex) Insert(ki *keyIndex) {
    ti.Lock()
    defer ti.Unlock()
    ti.tree.ReplaceOrInsert(ki)
}

On my laptop, it reduces the 10M revisions (1M keys and 10 versions pre key) recovery time from 40s to 10s.

For the unordered map recovery, we can make it even fancier by translating it into map-reduce style and increase concurrency. But that is a future step and we might not even need it.

@xiang90
Copy link
Contributor

xiang90 commented Nov 8, 2016

/cc @nekto0n @sinsharat

@nekto0n
Copy link

nekto0n commented Nov 8, 2016

So we change O(nlogn) (for every key version we use 2 get/put to the tree) with O(n) + O(logn) (every key is put into hashtable and then about 1/10 of them are put into the tree).
All is needed is to copy-paste the code from above into a PR, sounds doable :)

@xiang90
Copy link
Contributor

xiang90 commented Nov 8, 2016

@nekto0n Some tests need to be changed. Benchmark needs to be updated or added.

@sinsharat
Copy link
Contributor

@nekto0n are you going to add the code as well as benchmark or can i take any part of it?

@nekto0n
Copy link

nekto0n commented Nov 9, 2016

@sinsharat actually I'm pretty busy this week, so feel free to take all of it.

@sinsharat
Copy link
Contributor

@nekto0n Sure will start looking into it then.
Thanks!

@sinsharat
Copy link
Contributor

@xiang90 sorry for the delay on this. Was stuck with some business commitments, which i was asked to work on this entire week. Will try to finish this and raise a PR for this by tomorrow.
Thank you!

@xiang90
Copy link
Contributor

xiang90 commented Nov 18, 2016

Fixed by #6846. There are a few other places that can be improved. But we will optimize it until it becomes a problem in real world.

@xiang90 xiang90 closed this as completed Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants