Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028

serathius · 2023-06-07T11:20:21Z

What would you like to be added?

This is etcd side bug for kubernetes/kubernetes#118501

tldr; Etcd restore operation breaks existing etcd guarantees about revision never decreasing.

Proposal

Etcd restore operation should support bumping revision of the next etcd operations by an arbitrary number.
Etcd restore operation should support marking old revisions as compacted.
Etcd restore operation should support bumping revision of the latest key versions.

Tracking work

Implement optional --bump-revision flag to etcdutl snapshot restore operation.
Implement optional --mark-compacted flag to etcdutl snapshot restore operation.
Confirm that those changes are backport compatible for v3.5 and v3.4
Add etcd restore with e2e tests that confirm that:
- Revision doesn't decrease if value provided via --bump-revision is high enough
- Client is notified about restore by watch being broken
- Client cannot access revisions from between backup and restore operation
Backport the changes and tests to v3.5 and v3.4 etcd.

Why is this needed?

Without zero official guidance each vendor/administrator had a different plan for etcd restore many of which were broken and incorrect. Having an official guided restore operation allow whole community to work together and share their experience leading to improvements in Kubernetes disaster recovery handling.

The text was updated successfully, but these errors were encountered:

serathius · 2023-06-07T11:21:01Z

cc @ptabor @ahrtr

ahrtr · 2023-06-08T01:06:42Z

Offline compaction

Offline compaction makes sense to me. It's useful to perform compaction + defragmentation before we restore a snapshot. We may need to handle it from both etcd and bbolt perspective. Let me take care of it. But it may take some time, because I will be on some personal stuff starting from next Tuesday and until 1 ~ 2 weeks later.

I propose to add a separate command etcdutl compact [flags], just similar to the existing command etcdutl defrag [flags]. Let's do not change the existing etcdctl snapshot ... commands.

Offline bumping revision

I am not against offline bumping revision, but please see my concern/comment in #15381 (comment). Would like to get more feedback before we move on.

tjungblu · 2023-06-08T08:38:33Z

I think we already have etcdctl compaction <rev> that does that, albeit online.

Maybe before we go into the debate on what we need to implement for compaction, do we even need to compact?
I think what Marek meant, is that we return the current revision as "compacted" to force reset the informer caches, not do an actual compaction.

serathius · 2023-06-08T08:58:18Z

Offline compaction

Offline compaction makes sense to me. It's useful to perform compaction + defragmentation before we restore a snapshot. We may need to handle it from both etcd and bbolt perspective. Let me take care of it. But it may take some time, because I will be on some personal stuff starting from next Tuesday and until 1 ~ 2 weeks later.

As @tjungblu said, nowere I mentioned that I want to run offline compaction. I just want to mark revisions as compacted so they are inaccessible by clients. For large db files it's preferred that we can do that without doing any work.

Please don't agree to work on the task, but say that you don't agree on the design and change it.

I don't think you understood the design, we don't want to perform offline compaction, we don't care about offline defragmentation, we just want to make restore work. I what I mean by work is for etcd to uphold the https://etcd.io/docs/v3.6/learning/api_guarantees/. Meaning revision is unique and ordered.

tjungblu · 2023-06-08T09:12:10Z

(moving your arguments over from #15381 (comment) so we don't split the discussion into too many places)

Usually it is risky for an offline tool to modify the db data

Agreed, what @dusk125 wrote so far is just adding a single bump transaction on top of the snapshot data:
https://github.com/etcd-io/etcd/pull/16029/files#diff-c0f9abda5049ad8050c02966bd327db7fcd27b6848a54deac44d78d5b003f954R347-R368

So we're neither modifying the snapshot you restore from (important for integrity checks), nor do we hack surgically around the binary files. I think Marek's approach is even a step further, by just adding "+x" on the next transaction that's executed - not sure this is enough. Maybe @dusk125 has some more input on what worked and what didn't, I believe we tried that before.

When you restore an etcd cluster, usually it means that you already ran into a disastrous failure. In other words, there is already downtime. But if it's a huge cluster (e.g. with hundreds of or thousands of nodes), it might be inconvenient to restart all operators and nodes (actually kubelet, kube-proxy and CNI?). But if it's just a small cluster with a couple of nodes, it might be fine to just restart them.

We need to differentiate on what's down, quorum loss definitely means k8s API is not writable and no linearized reads are possible. Workload on the worker nodes can still continue happily on their own. Operators/Controllers (also consider third party stuff you installed with a helm chart) might be impacted, depending on what they're doing with the API.

The restarting definitely works functionally, but I think there is a window of time between you restoring to old_rev - 1000 and the operators catching up again that could be disastrous. Imagine you add a thousand revs again very quickly, then your informer cache gets a totally different state of the world + some new state. David called this "alternate history", which I found a good term for it.

But for stale data, it means the snapshot isn't up to date, and there are some data loss....what' the potential impact?

Whatever we bring up from a restored snapshot, must be guaranteed so far into the future that we don't create that alternate history. There must be data loss by definition, but that should be reconciled correctly by the operators (:crossed_fingers:).

ahrtr · 2023-06-08T09:28:50Z

I think what Marek meant, is that we return the current revision as "compacted"

Please elaborate this?

dusk125 · 2023-06-08T13:48:06Z

So we're neither modifying the snapshot you restore from (important for integrity checks), nor do we hack surgically around the binary files. I think Marek's approach is even a step further, by just adding "+x" on the next transaction that's executed - not sure this is enough. Maybe @dusk125 has some more input on what worked and what didn't, I believe we tried that before.

In the work I had been doing, I did not modify the snapshot, nor do I think we should modify it in whatever path we take. If, for whatever reason, the restore fails, the user could remove the proposed flags, and go through the normal restore and restart their workload; we wouldn't be able to do this if the snapshot file has been changed.

My testing was focused around getting a watcher to resync automatically as if the key its watching on changed, without it actually needing to change. The watcher would reconnect to the restored etcd and get a notification that the value was modified, and thus get its "new" value and hopefully continue operating as normal.

If the revision your watching is shown as compacted, then the watch fails. If the watcher contract is to invalidate and re-watch when a watch fails due to a compacted revision, then I think @serathius 's suggestion of marking everything as compacted after restore (not touching the snapshot file) and adding a single dummy revision on top of everything could work; I can start testing this out.

chaochn47 · 2023-06-08T14:47:19Z

Hi @serathius @dusk125 @tjungblu, this is great proposal. Could you please share how invalidating client cache (operator/controller) depends on watch fails on compacted higher revision will work or it works conceptually as of today? I think that's the fundamental assumption (theory) we make that drives this design.

It should clear some confusion in this thread.

(or should we move this conversation to k/k issue?)

tjungblu · 2023-06-08T17:06:39Z

Thanks for all the discussions in the meantime, I think I'm also a little further in my understanding and try to summarize in more detailed pieces.

Let's look at this from a K8s API point of view first. We want to ensure all operators/controllers consuming the watchlist API will receive a signal to invalidate their caches. Working backwards, from the official docs we can find that a 410 should be returned to signal a fresh listing from the start.

Which in turn, is always triggered by our own etcd return of etcdrpc.ErrCompacted when using the etcd watches.

From etcd point of view, we want to ensure that we have a high enough revision to start from after restoring from a snapshot before we serve any data to the outside world. Yet, we also need to ensure that we don't serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer). Conveniently, this is exactly what the compaction does.

For that, do we really need to run a full compaction? Or is it enough to mark a revision range as already compacted?

I think there are multiple approaches here, one that's hacky would be to use UnsafeSetScheduledCompact and use the scheduledCompactRev schema key to bump it. Not sure if this will already do the trick, @dusk125 wanted to check that out next. There's also opportunity in changing the compactMainRev in mvcc kvstore.go going through the code.

Rewinding a little, how to establish the revision in the snapshot?

Similar to Allen's approach, we discussed about O(n) looping through each key in the snapshot to establish which revision is the highest. If you look into kvstore.go, it already does that. I think there were some reservations around the performance impact of its linear runtime.

Since creating a snapshot already requires iteration, we would also propose caching this information in the snapshot proto as additional/trailing metadata. When contained in a snapshot it would allows us to save the iteration of the whole key space when restoring.

dusk125 · 2023-06-08T17:54:00Z

I would be happy to take this issue on, can you assign this issue to me please?

serathius · 2023-06-08T18:14:14Z

@tjungblu @dusk125 Thanks for jumping on a call and writing down our discussion. One think I noticed during our discussion was that my proposal didn't handle correctly so client during restore could still experience revision fluctuating 100 -> 110 -> 100.

ahrtr · 2023-06-09T00:26:27Z

Thanks all for the explanation, but it seems that we are still playing with a not well explained concept "mark a revision range as already compacted". etcd supports compaction, and returns ErrCompacted if the requested rev is older than the latest compacted revision or return ErrFutureRev if the requested revision is bigger than current revision. I have never seen mark a revision range as already compacted before, nor a clear explanation so far.

ensure that we don't serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer).

Yes, indeed there might be a gap in between. If users request a revision in the gap, then they get nothing. Users may be confused because etcd's revisions are supposed to be continuous. So far I do not see any feasible & good approach other than a complete offline compaction just as I mentioned in #16028 (comment)

tjungblu · 2023-06-09T07:42:32Z

Users may be confused because etcd's revisions are supposed to be continuous.

Fair point, I think this is just not practically possible because we don't know what the very latest revision was in a disaster recovery scenario. If all you have is a backup snapshot from two days ago and everything else is somehow lost, how do you find out the most recently served revision?

So far I do not see any feasible & good approach other than a complete offline compaction just as I mentioned in #16028 (comment)

We should settle on some non-functional requirements for a restoration. Marek made a good point about the process being as fast and reliable as possible to avoid spooking admins/SREs on an already stressful task. Waiting for a full offline compaction, that also might fail, on big snapshots might take some time.

From my experience starting a big etcd from disk (let's say close to our 8 gigs limit) until ready to serve the first byte of client requests (TTFB if you so want to call it) is on the order of several minutes. Does anyone have some reliable numbers from their cloud fleet they can share? I don't have this metric in our telemetry sadly.

ahrtr · 2023-06-09T09:23:21Z

Just had a quick discussion with @serathius .

There are two things.

Bumping revision. We are on the same page. The purpose is to make sure the revision will never decrease.
Compaction. Two options
- @serathius 's suggestion: "marking old revisions as compacted". It's actually just setting the scheduledCompactRev field in Meta bucket. So the finishedCompactRev != scheduledCompactRev, it means that previous compaction did not finish. When etcd starts, it will try to finish the compaction operation (so there is no performance differences between the two approaches). It's the minimal change on etcdutl, and zero change for etcdserver. The only concern is it's a little hacky. We need test it carefully. Let's follow this approach for now.
- @ahrtr 's suggestion: offline compaction, just as I mentioned in Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028 (comment). We already have offline defrag, it's natural to enhance etcdutl to support offline compaction. We can discuss this separately. FYI. We might get rid of defrag completely after Support customizing the rebalance threshold bbolt#422 is resolved, but we will always need offline compaction.

dusk125 · 2023-06-12T19:54:52Z

I've been able to verify that marking a revision as compacted: mvcc.UnsafeSetScheduledCompact(tx, revision) does cause the server to think that it has a compaction in progress which it will finish (sometime on startup after restore), and it causes the watchers to get canceled with the required revision has been compacted error.

The following is from running the server after restoring with my most recent changes in my draft PR with --bump-revision 1000000 --mark-compacted arguments

{"caller":"mvcc/kvstore.go:385","msg":"kvstore restored","current-rev":1000004}
{"caller":"mvcc/kvstore.go:392","msg":"resume scheduled compaction","scheduled-compact-revision":1000004}
{"caller":"mvcc/index.go:194","msg":"compact tree index","revision":1000004}
{"caller":"mvcc/kvstore_compaction.go:69","msg":"finished scheduled compaction","compact-revision":1000004,"took":"3.403894ms","hash":1818873421}

chaochn47 · 2023-06-30T21:54:06Z

From a etcd cluster administrator perspective, I would like the etcdctl snapshot restore flag named as desired-compacted-revision.

I don't want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs.

As a mitigation, naming the snapshot db file suffix with the revision and doing a calculation afterwards also works.. It's just not as convenient as just supplying the desired one.

ahrtr · 2023-07-06T17:53:58Z

I don't want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs

It is a good point. The concern is you might not always be able to know the desired target revision, but you are always able to provide a rough big enough bumping revision. As long as we can enhance the etcdutl tool to get the current highest revision, then we can easily calculate the bumping revision (desired revision - current highest revision). It seems not a big problem.

ahrtr · 2023-07-20T14:04:15Z

This task should have completed?

serathius · 2023-07-24T10:17:33Z

Yes, I the remaining work should be on Kubernetes side.

@dusk125 @tjungblu would you be interested in finishing kubernetes/kubernetes#118501 ?

serathius added the type/feature label Jun 7, 2023

serathius mentioned this issue Jun 7, 2023

Official support for restoring Kubernetes cluster kubernetes/kubernetes#118501

Open

3 tasks

serathius added release/v3.5 release/v3.4 labels Jun 7, 2023

serathius added the help wanted label Jun 7, 2023

serathius changed the title ~~Restoring etcd in client friendly way, aka etcd restore for Kubernetes~~ Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes Jun 7, 2023

dusk125 mentioned this issue Jun 7, 2023

Adding optional revision bump to snapshot restore #16029

Merged

serathius assigned dusk125 Jun 8, 2023

dusk125 mentioned this issue Jun 30, 2023

Add documentation for new snapshot restore flags #16160

Closed

tjungblu mentioned this issue Jul 3, 2023

[3.5] Adding optional revision bump and mark compacted to snapshot restore #16165

Merged

wenjiaswe mentioned this issue Jul 7, 2023

[3.4] Adding optional revision bump and mark compacted to snapshot restore #16193

Merged

serathius closed this as completed Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028

Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028

serathius commented Jun 7, 2023 •

edited by ahrtr

serathius commented Jun 7, 2023

ahrtr commented Jun 8, 2023 •

edited

tjungblu commented Jun 8, 2023

serathius commented Jun 8, 2023 •

edited

Offline compaction

tjungblu commented Jun 8, 2023

ahrtr commented Jun 8, 2023

dusk125 commented Jun 8, 2023

chaochn47 commented Jun 8, 2023

tjungblu commented Jun 8, 2023 •

edited

dusk125 commented Jun 8, 2023

serathius commented Jun 8, 2023

ahrtr commented Jun 9, 2023

tjungblu commented Jun 9, 2023

ahrtr commented Jun 9, 2023

dusk125 commented Jun 12, 2023 •

edited

chaochn47 commented Jun 30, 2023 •

edited

ahrtr commented Jul 6, 2023

ahrtr commented Jul 20, 2023

serathius commented Jul 24, 2023

Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028

Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028

Comments

serathius commented Jun 7, 2023 • edited by ahrtr

What would you like to be added?

Proposal

Tracking work

Why is this needed?

serathius commented Jun 7, 2023

ahrtr commented Jun 8, 2023 • edited

Offline compaction

Offline bumping revision

tjungblu commented Jun 8, 2023

serathius commented Jun 8, 2023 • edited

Offline compaction

tjungblu commented Jun 8, 2023

ahrtr commented Jun 8, 2023

dusk125 commented Jun 8, 2023

chaochn47 commented Jun 8, 2023

tjungblu commented Jun 8, 2023 • edited

dusk125 commented Jun 8, 2023

serathius commented Jun 8, 2023

ahrtr commented Jun 9, 2023

tjungblu commented Jun 9, 2023

ahrtr commented Jun 9, 2023

dusk125 commented Jun 12, 2023 • edited

chaochn47 commented Jun 30, 2023 • edited

ahrtr commented Jul 6, 2023

ahrtr commented Jul 20, 2023

serathius commented Jul 24, 2023

serathius commented Jun 7, 2023 •

edited by ahrtr

ahrtr commented Jun 8, 2023 •

edited

serathius commented Jun 8, 2023 •

edited

tjungblu commented Jun 8, 2023 •

edited

dusk125 commented Jun 12, 2023 •

edited

chaochn47 commented Jun 30, 2023 •

edited