Skip to content

Commit

Permalink
Add ADR for etcd snapshot CRD migration
Browse files Browse the repository at this point in the history
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
  • Loading branch information
brandond committed Oct 12, 2023
1 parent 9bb1ce1 commit 22065af
Showing 1 changed file with 53 additions and 0 deletions.
53 changes: 53 additions & 0 deletions docs/adrs/etcd-snapshot-cr.md
@@ -0,0 +1,53 @@
# Store etcd snapshot metadata in a Custom Resource

Date: 2023-07-27

## Status

Accepted

## Context

K3s currently stores a list of etcd snapshots and associated metadata in a ConfigMap. Other downstream
projects and controllers consume the content of this ConfigMap in order to present cluster administrators with
a list of snapshots that can be restored.

On clusters with more than a handful of nodes, and reasonable snapshot intervals and retention periods, the snapshot
list ConfigMap frequently reaches the maximum size allowed by Kubernetes, and fails to store any additional information.
The snapshots are still created, but they cannot be discovered by users or accessed by tools that consume information
from the ConfigMap.

When this occurs, the K3s service log shows errors such as:
```
level=error msg="failed to save local snapshot data to configmap: ConfigMap \"k3s-etcd-snapshots\" is invalid: []: Too long: must have at most 1048576 bytes"
```

Reference:
* https://github.com/rancher/rke2/issues/4495
* https://github.com/k3s-io/k3s/blob/36645e7311e9bdbbf2adb79ecd8bd68556bc86f6/pkg/etcd/etcd.go#L1503-L1516

### Existing Work

Rancher already has a `rke.cattle.io/v1 ETCDSnapshot` Custom Resource that contains the same information after it's been
imported by the management cluster:
* https://github.com/rancher/rancher/blob/027246f77f03b82660dc2e91df6bf2cd549163f0/pkg/apis/rke.cattle.io/v1/etcd.go#L48-L74

It is unlikely that we would want to use this custom resource in its current package; we may be able to negotiate moving
it into a neutral project for use by both projects.

## Decision

1. Instead of populating snapshots into a ConfigMap using the JSON serialization of the private `snapshotFile` type, K3s
will manage creation of an new Custom Resource Definition with similar fields.
2. Metadata on each snapshot will be stored in a distinct Custom Resource.
3. The new Custom Resource will be cluster-scoped, as etcd and its snapshots are a cluster-level resource.
4. Downstream consumers of etcd snapshot lists will migrate to watching the Custom Resource, instead of the ConfigMap.
5. K3s will observe a three minor version transition period, where both the new Custom Resource, and the existing
ConfigMap, will both be used.
6. During the transition period, older snapshot metadata may be removed from the ConfigMap while those snapshots still
exist and are referenced by new Custom Resources, if the ConfigMap exceeds a preset size or key count limit.

## Consequences

* Snapshot metadata will no longer be lost when the number of snapshots exceeds what can be stored in the ConfigMap.
* There will be some additional complexity in managing the new Custom Resource, and working with other projects to migrate to using it.

0 comments on commit 22065af

Please sign in to comment.