[Feature] Replace full snapshots at regular interval with compacted snapshots #231

abdasgupta · 2021-09-13T17:02:53Z

Feature (What you would like to be added):
We recently added compaction subcommand in ETCD Backup Restore here . This subcommand compacts, defragments and take full snapshot of ETCD database. We can run this subcommand parallel to ETCD BR at regular interval instead of full snapshots. But first full snapshot and a full snapshot every 24 hours is needed still. First snapshot is needed still because compaction can't run in parallel if there is not at least one full snapshot already in backup storage. We also need full snapshots every 24 hours because there may come situation for some cluster where not even a single compacted snapshot may not be taken in 24 hours. it would be really critical for those clusters to not have even a single full snapshot for 24 hours.

Motivation (Why is this needed?):
We need this because we want our snapshots to take less space in backup storage. ETCD DB when restored from our compacted snapshots will take lesser space in main memory as well. Moreover regular, compacted snapshots will keep number of events in delta snapshots limited as well. please check this

Approach/Hint to the implement solution (optional):

shreyas-s-rao · 2023-03-03T12:47:23Z

As mentioned by @vlerenc in gardener/etcd-backup-restore#587 (comment), we need to make snapshot compaction much smarter than it is today if it is to replace scheduled full snapshots.

Paraphrasing from Vedran's comment the points related to snapshot compaction here, since etcd-druid handles snapshot compaction:

We need to improve the conditions that trigger a snapshot compaction job.

Today a threshold-based trigger, based on the number of events accumulated in the latest set of delta snapshots in the snapstore is used.
Needs to be enhanced to also accommodate the size of the accumulated events - required for clusters that write and update huge resources, although the number of events may be relatively small.

We also need to improve the alerts:

If we plan to compact or take full snapshots every 1M revision, fire the alert if 2M revisions have accumulated since the last full snapshot (compacted or explicitly obtained)
If we plan to compact or take full snapshots every 24h cluster runtime, fire the alert if 48h have passed (do not use wall-clock time for condition and/or alert)…
If we plan to compact or take full snapshots every 200 delta snapshots, fire the alert if 400 delta snapshots have accumulated…

Additionally, I would also like to add that we need to check the cost difference between the current and proposed approach, and see whether we see a cost improvement. If not, whether the added costs is acceptable. My gut feeling is that since the proposed approach plans to utilize cluster runtime rather than wall-clock time, we might see a cost reduction for the average cluster by avoiding "unnecessary" full snapshots. For larger clusters, we will definitely see more frequent full snapshots due to higher rate/size of events, but that is acceptable and necessary to avoid slow restorations on potential data corruptions.

abdasgupta added the kind/enhancement Enhancement, improvement, extension label Sep 13, 2021

abdasgupta assigned abdasgupta and aaronfern Sep 13, 2021

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Mar 13, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 10, 2022

shreyas-s-rao mentioned this issue Mar 3, 2023

[Enhancement] Backup-restore should calculate previous cron schedule of full snapshot gardener/etcd-backup-restore#587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Replace full snapshots at regular interval with compacted snapshots #231

[Feature] Replace full snapshots at regular interval with compacted snapshots #231

abdasgupta commented Sep 13, 2021

shreyas-s-rao commented Mar 3, 2023

[Feature] Replace full snapshots at regular interval with compacted snapshots #231

[Feature] Replace full snapshots at regular interval with compacted snapshots #231

Comments

abdasgupta commented Sep 13, 2021

shreyas-s-rao commented Mar 3, 2023