Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Replace full snapshots at regular interval with compacted snapshots #231

Open
abdasgupta opened this issue Sep 13, 2021 · 1 comment
Assignees
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@abdasgupta
Copy link
Contributor

Feature (What you would like to be added):
We recently added compaction subcommand in ETCD Backup Restore here . This subcommand compacts, defragments and take full snapshot of ETCD database. We can run this subcommand parallel to ETCD BR at regular interval instead of full snapshots. But first full snapshot and a full snapshot every 24 hours is needed still. First snapshot is needed still because compaction can't run in parallel if there is not at least one full snapshot already in backup storage. We also need full snapshots every 24 hours because there may come situation for some cluster where not even a single compacted snapshot may not be taken in 24 hours. it would be really critical for those clusters to not have even a single full snapshot for 24 hours.

Motivation (Why is this needed?):
We need this because we want our snapshots to take less space in backup storage. ETCD DB when restored from our compacted snapshots will take lesser space in main memory as well. Moreover regular, compacted snapshots will keep number of events in delta snapshots limited as well. please check this

Approach/Hint to the implement solution (optional):

@abdasgupta abdasgupta added the kind/enhancement Enhancement, improvement, extension label Sep 13, 2021
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Mar 13, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 10, 2022
@shreyas-s-rao
Copy link
Contributor

As mentioned by @vlerenc in gardener/etcd-backup-restore#587 (comment), we need to make snapshot compaction much smarter than it is today if it is to replace scheduled full snapshots.

Paraphrasing from Vedran's comment the points related to snapshot compaction here, since etcd-druid handles snapshot compaction:


We need to improve the conditions that trigger a snapshot compaction job.

  • Today a threshold-based trigger, based on the number of events accumulated in the latest set of delta snapshots in the snapstore is used.
  • Needs to be enhanced to also accommodate the size of the accumulated events - required for clusters that write and update huge resources, although the number of events may be relatively small.

We also need to improve the alerts:

  • If we plan to compact or take full snapshots every 1M revision, fire the alert if 2M revisions have accumulated since the last full snapshot (compacted or explicitly obtained)
  • If we plan to compact or take full snapshots every 24h cluster runtime, fire the alert if 48h have passed (do not use wall-clock time for condition and/or alert)…
  • If we plan to compact or take full snapshots every 200 delta snapshots, fire the alert if 400 delta snapshots have accumulated…

Additionally, I would also like to add that we need to check the cost difference between the current and proposed approach, and see whether we see a cost improvement. If not, whether the added costs is acceptable. My gut feeling is that since the proposed approach plans to utilize cluster runtime rather than wall-clock time, we might see a cost reduction for the average cluster by avoiding "unnecessary" full snapshots. For larger clusters, we will definitely see more frequent full snapshots due to higher rate/size of events, but that is acceptable and necessary to avoid slow restorations on potential data corruptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

No branches or pull requests

4 participants