Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling scheduled backup compaction for ETCD via gardenlet feature gate #4511

Closed
amshuman-kr opened this issue Aug 13, 2021 · 7 comments
Closed
Assignees
Labels
area/control-plane Control plane related area/cost Cost related area/disaster-recovery Disaster recovery related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority)

Comments

@amshuman-kr
Copy link

How to categorize this issue?

/area control-plane cost high-availability disaster-recovery
/kind enhancement
/priority 2

What would you like to be added:

The upcoming release of etcd-druid includes the feature to schedule periodic compaction of incremental snapshots in the ETCD backup. This feature improves the disaster-recovery behaviour by setting an upper limit to the time taken to restore ETCDs from backup.

However, enabling this for all ETCDs in the landscape might incur additional cost especially in network traffic.

To gain more control over the rollout of this feature, it would be good to introduce a new gardenlet feature gate (say, ETCDBackupCompaction) so that the feature can be enabled only in the soil clusters first (to gain control over the ETCD restoration time for the seed ETCDs) and only later in other seed clusters if there is a need.

Further,

  1. Would it also be a good idea to introduce an overall gardener feature-gate?
  2. If we decide to switch on the feature on other seed clusters (other than the soil cluster) eventually, would it also be a good idea to have a mechanism to control which shoot ETCDs get this functionality (maybe based on an additional field or the existing purpose field in the Shoot spec)?

cc @dguendisch @dkistner @mliepold @ashwani2k

Why is this needed:

ETCD restoration time can be quite long (45m to 1h or even longer) if there are a lot of incremental snapshots to be applied during restoration (the incremental snapshots can only be applied sequentially).

This affects larger shoot clusters where the number of changes flowing into the ETCD is quite large (such as seed clusters).

@amshuman-kr amshuman-kr added the kind/enhancement Enhancement, improvement, extension label Aug 13, 2021
@gardener-robot gardener-robot added area/control-plane Control plane related area/cost Cost related area/disaster-recovery Disaster recovery related area/high-availability High availability related priority/2 Priority (lower number equals higher priority) labels Aug 13, 2021
@amshuman-kr
Copy link
Author

/assign @abdasgupta

@dkistner
Copy link
Member

If we decide to switch on the feature on other seed clusters (other than the soil cluster) eventually, would it also be a good idea to have a mechanism to control which shoot ETCDs get this functionality (maybe based on an additional field or the existing purpose field in the Shoot spec)?

How about using an annotation on the Shoot resource for this?

@vpnachev
Copy link
Member

vpnachev commented Aug 16, 2021

Would it also be a good idea to introduce an overall gardener feature-gate?

TBH, I don't think a gardenlet feature gate is the correct implementation. You can simply expose some etcd-druid configuration options via the gardenlet config, just like it is done for the logging and SNI

// Logging contains an optional configurations for the logging stack deployed
// by the Gardenlet in the seed clusters.
// +optional
Logging *Logging `json:"logging,omitempty"`
// SNI contains an optional configuration for the APIServerSNI feature used
// by the Gardenlet in the seed clusters.
// +optional
SNI *SNI `json:"sni,omitempty"`
.

@rfranzke
Copy link
Member

rfranzke commented Dec 2, 2021

@abdasgupta @timuthy Do you still plan to have such feature gate? In the recent PRs, I think etcd backup compaction can be configured via gardenlet configuration, is this enough?

@timuthy
Copy link
Member

timuthy commented Dec 2, 2021

Thanks for reminding @rfranzke. You're right, we discussed this option internally and decided to go with the typical options instead of introducing a feature gate because they usually come with a graduation plan, i.e. the feature gate itself will eventually be removed again. But compaction should be turned on/off by operators if they see demand for it, so it's independent from the graduation state.

@timuthy
Copy link
Member

timuthy commented Dec 2, 2021

/close

@rfranzke
Copy link
Member

rfranzke commented Dec 2, 2021

Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/cost Cost related area/disaster-recovery Disaster recovery related area/high-availability High availability related kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

7 participants