Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

Open
shreyas-s-rao opened this issue Dec 7, 2022 · 9 comments
Labels
kind/enhancement Enhancement, improvement, extension priority/3 Priority (lower number equals higher priority)

Comments

@shreyas-s-rao
Copy link
Contributor

shreyas-s-rao commented Dec 7, 2022

Feature (What you would like to be added):
I would like to leverage the data restoration functionality of the etcd-backup-restore sidecar (in both single- and multi-node etcd clusters) to allow druid to delete PVC(s) on-demand and trigger recreation of the PVC(s) later.

Motivation (Why is this needed?):
There are use-cases where user wants to switch storage class of the PVC to a better one, or to change the volume size of the etcd disk, or maybe to switch to encrypted disks. Today, any changes to the volume configuration in the Etcd CR leads to reconciliation errors, as the existing statefulset forbids any updates to the volumeClaimTemplate, as can be seen from gardener/gardener-extension-provider-aws#646 (comment). I would like druid to catch such errors, and possibly even gracefully handle them in a certain way, if I would specify it via maybe annotation(s) on the Etcd CR.

Approach/Hint to the implement solution (optional):
Introduce annotations in the Etcd CR, something like druid.gardener.cloud/operation: "recreate-pvc/etcd-main-0" and druid.gardener.cloud/operation: recreate-sts (if necessary), that druid sees and does the following:

  1. Scale down the statefulset (if it isn't already scaled down)
  2. Delete the PVC associated with the pod specified in the annotation value - maybe there needs to be a better way to specify this annotation
  3. Delete the statefulset (to allow updating the volumeClaimTemplate of the statefulset, which is a "forbidden" field, ie, it is forbidden to update this field of the statefulset spec) - based on the recreate-sts annotation
  4. Continue with regular reconciliation of the Etcd resource, which will recreate the statefulset and subsequently recreate the PVC and restore the data (in case of single-node etcd) or sync data with the leader (in case of multi-node etcd)

Also enhance the immutableFieldUpdate function introduced in #408 and see how this can be leveraged for the new druid.gardener.cloud/operation: recreate-sts annotation, if necessary.

cc @unmarshall @timuthy @vlerenc

@shreyas-s-rao shreyas-s-rao added the kind/enhancement Enhancement, improvement, extension label Dec 7, 2022
@timuthy
Copy link
Member

timuthy commented Dec 12, 2022

API-wise this approach seems procedural which we probably want to avoid.

Do we need to expose a function-style operation or isn't it possible to stick to Kubernetes's well known desired/actual state paradigm, i.e. API changes trigger changes that Druid performs in order to transform the actual to the desired one?

@shreyas-s-rao
Copy link
Contributor Author

isn't it possible to stick to Kubernetes's well known desired/actual state paradigm, i.e. API changes trigger changes that Druid performs in order to transform the actual to the desired one?

That was my first thought as well. The issue here is that any change to volumeClaimTemplate of the statefulset is forbidden. If we are to allow changing the volume-related spec in the Etcd resource (storageCapacity, storageClass), we'll need to define a strict behavior for it, like if a change is made to either of these fields, druid must always recreate the statefulset, accompanied with a deletion of the volume". But that's risky because it means deletion of etcd data.

Of course we can safeguard this operation with something similar to druid.gardener.cloud/pvc-deletion-confirmation and such, and also compulsorily make druid trigger a full snapshot of the etcd before scaling down the statefulset. But the fact remains that the whole operation remains more or less "procedural" in a strict sense.

Also, Etcd resource does not currently store any state for the volumes it uses (unless we plan to do that using the EtcdMembers field), so there's no strict desired/actual state maintained for the etcd volumes, atleast at the moment.

@shreyas-s-rao
Copy link
Contributor Author

Also, other use-cases of this functionality of being able to programatically delete a PVC on-demand are:

  1. Volume deletion upon shoot cluster hibernation: allows us to save costs when the etcd statefulset is scaled down, as long as we have safeguarded the etcd data using backups. Druid can trigger a full snapshot of the etcd data before PVC deletion.
  2. Handling permanent quorum loss: deletion of PVCs on-demand allows us to move atleast one step closer to the automation of handling permanent quorum loss for the multinode etcd cluster. Of course the main challenge there is accurately detecting a permanent quorum loss in the first place, but the operator need only add one annotation to the Etcd resource and the rest is taken care of by druid, as part of a handlePermanentQuorumLoss flow, which reuses the deletePVC flow

/cc @abdasgupta

@vlerenc
Copy link
Member

vlerenc commented Dec 12, 2022

Pro:

  • We need to find a way (either recommended manual steps or automation) to replace unencrypted volumes with encrypted volumes (we changed that >1y ago, but there are still Gardener adopters sitting on unencrypted volumes
  • We want to find a way (either recommended manual steps or automation) to replace over-/undersized volumes with properly sized volumes (e.g. on AWS there is no need anymore to over-provision ETCD volumes to get the desired IOPS)

Con:

  • I am naturally fearful of PV deletion code as we have discussed during the permanent quorum loss scenario, but that's not a showstopper - I am just mentioning it here as this scenario was mentioned here as something we may be able to "improve", but we hope we rather don't need it (transient quorum loss is automatically handled)
  • I would hope we can avoid scaling down the stateful set. For instance, we explicitly hoped with HA to be able to roll out these volume changes (encrypted, properly sized) without any downtime, i.e. in-flight. Yes, the stateful set would show "the wrong volume template", but it would adopt recreated volumes, so that these changes can be implemented without any downtime, right?

So, whether we automate the volume replacement (scripts have risks themselves, so a well-tested and safe-guarded druid implementation may still be the better option, even if we are all fearful of volume deletion code), I would still like to raise the question, whether we can offer a way to do that w/o a downtime as initially hoped/planned (HA became the prerequisite for volume replacement for critical clusters and scaling down the stateful set, would defy this goal).

@shreyas-s-rao
Copy link
Contributor Author

Yes, the stateful set would show "the wrong volume template", but it would adopt recreated volumes, so that these changes can be implemented without any downtime, right?

This would lead to inconsistency in the Kubernetes's well known desired/actual state paradigm mentioned earlier by @timuthy . The zero-downtime update can still be done, using the steps provided by @unmarshall in gardener/gardener-extension-provider-aws#646 (comment). So essentially, we can make an Etcd spec change for storageSize / storageClass, accompanied by something like a confirmation annotation to make sure that volumes aren't deleted without explicit confirmation. @vlerenc WDYT?

@vlerenc
Copy link
Member

vlerenc commented Dec 12, 2022

Yes, exactly. I was commenting on the sentence "Scale down the statefulset (if it isn't already scaled down)" and if you have the capacity to automate gardener/gardener-extension-provider-aws#646 (comment) that would be most welcome.

@shreyas-s-rao
Copy link
Contributor Author

/assign

@shreyas-s-rao
Copy link
Contributor Author

/assign @seshachalam-yv

@shreyas-s-rao shreyas-s-rao added priority/3 Priority (lower number equals higher priority) and removed priority/1 Priority (lower number equals higher priority) labels Nov 7, 2023
@shreyas-s-rao
Copy link
Contributor Author

Blocked until #588 is implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

5 participants