[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

shreyas-s-rao · 2022-12-07T17:19:54Z

Feature (What you would like to be added):
I would like to leverage the data restoration functionality of the etcd-backup-restore sidecar (in both single- and multi-node etcd clusters) to allow druid to delete PVC(s) on-demand and trigger recreation of the PVC(s) later.

Motivation (Why is this needed?):
There are use-cases where user wants to switch storage class of the PVC to a better one, or to change the volume size of the etcd disk, or maybe to switch to encrypted disks. Today, any changes to the volume configuration in the Etcd CR leads to reconciliation errors, as the existing statefulset forbids any updates to the volumeClaimTemplate, as can be seen from gardener/gardener-extension-provider-aws#646 (comment). I would like druid to catch such errors, and possibly even gracefully handle them in a certain way, if I would specify it via maybe annotation(s) on the Etcd CR.

Approach/Hint to the implement solution (optional):
Introduce annotations in the Etcd CR, something like druid.gardener.cloud/operation: "recreate-pvc/etcd-main-0" and druid.gardener.cloud/operation: recreate-sts (if necessary), that druid sees and does the following:

Scale down the statefulset (if it isn't already scaled down)
Delete the PVC associated with the pod specified in the annotation value - maybe there needs to be a better way to specify this annotation
Delete the statefulset (to allow updating the volumeClaimTemplate of the statefulset, which is a "forbidden" field, ie, it is forbidden to update this field of the statefulset spec) - based on the recreate-sts annotation
Continue with regular reconciliation of the Etcd resource, which will recreate the statefulset and subsequently recreate the PVC and restore the data (in case of single-node etcd) or sync data with the leader (in case of multi-node etcd)

Also enhance the immutableFieldUpdate function introduced in #408 and see how this can be leveraged for the new druid.gardener.cloud/operation: recreate-sts annotation, if necessary.

cc @unmarshall @timuthy @vlerenc

The text was updated successfully, but these errors were encountered:

timuthy · 2022-12-12T08:30:24Z

API-wise this approach seems procedural which we probably want to avoid.

Do we need to expose a function-style operation or isn't it possible to stick to Kubernetes's well known desired/actual state paradigm, i.e. API changes trigger changes that Druid performs in order to transform the actual to the desired one?

shreyas-s-rao · 2022-12-12T09:46:16Z

isn't it possible to stick to Kubernetes's well known desired/actual state paradigm, i.e. API changes trigger changes that Druid performs in order to transform the actual to the desired one?

That was my first thought as well. The issue here is that any change to volumeClaimTemplate of the statefulset is forbidden. If we are to allow changing the volume-related spec in the Etcd resource (storageCapacity, storageClass), we'll need to define a strict behavior for it, like if a change is made to either of these fields, druid must always recreate the statefulset, accompanied with a deletion of the volume". But that's risky because it means deletion of etcd data.

Of course we can safeguard this operation with something similar to druid.gardener.cloud/pvc-deletion-confirmation and such, and also compulsorily make druid trigger a full snapshot of the etcd before scaling down the statefulset. But the fact remains that the whole operation remains more or less "procedural" in a strict sense.

Also, Etcd resource does not currently store any state for the volumes it uses (unless we plan to do that using the EtcdMembers field), so there's no strict desired/actual state maintained for the etcd volumes, atleast at the moment.

shreyas-s-rao · 2022-12-12T09:55:00Z

Also, other use-cases of this functionality of being able to programatically delete a PVC on-demand are:

Volume deletion upon shoot cluster hibernation: allows us to save costs when the etcd statefulset is scaled down, as long as we have safeguarded the etcd data using backups. Druid can trigger a full snapshot of the etcd data before PVC deletion.
Handling permanent quorum loss: deletion of PVCs on-demand allows us to move atleast one step closer to the automation of handling permanent quorum loss for the multinode etcd cluster. Of course the main challenge there is accurately detecting a permanent quorum loss in the first place, but the operator need only add one annotation to the Etcd resource and the rest is taken care of by druid, as part of a handlePermanentQuorumLoss flow, which reuses the deletePVC flow

/cc @abdasgupta

vlerenc · 2022-12-12T10:10:17Z

Pro:

We need to find a way (either recommended manual steps or automation) to replace unencrypted volumes with encrypted volumes (we changed that >1y ago, but there are still Gardener adopters sitting on unencrypted volumes
We want to find a way (either recommended manual steps or automation) to replace over-/undersized volumes with properly sized volumes (e.g. on AWS there is no need anymore to over-provision ETCD volumes to get the desired IOPS)

Con:

I am naturally fearful of PV deletion code as we have discussed during the permanent quorum loss scenario, but that's not a showstopper - I am just mentioning it here as this scenario was mentioned here as something we may be able to "improve", but we hope we rather don't need it (transient quorum loss is automatically handled)
I would hope we can avoid scaling down the stateful set. For instance, we explicitly hoped with HA to be able to roll out these volume changes (encrypted, properly sized) without any downtime, i.e. in-flight. Yes, the stateful set would show "the wrong volume template", but it would adopt recreated volumes, so that these changes can be implemented without any downtime, right?

So, whether we automate the volume replacement (scripts have risks themselves, so a well-tested and safe-guarded druid implementation may still be the better option, even if we are all fearful of volume deletion code), I would still like to raise the question, whether we can offer a way to do that w/o a downtime as initially hoped/planned (HA became the prerequisite for volume replacement for critical clusters and scaling down the stateful set, would defy this goal).

shreyas-s-rao · 2022-12-12T12:25:05Z

Yes, the stateful set would show "the wrong volume template", but it would adopt recreated volumes, so that these changes can be implemented without any downtime, right?

This would lead to inconsistency in the Kubernetes's well known desired/actual state paradigm mentioned earlier by @timuthy . The zero-downtime update can still be done, using the steps provided by @unmarshall in gardener/gardener-extension-provider-aws#646 (comment). So essentially, we can make an Etcd spec change for storageSize / storageClass, accompanied by something like a confirmation annotation to make sure that volumes aren't deleted without explicit confirmation. @vlerenc WDYT?

vlerenc · 2022-12-12T12:29:17Z

Yes, exactly. I was commenting on the sentence "Scale down the statefulset (if it isn't already scaled down)" and if you have the capacity to automate gardener/gardener-extension-provider-aws#646 (comment) that would be most welcome.

shreyas-s-rao · 2022-12-19T09:51:13Z

/assign

shreyas-s-rao · 2023-05-03T09:13:56Z

/assign @seshachalam-yv

shreyas-s-rao · 2023-11-07T10:05:22Z

Blocked until #588 is implemented

shreyas-s-rao added the kind/enhancement Enhancement, improvement, extension label Dec 7, 2022

gardener-robot assigned shreyas-s-rao Dec 19, 2022

ashwani2k unassigned shreyas-s-rao Mar 20, 2023

This was referenced May 3, 2023

[Feature] Druid-controlled updates to the pods in the etcd cluster #588

Open

[Feature] Harmonize scaling operations of the etcd cluster #589

Open

[Feature] Introduce Task/Operation concept for out-of-band operations #590

Closed

shreyas-s-rao added the priority/1 Priority (lower number equals higher priority) label May 3, 2023

gardener-robot assigned seshachalam-yv May 3, 2023

shreyas-s-rao assigned ishan16696 Nov 7, 2023

shreyas-s-rao added priority/3 Priority (lower number equals higher priority) and removed priority/1 Priority (lower number equals higher priority) labels Nov 7, 2023

shreyas-s-rao unassigned ishan16696 and seshachalam-yv Nov 7, 2023

shreyas-s-rao mentioned this issue Apr 30, 2024

Improve Node Utilization by Avoiding "safe-to-evict" Annotation for Druid-Managed Pods #766

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

shreyas-s-rao commented Dec 7, 2022 •

edited

Loading

timuthy commented Dec 12, 2022

shreyas-s-rao commented Dec 12, 2022

shreyas-s-rao commented Dec 12, 2022

vlerenc commented Dec 12, 2022

shreyas-s-rao commented Dec 12, 2022

vlerenc commented Dec 12, 2022

shreyas-s-rao commented Dec 19, 2022

shreyas-s-rao commented May 3, 2023

shreyas-s-rao commented Nov 7, 2023

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

Comments

shreyas-s-rao commented Dec 7, 2022 • edited Loading

timuthy commented Dec 12, 2022

shreyas-s-rao commented Dec 12, 2022

shreyas-s-rao commented Dec 12, 2022

vlerenc commented Dec 12, 2022

shreyas-s-rao commented Dec 12, 2022

vlerenc commented Dec 12, 2022

shreyas-s-rao commented Dec 19, 2022

shreyas-s-rao commented May 3, 2023

shreyas-s-rao commented Nov 7, 2023

shreyas-s-rao commented Dec 7, 2022 •

edited

Loading