Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new configurable field fullSnapshotLeaseUpdateInterval in spec.backup section of Etcd CR #764

Conversation

anveshreddy18
Copy link
Contributor

@anveshreddy18 anveshreddy18 commented Feb 8, 2024

How to categorize this PR?

/area usability
/kind enhancement

What this PR does / why we need it:

This PR adds a new field fullSnapshotLeaseUpdateInterval in the spec.backup section of Etcd yaml and makes necessary changes, which allows to configure full-snapshot-lease-update-interval parameter used to configure the interval to retry updating full snapshot lease.

  • The backup-restore PR#711 introduces a new flag full-snapshot-lease-update-interval to configure the retry interval for updating the full snapshot lease. Adding this new fullSnapshotLeaseUpdateInterval field to Etcd CR allows user to control the behaviour of retrying to update full snapshot lease

Note: It will be an optional field, and when not set, backup-restore takes care of setting a default value to it.

Which issue(s) this PR fixes:
Fixes #763

Special notes for your reviewer:

Release note:

Enabling the configurability of `full-snapshot-lease-update-interval` flag through the etcd resource spec `.spec.backup.fullSnapshotLeaseUpdateInterval`.

@anveshreddy18 anveshreddy18 requested a review from a team as a code owner February 8, 2024 11:51
@gardener-robot gardener-robot added area/usability Usability related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Feb 8, 2024
@anveshreddy18 anveshreddy18 self-assigned this Feb 8, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 8, 2024
Copy link
Contributor

@shreyas-s-rao shreyas-s-rao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @anveshreddy18 !

Couple of comments from my side:

  1. PTAL at the points I have mentioned in Add a new configurable field fullSnapshotLeaseUpdateInterval in spec.backup section of Etcd CR #763 (comment)
  2. Please use a PR-built image of etcdbr in this PR (at charts/images.yaml) so that this PR can be easily tested. You can obtain the PR-built image from the concourse publish step from Full snapshot lease update retry on failure etcd-backup-restore#711

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Feb 8, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Feb 9, 2024
@anveshreddy18 anveshreddy18 changed the title Add a new configurable field fullsnapLeaseUpdateRetryInterval in spec.backup section of Etcd CR Add a new configurable field fullSnapshotLeaseUpdateInterval in spec.backup section of Etcd CR Feb 9, 2024
Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@shreyas-s-rao
Copy link
Contributor

/hold until PR for #728 gets merged, since that will bring changes in the component model for resources deployed by druid.

@gardener-robot gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Mar 21, 2024
@anveshreddy18 anveshreddy18 force-pushed the configure/fullsnapshot-lease-update-retry-interval branch from 7c4f934 to fefae40 Compare June 24, 2024 11:52
@gardener-robot gardener-robot added size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) and removed size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Jun 24, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 24, 2024
@@ -159,6 +159,9 @@ type BackupSpec struct {
// All full snapshots beyond this limit will be garbage collected.
// +optional
MaxBackupsLimitBasedGC *int32 `json:"maxBackupsLimitBasedGC,omitempty"`
// FullSnapshotLeaseUpdateInterval defines the interval for retrying to update the full snapshot lease.
// +optional
FullSnapshotLeaseUpdateInterval *metav1.Duration `json:"fullSnapshotLeaseUpdateInterval,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a change in API required, especially when we are no longer going to use snapshot leases when we work on steward. Introducing something now in the API which is anyways going to be removed is not so nice.
Also the original issue was to retry update of full snapshot lease from backup-restore if the first attempt failed. Why does this now result in a ETCD API change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree etcdbr has a default interval for doing this but we can also provide this option to configure from the druid to give a choice for users when this default value doesn't suit their purposes.

Regarding API change, since we have done a lot of api changes in 777, maybe we can push everything together and later when steward comes we can remove this as the changes are very minimal. The reason I say this is because I presume we don't have a near future plan for steward as priorities have changed to improve druid & other security issues. So I'm guessing it will take some time to get steward running, till then we can provide this option.

Also if snapshot leases will be removed in the steward, we can say the same for the PR #820 which decouples the ready condition for snapshot leases. But I think it's important to let the users configure & have better knowledge about the conditions even if they'll be present for a few months. WDYT?

Copy link
Contributor

@unmarshall unmarshall Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why do we need a consumer to define when and at what frequency a retry to update a snapshot lease should be done? Using snapshot leases is an implementation detail and should not be exposed anyways as part of the druid API. Implementations change and with any change in the implementation one cannot keep changing APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by user I meant Gardener here. The plan was to enable gardener to modify this as per the needs. Just that If they feel 3 min default is not necessary ( as it fills the logs with retry ) and want to increase it, then they have an option to. But yeah I agree it's not a necessary configuration which requires a change to API. I'll close this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can say the same for the PR #820 which decouples the ready condition for snapshot leases.

These are not the same things. What #820 does is quite different. It changes what goes into the status which anyways cannot be influenced by any consumer of druid. The intent of #820 is to improve monitoring of full and delta snapshots that are taken by an etcd cluster. We do not leak any implementation detail into the API.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/usability Usability related kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a new configurable field fullSnapshotLeaseUpdateInterval in spec.backup section of Etcd CR
8 participants